Handle Transform features that need access to all values of entity #91

bschreck · 2018-02-13T22:06:14Z

Adds logic to pass in all values of an entity before cutoff to features that need them.

The problem:
Features like Percentile and CumMean/CumMax/etc. need access to the all the values the entity before the cutoff time, rather than only specified instance ids. Normal Transform features are one-to-one mappings, where the output for specific instance only depends on a single input value for that instance. On the contrary, features like Percentile use all the available data to determine the rank of a particular value. Currently, they are only provided with a subset of the data- the subset that is linked to the set of specified instance ids on target entity.

This PR creates an additional entity_frames object inside of pandas_backend.py, which contains all the data before the cutoff time. It adds logic to determine whether to pass in this object of the normal entity_frames object to each feature calculation function. It also decides where to place the output of these functions.

Lastly, it makes sure to only extract the relevant columns from the entityset using EntitySet.get_pandas_data_slice. Previously, we simply used all the columns (which is a waste of memory).

…es that need them

Seth-Rothschild · 2018-02-14T14:55:08Z

Trying to use this with custom primitives now: what's the expected workflow? I thought that you might be able to set use_previous by doing

def primitive1(value):
    numeric = pd.Series(value)
    return (numeric - numeric.shift(1))

Prim1 = make_trans_primitive(primitive1,
                            input_types=[vtypes.Numeric],
                            cls_attributes={'use_previous': True},
                            return_type=vtypes.Numeric)

but that doesn't seem to be the way to access calculating using additional data.

* lint * Sort imports

codecov-io · 2018-02-14T20:41:49Z

Codecov Report

Merging #91 into master will increase coverage by 0.53%.
The diff coverage is 98.76%.

@@            Coverage Diff             @@
##           master      #91      +/-   ##
==========================================
+ Coverage   87.65%   88.18%   +0.53%     
==========================================
  Files          73       73              
  Lines        7192     7392     +200     
==========================================
+ Hits         6304     6519     +215     
+ Misses        888      873      -15

Impacted Files	Coverage Δ
...computational_backends/calculate_feature_matrix.py	`96.87% <100%> (+0.02%)`	⬆️
featuretools/primitives/transform_primitive.py	`97.76% <100%> (ø)`	⬆️
...eaturetools/computational_backends/feature_tree.py	`99.28% <100%> (+6.42%)`	⬆️
.../feature_function_tests/test_transform_features.py	`87.64% <100%> (+1.51%)`	⬆️
...aturetools/tests/entityset_tests/test_pandas_es.py	`99.72% <100%> (ø)`	⬆️
featuretools/entityset/entityset.py	`89.8% <100%> (+0.04%)`	⬆️
...utational_backend/test_calculate_feature_matrix.py	`98.79% <100%> (+0.18%)`	⬆️
featuretools/primitives/cum_transform_feature.py	`97.63% <100%> (+0.01%)`	⬆️
featuretools/primitives/primitive_base.py	`77.92% <100%> (+0.74%)`	⬆️
...turetools/computational_backends/pandas_backend.py	`89.24% <92.85%> (+2.04%)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3be6fb3...861f22b. Read the comment docs.

bschreck · 2018-02-16T17:15:53Z

@Seth-Rothschild the way to do it would be cls_attributes={'needs_all_values': True}. You just have the wrong variable name.

kmax12 · 2018-02-20T15:08:13Z

featuretools/computational_backends/pandas_backend.py

@@ -95,14 +96,28 @@ def calculate_all_features(self, instance_ids, time_last,
            ordered_entities = FeatureTree(self.entityset, self.features, ignored=ignored).ordered_entities
        else:
            ordered_entities = self.feature_tree.ordered_entities
+
+        necessary_columns = self._find_necessary_columns(only_if_needs_all_values=False)


is there a way to refactor to not need this parameter? not clear to me what it does

kmax12 · 2018-02-20T15:10:05Z

featuretools/computational_backends/pandas_backend.py

@@ -210,6 +243,72 @@ def generate_default_df(self, instance_ids, extra_columns=None):
                    default_df[c] = [np.nan] * len(instance_ids)
        return default_df

+    def _find_necessary_columns(self, only_if_needs_all_values=False):


any reason not to make this a method of featuretree?

kmax12 · 2018-02-20T15:11:04Z

featuretools/computational_backends/pandas_backend.py

+                    handler = self._feature_type_handler(test_feature)
+                    result_frame = handler(group, input_frames)
+
+                    self._place_in_output_frames(result_frame,


look like this is a function that is only called once. any reason you chose to make it a function rather than include the code in line?

kmax12 · 2018-02-20T15:15:07Z

featuretools/entityset/entityset.py

@@ -256,6 +272,7 @@ def query_entity_by_values(self, entity_id, instance_vals, variable_id=None,
                                      variable_id=variable_id,
                                      columns=columns,
                                      time_last=time_last,
+                                      training_window=training_window,


can you explain this addition? how were things working previously without this?

I think they weren't

should write a test to confirm

We actually never use this function except in tests... We always call the one on the Entity itself (Entity.query_by_values). I think we should just delete this function and change those tests to the one we actually use in EntitySet

Changed in latest commit. Let me know what you think

yep, let's remove if it's unused

kmax12 · 2018-02-20T20:08:20Z

featuretools/computational_backends/feature_tree.py

        self.all_features = list(all_features.values())
        self.feature_deps = feature_deps
+        self.feature_dependents = self._build_dependents()


i think this is the inverse of self.feature_deps, but the variable name doesn't make it very clear since it is the same

i didn't want to change feature_deps when I was first doing it for fear it was used elsewhere, but now that everythign works it would be better to name it feature_dependencies

kmax12 · 2018-02-20T20:09:33Z

featuretools/computational_backends/feature_tree.py

        self.all_features = list(all_features.values())
        self.feature_deps = feature_deps
+        self.feature_dependents = self._build_dependents()


why break this out into a different function? can it be combined into the logic above that calculates self.feature_deps?

we would have to change the above logic to accommodate it, since we rely on get_deep_dependencies() but we don't have an equivalent get_deep_dependents() method. It would just have to be a bit more complicated as a queue/search/while loop instead of just the two for loops. But yeah might make it look more uniform

Nevermind i think i can make it work

kmax12 · 2018-02-20T20:13:46Z

featuretools/computational_backends/pandas_backend.py

@@ -151,15 +169,30 @@ def calculate_all_features(self, instance_ids, time_last,
                if not self.entityset.find_backward_path(start_entity_id=filter_eid,
                                                         goal_entity_id=eid):
                    entity_frames[eid] = eframes_by_filter[eid][eid]
+                    if (large_eframes_by_filter is not None and
+                            eid in large_eframes_by_filter and eid in large_eframes_by_filter[eid]):
+                        large_entity_frames[eid] = large_eframes_by_filter[eid][eid]


what is this line about? seems a bit weird

kmax12 · 2018-02-20T20:17:53Z

featuretools/computational_backends/pandas_backend.py

+                    test_feature = group[0]
+
+                    input_frames, output_frames = \
+                        self._determine_input_output_frames(


seems like this logic can be done in a simpler way to avoid implementing a function that only gets called once.

input_frames = large_entity_frames if self.feature_tree.needs_all_values_differentiator(test_feature) == 'normal_no_dependent': input_frames = entity_frames

then you can put the output frame logic in existing _place_in_output_frames function you've defined.

kmax12 · 2018-02-20T20:20:19Z

featuretools/computational_backends/pandas_backend.py

@@ -210,6 +243,72 @@ def generate_default_df(self, instance_ids, extra_columns=None):
                    default_df[c] = [np.nan] * len(instance_ids)
        return default_df

+    def _find_necessary_columns(self, only_if_needs_all_values=False):


is only_if_needs_all_values an optimization or is it required for this to work? a bit hard to follow its role. in general, I find it confusing when a function has a single boolean parameter that controls behavior like this unless it is very obvious what it does. as far as I can tell, it there is only one place in the code where we set it to true. if that is the case, I'd prefer if the optimization happens more explicitly outside the function. makes the function itself more readable and maintainable.

rwedge · 2018-02-23T15:57:54Z

Codecov has feature_tree.get_all_features not being hit by the test cases and it doesn't look like it's called by other parts of featuretools. Is this a method we want to keep?

kmax12 · 2018-02-23T15:43:11Z

featuretools/primitives/cum_transform_feature.py

@@ -19,6 +19,7 @@ class CumFeature(TransformPrimitive):
    allow_where = True
    agg_feature = None
    rolling_function = True
+    needs_all_values = True


thinking through other names. do any of these stick out as better or trigger other ideas?

uses_all_values
uses_all_data
uses_full_data

What about something like "cross_entity", or "spans_entity"

"full_entity_transform"

historical_transform

I like the idea of full_entity_transform. any thoughts on other versions that don't repeat transform?

applied_to_full_entity applied_to_full_data ```?

uses_full_entity

kmax12 · 2018-02-23T15:51:01Z

featuretools/computational_backends/feature_tree.py


 # These functions are used for sorting and grouping features

+    def needs_all_values_differentiator(self, f):
+        if self._dependent_needs_all_values(f) and self._is_output_feature(f):


i think this is an example of where extra one time use functions don't necessarily help.

is_output = f.hash() in self.feature_hashes

then use the variable in your conditionals below

yeah i agree

i think they're more to help me think while I go about solving the problem, and makes sense to review them at final solution

kmax12 · 2018-02-23T15:56:55Z

featuretools/computational_backends/feature_tree.py

+    def _needs_all_values(self, feature):
+        if feature.needs_all_values:
+            return True
+        for d in self.feature_dependents[feature.hash()]:


can this be replaced with a call to self._dependent_needs_all_values? at first glance, the code looks copy and pasted

kmax12 · 2018-02-23T15:57:54Z

featuretools/computational_backends/feature_tree.py


 # These functions are used for sorting and grouping features

+    def needs_all_values_differentiator(self, f):


can you briefly document what each of the 4 cases mean for someone looking at this code in the future?

kmax12 · 2018-02-23T15:59:28Z

featuretools/computational_backends/pandas_backend.py

+        large_eframes_by_filter = None
+        if any([f.needs_all_values for f in self.feature_tree.all_features]):
+            large_necessary_columns = self.feature_tree.necessary_columns_for_all_values_features
+            all_instances = self.entityset[self.target_eid].get_all_instances()


is it a bit weird to get all instances here when you'd likely have to filter some of them out by time later?

workaround for issue #93. can merge with master and remove

kmax12 · 2018-02-23T16:01:26Z

featuretools/computational_backends/pandas_backend.py

+                    result_frame = handler(group, input_frames)
+
+                    output_frames = []
+                    if needs_all_values_type.startswith('dependent'):


I'd make this more explicit like the other conditionals around it

if needs_all_values_type in ["dependent", "dependent_and_output"]: pass

kmax12 · 2018-02-23T16:05:27Z

featuretools/computational_backends/feature_tree.py

+        # turn values which were hashes of features into the features themselves
+        # (note that they were hashes to prevent checking equality of feature objects,
+        #  which is not currently an allowed operation)
+        self.feature_dependents = {fhash: [all_features[dhash] for dhash in feature_dependents[fhash]]


why do you iterate over the keys in all_features rather than feature_dependents?

If a feature has no dependencies, it won't be put in feature_dependents by the above code. But, looking at it now, I can just explicitly add those in the for loop.

hmm actually that doesn't make much sense because it's a defaultdict, so a feature with no deps will get treated as an empty value in the defaultdict anyway.

…ansform-features

kmax12 · 2018-02-26T19:55:10Z

featuretools/computational_backends/feature_tree.py

                all_features[dep.hash()] = dep
-                feature_deps[dep.hash()] = dep.get_deep_dependencies(ignored=ignored)
+                subdeps = dep.get_deep_dependencies(


@bschreck i see we had this before, but can you explain why we need to do the sub dependency call here? like why only once rather than go to sub sub dependencies?

get_deep_dependencies() gets all the dependencies recursively, so features + each one's deep dependencies defines every feature. To build the dependents dict, for each one of those, we need all of its dependents, not dependencies. deps is every dependency of the top level features, and subdeps is every dependency of all the other ones. So deps also contains all the dependents of all the subdeps, and top level features are all the other dependents for all the rest. kind of tricky to wrap your head around

kmax12 · 2018-02-26T19:57:18Z

featuretools/computational_backends/feature_tree.py

@@ -153,9 +198,53 @@ def _get_feature_depths(self, entity_id):

        return list(features.values()), out

+    def _build_dependents(self):


is this used anywhere?

kmax12 · 2018-02-26T20:03:04Z

featuretools/computational_backends/feature_tree.py

+        # to calculate_feature_matrix), then we also need
+        # to subselect the output based on the desired instance ids
+        # and place in the return data frame.
+        if self._dependent_needs_all_values(f) and is_output:


i think you can refactor this to get rid of extra functions

dependent_needs_all_values = False for d in self.feature_dependents[f.hash()]: if d.needs_all_values: dependent_needs_all_values = True needs_all_values = f.needs_all_values or dependent_needs_all_values

then use variables below

kmax12 · 2018-02-26T20:03:15Z

featuretools/computational_backends/pandas_backend.py

@@ -10,9 +10,9 @@

 import numpy as np
 import pandas as pd
+# progress bar


kmax12 · 2018-02-26T20:03:20Z

featuretools/computational_backends/pandas_backend.py

 from featuretools.utils.gen_utils import make_tqdm_iterator

+# featuretools


kmax12 · 2018-02-26T20:07:35Z

featuretools/computational_backends/pandas_backend.py

@@ -151,15 +170,48 @@ def calculate_all_features(self, instance_ids, time_last,
                if not self.entityset.find_backward_path(start_entity_id=filter_eid,
                                                         goal_entity_id=eid):
                    entity_frames[eid] = eframes_by_filter[eid][eid]
+                    # precalculated features will only be placed in entity_frames,


is it possible to have to have precalculated a needs_all_values feature? do we need to add code to handle that case?

i thought that I considered all cases before but I did miss the case when you have a needs_all_values feature that depends on an approximated/precalculated feature. This will need to be reworked.

can we remove the TODO?

kmax12 · 2018-02-26T20:09:21Z

featuretools/computational_backends/pandas_backend.py

+
+                    for frames in output_frames:
+                        index = frames[entity_id].index
+                        frames[entity_id] = result_frame.loc[index]


you might want to use result_frame.reindex() here. you want to if some of the index values are going to be in result_frame.

kmax12 · 2018-02-26T20:14:59Z

featuretools/primitives/cum_transform_feature.py

@@ -19,6 +19,7 @@ class CumFeature(TransformPrimitive):
    allow_where = True
    agg_feature = None
    rolling_function = True
+    needs_all_values = True


historical_transform

kmax12 · 2018-02-26T20:15:20Z

featuretools/primitives/cum_transform_feature.py

@@ -19,6 +19,7 @@ class CumFeature(TransformPrimitive):
    allow_where = True
    agg_feature = None
    rolling_function = True
+    needs_all_values = True


I like the idea of full_entity_transform. any thoughts on other versions that don't repeat transform?

kmax12 · 2018-03-06T03:32:54Z

Looks great. Thanks for the work on this!

Ben Schreck added 3 commits February 13, 2018 13:50

adds logic to pass in all values of an entity before cutoff to featur…

97e802f

…es that need them

fixed some stray lines

dd2a631

removed one todo

bfd2439

bschreck mentioned this pull request Feb 14, 2018

EntitySet._related_instances doesn't respect time_last if no instances provided #93

Closed

Ben Schreck and others added 3 commits February 14, 2018 12:09

patched related instances problem

0646aa7

lint needs-all-values-transform (#92)

d61bd7f

* lint * Sort imports

merged

9858cba

kmax12 suggested changes Feb 20, 2018

View reviewed changes

testing potential bug

8911ff4

Seth-Rothschild mentioned this pull request Feb 20, 2018

Silent failure in 'cutoff_time' #99

Closed

bschreck added 4 commits February 20, 2018 19:41

refactored

f8e95c1

refactor

64b7470

update import order

79f1154

import order

88e7795

kmax12 reviewed Feb 23, 2018

View reviewed changes

bschreck added 5 commits February 23, 2018 09:25

slight edits

649dad5

Merge remote-tracking branch 'origin/master' into needs-all-values-tr…

ee8189b

…ansform-features

updates from review

582a952

removed print statements

53659be

update necessary columns

65708a8

kmax12 reviewed Feb 26, 2018

View reviewed changes

bschreck added 5 commits February 27, 2018 21:14

fix for approx features, might be overly complex

68190ca

almost fixed but weird issue with approx for another feature

3541f71

handle approx features correctly

6ab4731

update name to uses_full_entity

ca0a3f2

Merge branch 'master' into needs-all-values-transform-features

068af72

update for lint

861f22b

kmax12 approved these changes Mar 6, 2018

View reviewed changes

kmax12 merged commit 4189b0b into master Mar 6, 2018

rwedge mentioned this pull request Mar 21, 2018

Release v0.1.19 #114

Merged

rwedge deleted the needs-all-values-transform-features branch June 3, 2019 15:46


		# These functions are used for sorting and grouping features

		def needs_all_values_differentiator(self, f):

		@@ -153,9 +198,53 @@ def _get_feature_depths(self, entity_id):

		return list(features.values()), out

		def _build_dependents(self):

		from featuretools.utils.gen_utils import make_tqdm_iterator

		# featuretools

Handle Transform features that need access to all values of entity #91

Handle Transform features that need access to all values of entity #91

Conversation

bschreck commented Feb 13, 2018

Seth-Rothschild commented Feb 14, 2018

codecov-io commented Feb 14, 2018 • edited

Codecov Report

bschreck commented Feb 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rwedge commented Feb 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmax12 commented Mar 6, 2018

codecov-io commented Feb 14, 2018 •

edited