Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Transform features that need access to all values of entity #91

Merged
merged 22 commits into from Mar 6, 2018

Conversation

bschreck
Copy link
Contributor

Adds logic to pass in all values of an entity before cutoff to features that need them.

The problem:
Features like Percentile and CumMean/CumMax/etc. need access to the all the values the entity before the cutoff time, rather than only specified instance ids. Normal Transform features are one-to-one mappings, where the output for specific instance only depends on a single input value for that instance. On the contrary, features like Percentile use all the available data to determine the rank of a particular value. Currently, they are only provided with a subset of the data- the subset that is linked to the set of specified instance ids on target entity.

This PR creates an additional entity_frames object inside of pandas_backend.py, which contains all the data before the cutoff time. It adds logic to determine whether to pass in this object of the normal entity_frames object to each feature calculation function. It also decides where to place the output of these functions.

Lastly, it makes sure to only extract the relevant columns from the entityset using EntitySet.get_pandas_data_slice. Previously, we simply used all the columns (which is a waste of memory).

@Seth-Rothschild
Copy link
Contributor

Trying to use this with custom primitives now: what's the expected workflow? I thought that you might be able to set use_previous by doing

def primitive1(value):
    numeric = pd.Series(value)
    return (numeric - numeric.shift(1))

Prim1 = make_trans_primitive(primitive1,
                            input_types=[vtypes.Numeric],
                            cls_attributes={'use_previous': True},
                            return_type=vtypes.Numeric)

but that doesn't seem to be the way to access calculating using additional data.

@codecov-io
Copy link

codecov-io commented Feb 14, 2018

Codecov Report

Merging #91 into master will increase coverage by 0.53%.
The diff coverage is 98.76%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #91      +/-   ##
==========================================
+ Coverage   87.65%   88.18%   +0.53%     
==========================================
  Files          73       73              
  Lines        7192     7392     +200     
==========================================
+ Hits         6304     6519     +215     
+ Misses        888      873      -15
Impacted Files Coverage Δ
...computational_backends/calculate_feature_matrix.py 96.87% <100%> (+0.02%) ⬆️
featuretools/primitives/transform_primitive.py 97.76% <100%> (ø) ⬆️
...eaturetools/computational_backends/feature_tree.py 99.28% <100%> (+6.42%) ⬆️
.../feature_function_tests/test_transform_features.py 87.64% <100%> (+1.51%) ⬆️
...aturetools/tests/entityset_tests/test_pandas_es.py 99.72% <100%> (ø) ⬆️
featuretools/entityset/entityset.py 89.8% <100%> (+0.04%) ⬆️
...utational_backend/test_calculate_feature_matrix.py 98.79% <100%> (+0.18%) ⬆️
featuretools/primitives/cum_transform_feature.py 97.63% <100%> (+0.01%) ⬆️
featuretools/primitives/primitive_base.py 77.92% <100%> (+0.74%) ⬆️
...turetools/computational_backends/pandas_backend.py 89.24% <92.85%> (+2.04%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3be6fb3...861f22b. Read the comment docs.

@bschreck
Copy link
Contributor Author

@Seth-Rothschild the way to do it would be cls_attributes={'needs_all_values': True}. You just have the wrong variable name.

@@ -95,14 +96,28 @@ def calculate_all_features(self, instance_ids, time_last,
ordered_entities = FeatureTree(self.entityset, self.features, ignored=ignored).ordered_entities
else:
ordered_entities = self.feature_tree.ordered_entities

necessary_columns = self._find_necessary_columns(only_if_needs_all_values=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to refactor to not need this parameter? not clear to me what it does

@@ -210,6 +243,72 @@ def generate_default_df(self, instance_ids, extra_columns=None):
default_df[c] = [np.nan] * len(instance_ids)
return default_df

def _find_necessary_columns(self, only_if_needs_all_values=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to make this a method of featuretree?

handler = self._feature_type_handler(test_feature)
result_frame = handler(group, input_frames)

self._place_in_output_frames(result_frame,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look like this is a function that is only called once. any reason you chose to make it a function rather than include the code in line?

@@ -256,6 +272,7 @@ def query_entity_by_values(self, entity_id, instance_vals, variable_id=None,
variable_id=variable_id,
columns=columns,
time_last=time_last,
training_window=training_window,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain this addition? how were things working previously without this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they weren't

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should write a test to confirm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually never use this function except in tests... We always call the one on the Entity itself (Entity.query_by_values). I think we should just delete this function and change those tests to the one we actually use in EntitySet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in latest commit. Let me know what you think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, let's remove if it's unused

self.all_features = list(all_features.values())
self.feature_deps = feature_deps
self.feature_dependents = self._build_dependents()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is the inverse of self.feature_deps, but the variable name doesn't make it very clear since it is the same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didn't want to change feature_deps when I was first doing it for fear it was used elsewhere, but now that everythign works it would be better to name it feature_dependencies

self.all_features = list(all_features.values())
self.feature_deps = feature_deps
self.feature_dependents = self._build_dependents()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why break this out into a different function? can it be combined into the logic above that calculates self.feature_deps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would have to change the above logic to accommodate it, since we rely on get_deep_dependencies() but we don't have an equivalent get_deep_dependents() method. It would just have to be a bit more complicated as a queue/search/while loop instead of just the two for loops. But yeah might make it look more uniform

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind i think i can make it work

@@ -151,15 +169,30 @@ def calculate_all_features(self, instance_ids, time_last,
if not self.entityset.find_backward_path(start_entity_id=filter_eid,
goal_entity_id=eid):
entity_frames[eid] = eframes_by_filter[eid][eid]
if (large_eframes_by_filter is not None and
eid in large_eframes_by_filter and eid in large_eframes_by_filter[eid]):
large_entity_frames[eid] = large_eframes_by_filter[eid][eid]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this line about? seems a bit weird

test_feature = group[0]

input_frames, output_frames = \
self._determine_input_output_frames(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this logic can be done in a simpler way to avoid implementing a function that only gets called once.

input_frames = large_entity_frames
if self.feature_tree.needs_all_values_differentiator(test_feature) == 'normal_no_dependent':
    input_frames = entity_frames

then you can put the output frame logic in existing _place_in_output_frames function you've defined.

@@ -210,6 +243,72 @@ def generate_default_df(self, instance_ids, extra_columns=None):
default_df[c] = [np.nan] * len(instance_ids)
return default_df

def _find_necessary_columns(self, only_if_needs_all_values=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is only_if_needs_all_values an optimization or is it required for this to work? a bit hard to follow its role. in general, I find it confusing when a function has a single boolean parameter that controls behavior like this unless it is very obvious what it does. as far as I can tell, it there is only one place in the code where we set it to true. if that is the case, I'd prefer if the optimization happens more explicitly outside the function. makes the function itself more readable and maintainable.

@rwedge
Copy link
Contributor

rwedge commented Feb 23, 2018

Codecov has feature_tree.get_all_features not being hit by the test cases and it doesn't look like it's called by other parts of featuretools. Is this a method we want to keep?

@@ -19,6 +19,7 @@ class CumFeature(TransformPrimitive):
allow_where = True
agg_feature = None
rolling_function = True
needs_all_values = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking through other names. do any of these stick out as better or trigger other ideas?

uses_all_values
uses_all_data
uses_full_data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about something like "cross_entity", or "spans_entity"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"full_entity_transform"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

historical_transform

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of full_entity_transform. any thoughts on other versions that don't repeat transform?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied_to_full_entity
applied_to_full_data
```?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uses_full_entity


# These functions are used for sorting and grouping features

def needs_all_values_differentiator(self, f):
if self._dependent_needs_all_values(f) and self._is_output_feature(f):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is an example of where extra one time use functions don't necessarily help.

is_output = f.hash() in self.feature_hashes

then use the variable in your conditionals below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i agree

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think they're more to help me think while I go about solving the problem, and makes sense to review them at final solution

def _needs_all_values(self, feature):
if feature.needs_all_values:
return True
for d in self.feature_dependents[feature.hash()]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be replaced with a call to self._dependent_needs_all_values? at first glance, the code looks copy and pasted


# These functions are used for sorting and grouping features

def needs_all_values_differentiator(self, f):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you briefly document what each of the 4 cases mean for someone looking at this code in the future?

large_eframes_by_filter = None
if any([f.needs_all_values for f in self.feature_tree.all_features]):
large_necessary_columns = self.feature_tree.necessary_columns_for_all_values_features
all_instances = self.entityset[self.target_eid].get_all_instances()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a bit weird to get all instances here when you'd likely have to filter some of them out by time later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workaround for issue #93. can merge with master and remove

result_frame = handler(group, input_frames)

output_frames = []
if needs_all_values_type.startswith('dependent'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make this more explicit like the other conditionals around it

if needs_all_values_type in ["dependent", "dependent_and_output"]:
    pass

# turn values which were hashes of features into the features themselves
# (note that they were hashes to prevent checking equality of feature objects,
# which is not currently an allowed operation)
self.feature_dependents = {fhash: [all_features[dhash] for dhash in feature_dependents[fhash]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you iterate over the keys in all_features rather than feature_dependents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a feature has no dependencies, it won't be put in feature_dependents by the above code. But, looking at it now, I can just explicitly add those in the for loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm actually that doesn't make much sense because it's a defaultdict, so a feature with no deps will get treated as an empty value in the defaultdict anyway.

all_features[dep.hash()] = dep
feature_deps[dep.hash()] = dep.get_deep_dependencies(ignored=ignored)
subdeps = dep.get_deep_dependencies(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bschreck i see we had this before, but can you explain why we need to do the sub dependency call here? like why only once rather than go to sub sub dependencies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_deep_dependencies() gets all the dependencies recursively, so features + each one's deep dependencies defines every feature. To build the dependents dict, for each one of those, we need all of its dependents, not dependencies. deps is every dependency of the top level features, and subdeps is every dependency of all the other ones. So deps also contains all the dependents of all the subdeps, and top level features are all the other dependents for all the rest. kind of tricky to wrap your head around

@@ -153,9 +198,53 @@ def _get_feature_depths(self, entity_id):

return list(features.values()), out

def _build_dependents(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this used anywhere?

# to calculate_feature_matrix), then we also need
# to subselect the output based on the desired instance ids
# and place in the return data frame.
if self._dependent_needs_all_values(f) and is_output:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you can refactor this to get rid of extra functions

dependent_needs_all_values = False
for d in self.feature_dependents[f.hash()]:
        if d.needs_all_values:
            dependent_needs_all_values = True
needs_all_values = f.needs_all_values or dependent_needs_all_values

then use variables below

@@ -10,9 +10,9 @@

import numpy as np
import pandas as pd
# progress bar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete?

from featuretools.utils.gen_utils import make_tqdm_iterator

# featuretools
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete?

@@ -151,15 +170,48 @@ def calculate_all_features(self, instance_ids, time_last,
if not self.entityset.find_backward_path(start_entity_id=filter_eid,
goal_entity_id=eid):
entity_frames[eid] = eframes_by_filter[eid][eid]
# precalculated features will only be placed in entity_frames,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to have to have precalculated a needs_all_values feature? do we need to add code to handle that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought that I considered all cases before but I did miss the case when you have a needs_all_values feature that depends on an approximated/precalculated feature. This will need to be reworked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove the TODO?


for frames in output_frames:
index = frames[entity_id].index
frames[entity_id] = result_frame.loc[index]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to use result_frame.reindex() here. you want to if some of the index values are going to be in result_frame.

@@ -19,6 +19,7 @@ class CumFeature(TransformPrimitive):
allow_where = True
agg_feature = None
rolling_function = True
needs_all_values = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

historical_transform

@@ -19,6 +19,7 @@ class CumFeature(TransformPrimitive):
allow_where = True
agg_feature = None
rolling_function = True
needs_all_values = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of full_entity_transform. any thoughts on other versions that don't repeat transform?

@kmax12
Copy link
Contributor

kmax12 commented Mar 6, 2018

Looks great. Thanks for the work on this!

@kmax12 kmax12 merged commit 4189b0b into master Mar 6, 2018
@rwedge rwedge mentioned this pull request Mar 21, 2018
@rwedge rwedge deleted the needs-all-values-transform-features branch June 3, 2019 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants