New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate transform features of direct features #623
Conversation
args_set = frozenset({arg1.unique_name(), arg2.unique_name()}) | ||
unordered_args.add(args_set) | ||
|
||
assert len(add_feats) == len(unordered_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking the count doesn't communicate why it should have that value. I changed this to instead test that no features are created which have the same two inputs.
Codecov Report
@@ Coverage Diff @@
## master #623 +/- ##
==========================================
+ Coverage 97.47% 97.49% +0.02%
==========================================
Files 118 118
Lines 9671 9713 +42
==========================================
+ Hits 9427 9470 +43
+ Misses 244 243 -1
Continue to review full report at Codecov.
|
This is necessary for Python 2 because it does not default to the negation of "==".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do a bit more testing, but implementation looks good.
Left one question
self._build_transform_features( | ||
all_features, entity, max_depth=max_depth) | ||
# add seed features, if any, for dfs to build on top of | ||
for f in self.seed_features: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason this can't stay inside of _add_identity_features
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I extracted this because if seed features are added before building aggregations then this test fails:
https://github.com/Featuretools/featuretools/blob/4a620527c34a5f7a7484ed9f3a2f954eb7d1c1d8/featuretools/tests/synthesis/test_deep_feature_synthesis.py#L367-L378
But I'm realizing now that you could write a similar test without seed features and it would fail with the current implementation because I moved _add_identity_features
before building aggregations:
# We don't want this feature because it will always be equal to 'age'
assert not feature_with_name(features, 'LAST(sessions.customers.age)')
I need to think some more about how to solve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I resolved this by adding identity and seed features first, and then making sure to not build aggregations of direct features of the target. For example, if we are building features for customers
we do not use direct features such as customers.age
as the base for aggregations.
First on identity and aggregations, then on direct features. This allows features to be constructed which are direct features of aggregations of transforms on the target. E.g. "log: sessions.MEAN(log.ABSOLUTE(value))" Previously this was not being built because the transform was not built when the aggregation was being built.
@kmax12 the entityset tests revealed that this was no longer building direct features of aggregations of transform. For example, To solve this I made two calls to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. just some comments on how to structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
But we do not create transforms of single direct features, or when all inputs are direct features with the same relationship path. This would be redundant as
WEEKDAY(customers.signup_date)
is equivalent tocustomers.WEEKDAY(signup_date)
, which should already have been calculated.This is based on #123
Resolves #119
Resolves #81