Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate transform features of direct features #623

Merged
merged 27 commits into from Jul 25, 2019

Conversation

CJStadler
Copy link
Contributor

But we do not create transforms of single direct features, or when all inputs are direct features with the same relationship path. This would be redundant as WEEKDAY(customers.signup_date) is equivalent to customers.WEEKDAY(signup_date), which should already have been calculated.

This is based on #123

Resolves #119
Resolves #81

args_set = frozenset({arg1.unique_name(), arg2.unique_name()})
unordered_args.add(args_set)

assert len(add_feats) == len(unordered_args)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the count doesn't communicate why it should have that value. I changed this to instead test that no features are created which have the same two inputs.

@codecov
Copy link

codecov bot commented Jun 24, 2019

Codecov Report

Merging #623 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #623      +/-   ##
==========================================
+ Coverage   97.47%   97.49%   +0.02%     
==========================================
  Files         118      118              
  Lines        9671     9713      +42     
==========================================
+ Hits         9427     9470      +43     
+ Misses        244      243       -1
Impacted Files Coverage Δ
featuretools/synthesis/deep_feature_synthesis.py 96.25% <100%> (+0.18%) ⬆️
...ols/tests/synthesis/test_deep_feature_synthesis.py 100% <100%> (ø) ⬆️
featuretools/entityset/relationship.py 98.71% <100%> (+0.03%) ⬆️
...s/computational_backends/feature_set_calculator.py 98.42% <0%> (+0.31%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eabad43...7dba0a8. Read the comment docs.

Copy link
Contributor

@kmax12 kmax12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do a bit more testing, but implementation looks good.

Left one question

self._build_transform_features(
all_features, entity, max_depth=max_depth)
# add seed features, if any, for dfs to build on top of
for f in self.seed_features:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason this can't stay inside of _add_identity_features?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I extracted this because if seed features are added before building aggregations then this test fails:
https://github.com/Featuretools/featuretools/blob/4a620527c34a5f7a7484ed9f3a2f954eb7d1c1d8/featuretools/tests/synthesis/test_deep_feature_synthesis.py#L367-L378

But I'm realizing now that you could write a similar test without seed features and it would fail with the current implementation because I moved _add_identity_features before building aggregations:

# We don't want this feature because it will always be equal to 'age'
assert not feature_with_name(features, 'LAST(sessions.customers.age)')

I need to think some more about how to solve this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I resolved this by adding identity and seed features first, and then making sure to not build aggregations of direct features of the target. For example, if we are building features for customers we do not use direct features such as customers.age as the base for aggregations.

First on identity and aggregations, then on direct features.

This allows features to be constructed which are direct features of
aggregations of transforms on the target.
E.g. "log: sessions.MEAN(log.ABSOLUTE(value))"

Previously this was not being built because the transform was not built
when the aggregation was being built.
@CJStadler
Copy link
Contributor Author

@kmax12 the entityset tests revealed that this was no longer building direct features of aggregations of transform. For example, log: sessions.MEAN(log.ABSOLUTE(value)) would not have been built because the ABSOLUTE transform feature was not built until after we recursed into sessions.

To solve this I made two calls to build_transform_features: one in the original location to build on identity and aggregation features, and one at the end to build on direct features. To avoid making duplicate features in the second call only sets of inputs which contain at least one direct feature will be used.

Copy link
Contributor

@kmax12 kmax12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. just some comments on how to structure.

featuretools/synthesis/deep_feature_synthesis.py Outdated Show resolved Hide resolved
@CJStadler CJStadler requested a review from kmax12 July 24, 2019 14:45
Copy link
Contributor

@kmax12 kmax12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@CJStadler CJStadler merged commit c583ff1 into master Jul 25, 2019
@CJStadler CJStadler deleted the dfs-trans-after-direct branch July 25, 2019 21:32
@rwedge rwedge mentioned this pull request Aug 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Why not build Entity features after Direct features? Primitive stacking on direct features
3 participants