-
Notifications
You must be signed in to change notification settings - Fork 873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set index after adding ancestor relationship variables #668
Conversation
Codecov Report
@@ Coverage Diff @@
## master #668 +/- ##
==========================================
+ Coverage 97.44% 97.44% +<.01%
==========================================
Files 118 118
Lines 9618 9634 +16
==========================================
+ Hits 9372 9388 +16
Misses 246 246
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -337,6 +337,9 @@ def _add_ancestor_relationship_variables(self, child_df, parent_df, | |||
left_on=relationship.child_variable.id, | |||
right_on=relationship.child_variable.id) | |||
|
|||
# ensure index is maintained | |||
df = df.set_index(relationship.child_entity.index, drop=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not a big deal, but inplace
looks like it might be faster.
In [1]: import pandas as pd
In [2]: df10k = pd.DataFrame({'a': range(10000)}, index=range(10000))
In [3]: df100k = pd.DataFrame({'a': range(100000)}, index=range(100000))
In [4]: %timeit df10k.set_index('a', drop=False)
312 µs ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: %timeit df100k.set_index('a', drop=False)
912 µs ± 50.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: %timeit df10k.set_index('a', drop=False, inplace=True)
101 µs ± 5.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: %timeit df100k.set_index('a', drop=False, inplace=True)
113 µs ± 25.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This fixes a bug where we essentially reset the dataframe index after adding ancestor variables. This breaks merging later when trying to create aggregation features because we merge on the index
https://github.com/Featuretools/featuretools/blob/master/featuretools/computational_backends/feature_set_calculator.py#L611
This only occurs when you stack to a certain depth because you need to be creating features for an entity whose dataframe has had ancestor relationship variables added to it.
The test cases uses a string index to avoid the situation where the reset index is masked because it is the same as the existing index.
Fixes #643