Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set index after adding ancestor relationship variables #668

Merged
merged 5 commits into from
Jul 19, 2019

Conversation

kmax12
Copy link
Contributor

@kmax12 kmax12 commented Jul 14, 2019

This fixes a bug where we essentially reset the dataframe index after adding ancestor variables. This breaks merging later when trying to create aggregation features because we merge on the index

https://github.com/Featuretools/featuretools/blob/master/featuretools/computational_backends/feature_set_calculator.py#L611

This only occurs when you stack to a certain depth because you need to be creating features for an entity whose dataframe has had ancestor relationship variables added to it.

The test cases uses a string index to avoid the situation where the reset index is masked because it is the same as the existing index.

Fixes #643

@codecov
Copy link

codecov bot commented Jul 14, 2019

Codecov Report

Merging #668 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #668      +/-   ##
==========================================
+ Coverage   97.44%   97.44%   +<.01%     
==========================================
  Files         118      118              
  Lines        9618     9634      +16     
==========================================
+ Hits         9372     9388      +16     
  Misses        246      246
Impacted Files Coverage Δ
...mputational_backend/test_feature_set_calculator.py 100% <100%> (ø) ⬆️
...s/computational_backends/feature_set_calculator.py 98.1% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d191c64...e452c62. Read the comment docs.

CJStadler
CJStadler previously approved these changes Jul 18, 2019
Copy link
Contributor

@CJStadler CJStadler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -337,6 +337,9 @@ def _add_ancestor_relationship_variables(self, child_df, parent_df,
left_on=relationship.child_variable.id,
right_on=relationship.child_variable.id)

# ensure index is maintained
df = df.set_index(relationship.child_entity.index, drop=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not a big deal, but inplace looks like it might be faster.

In [1]: import pandas as pd

In [2]: df10k = pd.DataFrame({'a': range(10000)}, index=range(10000))

In [3]: df100k = pd.DataFrame({'a': range(100000)}, index=range(100000))

In [4]: %timeit df10k.set_index('a', drop=False)
312 µs ± 7.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit df100k.set_index('a', drop=False)
912 µs ± 50.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [6]: %timeit df10k.set_index('a', drop=False, inplace=True)
101 µs ± 5.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [7]: %timeit df100k.set_index('a', drop=False, inplace=True)
113 µs ± 25.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

@kmax12 kmax12 merged commit 278c0c4 into master Jul 19, 2019
@rwedge rwedge mentioned this pull request Aug 19, 2019
@kmax12 kmax12 deleted the set-index-featureset-calculator branch September 11, 2019 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Depth 3 feature always equals to 0
2 participants