Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved chunking when calculating feature matrices #121

Merged
merged 38 commits into from Mar 28, 2018
Merged

Improved chunking when calculating feature matrices #121

merged 38 commits into from Mar 28, 2018

Conversation

rwedge
Copy link
Contributor

@rwedge rwedge commented Mar 26, 2018

Instead of calculating all rows of a feature matrix that share the same cutoff time together, Featuretools breaks the feature matrix rows into chunks to calculate separately, prioritizing grouping rows with the same cutoff time in the same chunk.

ft.dfs and ft.calculate_feature_matrix now have a chunk_size parameter to allow for custom chunk sizes. chunk_size accepts positive integers for explicit chunk sizes or floats between 0 and 1 for percentage-based chunk sizes. The old 'group all rows that share a cutoff time' method can be used by setting the string "cutoff time" as the chunk size.

There is also new information in the documentation about using chunking and other parameters to improve the performance of Featuretools

@kmax12 kmax12 changed the title Chunking Improved chunking when calculating feature matrices Mar 26, 2018
@codecov-io
Copy link

codecov-io commented Mar 26, 2018

Codecov Report

Merging #121 into master will increase coverage by 0.19%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #121      +/-   ##
==========================================
+ Coverage   88.27%   88.47%   +0.19%     
==========================================
  Files          73       73              
  Lines        7457     7558     +101     
==========================================
+ Hits         6583     6687     +104     
+ Misses        874      871       -3
Impacted Files Coverage Δ
featuretools/synthesis/dfs.py 100% <ø> (ø) ⬆️
featuretools/computational_backends/api.py 100% <ø> (ø) ⬆️
...computational_backends/calculate_feature_matrix.py 98.59% <100%> (+1.7%) ⬆️
...utational_backend/test_calculate_feature_matrix.py 98.93% <100%> (+0.13%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6f1b813...97cfd5a. Read the comment docs.

@@ -82,6 +82,18 @@ def calculate_feature_matrix(features, cutoff_time=None, instance_ids=None,

profile (bool, optional): Enables profiling if True.

chunk_size (int or float or None or "cutoff time"): Instead of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we have the usage guide, let's just make this description

Number of rows of output feature matrix to calculate at time. If passed an integer greater than 0, will try to use that many rows per chunk. If passed a float value between 0 and 1 sets the chunk size to that percentage of all instances. If passed the string "cutoff time", rows are split per cutoff time.


chunks = []

for group_name in groups:
Copy link
Contributor

@kmax12 kmax12 Mar 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand the logic in this function, but can you add a some brief comments explaining?

else:
for chunk in iterator:
chunks.append(chunk)

pbar_string = ("Elapsed: {elapsed} | Remaining: {remaining} | "
"Progress: {l_bar}{bar}|| "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we get rid of the || after the progress bar? the output looks better without it

@kmax12
Copy link
Contributor

kmax12 commented Mar 28, 2018

Good work! Merging in

@kmax12 kmax12 merged commit d60c664 into master Mar 28, 2018
@rwedge rwedge mentioned this pull request Apr 13, 2018
rwedge added a commit that referenced this pull request Apr 13, 2018
**v0.1.20** Apr 13, 2018
* Improved chunking when calculating feature matrices  (#121)
* Primitives as strings in DFS parameters (#129)
* Integer time index bugfixes (#128)
* Add make_temporal_cutoffs utility function (#126)
* Show all entities, switch shape display to row/col (#124)
* fixed num characters nan fix (#118)
* modify ignore_variables docstring (#117)
@kmax12 kmax12 deleted the chunking branch June 11, 2018 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants