New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved chunking when calculating feature matrices #121
Conversation
…es that need them
* lint * Sort imports
Codecov Report
@@ Coverage Diff @@
## master #121 +/- ##
==========================================
+ Coverage 88.27% 88.47% +0.19%
==========================================
Files 73 73
Lines 7457 7558 +101
==========================================
+ Hits 6583 6687 +104
+ Misses 874 871 -3
Continue to review full report at Codecov.
|
@@ -82,6 +82,18 @@ def calculate_feature_matrix(features, cutoff_time=None, instance_ids=None, | |||
|
|||
profile (bool, optional): Enables profiling if True. | |||
|
|||
chunk_size (int or float or None or "cutoff time"): Instead of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we have the usage guide, let's just make this description
Number of rows of output feature matrix to calculate at time. If passed an integer greater than 0, will try to use that many rows per chunk. If passed a float value between 0 and 1 sets the chunk size to that percentage of all instances. If passed the string "cutoff time", rows are split per cutoff time.
|
||
chunks = [] | ||
|
||
for group_name in groups: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can understand the logic in this function, but can you add a some brief comments explaining?
else: | ||
for chunk in iterator: | ||
chunks.append(chunk) | ||
|
||
pbar_string = ("Elapsed: {elapsed} | Remaining: {remaining} | " | ||
"Progress: {l_bar}{bar}|| " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we get rid of the ||
after the progress bar? the output looks better without it
…ose option from approximate_features
Good work! Merging in |
**v0.1.20** Apr 13, 2018 * Improved chunking when calculating feature matrices (#121) * Primitives as strings in DFS parameters (#129) * Integer time index bugfixes (#128) * Add make_temporal_cutoffs utility function (#126) * Show all entities, switch shape display to row/col (#124) * fixed num characters nan fix (#118) * modify ignore_variables docstring (#117)
Instead of calculating all rows of a feature matrix that share the same cutoff time together, Featuretools breaks the feature matrix rows into chunks to calculate separately, prioritizing grouping rows with the same cutoff time in the same chunk.
ft.dfs
andft.calculate_feature_matrix
now have achunk_size
parameter to allow for custom chunk sizes.chunk_size
accepts positive integers for explicit chunk sizes or floats between 0 and 1 for percentage-based chunk sizes. The old 'group all rows that share a cutoff time' method can be used by setting the string "cutoff time" as the chunk size.There is also new information in the documentation about using chunking and other parameters to improve the performance of Featuretools