New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add per column imputer #824
Conversation
Codecov Report
@@ Coverage Diff @@
## master #824 +/- ##
========================================
Coverage 99.67% 99.68%
========================================
Files 191 193 +2
Lines 7428 7530 +102
========================================
+ Hits 7404 7506 +102
Misses 24 24
Continue to review full report at Codecov.
|
was there an offline discussion where we decide whether or not to treat this a new component? my initial inclination would be to just add support in our current imputer to pass a dictionary of columns names to the current strategy parameter, but i would be curious to discuss. |
@kmax12 Not yet - I went with this approach first since there will be complications with having input defined hyperparameter ranges (in this case column names or index of columns) but happy to discuss! |
@kmax12: after speaking to @dsherry we agreed on having the
Let me know what you think! |
@jeremyliweishih that works for me. i could see it going either way, but good to have documentation on rationale and i like that were taking the simpler path right now. |
evalml/pipelines/components/transformers/imputers/per_column_imputer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/imputers/per_column_imputer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/imputers/per_column_imputer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/imputers/per_column_imputer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/imputers/per_column_imputer.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, left comments about impl, will review unit tests on next pass
evalml/pipelines/components/transformers/imputers/per_column_imputer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/imputers/per_column_imputer.py
Outdated
Show resolved
Hide resolved
Valid values include "mean", "median", "most_frequent", "constant" for numerical data, | ||
and "most_frequent", "constant" for object data types. | ||
fill_value (string): When impute_strategy == "constant", fill_value is used to replace missing data. | ||
Defaults to 0 when imputing numerical data and "missing_value" for strings or object data types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this default of 0 encoded? I don't see it in the code here--the default value appears to be None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its part of the sk-learn implementation:
fill_value : string or numerical value, default=None
When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.
evalml/pipelines/components/transformers/imputers/per_column_imputer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/imputers/per_column_imputer.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few more suggestions
imputers = dict() | ||
for column in X.columns: | ||
strategy = self.impute_strategies.get(column, self.default_impute_strategy) | ||
print(strategy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errant print here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. We should always use our logger instead of print. And perhaps in this particular case its better to just delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, left suggestions, main suggestion is about how to organize the impute_strategies
datastructure
imputers = dict() | ||
for column in X.columns: | ||
strategy = self.impute_strategies.get(column, self.default_impute_strategy) | ||
print(strategy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. We should always use our logger instead of print. And perhaps in this particular case its better to just delete
evalml/pipelines/components/transformers/imputers/per_column_imputer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/imputers/per_column_imputer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/imputers/per_column_imputer.py
Outdated
Show resolved
Hide resolved
from pandas.testing import assert_frame_equal | ||
|
||
from evalml.pipelines.components import PerColumnImputer | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyliweishih are there other error cases which should be covered? What if the strategy is constant
but there's no fill value? What if the schema of the impute_strategies
param isn't valid / isn't what your code expects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry Sklearn has default values when strategy is constant
but fill value is not provided (this is documented in our docstring as well). I will add a check for impute_strategies
to be a dictionary - since strategy_dict = self.impute_strategies.get(column, dict())
provides a default into the default_impute_strategy
.
strategy_dict = self.impute_strategies.get(column, dict()) | ||
strategy = strategy_dict.get('impute_strategy', self.default_impute_strategy) | ||
fill_value = strategy_dict.get('fill_value', None) | ||
self.imputers[column] = SkImputer(strategy=strategy, fill_value=fill_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 cool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 🚢
Fixes #768.