Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add per column imputer #824

Merged
merged 24 commits into from Jun 9, 2020
Merged

Add per column imputer #824

merged 24 commits into from Jun 9, 2020

Conversation

jeremyliweishih
Copy link
Contributor

Fixes #768.

@codecov
Copy link

codecov bot commented Jun 1, 2020

Codecov Report

Merging #824 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master     #824    +/-   ##
========================================
  Coverage   99.67%   99.68%            
========================================
  Files         191      193     +2     
  Lines        7428     7530   +102     
========================================
+ Hits         7404     7506   +102     
  Misses         24       24            
Impacted Files Coverage Δ
evalml/pipelines/__init__.py 100.00% <ø> (ø)
evalml/pipelines/components/__init__.py 100.00% <ø> (ø)
evalml/pipelines/components/utils.py 100.00% <ø> (ø)
...alml/pipelines/components/transformers/__init__.py 100.00% <100.00%> (ø)
...lines/components/transformers/imputers/__init__.py 100.00% <100.00%> (ø)
...onents/transformers/imputers/per_column_imputer.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_components.py 100.00% <100.00%> (ø)
...l/tests/component_tests/test_per_column_imputer.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_utils.py 96.42% <100.00%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 167d1ed...96ba84f. Read the comment docs.

@kmax12
Copy link
Contributor

kmax12 commented Jun 1, 2020

was there an offline discussion where we decide whether or not to treat this a new component?

my initial inclination would be to just add support in our current imputer to pass a dictionary of columns names to the current strategy parameter, but i would be curious to discuss.

@jeremyliweishih
Copy link
Contributor Author

was there an offline discussion where we decide whether or not to treat this a new component?

my initial inclination would be to just add support in our current imputer to pass a dictionary of columns names to the current strategy method, but i would be curious to discuss.

@kmax12 Not yet - I went with this approach first since there will be complications with having input defined hyperparameter ranges (in this case column names or index of columns) but happy to discuss!

@jeremyliweishih
Copy link
Contributor Author

@kmax12: after speaking to @dsherry we agreed on having the PerColumnImputer as a separate component for two reasons.

  1. If we were to replace SimpleImputer, we would need to design and discuss how automl and tuners accept input (number of columns etc.) as factors to consider. This can be done outside the scope of this PR.

  2. The purpose of creating this component would to be used in EvalML pipelines and outside the scope of automl.

Let me know what you think!

@jeremyliweishih jeremyliweishih marked this pull request as ready for review June 1, 2020 20:59
@kmax12
Copy link
Contributor

kmax12 commented Jun 1, 2020

@jeremyliweishih that works for me. i could see it going either way, but good to have documentation on rationale and i like that were taking the simpler path right now.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, left comments about impl, will review unit tests on next pass

docs/source/changelog.rst Outdated Show resolved Hide resolved
Valid values include "mean", "median", "most_frequent", "constant" for numerical data,
and "most_frequent", "constant" for object data types.
fill_value (string): When impute_strategy == "constant", fill_value is used to replace missing data.
Defaults to 0 when imputing numerical data and "missing_value" for strings or object data types.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this default of 0 encoded? I don't see it in the code here--the default value appears to be None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its part of the sk-learn implementation:

fill_value : string or numerical value, default=None

When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more suggestions

imputers = dict()
for column in X.columns:
strategy = self.impute_strategies.get(column, self.default_impute_strategy)
print(strategy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errant print here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. We should always use our logger instead of print. And perhaps in this particular case its better to just delete

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, left suggestions, main suggestion is about how to organize the impute_strategies datastructure

imputers = dict()
for column in X.columns:
strategy = self.impute_strategies.get(column, self.default_impute_strategy)
print(strategy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. We should always use our logger instead of print. And perhaps in this particular case its better to just delete

evalml/tests/component_tests/test_per_column_imputer.py Outdated Show resolved Hide resolved
evalml/tests/component_tests/test_per_column_imputer.py Outdated Show resolved Hide resolved
evalml/tests/component_tests/test_per_column_imputer.py Outdated Show resolved Hide resolved
from pandas.testing import assert_frame_equal

from evalml.pipelines.components import PerColumnImputer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih are there other error cases which should be covered? What if the strategy is constant but there's no fill value? What if the schema of the impute_strategies param isn't valid / isn't what your code expects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry Sklearn has default values when strategy is constant but fill value is not provided (this is documented in our docstring as well). I will add a check for impute_strategies to be a dictionary - since strategy_dict = self.impute_strategies.get(column, dict()) provides a default into the default_impute_strategy.

strategy_dict = self.impute_strategies.get(column, dict())
strategy = strategy_dict.get('impute_strategy', self.default_impute_strategy)
fill_value = strategy_dict.get('fill_value', None)
self.imputers[column] = SkImputer(strategy=strategy, fill_value=fill_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 cool

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 🚢

@jeremyliweishih jeremyliweishih merged commit 2ed8545 into master Jun 9, 2020
@angela97lin angela97lin mentioned this pull request Jun 30, 2020
@dsherry dsherry deleted the js_column_768 branch October 29, 2020 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SimpleImputer: support different strategy for each column
4 participants