Add per column imputer #824

jeremyliweishih · 2020-06-01T15:36:14Z

Fixes #768.

codecov · 2020-06-01T15:39:25Z

Codecov Report

Merging #824 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff            @@
##           master     #824    +/-   ##
========================================
  Coverage   99.67%   99.68%            
========================================
  Files         191      193     +2     
  Lines        7428     7530   +102     
========================================
+ Hits         7404     7506   +102     
  Misses         24       24

Impacted Files	Coverage Δ
evalml/pipelines/__init__.py	`100.00% <ø> (ø)`
evalml/pipelines/components/__init__.py	`100.00% <ø> (ø)`
evalml/pipelines/components/utils.py	`100.00% <ø> (ø)`
...alml/pipelines/components/transformers/__init__.py	`100.00% <100.00%> (ø)`
...lines/components/transformers/imputers/__init__.py	`100.00% <100.00%> (ø)`
...onents/transformers/imputers/per_column_imputer.py	`100.00% <100.00%> (ø)`
evalml/tests/component_tests/test_components.py	`100.00% <100.00%> (ø)`
...l/tests/component_tests/test_per_column_imputer.py	`100.00% <100.00%> (ø)`
evalml/tests/component_tests/test_utils.py	`96.42% <100.00%> (ø)`
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 167d1ed...96ba84f. Read the comment docs.

kmax12 · 2020-06-01T16:22:58Z

was there an offline discussion where we decide whether or not to treat this a new component?

my initial inclination would be to just add support in our current imputer to pass a dictionary of columns names to the current strategy parameter, but i would be curious to discuss.

jeremyliweishih · 2020-06-01T16:25:14Z

was there an offline discussion where we decide whether or not to treat this a new component?

my initial inclination would be to just add support in our current imputer to pass a dictionary of columns names to the current strategy method, but i would be curious to discuss.

@kmax12 Not yet - I went with this approach first since there will be complications with having input defined hyperparameter ranges (in this case column names or index of columns) but happy to discuss!

jeremyliweishih · 2020-06-01T20:27:12Z

@kmax12: after speaking to @dsherry we agreed on having the PerColumnImputer as a separate component for two reasons.

If we were to replace SimpleImputer, we would need to design and discuss how automl and tuners accept input (number of columns etc.) as factors to consider. This can be done outside the scope of this PR.
The purpose of creating this component would to be used in EvalML pipelines and outside the scope of automl.

Let me know what you think!

kmax12 · 2020-06-01T21:20:52Z

@jeremyliweishih that works for me. i could see it going either way, but good to have documentation on rationale and i like that were taking the simpler path right now.

evalml/pipelines/components/transformers/imputers/per_column_imputer.py

evalml/tests/component_tests/test_per_column_imputer.py

evalml/pipelines/components/transformers/imputers/per_column_imputer.py

dsherry

Looking good, left comments about impl, will review unit tests on next pass

evalml/tests/component_tests/test_components.py

docs/source/changelog.rst

evalml/pipelines/components/transformers/imputers/per_column_imputer.py

dsherry · 2020-06-03T16:49:21Z

evalml/pipelines/components/transformers/imputers/per_column_imputer.py

+                Valid values include "mean", "median", "most_frequent", "constant" for numerical data,
+                and "most_frequent", "constant" for object data types.
+            fill_value (string): When impute_strategy == "constant", fill_value is used to replace missing data.
+               Defaults to 0 when imputing numerical data and "missing_value" for strings or object data types.


Where is this default of 0 encoded? I don't see it in the code here--the default value appears to be None

Its part of the sk-learn implementation:

fill_value : string or numerical value, default=None

When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.

evalml/pipelines/components/transformers/imputers/per_column_imputer.py

dsherry

Left a few more suggestions

…s_column_768

kmax12 · 2020-06-08T14:46:20Z

evalml/pipelines/components/transformers/imputers/per_column_imputer.py

+        imputers = dict()
+        for column in X.columns:
+            strategy = self.impute_strategies.get(column, self.default_impute_strategy)
+            print(strategy)


errant print here

+1. We should always use our logger instead of print. And perhaps in this particular case its better to just delete

dsherry

Looking good, left suggestions, main suggestion is about how to organize the impute_strategies datastructure

dsherry · 2020-06-08T15:09:01Z

evalml/pipelines/components/transformers/imputers/per_column_imputer.py

+        imputers = dict()
+        for column in X.columns:
+            strategy = self.impute_strategies.get(column, self.default_impute_strategy)
+            print(strategy)


+1. We should always use our logger instead of print. And perhaps in this particular case its better to just delete

evalml/pipelines/components/transformers/imputers/per_column_imputer.py

evalml/tests/component_tests/test_per_column_imputer.py

evalml/pipelines/components/transformers/imputers/per_column_imputer.py

dsherry · 2020-06-08T15:33:17Z

evalml/tests/component_tests/test_per_column_imputer.py

+from pandas.testing import assert_frame_equal
+
+from evalml.pipelines.components import PerColumnImputer
+


@jeremyliweishih are there other error cases which should be covered? What if the strategy is constant but there's no fill value? What if the schema of the impute_strategies param isn't valid / isn't what your code expects?

@dsherry Sklearn has default values when strategy is constant but fill value is not provided (this is documented in our docstring as well). I will add a check for impute_strategies to be a dictionary - since strategy_dict = self.impute_strategies.get(column, dict()) provides a default into the default_impute_strategy.

…s_column_768

dsherry · 2020-06-09T14:02:15Z

evalml/pipelines/components/transformers/imputers/per_column_imputer.py

+            strategy_dict = self.impute_strategies.get(column, dict())
+            strategy = strategy_dict.get('impute_strategy', self.default_impute_strategy)
+            fill_value = strategy_dict.get('fill_value', None)
+            self.imputers[column] = SkImputer(strategy=strategy, fill_value=fill_value)


dsherry

👍 🚢

jeremyliweishih added 2 commits June 1, 2020 11:35

Add per column imputer and tests

e83f79c

CL

28ba62e

jeremyliweishih added 4 commits June 1, 2020 11:43

Fix util tests

f5ea930

lint

02890ea

Add to api reference

39f84bc

add describe test

c8e73e3

Lint

d76da31

jeremyliweishih added 3 commits June 1, 2020 16:32

Add none case error

552ff14

Revise none case

3694201

lint

0d6dd43

jeremyliweishih marked this pull request as ready for review June 1, 2020 20:59

auto-assign bot assigned jeremyliweishih Jun 1, 2020

jeremyliweishih requested review from dsherry and eccabay June 1, 2020 20:59

eccabay reviewed Jun 2, 2020

View reviewed changes

dsherry reviewed Jun 2, 2020

View reviewed changes

evalml/pipelines/components/transformers/imputers/per_column_imputer.py Outdated Show resolved Hide resolved

dsherry reviewed Jun 2, 2020

View reviewed changes

evalml/pipelines/components/transformers/imputers/per_column_imputer.py Outdated Show resolved Hide resolved

dsherry reviewed Jun 2, 2020

View reviewed changes

evalml/pipelines/components/transformers/imputers/per_column_imputer.py Outdated Show resolved Hide resolved

dsherry suggested changes Jun 2, 2020

View reviewed changes

dsherry reviewed Jun 2, 2020

View reviewed changes

evalml/tests/component_tests/test_components.py Outdated Show resolved Hide resolved

Address comments

e4a9c57

jeremyliweishih mentioned this pull request Jun 3, 2020

Standardize: in predict, validate that component/pipeline has been fit #831

Closed

jeremyliweishih requested a review from dsherry June 3, 2020 14:56

dsherry reviewed Jun 3, 2020

View reviewed changes

docs/source/changelog.rst Outdated Show resolved Hide resolved

dsherry reviewed Jun 3, 2020

View reviewed changes

evalml/pipelines/components/transformers/imputers/per_column_imputer.py Outdated Show resolved Hide resolved

dsherry reviewed Jun 3, 2020

View reviewed changes

evalml/pipelines/components/transformers/imputers/per_column_imputer.py Outdated Show resolved Hide resolved

dsherry reviewed Jun 3, 2020

View reviewed changes

evalml/pipelines/components/transformers/imputers/per_column_imputer.py Outdated Show resolved Hide resolved

dsherry reviewed Jun 3, 2020

View reviewed changes

evalml/pipelines/components/transformers/imputers/per_column_imputer.py Outdated Show resolved Hide resolved

dsherry suggested changes Jun 3, 2020

View reviewed changes

jeremyliweishih added 6 commits June 4, 2020 12:15

Address comments

510bfb0

Merge branch 'master' of https://github.com/FeatureLabs/evalml into j…

b4e9d42

…s_column_768

lint

00375a7

Fix tests after merge

840285e

Merge branch 'master' of https://github.com/FeatureLabs/evalml into j…

c6b2788

…s_column_768

lint

a12f114

jeremyliweishih requested a review from dsherry June 4, 2020 20:54

kmax12 reviewed Jun 8, 2020

View reviewed changes

dsherry suggested changes Jun 8, 2020

View reviewed changes

jeremyliweishih added 7 commits June 8, 2020 14:36

Style changes

1790a78

Fix tests

f81c943

Change schema to use dict in dict

703ae81

Add check schema

75f4fb7

Merge branch 'master' of https://github.com/FeatureLabs/evalml into j…

8b99791

…s_column_768

Change checking order

fb74f25

Add self

96ba84f

jeremyliweishih requested a review from dsherry June 9, 2020 00:45

dsherry reviewed Jun 9, 2020

View reviewed changes

dsherry approved these changes Jun 9, 2020

View reviewed changes

jeremyliweishih merged commit 2ed8545 into master Jun 9, 2020

angela97lin mentioned this pull request Jun 30, 2020

Release v0.11.0 #901

Merged

dsherry deleted the js_column_768 branch October 29, 2020 23:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per column imputer #824

Add per column imputer #824

jeremyliweishih commented Jun 1, 2020

codecov bot commented Jun 1, 2020 •

edited

kmax12 commented Jun 1, 2020 •

edited

jeremyliweishih commented Jun 1, 2020

jeremyliweishih commented Jun 1, 2020

kmax12 commented Jun 1, 2020

dsherry left a comment

dsherry Jun 3, 2020

jeremyliweishih Jun 4, 2020 •

edited

dsherry left a comment

kmax12 Jun 8, 2020

dsherry Jun 8, 2020

dsherry left a comment

dsherry Jun 8, 2020

dsherry Jun 8, 2020

jeremyliweishih Jun 8, 2020

dsherry Jun 9, 2020

dsherry left a comment

		from pandas.testing import assert_frame_equal

		from evalml.pipelines.components import PerColumnImputer

Add per column imputer #824

Add per column imputer #824

Conversation

jeremyliweishih commented Jun 1, 2020

codecov bot commented Jun 1, 2020 • edited

Codecov Report

kmax12 commented Jun 1, 2020 • edited

jeremyliweishih commented Jun 1, 2020

jeremyliweishih commented Jun 1, 2020

kmax12 commented Jun 1, 2020

dsherry left a comment

Choose a reason for hiding this comment

dsherry Jun 3, 2020

Choose a reason for hiding this comment

jeremyliweishih Jun 4, 2020 • edited

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

kmax12 Jun 8, 2020

Choose a reason for hiding this comment

dsherry Jun 8, 2020

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

dsherry Jun 8, 2020

Choose a reason for hiding this comment

dsherry Jun 8, 2020

Choose a reason for hiding this comment

jeremyliweishih Jun 8, 2020

Choose a reason for hiding this comment

dsherry Jun 9, 2020

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 1, 2020 •

edited

kmax12 commented Jun 1, 2020 •

edited

jeremyliweishih Jun 4, 2020 •

edited