Split `fill_value` into `categorical_fill_value` and `numeric_fill_value` for Imputer #1019

angela97lin · 2020-08-04T20:09:56Z

Also fixes a bug with imputing pandas category typed columns. For these columns, the dtype stores the possible category values (ex: "a", "b"). After we impute it with new values, we set it back to the original dtype, which doesn't contain the imputed value as a category. Any value that isn't in the categories is set to nan, so the imputed value gets reverted back to np.nan 😟. This PR fixes it by dropping the dtype from the dictionary, then resetting the column as a category dtype column after the transformation. Since I realized this is also an issue with the SimpleImputer and the two imputers thus would share the same code, I decided to update the Imputer to use our SimpleImputer rather than sklearn's.

Finally, moved data for tests to pyfixture to be reused in each test!

angela97lin · 2020-08-05T15:11:05Z

evalml/pipelines/components/transformers/imputers/imputer.py

+        for c in category_cols:
+            X_null_dropped[c] = pd.Series(X_null_dropped[c], dtype="category")
+            dtypes.pop(c)


There was a tiny nuanced bug with pandas category type. The dtype stores the possible category values (ex: "a", "b"). After we filled it with the new imputed values, we set it back to the original dtype, which doesn't contain the imputed value as a category. Any value that isn't in the categories is set to nan, so we never actually impute. This fixes it by dropping the dtype from the dictionary, then resetting the column as a category dtype column after the transformation :)

codecov · 2020-08-05T21:16:20Z

Codecov Report

Merging #1019 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1019   +/-   ##
=======================================
  Coverage   99.90%   99.90%           
=======================================
  Files         181      181           
  Lines        9690     9748   +58     
=======================================
+ Hits         9681     9739   +58     
  Misses          9        9

Impacted Files	Coverage Δ
evalml/tests/pipeline_tests/test_pipelines.py	`100.00% <ø> (ø)`
...elines/components/transformers/imputers/imputer.py	`100.00% <100.00%> (ø)`
...components/transformers/imputers/simple_imputer.py	`100.00% <100.00%> (ø)`
evalml/tests/component_tests/test_imputer.py	`100.00% <100.00%> (ø)`
...valml/tests/component_tests/test_simple_imputer.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c1aa75...81f5da1. Read the comment docs.

docs/source/release_notes.rst

angela97lin · 2020-08-06T02:38:31Z

evalml/pipelines/components/transformers/imputers/imputer.py

        if self._numeric_cols is not None and len(self._numeric_cols) > 0:
            X_numeric = X_null_dropped[self._numeric_cols]
            X_null_dropped[X_numeric.columns] = self._numeric_imputer.transform(X_numeric)
        if self._categorical_cols is not None and len(self._categorical_cols) > 0:
            X_categorical = X_null_dropped[self._categorical_cols]
            X_null_dropped[X_categorical.columns] = self._categorical_imputer.transform(X_categorical)
-
-        transformed = X_null_dropped.astype(dtypes)


This is now handled in SimpleImputer and since this has been updated to use our version instead, we get this for free :D

dsherry

@angela97lin great! Nothing blocking IMO except a docs tweak and resolving the conversation about arg order. I left a few miscellaneous suggestions.

evalml/pipelines/components/transformers/imputers/imputer.py

evalml/pipelines/components/transformers/imputers/simple_imputer.py

evalml/tests/component_tests/test_imputer.py

evalml/tests/component_tests/test_simple_imputer.py

angela97lin added 3 commits August 4, 2020 15:58

init

336677c

fix

9043568

Merge branch 'main' into 1000_fill

5e0cae9

angela97lin commented Aug 5, 2020

View reviewed changes

angela97lin added 4 commits August 5, 2020 12:44

fix simpleimputer and add tests

d389717

remove unnecessary lines

2fb724a

address some tests, cleanup

142071c

fix tests and release note

ca2c6d5

angela97lin added 3 commits August 5, 2020 20:49

Merge branch 'main' into 1000_fill

5060506

use evalml simpleimputer for imputer

73a29c8

Merge branch '1000_fill' of github.com:FeatureLabs/evalml into 1000_fill

c433b4d

angela97lin commented Aug 6, 2020

View reviewed changes

docs/source/release_notes.rst Outdated Show resolved Hide resolved

angela97lin commented Aug 6, 2020

View reviewed changes

angela97lin added 3 commits August 5, 2020 22:43

cleanup

a91ef51

category update

34997f9

add pyfixture and reorganize tests

c0a48c2

angela97lin self-assigned this Aug 6, 2020

angela97lin marked this pull request as ready for review August 6, 2020 15:18

angela97lin requested review from dsherry, freddyaboulton, jeremyliweishih and eccabay and removed request for dsherry and freddyaboulton August 6, 2020 15:18

angela97lin added this to the August 2020 milestone Aug 6, 2020

Merge branch 'main' into 1000_fill

a514f3c

dsherry approved these changes Aug 7, 2020

View reviewed changes

angela97lin added 6 commits August 7, 2020 16:44

Merge branch 'main' into 1000_fill

5dad6cf

update via most comments

1f1dee2

fix test

02528be

update simple imputer

3f8bcfd

cleanup

84ef7b3

linting

81f5da1

angela97lin merged commit bbc315f into main Aug 10, 2020

angela97lin deleted the 1000_fill branch August 10, 2020 16:39

dsherry mentioned this pull request Aug 25, 2020

Release v0.13.1 #1101

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Split `fill_value` into `categorical_fill_value` and `numeric_fill_value` for Imputer #1019

Split `fill_value` into `categorical_fill_value` and `numeric_fill_value` for Imputer #1019

Uh oh!

angela97lin commented Aug 4, 2020 •

edited

Loading

Uh oh!

angela97lin Aug 5, 2020

Uh oh!

codecov bot commented Aug 5, 2020 •

edited

Loading

Uh oh!

Uh oh!

angela97lin Aug 6, 2020

Uh oh!

dsherry left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Split fill_value into categorical_fill_value and numeric_fill_value for Imputer #1019

Split fill_value into categorical_fill_value and numeric_fill_value for Imputer #1019

Uh oh!

Conversation

angela97lin commented Aug 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

angela97lin Aug 5, 2020

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Aug 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

angela97lin Aug 6, 2020

Choose a reason for hiding this comment

Uh oh!

dsherry left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Split `fill_value` into `categorical_fill_value` and `numeric_fill_value` for Imputer #1019

Split `fill_value` into `categorical_fill_value` and `numeric_fill_value` for Imputer #1019

angela97lin commented Aug 4, 2020 •

edited

Loading

codecov bot commented Aug 5, 2020 •

edited

Loading