Skip to content

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Aug 4, 2020

Closes #1000

Also fixes a bug with imputing pandas category typed columns. For these columns, the dtype stores the possible category values (ex: "a", "b"). After we impute it with new values, we set it back to the original dtype, which doesn't contain the imputed value as a category. Any value that isn't in the categories is set to nan, so the imputed value gets reverted back to np.nan 😟. This PR fixes it by dropping the dtype from the dictionary, then resetting the column as a category dtype column after the transformation. Since I realized this is also an issue with the SimpleImputer and the two imputers thus would share the same code, I decided to update the Imputer to use our SimpleImputer rather than sklearn's.

Finally, moved data for tests to pyfixture to be reused in each test!

Comment on lines 106 to 108
for c in category_cols:
X_null_dropped[c] = pd.Series(X_null_dropped[c], dtype="category")
dtypes.pop(c)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a tiny nuanced bug with pandas category type. The dtype stores the possible category values (ex: "a", "b"). After we filled it with the new imputed values, we set it back to the original dtype, which doesn't contain the imputed value as a category. Any value that isn't in the categories is set to nan, so we never actually impute. This fixes it by dropping the dtype from the dictionary, then resetting the column as a category dtype column after the transformation :)

@codecov
Copy link

codecov bot commented Aug 5, 2020

Codecov Report

Merging #1019 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1019   +/-   ##
=======================================
  Coverage   99.90%   99.90%           
=======================================
  Files         181      181           
  Lines        9690     9748   +58     
=======================================
+ Hits         9681     9739   +58     
  Misses          9        9           
Impacted Files Coverage Δ
evalml/tests/pipeline_tests/test_pipelines.py 100.00% <ø> (ø)
...elines/components/transformers/imputers/imputer.py 100.00% <100.00%> (ø)
...components/transformers/imputers/simple_imputer.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_imputer.py 100.00% <100.00%> (ø)
...valml/tests/component_tests/test_simple_imputer.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c1aa75...81f5da1. Read the comment docs.

if self._numeric_cols is not None and len(self._numeric_cols) > 0:
X_numeric = X_null_dropped[self._numeric_cols]
X_null_dropped[X_numeric.columns] = self._numeric_imputer.transform(X_numeric)
if self._categorical_cols is not None and len(self._categorical_cols) > 0:
X_categorical = X_null_dropped[self._categorical_cols]
X_null_dropped[X_categorical.columns] = self._categorical_imputer.transform(X_categorical)

transformed = X_null_dropped.astype(dtypes)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now handled in SimpleImputer and since this has been updated to use our version instead, we get this for free :D

@angela97lin angela97lin self-assigned this Aug 6, 2020
@angela97lin angela97lin marked this pull request as ready for review August 6, 2020 15:18
@angela97lin angela97lin requested review from dsherry, freddyaboulton, jeremyliweishih and eccabay and removed request for dsherry and freddyaboulton August 6, 2020 15:18
@angela97lin angela97lin added this to the August 2020 milestone Aug 6, 2020
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin great! Nothing blocking IMO except a docs tweak and resolving the conversation about arg order. I left a few miscellaneous suggestions.

@angela97lin angela97lin merged commit bbc315f into main Aug 10, 2020
@angela97lin angela97lin deleted the 1000_fill branch August 10, 2020 16:39
@dsherry dsherry mentioned this pull request Aug 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Imputer: same fill value used for numeric/categorical for "constant" strategy
2 participants