-
Notifications
You must be signed in to change notification settings - Fork 90
Split fill_value
into categorical_fill_value
and numeric_fill_value
for Imputer
#1019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
for c in category_cols: | ||
X_null_dropped[c] = pd.Series(X_null_dropped[c], dtype="category") | ||
dtypes.pop(c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a tiny nuanced bug with pandas category
type. The dtype stores the possible category values (ex: "a", "b"). After we filled it with the new imputed values, we set it back to the original dtype, which doesn't contain the imputed value as a category. Any value that isn't in the categories is set to nan, so we never actually impute. This fixes it by dropping the dtype from the dictionary, then resetting the column as a category dtype column after the transformation :)
Codecov Report
@@ Coverage Diff @@
## main #1019 +/- ##
=======================================
Coverage 99.90% 99.90%
=======================================
Files 181 181
Lines 9690 9748 +58
=======================================
+ Hits 9681 9739 +58
Misses 9 9
Continue to review full report at Codecov.
|
if self._numeric_cols is not None and len(self._numeric_cols) > 0: | ||
X_numeric = X_null_dropped[self._numeric_cols] | ||
X_null_dropped[X_numeric.columns] = self._numeric_imputer.transform(X_numeric) | ||
if self._categorical_cols is not None and len(self._categorical_cols) > 0: | ||
X_categorical = X_null_dropped[self._categorical_cols] | ||
X_null_dropped[X_categorical.columns] = self._categorical_imputer.transform(X_categorical) | ||
|
||
transformed = X_null_dropped.astype(dtypes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now handled in SimpleImputer
and since this has been updated to use our version instead, we get this for free :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin great! Nothing blocking IMO except a docs tweak and resolving the conversation about arg order. I left a few miscellaneous suggestions.
Closes #1000
Also fixes a bug with imputing pandas category typed columns. For these columns, the dtype stores the possible category values (ex: "a", "b"). After we impute it with new values, we set it back to the original dtype, which doesn't contain the imputed value as a category. Any value that isn't in the categories is set to nan, so the imputed value gets reverted back to np.nan 😟. This PR fixes it by dropping the dtype from the dictionary, then resetting the column as a category dtype column after the transformation. Since I realized this is also an issue with the
SimpleImputer
and the two imputers thus would share the same code, I decided to update theImputer
to use ourSimpleImputer
rather than sklearn's.Finally, moved data for tests to pyfixture to be reused in each test!