New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrameToMatrix train_transform and transform dealing with categories #108
Comments
The issue seems to be caused by self.lock_categorical(_internal_df) in the DataFrameToMatrix class. This is applied during fit_transform but not transform. Commenting out the line resolves the issue. Would it be an idea to for the DataFrameToMatrix object to hold the levels for each categorical variable so that you could apply them to new data when applying the transformation? |
@DionAkkerman sorry for the delay on this (just missed it somehow). I'm looking at this now and there definitely is a bug, so thanks for reporting this. It appears that if you set the DF types as categories manually before calling fit_transform that the category book keeping somehow gets messed up (which is obviously a bug). I'll fix this asap... for now you can just skip the 'astype' call and it should work. But either way this needs to be fixed... thanks again for reporting it.
|
Okay, so this was a subtle issue with how Pandas improperly handles astype for categorical into a categorical with different value orders
Line 40 and 41 should both give the output from 41 but don't :) |
Thank you for looking into the issue (and for developing/maintaining this amazing package!). I'll get around to testing it next Wednesday and will report back. |
Works perfectly on the test data and actual logs. Thank you! |
@DionAkkerman hey, just wanted to circle back... I ^finally^ released a PyPI version (0.3.9) with this fix in it... anyway thanks again for catching this. |
Applying dataframe_to_matrix.transform() can lead to issues when applying a fitted transformation to new categorical data.
print(cat_df[0]) results in [1. 0. 0. 1. 0. 0.]
whereas
print(cat_row[0]) results in [1. 0. 0. 0. 0. 1.]
The text was updated successfully, but these errors were encountered: