Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrameToMatrix train_transform and transform dealing with categories #108

Closed
DionAkkerman opened this issue Jan 30, 2020 · 6 comments
Closed

Comments

@DionAkkerman
Copy link

Applying dataframe_to_matrix.transform() can lead to issues when applying a fitted transformation to new categorical data.

# Set up test data
cat_df = pd.DataFrame({'a': ['a', 'b', 'c'], 'b': ['v', 'c', 'a']})
cat_df = cat_df.astype('category')
cat_row = cat_df.iloc[[0]]

# Create and fit transformer
to_matrix = DataFrameToMatrix()
cat_df = to_matrix.fit_transform(cat_df)

# Apply transformer to subset
cat_row = to_matrix.transform(cat_row)

print(cat_df[0]) results in [1. 0. 0. 1. 0. 0.]
whereas
print(cat_row[0]) results in [1. 0. 0. 0. 0. 1.]

@DionAkkerman
Copy link
Author

The issue seems to be caused by self.lock_categorical(_internal_df) in the DataFrameToMatrix class. This is applied during fit_transform but not transform. Commenting out the line resolves the issue.

Would it be an idea to for the DataFrameToMatrix object to hold the levels for each categorical variable so that you could apply them to new data when applying the transformation?

@brifordwylie
Copy link
Member

@DionAkkerman sorry for the delay on this (just missed it somehow). I'm looking at this now and there definitely is a bug, so thanks for reporting this. It appears that if you set the DF types as categories manually before calling fit_transform that the category book keeping somehow gets messed up (which is obviously a bug). I'll fix this asap... for now you can just skip the 'astype' call and it should work. But either way this needs to be fixed... thanks again for reporting it.

# Set up test data 
cat_df = pd.DataFrame({'a': ['a', 'b', 'c'], 'b': ['v', 'c', 'a']})                        
cat_row = cat_df.iloc[[0]]                                                                 

# Create and fit transformer 
to_matrix = DataFrameToMatrix()                                                            

print(to_matrix.fit_transform(cat_df))                                                     
Changing column a to category...
Changing column b to category...
[[1. 0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 1.]]

print(to_matrix.transform(cat_row))                                                        
[[1. 0. 0. 1. 0. 0.]]

@brifordwylie
Copy link
Member

brifordwylie commented Feb 27, 2020

Okay, so this was a subtle issue with how Pandas improperly handles astype for categorical into a categorical with different value orders

In [38]: cat_df = pd.DataFrame({'test': ['v', 'c', 'a']}) 
    ...: cat_df = cat_df.astype('category')                                                         

In [39]: cat_df['test'].dtype                                                                       
Out[39]: CategoricalDtype(categories=['a', 'c', 'v'], ordered=False)

In [40]: cat_df['test'].astype(pd.api.types.CategoricalDtype(categories=['v', 'c', 'a'], ordered=False))              
Out[40]: 
0    v
1    c
2    a
Name: test, dtype: category
Categories (3, object): [a, c, v]

In [41]: cat_df['test'].astype(object).astype(pd.api.types.CategoricalDtype(categories=['v', 'c', 'a'], ordered=False)
    ...: )                                                                                                            
Out[41]: 
0    v
1    c
2    a
Name: test, dtype: category
Categories (3, object): [v, c, a]

Line 40 and 41 should both give the output from 41 but don't :)
I've pushed a workaround into the master branch and so everything should now work properly, please let me know if you can test it with a new pull from master. I'll do a PyPI release within the next week or so.

@DionAkkerman
Copy link
Author

Thank you for looking into the issue (and for developing/maintaining this amazing package!). I'll get around to testing it next Wednesday and will report back.

@DionAkkerman
Copy link
Author

Works perfectly on the test data and actual logs. Thank you!

@brifordwylie
Copy link
Member

@DionAkkerman hey, just wanted to circle back... I ^finally^ released a PyPI version (0.3.9) with this fix in it... anyway thanks again for catching this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants