DataFrameToMatrix train_transform and transform dealing with categories #108

DionAkkerman · 2020-01-30T14:09:52Z

Applying dataframe_to_matrix.transform() can lead to issues when applying a fitted transformation to new categorical data.

# Set up test data
cat_df = pd.DataFrame({'a': ['a', 'b', 'c'], 'b': ['v', 'c', 'a']})
cat_df = cat_df.astype('category')
cat_row = cat_df.iloc[[0]]

# Create and fit transformer
to_matrix = DataFrameToMatrix()
cat_df = to_matrix.fit_transform(cat_df)

# Apply transformer to subset
cat_row = to_matrix.transform(cat_row)

print(cat_df[0]) results in [1. 0. 0. 1. 0. 0.]
whereas
print(cat_row[0]) results in [1. 0. 0. 0. 0. 1.]

The text was updated successfully, but these errors were encountered:

DionAkkerman · 2020-01-30T15:41:20Z

The issue seems to be caused by self.lock_categorical(_internal_df) in the DataFrameToMatrix class. This is applied during fit_transform but not transform. Commenting out the line resolves the issue.

Would it be an idea to for the DataFrameToMatrix object to hold the levels for each categorical variable so that you could apply them to new data when applying the transformation?

brifordwylie · 2020-02-27T02:16:24Z

@DionAkkerman sorry for the delay on this (just missed it somehow). I'm looking at this now and there definitely is a bug, so thanks for reporting this. It appears that if you set the DF types as categories manually before calling fit_transform that the category book keeping somehow gets messed up (which is obviously a bug). I'll fix this asap... for now you can just skip the 'astype' call and it should work. But either way this needs to be fixed... thanks again for reporting it.

# Set up test data 
cat_df = pd.DataFrame({'a': ['a', 'b', 'c'], 'b': ['v', 'c', 'a']})                        
cat_row = cat_df.iloc[[0]]                                                                 

# Create and fit transformer 
to_matrix = DataFrameToMatrix()                                                            

print(to_matrix.fit_transform(cat_df))                                                     
Changing column a to category...
Changing column b to category...
[[1. 0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 1.]]

print(to_matrix.transform(cat_row))                                                        
[[1. 0. 0. 1. 0. 0.]]

brifordwylie · 2020-02-27T03:25:06Z

Okay, so this was a subtle issue with how Pandas improperly handles astype for categorical into a categorical with different value orders

In [38]: cat_df = pd.DataFrame({'test': ['v', 'c', 'a']}) 
    ...: cat_df = cat_df.astype('category')                                                         

In [39]: cat_df['test'].dtype                                                                       
Out[39]: CategoricalDtype(categories=['a', 'c', 'v'], ordered=False)

In [40]: cat_df['test'].astype(pd.api.types.CategoricalDtype(categories=['v', 'c', 'a'], ordered=False))              
Out[40]: 
0    v
1    c
2    a
Name: test, dtype: category
Categories (3, object): [a, c, v]

In [41]: cat_df['test'].astype(object).astype(pd.api.types.CategoricalDtype(categories=['v', 'c', 'a'], ordered=False)
    ...: )                                                                                                            
Out[41]: 
0    v
1    c
2    a
Name: test, dtype: category
Categories (3, object): [v, c, a]

Line 40 and 41 should both give the output from 41 but don't :)
I've pushed a workaround into the master branch and so everything should now work properly, please let me know if you can test it with a new pull from master. I'll do a PyPI release within the next week or so.

DionAkkerman · 2020-02-27T12:45:14Z

Thank you for looking into the issue (and for developing/maintaining this amazing package!). I'll get around to testing it next Wednesday and will report back.

DionAkkerman · 2020-03-04T19:17:52Z

Works perfectly on the test data and actual logs. Thank you!

brifordwylie · 2020-04-19T18:50:49Z

@DionAkkerman hey, just wanted to circle back... I ^finally^ released a PyPI version (0.3.9) with this fix in it... anyway thanks again for catching this.

https://pypi.org/project/zat/

DionAkkerman closed this as completed Mar 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrameToMatrix train_transform and transform dealing with categories #108

DataFrameToMatrix train_transform and transform dealing with categories #108

DionAkkerman commented Jan 30, 2020

DionAkkerman commented Jan 30, 2020

brifordwylie commented Feb 27, 2020

brifordwylie commented Feb 27, 2020 •

edited

DionAkkerman commented Feb 27, 2020

DionAkkerman commented Mar 4, 2020

brifordwylie commented Apr 19, 2020

DataFrameToMatrix train_transform and transform dealing with categories #108

DataFrameToMatrix train_transform and transform dealing with categories #108

Comments

DionAkkerman commented Jan 30, 2020

DionAkkerman commented Jan 30, 2020

brifordwylie commented Feb 27, 2020

brifordwylie commented Feb 27, 2020 • edited

DionAkkerman commented Feb 27, 2020

DionAkkerman commented Mar 4, 2020

brifordwylie commented Apr 19, 2020

brifordwylie commented Feb 27, 2020 •

edited