Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FeatureBinarizerFromTrees #77

Merged
merged 4 commits into from
Mar 24, 2020
Merged

FeatureBinarizerFromTrees #77

merged 4 commits into from
Mar 24, 2020

Conversation

floidgilbert
Copy link
Contributor

Proposing a transformer compatible with FeatureBinarizer. The new transformer, FeatureBinarizerFromTrees, significantly shortens training times and often results in simpler rule sets. Please see examples/rbm/feature_binarizer_from_trees.ipynb for an overview and a formal performance comparison.

A test module is included. Detailed parameter information is available in the doc strings.

Thank you for sharing AIX360.

@dennislwei
Copy link
Collaborator

@floidgilbert This is really great. Thanks for contributing! The notebook examples/rbm/feature_binarizer_from_trees.ipynb is pretty compelling.

Just to confirm, the features returned by FeatureBinarizerFromTrees are all of the form [feature] [operation] [value], e.g. age >= 50, like FeatureBinarizer, right? In other words, it doesn't create interactions between two or more original features, leaving that to BooleanRuleCG, LogisticRuleRegression, etc. (Although such interactions are happening within the decision trees that FeatureBinarizerFromTrees uses.)

@vijay-arya Does the test module tests/rbm/test_Feature_Binarizer_From_Trees.py look comparable to the tests for existing algorithms?

@floidgilbert
Copy link
Contributor Author

@dennislwei The transformer does not produce interactions. Every split in the tree is considered an independent feature. FeatureBinarizerFromTrees attempts to maintain compatibility with FeatureBinarizer in almost every case. For example, public members like maps, enc, etc. are all included. There are only three practical compatibility differences.

  1. FeatureBinarizerFromTrees does not accept missing values. Because it fits a scikit-learn decision tree, missing values must be imputed. It's difficult to generalize an appropriate method for imputation, so the user must do it themselves beforehand.
  2. FeatureBinarizerFromTrees populates its ordinal member even when returnOrd=False. This is a matter of preference and convenience. It's nice to have the list of ordinal features available, but it could be changed if necessary.
  3. FeatureBinarizerFromTrees does not convert categorical feature values to strings in the transformed data frame's multi-index unless the user sets threshStr=True. There are a few reasons for this which I can explain if it's a problem. Perhaps I should rename the threshStr parameter to something more inclusive... I just took the name from FeatureBinarizer.

Copy link
Collaborator

@vijay-arya vijay-arya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vijay-arya vijay-arya merged commit 5ef5213 into Trusted-AI:master Mar 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants