Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

force/auto-convert to one-hot encoding for categorical features #61

Open
Eight1911 opened this issue Jun 3, 2018 · 1 comment
Open

Comments

@Eight1911
Copy link
Contributor

The current implementation uses the lexicographical ordering to calculate splits of string features. But in practice, this is rarely intended since categorical features are by definition unordered (for example, it wouldn't make any sense that "Blue" < "Red" < "Yellow".) One hot encoding would decouple the categorical variable from any unintended ordering, and allow, as is not currently the case, regression on datasets with categorical features.

@ValdarT
Copy link

ValdarT commented Jun 3, 2018

Unlike most algorithms, decision tree can handle categorical variables natively
(without OHE) which gives a performance boost and also makes a difference in decision forests. R packages do that and so does Microsoft's LightGBM. And of course you can do the OHE yourself if you prefer that.
We don't have the NumPy issue that Python has so no reason to restrict ourselves to operations on Floats in my opinion. Being able to handle mixed features natively is for me one of the main selling points of decision trees.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants