force/auto-convert to one-hot encoding for categorical features #61

Eight1911 · 2018-06-03T07:41:10Z

The current implementation uses the lexicographical ordering to calculate splits of string features. But in practice, this is rarely intended since categorical features are by definition unordered (for example, it wouldn't make any sense that "Blue" < "Red" < "Yellow".) One hot encoding would decouple the categorical variable from any unintended ordering, and allow, as is not currently the case, regression on datasets with categorical features.

ValdarT · 2018-06-03T07:50:51Z

Unlike most algorithms, decision tree can handle categorical variables natively
(without OHE) which gives a performance boost and also makes a difference in decision forests. R packages do that and so does Microsoft's LightGBM. And of course you can do the OHE yourself if you prefer that.
We don't have the NumPy issue that Python has so no reason to restrict ourselves to operations on Floats in my opinion. Being able to handle mixed features natively is for me one of the main selling points of decision trees.

bensadeghi mentioned this issue Jun 7, 2018

[fixes #59] optimization of regression's build_tree #63

Merged

aprive mentioned this issue Dec 14, 2018

categorical features handled "correctly"? #92

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

force/auto-convert to one-hot encoding for categorical features #61

force/auto-convert to one-hot encoding for categorical features #61

Eight1911 commented Jun 3, 2018

ValdarT commented Jun 3, 2018

force/auto-convert to one-hot encoding for categorical features #61

force/auto-convert to one-hot encoding for categorical features #61

Comments

Eight1911 commented Jun 3, 2018

ValdarT commented Jun 3, 2018