You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current implementation uses the lexicographical ordering to calculate splits of string features. But in practice, this is rarely intended since categorical features are by definition unordered (for example, it wouldn't make any sense that "Blue" < "Red" < "Yellow".) One hot encoding would decouple the categorical variable from any unintended ordering, and allow, as is not currently the case, regression on datasets with categorical features.
The text was updated successfully, but these errors were encountered:
Unlike most algorithms, decision tree can handle categorical variables natively
(without OHE) which gives a performance boost and also makes a difference in decision forests. R packages do that and so does Microsoft's LightGBM. And of course you can do the OHE yourself if you prefer that.
We don't have the NumPy issue that Python has so no reason to restrict ourselves to operations on Floats in my opinion. Being able to handle mixed features natively is for me one of the main selling points of decision trees.
The current implementation uses the lexicographical ordering to calculate splits of string features. But in practice, this is rarely intended since categorical features are by definition unordered (for example, it wouldn't make any sense that "Blue" < "Red" < "Yellow".) One hot encoding would decouple the categorical variable from any unintended ordering, and allow, as is not currently the case, regression on datasets with categorical features.
The text was updated successfully, but these errors were encountered: