## scikit-learn tip #6

Two common ways to encode categorical features:

- OneHotEncoder for unordered (nominal) data
- OrdinalEncoder for ordered (ordinal) data

In [1]:
import pandas as pd
X = pd.DataFrame({'Shape':['square', 'square', 'oval', 'circle'],
                  'Class': ['third', 'first', 'second', 'third'],
                  'Size': ['S', 'S', 'L', 'XL']})

In [3]:
# "Shape" is unordered, "Class" and "Size" are ordered
X

Unnamed: 0,Shape,Class,Size
0,square,third,S
1,square,first,S
2,oval,second,L
3,circle,third,XL


In [4]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

In [5]:
# left-to-right column order is alphabetical (circle, oval, square)
ohe = OneHotEncoder(sparse=False)
ohe.fit_transform(X[['Shape']])



array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.]])

In [6]:
# category ordering (within each feature) is defined by you
oe = OrdinalEncoder(categories=[['first', 'second', 'third'], ['S', 'M', 'L', 'XL']])
oe.fit_transform(X[['Class', 'Size']])

array([[2., 0.],
       [0., 0.],
       [1., 2.],
       [2., 3.]])