# one-hot encoding for categorical data

#### computers can't learn much from words; they need numbers.

in order to be machine-readable, categorical data need to be converted into something the computer can understand and learn from: numbers.

there are 2 steps to encoding categorical data--however, depending on the data you may not need to do both:

##### 1) integer encoding
##### 2) one-hot encoding

### integer encoding

for categories with an inherent numerical hierarchy, i.e. ordinal relationships, integer encoding is sufficient.

if there's no ordinal relationship in the data, stopping at this step might cause the model to infer an ordinal relationship where none exists. 

#### example:

a customer service rating survey where the choices were "not satisfied, somewhat satisfied, very satisfied" could be encoded into integers [1, 2, 3], for example, allowing the computer to infer relationships successfully. this is an example of ordinal data--the ordering of the integers means something. 

choices of colors like "blue, red, yellow", if labeled [1, 2, 3] respectively, might imply that, say, blue is better than red, and both blue and red are better than yellow. these categories have no ordinal relationship.

### one hot encoding

for categories with no ordinal relationship, such as the color example above, one hot encoding is neccessery to properly vectorize the data. one hot encoding converts each category into a binary variable, assigned a 1 for a positive instance in a row, while all other categories in the instance row are assigned a 0.

### scikit-learn's categorical encoding

scikit-learn's library includes encoder objects that make things easy. we'll start with the LabelEncoder object for integer encoding, and then demo the OneHotEncoder for a complete encoding intro.


## encoding

let's define some data. we'll stick with the color categories "blue, red, yellow" from above.

In [6]:
# imports 

from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

data = ['yellow', 'yellow', 'red', 'blue', 'yellow', 'blue', 'blue', 'red', 'yellow', 'red', 'blue']

data_arr = array(data)

print(data_arr)

['yellow' 'yellow' 'red' 'blue' 'yellow' 'blue' 'blue' 'red' 'yellow'
 'red' 'blue']


perfect.

#### step 1: integer encoding

In [8]:
# instantiate LabelEncoder object

label_encoder = LabelEncoder()

# fit & transform the categorical labels into integers
# feed in the array we just made

int_encode = label_encoder.fit_transform(data_arr)

# test

print(int_encode)

[2 2 1 0 2 0 0 1 2 1 0]


if these data were ordinal, we'd be done now. but we don't want the model to infer an ordinal relationship where none exists, as this could negatively impact our model's performance. 

so we move on to binary (one hot) encoding.

#### step 2: one hot encoding

note: by default, sklearn's OneHotEncoder produces a sparse matrix--it's a way to save memory. however this doesn't always work for all machine learning libraries. for example, the keras DL library doesn't work with sparse matrices. to fix this, when creating a OneHotEncoder object, just set the "sparse" parameter to False.

In [10]:
# instantiate encoder object

one_hot_encoder = OneHotEncoder(sparse=False)

# reshape data into n x 1 matrix

int_data = int_encode.reshape(len(int_encode), 1)

# fit & transform

one_hot_data = one_hot_encoder.fit_transform(int_data)

# test

print(one_hot_data)

[[0. 0. 1.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]


 sweet. our data is now machine readable.
 
 ## for more information on sklearn's categorical label encoders:

 ##### sklearn LabelEncoder documentation
 
 http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
 
 ##### sklearn OneHotEncoder documentation
 
 http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html