# One-Hot Encoding

## Preprocessing Review

Why do we **preprocess** data when we build machine learning pipelines?

We preprocess data for two principle reasons:

1. To transform the data to better suit a model's underlying assumptions.
2. To format the data in the way a model expects.

Today, we're concerned with this second reason.

## Inputs to Neural Networks

What does the input to a neural network look like?

Inputs to neural networks are **vectors**. Each entry in the vector corresponds to a feature, which the net uses to make predictions. 

Crucially, these vectors contain can contain only _numerical_ data. They _cannot_ contain string data.

In [1]:
# Good!
good_input_row1 = [1.3, 2.2, 5.4, 5.8, 0]
good_input_row2 = [1.3, 2.2, 5.4, 5.8, 1]

In [2]:
# Bad...
bad_input_row1 = [1.3, 2.2, 5.4, 5.8, 'dog']
bad_input_row2 = [1.3, 2.2, 5.4, 5.8, 'cat']

## One-Hot Encoding

This poses a problem when we want to train a neural network on categorical data, such as the classic [Iris data set](https://archive.ics.uci.edu/ml/datasets/Iris

![](https://upload.wikimedia.org/wikipedia/commons/5/56/Kosaciec_szczecinkowaty_Iris_setosa.jpg)

In [3]:
import pandas as pd

# Read from: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names
names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
df = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', names=names)

In [4]:
# Note the entries `iris-virginica`
df.tail(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


Note that all of our data is numerical..._Except_ for the data in that `class` column, which contains strings.

The `class` column will contain one of three values:

1. `iris-setosa`
2. `iris-versicolour`
3. `iris-virginica`

As these are not numerical values, we can't use them to fit our nnet. To fix this, we must convert each class label to a numerical value.

We do this via the following steps:

1. **Label Encoding**. First, we convert the three possible classes to integer labels. E.g., `iris-setosa` will be `1`; `iris-versicolour`, `2`; and `iris-virginica`, `3`.
2. **One-Hot Encoding**. Then, we set each row's `class` value to an _array_. This array will have a `1` in whichever slot corresponds to the integer label. E.g., after one-hot encoding, a row with the class `iris-setosa` will have the array `[1, 0, 0]`. A row with class `iris-virginica`, the array `[0, 0, 1]`; etc.

In many cases, categories in the data sets you work with will already be label-encoded. In this case, you can apply one-hot encoding immediately.

## Applying One-Hot Encoding

In [5]:
# Step 0: Reformat data
data = df.values
X = data[:, 0:4]
y = data[:, 4]

In [6]:
from sklearn.preprocessing import LabelEncoder

# Step 1: Label-encode data set
label_encoder = LabelEncoder()
label_encoder.fit(y)
encoded_y = label_encoder.transform(y)

In [7]:
for label, original_class in zip(encoded_y, y):
    print('Original Class: ' + str(original_class))
    print('Encoded Label: ' + str(label))
    print('-' * 12)

Original Class: Iris-setosa
Encoded Label: 0
------------
Original Class: Iris-setosa
Encoded Label: 0
------------
Original Class: Iris-setosa
Encoded Label: 0
------------
Original Class: Iris-setosa
Encoded Label: 0
------------
Original Class: Iris-setosa
Encoded Label: 0
------------
Original Class: Iris-setosa
Encoded Label: 0
------------
Original Class: Iris-setosa
Encoded Label: 0
------------
Original Class: Iris-setosa
Encoded Label: 0
------------
Original Class: Iris-setosa
Encoded Label: 0
------------
Original Class: Iris-setosa
Encoded Label: 0
------------
Original Class: Iris-setosa
Encoded Label: 0
------------
Original Class: Iris-setosa
Encoded Label: 0
------------
Original Class: Iris-setosa
Encoded Label: 0
------------
Original Class: Iris-setosa
Encoded Label: 0
------------
Original Class: Iris-setosa
Encoded Label: 0
------------
Original Class: Iris-setosa
Encoded Label: 0
------------
Original Class: Iris-setosa
Encoded Label: 0
------------
Original Class

Note that each of the original labels has been replaced with an integer.

In [8]:
from keras.utils import to_categorical

# Step 2: One-hot encoding
one_hot_y = to_categorical(encoded_y)
one_hot_y

Using TensorFlow backend.


array([[ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.