# Understanding One Hot Encoding

In this notebook we are going to explore why we do one-hot-encoding.

In [1]:
import numpy as np
from keras.datasets import mnist
import pandas as pd

## MNIST

Let's setup our standard approach to MNIST

In [2]:
## mnist.load_data() will automatically download the dataset if you don't have it
(MNIST_train_X, MNIST_train_y), (MNIST_test_X, MNIST_test_y) = mnist.load_data()

In [3]:
MNIST_train_X = MNIST_train_X.reshape((60000, 28 * 28))
MNIST_test_X = MNIST_test_X.reshape((10000, 28 * 28))

MNIST_train_X = MNIST_train_X.astype('float32') / 255
MNIST_test_X = MNIST_test_X.astype('float32') / 255

## One Hot Encoding

So here is the OHE that we have been doing until now.  Let's take a look at the before and after to convince ourselves of what it's doing.

In [4]:
from keras.utils import to_categorical

In [5]:
print(MNIST_train_y[:5])
print(MNIST_train_y.shape)

[5 0 4 1 9]
(60000,)


In [6]:
MNIST_train_y_ohe = to_categorical(MNIST_train_y)
MNIST_test_y_ohe = to_categorical(MNIST_test_y)

print(MNIST_train_y_ohe[:5])
print(MNIST_train_y_ohe.shape)

[[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]
(60000, 10)


### What does the difference in shape mean?

We see that the non-encoded labels are shape (60000,) which means they are a 1-tensor, or a column vector.
This means that each label is a single integer.  This makes sense, it's just the list of answers.

Now once we encode, then the labels become a 2-tensor, or a matrix.  Each label is now a one-hot-encoded vector, this makes the dimensionality of the labels MUCH higher (10x higher), but allows for the network to learn each label seperately from each other.  It assumes no relationship between them.

We have seen the results from OHE many times, we get around `%97` accuracy.  So what about just leaving the labels as they are?  

Would we get better results by not increasing the dimesionality of our labels?

### Changes we have to make in order to use scalar labels

Ok, so if we are going to make the output a scalar value (not a vector), then our final label is shape (1,) because it's a single label.
This means we need to change out final layer to be `Dense(1, activation = None)`.  We choose `1` for the final output because that's the shape.  We also have to remove the `softmax` because that is designed to make a probablisitic output across a vector, but our output will be a single value.  So we can then basically do regression on the labels.  

Let's let our model just try to learn the answer as if they are ordinal (greater and less than each other).


## Run the non OHE data

In [7]:
from keras import models
from keras import layers

In [8]:
network = models.Sequential() #we'll stick to sequential for this course

network.add(layers.Dense(512, activation='relu', input_shape=(784,)))  
network.add(layers.Dense(1, activation=None))  # two important changes here.

network.compile(optimizer='rmsprop',
                loss='mae',  # have to use a regression loss function herer
                metrics=['accuracy'])

network.fit(MNIST_train_X, MNIST_train_y, epochs=5, batch_size=128)
test_loss, test_acc = network.evaluate(MNIST_test_X, MNIST_test_y)
print('test_acc:', test_acc)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
test_acc: 0.18549999594688416


## Conclusions

So, what did you notice?

your answer here : 

## Additional Experiment

So maybe the problem was that we used the default labels 0-9.  What if we re-ordered the numbers so they made more sense? Like if we sorted the digits by similiarity and remapped them so the regression could find the patterns better.



In [9]:
MNIST_test_y = pd.Series(MNIST_test_y)
MNIST_train_y = pd.Series(MNIST_train_y)

### Make A Mapping

Fill out the dictionary below to make a mapping.  The number on the left side is the original value, the number on the right side is the "new" value.  Try to order it so it makes a case for digits being physically similar in shape.


In [10]:
mapping = { 9:0,
            4:1,
            7:2,
            8:3,
            1:4,
            6:5,
            5:6, 
            2:7,
            3:8,
            0:9 }

In [11]:
MNIST_train_y_reorder = MNIST_train_y.map(mapping)
MNIST_test_y_reorder = MNIST_test_y.map(mapping)
print ("The original data: \n{}".format(MNIST_train_y[:10]))
print (" ----- ")
print ("the reordered data: \n{}".format(MNIST_train_y_reorder[:10]))

The original data: 
0    5
1    0
2    4
3    1
4    9
5    2
6    1
7    3
8    1
9    4
dtype: uint8
 ----- 
the reordered data: 
0    6
1    9
2    1
3    4
4    0
5    7
6    4
7    8
8    4
9    1
dtype: int64


In [12]:
network = models.Sequential() #we'll stick to sequential for this course

network.add(layers.Dense(512, activation='relu', input_shape=(784,)))  
network.add(layers.Dense(1, activation=None))  # two important changes here.

network.compile(optimizer='rmsprop',
                loss='mse',  # have to use a regression loss function here
                metrics=['accuracy'])

network.fit(MNIST_train_X, MNIST_train_y_reorder, epochs=5, batch_size=128)
test_loss, test_acc = network.evaluate(MNIST_test_X, MNIST_test_y_reorder)
print('test_acc:', test_acc)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
test_acc: 0.15950000286102295


## Your Conclusion:

Did it help? Hurt?

your answer here : 

## Final experiments

One other idea for you : maybe you need to add more layers.  Try a more complex network and see if that helps the regression case work better.  I'd consider something more like `256 >> 128 >> 64 >> 10 > 1`

In [13]:
network = models.Sequential() #we'll stick to sequential for this course

network.add(layers.Dense(256, activation='relu', input_shape=(784,)))  
network.add(layers.Dense(128, activation='relu'))  
network.add(layers.Dense(64, activation='relu'))  
network.add(layers.Dense(16, activation='relu'))  

network.add(layers.Dense(1, activation=None))  # two important changes here.

network.compile(optimizer='rmsprop',
                loss='mse',  # have to use a regression loss function herer
                metrics=['accuracy'])

network.fit(MNIST_train_X, MNIST_train_y_reorder, epochs=5, batch_size=128)
test_loss, test_acc = network.evaluate(MNIST_test_X, MNIST_test_y_reorder)
print('test_acc:', test_acc)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
test_acc: 0.18379999697208405


## same experiment - no re-mappings

In [14]:
network = models.Sequential() #we'll stick to sequential for this course

network.add(layers.Dense(256, activation='relu', input_shape=(784,)))  
network.add(layers.Dense(128, activation='relu'))  
network.add(layers.Dense(64, activation='relu'))  
network.add(layers.Dense(16, activation='relu'))  

network.add(layers.Dense(1, activation=None))  # two important changes here.

network.compile(optimizer='rmsprop',
                loss='mse',  # have to use a regression loss function herer
                metrics=['accuracy'])

network.fit(MNIST_train_X, MNIST_train_y, epochs=5, batch_size=128)
test_loss, test_acc = network.evaluate(MNIST_test_X, MNIST_test_y)
print('test_acc:', test_acc)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
test_acc: 0.20880000293254852
