# Softmax activation, one-hot encoding, categorical cross-entropy and accuracy

In this tutorial we will review concepts seen in previous lectures and tutorials. The learning goals are:
- Review the softmax activation for converting a vector of scores into a vector of probabilities;
- Review the one-hot encoding representation;
- Contrast categorical cross-entropy and accuracy as potential loss functions. Categorical corss-entropy is better!


## Softmax

The softmax function is commonly used as an activation for multi-class supervised classification in deep learning.
The softmax function converts an input vector of real values to an output vector that can be interpreted as categorical probabilities. The elements of the output vector are in the range between 0 and 1 and they sum to 1, which are elements necessary to interpret the output of the softmax function as probabilities.

The softmax is often used as the activation for the last layer of a classification model.

The softmax equation for an input vector $\overrightarrow{Z}$ is given by:


$$ Softmax(\overrightarrow{Z}) = \frac{e^{\overrightarrow{Z}}}{\sum_{j=0}^{k-1}e^{z_j}} $$

Rewriting the vector notation to make the equation simpler:

$$ Softmax(\left [  z_0 z_1 \ldots z_{k-1} \right ]) = \left [ \frac{e^{z_0}}{\sum_{i=0}^{k-1} e^{z_i}} \frac{e^{z_1}}{\sum_{i=0}^{k-1} e^{z_i}} \ldots \frac{e^{z_{k-1}}}{\sum_{i=0}^{k-1} e^{z_i}} \right ] $$

## Softmax implementation using NumPy

In this section, we define the softmax function where the input is a two-dimensional matrix, where each row represents a vector of real values (*i.e.*, $\overrightarrow{Z}$). We will represent the two-dimensional matrix by $\boldsymbol{Z}$. This matrix has $n$ rows and $k$ columns, which represent the number of real-valued vectors and the number of probabilities to output for each vector.

$$ \boldsymbol{Z} = z_{i,j}; i \in (0,1,\ldots n-1); j \in (0,1, \ldots k-1) $$


$$
Softmax(\boldsymbol{Z}) = Softmax(\begin{bmatrix}  
z_{0,0} & z_{0,1} & \ldots & z_{0,k-1} \\
z_{1,0} & z_{1,1} & \ldots & z_{1,k-1} \\
\vdots & \vdots & & \vdots \\
z_{n-1,0} & z_{n-1,1} & \ldots & z_{n-1,k-1} 
 \end{bmatrix}) = 
\begin{bmatrix}
\frac{e^{z_{0,0}}}{\sum_{j=0}^{k-1} e^{z_{0,j}}}& \frac{e^{z_{0,1}}}{\sum_{j=0}^{k-1} e^{z_{0,j}}} & \ldots & \frac{e^{z_{0,k-1}}}{\sum_{j=0}^{k-1} e^{z_{0,j}}} \\
\frac{e^{z_{1,0}}}{\sum_{j=0}^{k-1} e^{z_{1,j}}}& \frac{e^{z_{1,1}}}{\sum_{j=0}^{k-1} e^{z_{1,j}}} & \ldots & \frac{e^{z_{1,k-1}}}{\sum_{j=0}^{k-1} e^{z_{1,j}}} \\
\vdots & \vdots & & \vdots \\
\frac{e^{z_{n-1,0}}}{\sum_{j=0}^{k-1} e^{z_{n-1,j}}}& \frac{e^{z_{n-1,1}}}{\sum_{j=0}^{k-1} e^{z_{n-1,j}}} & \ldots & \frac{e^{z_{n-1,k-1}}}{\sum_{j=0}^{k-1} e^{z_{n-1,j}}} \\
\end{bmatrix}
$$

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pylab as plt
np.set_printoptions(suppress=True, precision=3) # Limits the number of decimal houses when printing values to 3
import ipywidgets as widgets # for cells with interactivity
from tensorflow.keras.utils import to_categorical # Function to convert labels to one-hot encoding

ModuleNotFoundError: No module named 'tensorflow'

In [20]:
def softmax(Z):
    EZ = np.exp(Z)
    S = EZ / EZ.sum(axis=1,keepdims = True)
    return S

Let's test the code on a case where n = 10 and k = 3! 

In [10]:
aux = np.linspace(-2, 6.0,10).reshape(-1,1)
Z = np.hstack([aux, np.ones_like(aux), 0.2 * np.ones_like(aux)]) # we are keeping two values constant and 
                                                                 # changing the thirs one
S = softmax(Z)
print('Z=\n',Z)
print('S=\n',S)

Z=
 [[-2.     1.     0.2  ]
 [-1.111  1.     0.2  ]
 [-0.222  1.     0.2  ]
 [ 0.667  1.     0.2  ]
 [ 1.556  1.     0.2  ]
 [ 2.444  1.     0.2  ]
 [ 3.333  1.     0.2  ]
 [ 4.222  1.     0.2  ]
 [ 5.111  1.     0.2  ]
 [ 6.     1.     0.2  ]]
S=
 [[0.033 0.667 0.3  ]
 [0.077 0.637 0.286]
 [0.169 0.573 0.258]
 [0.331 0.462 0.207]
 [0.546 0.313 0.141]
 [0.745 0.176 0.079]
 [0.877 0.085 0.038]
 [0.945 0.038 0.017]
 [0.977 0.016 0.007]
 [0.99  0.007 0.003]]


In the example below, we will change the values of $\overrightarrow{Z}$ interactively. The interactivety is generated using the ipywidgets Python module.

In [13]:
def plotmodel(s1,s2,s3):
    scores = np.array([[s1, s2, s3]]) # shape: (1,3) 
    S = softmax(scores)[0] # (1,3) array as (3,) array
    plt.rcdefaults()
    fig, ax = plt.subplots(figsize=(3, 2))
    classes = ('0', '1', '2')
    x_pos = [2,4,6]
    ax.bar(x_pos, S, align='center',color='green', ecolor='black')
    ax.set_xticks(x_pos)
    ax.set_xticklabels(classes)
    ax.set_ylim([0,1])
    ax.set_xlabel('$\overrightarrow{Z}$')
    ax.set_ylabel('Softmax')
    plt.show()
                       
widgets.interact(plotmodel,s1 = (1,10,.1),s2 = (1,10,.1),s3 = (1,10,.1))

interactive(children=(FloatSlider(value=5.0, description='s1', max=10.0, min=1.0), FloatSlider(value=5.0, desc…

<function __main__.plotmodel(s1, s2, s3)>

## One hot encoding

One hot encoding represents categorical data as a list of binary values with one element in the list for each possible category. The name "one hot" comes from the fact that only one binary element is set to 1 (hot) at a time and all other elements are set to 0 (cold).

Most deep learning algorithms cannot work with categorical data directly. The categories need to be converted into numerical representations. This is true for both the input ($X$) and output ($\widehat{Y}$) of our models.

Let's think about this class garbage classification assignment. There are three classes: "green", "blue" and "black" garbage bins. We often encode these classes by assigning an integer label without even giving too much thought about it:

- "green" - class 0
- "bllue" - class 1
- "black" - class 2

This label assignment is called label encoding. Label encoding can be a proper representation if there is a natural ordering relationship between the categories. In the example of garbage classiifcation, where there is no clear ordering, label encoding is not a good strategy. One example of categorical data that has an ordering relationship is the [Likert scale](https://en.wikipedia.org/wiki/Likert_scale), which is split into five categories that clearly have an ordering relationship among them: "Like", "Like Somewhat",	"Neutral", 	"Dislike Somewhat", 	"Dislike".

When our data does not have an ordering relationship, we employ one hot encoding. 

The code snippet below show how to get the one hot encoding representation from a list of strings.

In [69]:
s = pd.DataFrame({"bin": ["blue","blue","green","black","green","black","black","green"]})
s.head(8)

Unnamed: 0,bin
0,blue
1,blue
2,green
3,black
4,green
5,black
6,black
7,green


In [70]:
one_hot = pd.get_dummies(s) # Get one-hot encoding of variable
# Join the one hot encoding to the data frame
s2 = s.join(one_hot)
s2.head(8)

Unnamed: 0,bin,bin_black,bin_blue,bin_green
0,blue,0,1,0
1,blue,0,1,0
2,green,0,0,1
3,black,1,0,0
4,green,0,0,1
5,black,1,0,0
6,black,1,0,0
7,green,0,0,1


In [46]:
s3 = s2.drop('bin',axis = 1) # Drop the bin column 
s3.head(8)

Unnamed: 0,bin_black,bin_blue,bin_gree,bin_green
0,0,1,0,0
1,0,1,0,0
2,0,0,1,0
3,1,0,0,0
4,0,0,0,1
5,1,0,0,0
6,1,0,0,0
7,0,0,0,1


The code snippet below show how to get the one hot encoding representation from an array of integers using the keras [to_categorical](https://keras.io/api/utils/python_utils/#to_categorical-function)
function.

In [73]:
Y = np.array([0, 0, 1, 0, 2 ,1 ,1])
Yoh = to_categorical(Y)

print('Y=')
print(Y)
print('\nYoh=')
print(Yoh)

Y=
[0 0 1 0 2 1 1]

Yoh=
[[1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 1. 0.]]


To go back from one hot encoding to label encoding, you just need to use the [numpy argmax](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html) funcation across the columns of the array.

In [74]:
Y2 = np.argmax(Yoh, axis = 1)
print('\nY2=')
print(Y2)
print("\nY = Y2?")
print(np.all(Y == Y2))


Y2=
[0 0 1 0 2 1 1]

Y = Y2?
True


## Categorical cross-entropy and Accuracy

Cross-entropy builds upon the idea of entropy from information theory. Cross-entropy essentially calculates the number of bits required to represent an average event from one distribution compared to another distribution. In our case, one distribution is the ground-truth distribution represented by the labels of our data and the other distribution is represented by the probabilities that our model outputs (*i.e.*, softmax output of the final layer). The categorical cross-entropy (CCE) is computed by the following equation:

$$CCE[Y_{oh},\widehat{Y_{oh}}] = -\frac{1}{N}\sum_{i=0}^{N-1}\sum_{j=0}^{k-1}Y_{oh}[i,j]log(\widehat{Y_{oh}[i,j]})$$

Another imortant metric is accuracy, which is a metric that ranges between 0 and 1. Zero meaning that the model classified all samples incorrectly and one meaning the model classified all sample perfectly. The accuracy is computed by the following euqation:

$$accuracy = \frac{samples \quad classified \quad correctly}{total \quad number \quad  of \quad samples}$$

Accuracy values are a lot easier to understand and interpret than CCE values. So, why we use CCE most of the time for training our models?

**Important comment**: If accuracy was used as the loss function of our model, our goal would be to maximize it. In the case of CCE, we want to minimize it.

In [63]:
def compute_cce(Yoh,Yoh_pred):
    cce = (-Yoh*np.log(Yoh_pred)).mean()
    return cce

def compute_accuracy(Yoh,Yoh_pred):
    Y = np.argmax(Yoh, axis = 1)
    Ypred = np.argmax(Yoh_pred, axis = 1)
    accuracy = (Y == Ypred).sum()/Y.size
    return accuracy

In [66]:
# Labels
Yoh = np.array([[0, 0, 1],\
                [1, 0, 0],\
                [0, 1, 0]])

# Confident predictions
Yoh_pred = np.array([[0.01, 0.02, 0.97],\
                     [0.94, 0.03, 0.03],\
                     [0.02, 0.95, 0.03]])

# Low confidence predictions
Yoh_pred2 = np.array([[0.33, 0.33, 0.34],\
                     [0.40, 0.30, 0.30],\
                     [0.31, 0.36, 0.33]])

In [67]:
print("Confident predictions case")
print("CCE:")
print(compute_cce(Yoh,Yoh_pred))
print("Accuracy:")
print(compute_accuracy(Yoh,Yoh_pred))

Confident predictions case
CCE:
0.015958656176705183
Accuracy:
1.0


In [68]:
print("Low confidence predictions case")
print("CCE:")
print(compute_cce(Yoh,Yoh_pred2))
print("Accuracy:")
print(compute_accuracy(Yoh,Yoh_pred2))

Low confidence predictions case
CCE:
0.3351946267531185
Accuracy:
1.0


In both the high and low confidence prediction cases, the accuracy was 1. On the other hand, the CCE achieved a considrable smaller value for predictions with high confidence compared to prediction with low confidence. We want to have confident preictions and that is one of the many reasons why we prefer using CCE than accuracy as a loss function for training our models.

## Suggested References

- https://gombru.github.io/2018/05/23/cross_entropy_loss/
- http://www.jussihuotari.com/2018/01/17/why-loss-and-accuracy-metrics-conflict/