# neuralthreads
[medium](https://neuralthreads.medium.com/i-was-not-satisfied-by-any-deep-learning-tutorials-online-37c5e9f4bea1)

## Chapter 3 — Activation functions and their derivatives

Softmax function — It is frustrating that everyone talks about it but very few talk about its Jacobian

### 3.7 What is the Softmax function and how to compute its Jacobian?

This is the most important post in the third chapter. In this post, we will talk about the Softmax activation function and how to compute its Jacobian.

This is the definition of the Softmax function.

In [1]:
%%latex
\begin{gather*}
    y = softmax(x) = f(x) = \dfrac{e^{x_i}}{\sum_{i=1}^{i=n}e^{x_i}}
    \\
    \text{Suppose we have 'x'}\\
    \\
    x = 
    \begin{bmatrix}
    x_1\\
    x_2\\
    x_3\\
    x_4
    \end{bmatrix} \\
    \\
    \text{Then 'y' is}\\
    \\
    \Rightarrow y = 
    \begin{bmatrix}
    y_1\\
    y_2\\
    y_3\\
    y_4
    \end{bmatrix} = softmax(x) = f(x) =  
    \begin{bmatrix}
    \dfrac{e^{x_1}}{e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4}} \\
    \\
    \dfrac{e^{x_2}}{e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4}} \\
    \\
    \dfrac{e^{x_3}}{e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4}} \\
    \\
    \dfrac{e^{x_4}}{e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4}} \\
    \end{bmatrix} \\
    \\
    \text{We can easily define the Softmax function in Python by doing this reduction} \\
    \\
    \begin{bmatrix}
    \dfrac{e^{x_1}}{e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4}} \\
    \\
    \dfrac{e^{x_2}}{e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4}} \\
    \\
    \dfrac{e^{x_3}}{e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4}} \\
    \\
    \dfrac{e^{x_4}}{e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4}} \\
    \end{bmatrix}  = 
    \begin{bmatrix} \\
    e^{x_1} \\
    e^{x_2} \\
    e^{x_3} \\
    e^{x_4} \\    
    \end{bmatrix} \big/ (e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4}) \Rightarrow \\
    \\
    \Rightarrow e^{x} / sum(e^{x})       
\end{gather*}

<IPython.core.display.Latex object>

In [2]:
import numpy as np                             # importing NumPy
np.random.seed(42)

def softmax(x):                                # Softmax
    return np.exp(x) / np.sum(np.exp(x))

Now, the most important question is how to compute its Derivative/Jacobian?

Unlike the Sigmoid activation function or any other previous activation function, **we don’t have a situation like this.**

In [3]:
%%latex
\begin{gather*}
    y_1 = f_1(x_1) \\
    \\
    y_2 = f_2(x_2) \\
    \\
    y_3 = f_3(x_3) \\
    \\
    y_4 = f_4(x_4) \\
    \\
    \text{Instead, we have a situation like this.}\\
    \\
    y_1 = f_1(x_1,x_2,x_3,x_4) \\
    \\
    y_2 = f_2(x_1,x_2,x_3,x_4) \\
    \\
    y_3 = f_3(x_1,x_2,x_3,x_4) \\
    \\
    y_4 = f_4(x_1,x_2,x_3,x_4) \\
    \\    
    \text{In such a case, we use something called Jacobians.} \\
    \text{Jacobian in a very simple language is a collection of partial derivatives.}\\
    \text{So, the Jacobian 'J' for the Softmax function is:}\\
    \\
    \newcommand{\arraystretch}{2}
    J = \frac{\partial (y_1,y_2,y_3,y_4)}{\partial (x_1,x_2,x_3,x_4)} = 
    \begin{bmatrix*}
      \frac{\partial y_1}{\partial x_1}   & \frac{\partial y_1}{\partial x_2} & \frac{\partial y_1}{\partial x_3} & \frac{\partial y_1}{\partial x_4}  \\
      \frac{\partial y_2}{\partial x_1}   & \frac{\partial y_2}{\partial x_2} & \frac{\partial y_2}{\partial x_3} & \frac{\partial y_2}{\partial x_4} \\
      \frac{\partial y_3}{\partial x_1}   & \frac{\partial y_3}{\partial x_2} & \frac{\partial y_3}{\partial x_3} & \frac{\partial y_3}{\partial x_4} \\
      \frac{\partial y_4}{\partial x_1}   & \frac{\partial y_4}{\partial x_2} & \frac{\partial y_4}{\partial x_3} & \frac{\partial y_4}{\partial x_4} \\
    \end{bmatrix*}\\
    \newcommand{\arraystretch}{1}
    \\
    \text{Let us start finding each term in this Jacobian.}\\
    \text{Starting with the first term, i.e.,} \\
    \\
    \frac{\partial y_1}{\partial x_1}
    \\
    \\
    y_1 = \frac{e^{x_1}}{e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4}}
    \\
    \\
    \frac{\partial y_1}{\partial x_1} = \frac{\partial( \frac{e^{x_1}}{e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4}})}{\partial x_1} \Rightarrow \\
    \\
\Rightarrow \frac{\partial y_1}{\partial x_1} = \frac{e^{x_1}(e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4}) - e^{x_1}(e^{x_1})}
{(e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4})^2} \Rightarrow \\
    \\
\Rightarrow \frac{\partial y_1}{\partial x_1} = \frac{e^{2x_1} + e^{x_1} (e^{x_2} + e^{x_3} + e^{x_4}) - e^{2x_1}}
{(e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4})^2} \Rightarrow \\
    \\
\Rightarrow \frac{\partial y_1}{\partial x_1} = \frac{e^{x_1} (e^{x_2} + e^{x_3} + e^{x_4})}
{(e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4})^2} \Rightarrow \\
    \\
\Rightarrow \frac{\partial y_1}{\partial x_1} = (\frac{e^{x_1}}{{e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4}}})
(\frac{e^{x_2} + e^{x_3} + e^{x_4}}{{e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4}}}) \Rightarrow \\
    \\
\Rightarrow \frac{\partial y_1}{\partial x_1} = y_1
(\frac{e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4} - e^{x_1}}{{e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4}}}) \Rightarrow \\
    \\
\Rightarrow \frac{\partial y_1}{\partial x_1} = y_1
(1 - \frac{e^{x_1}}{{e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4}}})\Rightarrow \\
    \\
\Rightarrow \frac{\partial y_1}{\partial x_1} = y_1 ( 1 - y_1)

    \\

\end{gather*}

<IPython.core.display.Latex object>

Similarly, we can calculate

In [4]:
%%latex
\begin{gather*}
    \frac{\partial y_1}{\partial x_2} \\
    \\
    \frac{\partial y_1}{\partial x_2} = \frac{\partial(\frac{e^{x_1}}{e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4}})}{\partial x_2} \Rightarrow \\
    \\
\Rightarrow \frac{\partial y_1}{\partial x_2} = \frac{-e^{x_1}e^{x_2}}
{(e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4})^2} \Rightarrow \\
    \\
\Rightarrow \frac{\partial y_1}{\partial x_2} = - (\frac{e^{x_1}}
{(e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4})^2})
(\frac{e^{x_2}}
{(e^{x_1} + e^{x_2} + e^{x_3} + e^{x_4})^2})\Rightarrow \\
    \\
\Rightarrow \frac{\partial y_1}{\partial x_2} = - y_1 y_2
\end{gather*}


<IPython.core.display.Latex object>

After finding every term in the Jacobian, we have a symmetric matrix

In [11]:
%%latex
\begin{gather*}
    \newcommand{\arraystretch}{2}
    J =  
    \begin{bmatrix*}
      y_1(1-y_1)    & -y_1 y_2      & -y_1 y_3      & -y_1 y_4   \\
      -y_1 y_2      & y_2(1-y_2)    & -y_2 y_3      & -y_2 y_4   \\
      -y_1 y_3      & -y_2 y_3      & y_3(1-y_3)    & -y_3 y_4   \\
      -y_1 y_4      & -y_2 y_4      & -y_3 y_4      & y_4(1-y_4) \\
    \end{bmatrix*}\\
\end{gather*}

<IPython.core.display.Latex object>

We can reduce it to define the Softmax Jacobian in Python like this.

In [6]:
%%latex
\begin{gather*}
    \newcommand{\arraystretch}{2}
    J =  \begin{bmatrix*}
      y_1 \\
      y_2 \\
      y_3 \\
      y_4 \\
    \end{bmatrix*} * 
    \begin{bmatrix*}
      1-y_1   & -y_2    & -y_3     & -y_4   \\
      -y_1    & 1-y_2   & -y_3     & -y_4   \\
      -y_1    & -y_2    & 1-y_3    & -y_4   \\
      -y_1    & -y_2    & -y_4     & 1-y_4  \\
    \end{bmatrix*}\\
    \\
    J =  \begin{bmatrix*}
      y_1 \\
      y_2 \\
      y_3 \\
      y_4 \\
    \end{bmatrix*} * (
    \begin{bmatrix*}
      1   & 0    & 0    & 0   \\
      0   & 1    & 0    & 0   \\
      0   & 0    & 1    & 0   \\
      0   & 0    & 0    & 1  \\
    \end{bmatrix*} - 
    \begin{bmatrix*}
      y_1   & y_2    & y_3     & y_4   \\
    \end{bmatrix*} ) \Rightarrow \\
    \\
\Rightarrow J = softmax(x) * (I - softmax(x)^{T})
\end{gather*}

<IPython.core.display.Latex object>

You must have noticed that it is very similar to the derivative of the Sigmoid function but not exactly the same.

In [7]:
def softmax_dash(x):                           # Softmax Jacobian
    
    I = np.eye(x.shape[0])
    
    return softmax(x) * (I - softmax(x).T)

Let us have a look at an example

In [8]:
x = np.array([[0.25], [-1], [2.3], [-0.2], [1]])
print(x)
print(softmax(x))
print(np.sum(softmax(x)))

[[ 0.25]
 [-1.  ]
 [ 2.3 ]
 [-0.2 ]
 [ 1.  ]]
[[0.08468093]
 [0.02426149]
 [0.6577931 ]
 [0.05399495]
 [0.17926953]]
0.9999999999999999


You must have noticed that the sum of scalars in softmax ‘y’ is equal to 1.
Why? It is obvious from the definition of the Softmax function.

y1, y2, y3, y4… can be treated as probabilities or answers to a question because their sum is equal to 1. We will see more on this later when we will talk about the Categorical Cross-entropy loss function.

In [9]:
print(softmax_dash(x))
softmax_dash(x) == softmax_dash(x).T

[[ 0.07751007 -0.00205449 -0.05570253 -0.00457234 -0.01518071]
 [-0.00205449  0.02367287 -0.01595904 -0.00131    -0.00434935]
 [-0.05570253 -0.01595904  0.22510134 -0.0355175  -0.11792226]
 [-0.00457234 -0.00131    -0.0355175   0.05107949 -0.00967965]
 [-0.01518071 -0.00434935 -0.11792226 -0.00967965  0.14713197]]


array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]])

We can see that the Jacobian of the Softmax function is a symmetric matrix.

I hope that now you understand the Softmax function and its Jacobian.

With this, the third chapter is over. In the next post, we will start the Fourth Chapter — Losses and their derivatives with Mean Square Error.