# Softmax Activation Function

This notebook explores the properties of the softmax activation function and its derivatives, as well as the use of the cross-entropy loss function.

## Overview
The key steps involve proving the derivatives of the softmax function and showing the correctness of the gradient of the cross-entropy loss function.

In [1]:
import pandas as pd
import numpy as np


## Softmax Activation Function

$$
\sigma_{o_j} = \frac{e^{o_j}}{\sum_{k=1}^{K}e^{o_k}}
$$

### Part 1

For both $i=j$ and $i \neq j$ cases, we will use the quotient rule. 


For $i = j$: 



\begin{align*}
    \frac{\partial \sigma_{v_j}}{\partial v_i}  
        &= \frac{ e^{v_j} \sum_{j=1}^{k} e^{v_j} - e^{v_j}e^{v_i}}{(\sum_{j=1}^{k} e^{v_j})^2} \\
        &= \frac{ e^{v_j}}{\sum_{j=1}^{k} e^{v_j}} - \left(\frac{ e^{v_j}}{\sum_{j=1}^{k} e^{v_j}}\right)^2  \\ 
        &= \sigma_{v_j} - \sigma_{v_j}^2 \\ 
        &= \sigma_{v_j}(1 - \sigma_{v_j})
\end{align*} 


For $i \neq j$:

\begin{align*}
    \frac{\partial \sigma_{v_j}}{\partial v_i}  
        &= \frac{- e^{v_j}e^{v_i}}{(\sum_{j=1}^{k} e^{v_j})^2} \\
        &= - \sigma_{v_j}\sigma_{v_i}
\end{align*}





### Part 2

We want to find the gradient of the loss function with respect to the input $v_i$, which is $\frac{\partial L}{\partial v_i}$.

To compute $\frac{\partial L}{\partial v_i}$, we use the chain rule, summing over all classes because the output $o_i$ for each class depends on all inputs $v_j$: $$ \frac{\partial L}{\partial v_i} = \sum_{j=1}^{k} \frac{\partial L}{\partial o_j} \frac{\partial o_j}{\partial v_i} $$

first lets find the derivative of the loss w.r.t the softmax function:
$$
\frac{\partial L}{\partial o_i} = -\frac{y_i}{o_i}
$$


Substituting $\frac{\partial L}{\partial o_j} = -\frac{y_j}{o_j}$ and using the results from part 1:


* For $j = i$, $\frac{\partial o_j}{\partial v_i} = o_i(1 - o_i)$

* For $j \neq i$, $\frac{\partial o_j}{\partial v_i} = -o_io_j$

We get: $$ \frac{\partial L}{\partial v_i} = -\frac{y_i}{o_i} o_i(1 - o_i) + \sum_{j \neq i} -\frac{y_j}{o_j} (-o_io_j) $$ $$ = -y_i(1 - o_i) + \sum_{j \neq i} y_jo_i $$ $$ = -y_i + y_io_i + o_i\sum_{j \neq i} y_j $$

Since $y_i$ is one-hot encoded, $\sum_{j=1}^{k} y_j = 1$, and $\sum_{j \neq i} y_j = 1 - y_i$. Substituting this in gives: $$ \frac{\partial L}{\partial v_i} = -y_i + y_io_i + o_i(1 - y_i) $$ $$ = o_i - y_i $$.

This shows the correctness of the equation $\frac{\partial L}{\partial v_i} = o_i - y_i$ using the results from part 1.