# LESSON 6: SOFTMAX REGRESSION
<table><tr>
<td> <img src="../images/linear_logistic_regression_logo.jpeg" width="600px"/> </td>
</tr></table>

*This lecture was refered by [machinelearningcoban.com](https://machinelearningcoban.com/2017/02/17/softmax/)*

## 1. Softmax regression introduction
Using simple logistic regression, we only solve classification problem with only 1 class.

To solve multiple classes classification problem by using logistic regression, we have to build multiple logistic regression models. This model is called ***one-vs-rest***.

<img src="../images/softmax_regression_one_vs_rest.png" width="600px"/>

$a_i$ with i = 1, 2, 3 ... C are almost independent and their sum can be larger or smaller than 1.

<img src="../images/softmax_regression_softmax_net.png" width="700px"/>

<img src="../images/softmax_regression_example.png" width="700px"/>

Because we have to calculate $e^{z_i}$, if $z_i$ is large, $e^{z_i}$ will be very large and it causes value out of range error. We need to build a stable version of softmax.

<center>
    \[
    softmax(z^i)
    = \frac{e^{z_i}}{\sum_j^C e^{z_j}}
    = \frac{e^{-max_j(z_j)} * e^{z_i}}{e^{-max_j(z_j)} * \sum_j^C e^{z_j}}
    = \frac{e^{z_i - max_j(z_j)}}{\sum_j^C e^{z_j - max_j(z_j)}}
    \]
</center>

## 2. Loss function and Optimizer for Softmax regression
Instead of having only 2 classes like logistic regression, softmax regression has C classes and it need another form of cross entropy loss function.

Its name is ***CATEGORICAL CROSS ENTROPY***

<center>
    \[
    J(W; X, Y) = -\sum_{i=1}^N \sum_{j=1}^C y_{ji}\log(a_{ji})
    = -\sum_{i=1}^N \sum_{j=1}^C y_{ji}\log\frac{e^{{z}_i}}{\sum_{k=1}^C e^{{z}_k}}
    \]
</center>

For each sample from the dataset,
<center>
    \[
    J({W}; {x}_i, {y}_i)
    = -\sum_{j=1}^C y_{ji}\log\frac{e^{{z}_ji}}{\sum_{k=1}^C e^{{z}_ki}} \\
    = -\sum_{j=1}^C (y_{ji}{z}_ji - y_{ji}\log{\sum_{k=1}^C e^{{z}_ki}}) \\
    = -\sum_{j=1}^C y_{ji}{z}_ji + \sum_{j=1}^C y_{ji}\log{\sum_{k=1}^C e^{{z}_ki}} \\
    = -\sum_{j=1}^C y_{ji}{z}_ji + \log{\sum_{k=1}^C e^{{z}_ki}}
    \]
</center>

Note:
- $\sum_{j=1}^C y_{ji}=1$ because it's the sum of probability
- $\log{\sum_{k=1}^C e^{{z}_ki}}$ is independent with $j$ so we can remove $\sum_{j=1}^C$

To calculate derivative of $J$ with $W$, we can use the following fomular

<center>
    \[
    \frac{\partial J_i(W)}{\partial W} = [\frac{\partial J_i(W)}{\partial w_1}, \frac{\partial J_i(W)}{\partial w_2}, \dots, \frac{\partial J_i(W)}{\partial w_C}]
    \]
</center>

and gradient of each column can be calculated by

<center>
    \[
    \frac{\partial J_i(W)}{\partial w_j} 
    = -y_{ji}x_i + \frac{e^{z_ji} x_i}{\sum_{k=1}^C e^{z_ki}}
    = -y_{ji}x_i + a_{ji} x_i
    = x_i (a_{ji} - y_{ji})
    \]
</center>

Note:
- In the first equation, because we do derivative with $w_j$, all elements in $\sum_{k=1}^C e^{z_ki}$ are equal to 0 except $e^{z_ji}$
- $e_{ji} = a_{ji} - y_{ji}$ is the different between the prediction and the real value

Now, we have 

<center>
    \[
    \frac{\partial J_i(W)}{\partial W} = x_i[e_{1i}, e_{1i}, \dots, e_{ji}] = x_i e_i
    \]
</center>

and for the whole dataset

<center>
    \[
    \frac{\partial J(W)}{\partial W} = \sum_{i=1}^N x_i e_{i} = XE
    \]
</center>

Using SGD, we have fomular to update parameters

<center>
    \[
    W = W + \eta e_i x_i = W + \eta(a_{i} - y_{i})x_i
    \]
</center>

**To conclude, Logistic Regression is a special case of Softmax Regression!!!**

## 3. Implementation example

### 3.1. Prepare library and data

### 3.2. Implement from scratch

### 3.3. Use `sklearn`

## 4. Homework