# Lecture #12: Multiclass Classification

## A. Multiclass

* Normally, we are used to dealing with logistic regression which lets us deal with binary classification
* However, in the real world there are many instances where you can classify an instance with more than 2 possible classes
  * i.e. there are multiple classes that we can put an instance into when we make a prediction, hence *multiclass*

## B. Softmax - Generalization of Logistic Regression

Recall that in logistic regression we had the following:

$$
\begin{aligned}
z &= \vec w \cdot \vec x + b\\
a_1 &= g(z) = \frac{1}{1+e^{-z}} = P(y=1|\vec x)\\
a_2 &= 1 - a_1 = P(y=0|\vec x)
\end{aligned}
$$

In softmax regression, we can generalize to the following:

$$
\begin{aligned}
z_j &= \vec w_j \cdot \vec x + b_j\\
a_j &= \frac{e^{z_j}}{\sum_{k=1}^N e^{z_k}} = P(y=j|\vec x)
\end{aligned}
$$

The loss function with softmax regression is the following:

$$
\text{loss}(a_1,a_2,a_3,...,a_n;y) = -\log_n \text{  if } y = n
$$

## Neural Network with Softmax Output

Start with specifying the model:

In [1]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
    Dense(units=25, activation='relu'),
    Dense(units=15, activation='relu'),
    Dense(units=10, activation='relu')
])

Specify the loss and the cost:

In [2]:
from tensorflow.keras.losses import SparseCategoricalCrossentropy
model.compile(loss=SparseCategoricalCrossentropy())

* `SparseCategoricalCrossentropy`: target is a number corresponding to index
  * i.e. if $n=10$, then $0 \le y \le 9$
* `CategoricalCrossEntropy`: target is one-hot encoded
  * i.e. $y=2$, then you would get back a vector with 0's and a 1 in the index related to $y$

Train on data:

In [3]:
# code breaks because we haven't defined any data for the model to train on yet
model.fit(X,Y,epochs=100)

NameError: name 'X' is not defined

## D. Improved Implementation of Softmax

* We already have some issues with roundoff errors
* more numerically accurate implementation of logistic loss

## E. Additional NN Conecpts

1. Advanced optimization with gradient descent
   * $w_j = w_j - \alpha \frac{\partial}{\partial w_j}J(\vec w, b)$
   * will generate a bunch of contour lines
   * to go faster, increase $\alpha$
   * to go slower, decrease $\alpha$
2. Adam algorithm
   * faster, the standard choice for ML algorithm
   * Adam: adaptive movement estimation (not just one $\alpha$)
   * $w_1 = w_1 - \alpha_1 \frac{\partial}{\partial w_1}J(\vec w, b)$
   * $w_2 = w_2 - \alpha_2 \frac{\partial}{\partial w_2}J(\vec w, b)$
   * ...
   * $b = b - \alpha_{n+1} \frac{\partial}{\partial w_1}J(\vec w, b)$
   * if $w_j$ or $b$ keeps moving in the same direction, increase $\alpha_j$
   * if $w_j$ or $b$ keeps oscillating, decrease $\alpha_j$
   * code is the following:

```python
model.compile(optimizer=tf.keras.optimizer.Adam(learning_rate=1e-3), loss=tf.keras.losses.SparseCtagoricalCrossentropy(from_logits=True))
```


3. Additional layer types
   * Dense layer: Each neuron in a function of **all** the activation outputs of the previous layer
   * Convolutional layer: each neuron only looks at part of the previous layers and outputs
     * faster computation
     * needs less training data, meaning less prone to overfitting
    * other types: transformer, attention, etc.