<a href="https://colab.research.google.com/github/PaulToronto/Stanford-Andrew-Ng-Machine-Learning-Specialization/blob/main/2_2_3_Multiclass_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2.2.3 Multiclass Classification

## 2.2.3.1 Multiclass

Target, $y$, can take on **more than two** possible values

<img src='https://drive.google.com/uc?export=view&id=1akw0YgvtT5dRBMH1kX1-n4FEyXuhnLTL'>

### Examples

- MNIST: could be any one of 10 digits
- Trying to classify whether patients have any of five different possible diseases
- Visual inspection of part defects in a factory
 - scratch, discoloration, chip defect

## 2.2.3.2 Softmax

The **Softmax regression algorithm** is a generalization of **logistic regression**.

<img src='https://drive.google.com/uc?export=view&id=10lKA9fxtJWQIENs-LXKkIxajuut25eec'>

- If you apply softmax regression with $n = 2$ it computes the same thing as logistic regression.
 - The parameters end up being a little bit different, but it reduces to a logistic regression model

### Softmax Cost Function

<img src='https://drive.google.com/uc?export=view&id=1itSB-E4kiInMQkOClEznl0uuGunvZpTW'>

## 2.2.3.3 Neural Network with Softmax output

Essentially, we take the Softmax regression model and put it into the output layer of a neural network.

Previously, we did handwritten digit recognition with just two classes.

<img src='https://drive.google.com/uc?export=view&id=1JDkX0dONTpAPLr6SaqEmvSaN1C-l3UzJ'>

We can modify this to accomodate 10 digits.

<img src='https://drive.google.com/uc?export=view&id=1-OjamBEZL1J7vkCIl-CgM3-MWoO1Dig0'>


- The softmax layer or activation function is different from the other activation functions
 - With the other activation functions, $a_n$ is a function of $z_n$ and only $z_n$
   - $a_n = g\left(z_n\right)$
   - that means we can apply the activation function element-wise
 - With softmax $a_n$ is a function of all of $z_1, z_2, \cdots, z_n$ **simultaneously**
   - each of the activation values depend on all of the values of $z$
   - we can no longer apply the activation function element-wise

### How to implement Softmax in TensorFlow

- **NOTE: don't actually use this code, there is a better version, covered in the next section**
- As before, there are 3 steps
- The loss function is `SparseCategoricalCrossentropy()`
 - `Sparse` means that the categories are mutually exclusive. A digit can't be a 4 and a 5 simultaneously

 <img src='https://drive.google.com/uc?export=view&id=1bzIY-6_3DMFDd-W7xMvBMTVetXlvkyFn'>

## 2.2.3.4 Improved implementation of Softmax

### Numerical Roundoff Errors

#### Option 1

$$
x = \frac{2}{10000}
$$

#### Option 2

$$
x = \left(1 + \frac{1}{10000}\right) - \left(1 - \frac{1}{10000}\right)
$$

In [None]:
2 / 10_000

0.0002

In [None]:
(1 + 1/10_000) - (1 - 1/10_000)

0.00019999999999997797

- Option 2 has a roundoff error

#### More numerically accurate implementation of logistic loss

<img src='https://drive.google.com/uc?export=view&id=14SB2ci9facqQIs_gaUwG0B45glT4icA7'>

- Original loss:
    - for logistic regression this works ok, and usually, the roundoff error is not too great
- More accurate loss:
    - note in this version there is no insistance on computing an intermediate value explicitly
    - this gives TensorFlow more flexibility in terms of how to do the computation
- Note the changes to the TensorFlow code:
    1. `loss=BinaryCrossentropy(from_logits=True)`
    2. In the final layer, `activation='linear'`

#### For Softmax, the roundoff errors become more serious

- This is particularly true when $z$ is really small or really large

<img src='https://drive.google.com/uc?export=view&id=1vUAskmkS8IosowqYT3XyG53-RiYKANUw'>

- IMPORTANT: With the more accurate version, the model no longer outputs $a_1, \cdots a_10$, instead it outputs $z_1, \cdots, z_{10}$

<img src='https://drive.google.com/uc?export=view&id=1vBYuQch3p2I0pvYHRmJeqJYkEnNYTtC_'>



## 2.2.3.5 Classification with multiple outputs

### Multi-label Classification

<img src='https://drive.google.com/uc?export=view&id=1W76-dglbRlhfgC8i3DD8Lpm3JFyKzvJK'>

- One way to handle this problem is to just treat it as 3 separate machine learning problems and build 3 neural networks
    - This is not an unreasonable approach, but there is another way

<img src='https://drive.google.com/uc?export=view&id=106xGoOhw4jW8UBSg5Jfj1pbfe8ofkJnu'>

- Note that `sigmoid` can be used since this is a binary classification problem for all 3



## 2.2.3.6 Lab - Softmax

https://colab.research.google.com/drive/1wkkYfpsImPmYdVB4h4uz7JsVZMH5QBl2

## 2.2.3.7 Lab - Multiclass

https://colab.research.google.com/drive/1zBhYkfsFZ-LW7foakf0oc0nvg3W1_ks7