# Multiclass classification problem:

- target y can take on more than two possible values.

## Softmax regression algorithm

- generalization of a logistic regression model to the case where we want to handle multiple classes. 

- Example: softmax regression model for a 4-class classification problem (4 possible values for y).

$n = 4$

$z_n = \vec{w_n}\cdot x + b_n$

**softmax formula**

$a_1 = \frac{e^z1}{e^z1 + e^z2 + e^z3 + e^z4}  = P(y= 1|\vec{x})$

$a_2 = \frac{e^z2}{e^z1 + e^z2 + e^z3 + e^z4} = P(y= 2|\vec{x})$ 

$\cdots$

**General formula:**

$z_n = \vec{w_n}\cdot x + b_n$, y = 1, 2, ..., N

$a_j = \frac{e^zj}{\sum_{k=1}^N e^zk} = P(y= j|\vec{x})$

## Cost Function

The cost function for softmax regression is the average of the loss function for each training example plus a regularization term:

$loss(a_1, a_2, ..., a_N, y) = -\sum_{j=1}^N \log a_N if y=N$ 

This is called the **cross-entropy** loss. This incentivates the algorithm to increase the probability of the correct class for each training example. To maximize $a_N$

## Neural Network with Softmax Output

For the example with classification of ten handwritten digits, we have 10 output neurons, one for each possible digit. The output of the softmax function for each of these neurons is the probability that the input image is that digit. The cost function is the average of the cross-entropy loss for each training example plus a regularization term.

## Tensorflow Implementation

In [None]:
import tensorflow as tf
from tensorflow.keras import Sequential, layers, losses

model = Sequential([
    layers.Dense(25, activation='relu'),
    layers.Dense(15, activation='relu'),
    layers.Dense(10, activation='softmax')
])

model.compile(loss=losses.SparseCategoricalCrossentropy()) 

# Sparse refers to the fact that each digit can only be classifies in one class.

model.fit(X_train, y_train, epochs=100)


## Improved Implementation of softmax

### For Logistic regression

More numerical accurate implementation of logistic loss:


In [None]:
# Intead of this
model = Sequential([
    layers.Dense(25, activation='relu'),
    layers.Dense(15, activation='relu'),
    layers.Dense(10, activation='sigmoid')
])

model.compile(loss=losses.BinaryCrossEntropy())

# We can do this
model = Sequential([
    layers.Dense(25, activation='relu'),
    layers.Dense(15, activation='relu'),
    layers.Dense(10, activation='linear')
])

model.compile(loss=losses.BinaryCrossEntropy(from_logits=True))

- This lets the algorithm to decide how to compute the loss function. It replaces "a" directly in the loss function with "z" and "z" is computed directly from the input data. This is more numerically stable than computing "a" first and then computing "z" from "a".

- This allows Tensorflow to have a little bit less numerical roundoff error.


In [None]:
# So the whole code will be

import tensorflow as tf
from tensorflow.keras import Sequential, layers, losses

model = Sequential([
    layers.Dense(25, activation='relu'),
    layers.Dense(15, activation='relu'),
    layers.Dense(10, activation='linear')
])

#Loss

model.compile(loss=losses.BinaryCrossEntropy(from_logits=True)) 

# Fit the model

model.fit(X, Y, epochs=100)

#Predictions

logit = model(X)

f_x = tf.nn.sigmoid(logit)

### For Softmax regression:

In [None]:
# Instead of this 

model = Sequential([
    layers.Dense(25, activation='relu'),
    layers.Dense(15, activation='relu'),
    layers.Dense(10, activation='softmax')
])

model.compile(loss=losses.SparseCategoricalCrossEntropy())

# We can do this

model = Sequential([
    layers.Dense(25, activation='relu'),
    layers.Dense(15, activation='relu'),
    layers.Dense(10, activation='linear')
])

model.compile(loss=losses.SparseCategoricalCrossEntropy(from_logits=True))

# More numerical accurate.

So the whole code will look like this:

In [None]:
import tensorflow as tf
from tensorflow.keras import Sequential, layers, losses

model = Sequential([
    layers.Dense(25, activation='relu'),
    layers.Dense(15, activation='relu'),
    layers.Dense(10, activation='linear')
])

#Loss

model.compile(loss=losses.SparseCategoricalCrossentropy(from_logits=True)) 

# Fit the model

model.fit(X, Y, epochs=100)

#Predictions

logits = model(X)

f_x = tf.nn.softmax(logits)


# Multi-label classification problem:

A multi-label classification problem is where associate to each image, there are multiple labels. For example, an image of a dog and a cat, the labels are "dog" and "cat".


## How to implement a neural network for multi-label classification problem?

**Option One:**

Treat the situation as two completely separate machine learning problems. One is a binary classification problem for the dog and the other is a binary classification problem for the cat.

**Option Two:**

Train a single neural network simultaneously to predict both the dog and the cat. For this we need to change the output layer of the neural network. It would have to have two output neurons, one for the dog and one for the cat. 

As the two separate probles are binary classification problems, we can use the sigmoid function as the activation function for the output layer.