# Multiclass Classification

In the previous units, we have discussed  logistic regression, and how the logistic regression model is perfect for the classification task of  discrete instances belonging to two classes.

What happens if we need to classify into more than two classes? For instance classification of handwritten digits (MNIST dataset) into 10 classes, classification of Iris dataset into three classes.

Let's consider the following cases for classification.
<!--
|![Binary classification](https://drive.google.com/uc?id=1A2sOZ3TIZ0eQV0QDbjlVTZn2Iify-48n)|![multiclass](https://drive.google.com/uc?id=1T0kDZqp622AJeANHPmAZSYnwevmuhqhp)|
|-|-| -->

|![Binary classification](https://i.postimg.cc/TPPXxRK4/image.png)|![multiclass](https://drive.google.com/uc?id=1T0kDZqp622AJeANHPmAZSYnwevmuhqhp)|
|-|-|

<center><figcaption>Figure 1: Binary class and Multiclass</figcaption></center>

In the first figure, there are two classes, thus logistic regression that can be used for binary classification. On the other hand, the second figure consists of three classes. How can we find the decision boundaries that effectively divide it into three classes?


## One-vs-All ( One-vs-Rest)

The first method that we are going to discuss is called the One-vs-all method (sometimes also referred to as one-vs-rest or OvR). It tries to classify samples into different classes by splitting the original problem into multiple binary classification problems. Then we employ logistic regression for each subset of the original problem. For a particular class, it estimates the probability of whether the sample belongs to the class or not. We make a prediction based on the class that maximizes the prediction.

Consider the following classification problem depicted as

<div>
<center>
<!-- <img src='https://drive.google.com/uc?id=1T0kDZqp622AJeANHPmAZSYnwevmuhqhp' width="700px"/> -->
<img src='https://i.postimg.cc/2Syhx1j2/image.png' width="700px"/>
<figcaption>Figure 2: Multiclass Classification</figcaption>
</center>
</div>



We can observe that there are three classes. Let A, B, and C be the classes, then we need to

- train a logistic regression classifier $h_w^{(i)}(x) $ for each class $i$ to predict the probability that $y=i$.

- On a new input $x$, to make a prediction, pick the class $i$ that maximizes
$$
\DeclareMathOperator*{\argmax}{arg\,max}
 \max_i h_w^{(i)}(x) $$

Let's split the problem into three binary classification problems.
For class A,
<div>
<center>
<!-- <img src='https://drive.google.com/uc?id=1RBY_alSnriTd76ZZTSaRkAj_obW7bJtI' width="700px"/> -->
<img src='https://i.postimg.cc/2Syhx1j2/image.png' width="700px"/>
<figcaption> Figure 3: Class A vs Not Class A </figcaption>
</center>
</div>

The probability of an instance that belongs to class A is given as $$\mathcal{P}(A|x) = h_w^{(1)}(x) $$

Similarly for class B,
<div>
<center>
<!-- <img src='https://drive.google.com/uc?id=1t7FFHEBfwQlZZdetr-8BrKbLrAo8AjNH' width="700px"/> -->
<img src="https://i.postimg.cc/LsyvdSPv/image.png" width= "700px"/>
<figcaption> Figure 4: Class B vs Not Class B </figcaption>
</center>
</div>

The probability of an instance that belongs to class B is given as $$\mathcal{P}(B|x) = h_w^{(2)}(x) $$

And lastly for class C,
<div>
<center>
<!-- <img src='https://drive.google.com/uc?id=1SFlD6aro4g-oTWRUQ8iA0l8A0O0gw7G0' width="700px"/> -->
<img src="https://i.postimg.cc/1tfk2J8n/image.png" width="700px"/>
<figcaption> Figure 5: Class C vs Not Class C </figcaption>
</center>
</div>

The probability of an instance that belongs to class B is given as $$\mathcal{P}(A|x) = h_w^{(3)}(x) $$

Now for any new input $x$, we will evaluate all three probabilities and choose the class that gives the maximum probability for that particular input.

$$
\DeclareMathOperator*{\argmax}{arg\,max}
 \max_i h_w^{(i)}(x) $$


<div>
<center>
<!-- <img src='https://drive.google.com/uc?id=1wWDELsZFfyEdDk8bT0Wzk5A9CjAwFlMZ' width="700px"/> -->
<img src="https://i.postimg.cc/fRRB8ktk/image.png" width="700px"/>
<figcaption> Figure 6: Combined Classifiers  </figcaption>
</center>
</div>

The issue with this method is that even if the training examples have an even class distribution, the binary classifiers will see an unbalanced distribution as the set of negative examples are much more than the set of positive examples for any particular class.

## One-vs-One (OvO)

Another strategy for multinomial classification using logistic regression is to compute the binary classification between each of the available classes. Thus if the problem has $N$ classes, we will require to train $$\frac{N(N-1)}{N}$$ binary classifiers. This problem is similar to the total handshakes in a room or total matches in a group of football teams.

During prediction, for an unseen sample all $N(N-1)/2$ classifiers are applied and a voting scheme will be used. The class that got the highest number of positive predictions (votes) will be predicted by the combined classifier.

This strategy fails when there are equal numbers of votes for more than one class.


## Softmax Function

The Softmax regression, also known as the multinomial logistic regression, is a direct approach to classifying a given sample among more than two classes without the need to use multiple binary classifiers unlike One-vs-rest or one-vs-one methods.

For each class $k$, the Softmax regression model computes a score $s_k(\boldsymbol x)$ for every given instance of $\boldsymbol x$. Then softmax function (also called the _normalized exponential_) is applied to the scores to estimate the probability of each class. We compute the scores using the following expression:
$$ s_k(\boldsymbol x) = \boldsymbol w^T_k \boldsymbol x $$

$\boldsymbol w^{(k)}$ is a parameter vector of each class. All these vectors are stored as rows in a parameter matrix $\boldsymbol W$.

The estimated probability $\mathcal{\hat{P}}_k$ that an instance $\boldsymbol x$ belongs to class $k$ given the scores of every class for that instance is computed as

$$ \mathcal{\hat{P}}_k = softmax(\boldsymbol s(\boldsymbol x))_k = \frac{\exp(s_k(\boldsymbol x))} {\sum_{i=1}^K \exp(s_j(\boldsymbol x))} \tag{Equation 2}
$$


Equation 2 is known as the softmax function, here
- $K$ is the number of classes
- $\boldsymbol s(\boldsymbol x)$ is a vector containing the scores of each class for the instance $\boldsymbol x$
- the exponent maps each score to a positive interval as the score might be negative.
- the denominator term normalizes the score, such that the sum of all probability equals 1.  

Similar to the binary classifier, the multinomial regression classifier predicts the class with the highest estimated probability (class with the highest score).

$$
\DeclareMathOperator*{\argmax}{argmax}
\hat{y} = \argmax_k  (softmax(\boldsymbol s(\boldsymbol x))_k ) = \argmax_k s_k(\boldsymbol x) = \argmax_k \left(\boldsymbol w_k^T \boldsymbol x \right) \tag{Equation 3}
$$

The  $argmax$ operator in Equation 3 returns the value of $k$ that maximizes the estimated probability $softmax(\boldsymbol s(\boldsymbol x))_k $.



# Cost Function

Our training objective for the softmax regression model is to estimate a high probability for the target class while simultaneously estimating low probabilities for the remaining $K-1$ classes. We need a cost function such that there is a low penalty for correct prediction and a high penalty for the  wrong prediction. Since the premise is similar to the binary logistic regression, we can generalize the cost function for the binary classifier from 2 to $K$ classes.

The binary cross-entropy gives the cost function for the binary logistic regression and has the following form:


$$ \text{Cost}(\hat{y}, y) = -y\log(\hat{y}) - (1-y)\log(1-\hat{y}) \tag{Equation 4}$$

We can generalize the two terms in Equation 4 to $K$ terms. In multinomial regression, we represent both the true label ($\boldsymbol y$) and predicted label ($ \hat{\boldsymbol y}$) as vectors with $K$ elements. The true label $\boldsymbol y$ is a one-hot vector of length $K$ with $y_c=1$ for correct class $c$ and the remaining elements being 0. Whereas ($ \hat{\boldsymbol y}$) is the estimated vector with $K$ elements, with each element ($ \hat{\boldsymbol y}_k$) representing probability estimates  given by our classifier.

The generalized version of binary cross-entropy, binary logistic regression cost function, for multiple classes is known as **cross-entropy**.


## Cross Entropy
The cost function for multinomial logistic regression, Cross entropy for a single example $\boldsymbol x$ is expressed as:
$$
 \text{Cost}(\hat{\boldsymbol y}, \boldsymbol y) = - \sum_{k=1}^K \boldsymbol y_k \log (\hat{\boldsymbol y}_k) \tag{Equation 5}
$$

Equation 5  is the sum of logs of the $K$ output classes, each weighted by their probability $\boldsymbol y_k$. Since $\boldsymbol y_k$ is one-hot vector with $y_c = 1$ for correct class $c$ and the remaining elements being $0$.
The cost function will result in the negative loss probability of the correct class $c$.
\begin{align*}
  \text{Cost}(\hat{\boldsymbol y}, \boldsymbol y) &= - \sum_{k=1}^K \boldsymbol y_k \log (\hat{\boldsymbol y}_k) \\
  &= -\log(\hat{\boldsymbol y}_c) \tag{where $c$ is the correct class}
\end{align*}

The cost function for $m$ examples will then be given as:
$$
\mathcal{J}(\boldsymbol W) = -\frac{1}{m}\sum_{i=1}^m \sum_{k=1}^K \boldsymbol y_k^{(i)} \log \left(\hat{\boldsymbol y}_k^{(i)}\right) \tag{Equation 6}
$$

Here, $y_k^{(i)}$ represents the labeled one-hot probability vector that the $i^{th}$ example belongs to class $k$.

When there are only two classes ($K=2$), Equation 6 is equivalent to the binary cross-entropy.

More details on [Cross Entropy](https://youtu.be/ErfnhcEV1O8).

# Optimizing Parameters using Gradient Descent

Cross-entropy is a convex function so taking the gradient descent on the cost function will find the parameter matrix $\boldsymbol W$ that minimizes the cost function.

Computing the gradient vector for each class of Equation 6 with regard to $\boldsymbol w^{(k)}$, we get
(_Refer to previous unit notebook for the computation of partial derivative for cross-entropy._)
$$
\nabla_{\boldsymbol w^{(k)}} \mathcal{J}(\boldsymbol W) = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{\boldsymbol y}_k^{(i)} - \boldsymbol y_k^{(i)} \right) \boldsymbol x^{(i)} \tag{Equation 7}
$$

Equation 7 shows that the gradient vector for every class is the difference between the true value (one-hot vector) and the probability the classifier outputs for that class, weighted by the value of the input $x$ corresponding to the $i^{th}$ element of the weight vector for that class.

Now we can use the following update equation for  Gradient descent to optimize the $w$ parameters

$$
 \boldsymbol w_{j+1} = \boldsymbol w_j - \alpha  \nabla_{\boldsymbol w^{(k)}} \mathcal{J}(\boldsymbol W)\tag{Equation 8}
$$

The amount of the movement in the gradient descent is given by the slope of the cost function weighted by a learning rate $\alpha$.



# Key Takeaways
- Binary classifiers can be used for multiclass classification in one-vs-rest and one-vs-one methods.
- Softmax function can estimate probabilities for multiclass problem.
- Cross entropy is a generalized binary cross-entropy cost function for the multi-class classification.
- The gradient vector for every class is the difference between the true value and the probability the classifier outputs for that class, weighted by the value of the input $x$ corresponding to the $i^{th}$ element of the weight vector for that class.