# Multiclass Classification



## Introduction

> Here, we cover classification with more than two classes.

## Binary vs Multiclass



As is known, in binary classification, the output must be either `True` or `False`.

An example may either fall in this class or not. Nonetheless, as has been demonstrated, we can represent this by our model having a single output node whose value is forced between 0 and 1, representing the probability that the example belongs to the positive class.



## Multiclass


<p align=center><img width=1000 src=images/binary-class.jpg></p>

A case that involves two nodes representing true and false is analogous to a case where two separate models are trained.

The idea of treating `True` and `False` as separate classes with separate output nodes can be extended to multiclass classification. 

>Simply add more nodes, ensuring that their values are positive and sum to one.

__Each node is a single `logit`, and all of them combined are later passed to `softmax`.__

<p align=center><img width=1000 src=images/multiclass.jpg></p>



## Multiclass vs Multilabel



In this notebook, we will not explore the __multilabel__ case in depth; however, be aware of the following:

> In a multilabel problem, each label can exist simultaneously instead of exclusively, similar to the case in multiclass.

This might be a single vector, e.g. where we have `cat` and `dog` on a picture, but not a turtle:

$$
[1, 0, 1]
$$

> In multiclass, __there is always one '`1`' label__ (not less, not more).

## Logits


For multiclass, logits are outputted as well. The only difference is that __they will be a vector of values__. Each value in the output vector corresponds to a certain class.

Assuming that we wish to classify the input image into one of threes classes: `{dog=0, cat=1, turtle=2}`, our model's output may be similar to this:

$$
    [-5, -3, 2]
$$

This would be a prediction of class `turtle` as its value is the highest.
To obtain a label from this operation, we simply use [`argmax`](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html):

>`argmax` returns the __index__ of the array entry with __the highest value.__

Note that logits can be transformed into probabilities here as well.



## Softmax


>The **softmax function** exponentiates each value in a vector to make it positive and subsequently divides each of them by their sum to normalise them (ensure that their sum equals 1). </font>

This ensures that the vector can be interpreted as a probability distribution.

<p align=center><img width=1000 src=images/softmax.jpg></p>

For example, we can replace each variable with values:

<p align=center><img width=1000 src=images/softmax_example.jpg></p>



## Differentiating the Softmax


- The softmax derivative differs based on the index of the element with respect to which we obtain the derivative. 
- If it is the same as the index of the element to which we applied softmax, the derivative becomes the equation at the bottom.
- Otherwise, it is the one above it.

<p align=center><img width=1000 src=images/softmax_deriv.jpg></p>



### Softmax properties



- An increase in the value of any entry corresponds to a decrease in the value of all the others; this is because the whole vector must always sum to one. 
- An increase in one input element results in an exponential increase in its corresponding output element and a simultaneous decrease in others; this indicates that __it is very easy for the largest output element to become dominant.__ 
- This results in `softmax` being overconfident, a problem to which there are several solutions, including `label smoothing`.



### Softmax vs argmax



- As explained above, conventionally, one input is close to `1`, while all the others are close to `0`. This is similar to the `argmax` operation mentioned previously; however, it is 'soft' as it can be differentiated.
- `argmax` changes abruptly. A small difference between two values would result in an output of either `0` or `1`. Contrarily, softmax changes gradually when the maximum changes.


### Example

Let us implement a softmax function.

It should accept `x` and divide by `sum` across `axis=1` (as we are normalising along the features):

In [1]:
import numpy as np

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    exponentials = np.exp(x)
    return exponentials / exponentials.sum(axis=1).reshape(-1, 1)

## Stable Softmax

As seen in the `sigmoid` case, this version also suffers from numerical instability. Consider the example below:

In [2]:
softmax(np.array([[1000, 9, 8], [11, 12, 15]]))

  exponentials = np.exp(x)
  return exponentials / exponentials.sum(axis=1).reshape(-1, 1)


array([[       nan, 0.        , 0.        ],
       [0.01714783, 0.04661262, 0.93623955]])

This time, the result is worse as we obtain `np.nan` due to overflow. The solution to the problem is to subtract the maximum value from each row.

As `softmax` works along the horizontal axis (`1`) and all values sum to `1`, we are concerned with the absolute distance between the numbers in certain rows. 

This implies that we can divide __any value__ from them and still obtain the same results:

In [3]:
original = np.array([5, -2, 0]).reshape(1, -1)
subtracted = original - 6
softmax(original), softmax(subtracted)

(array([[9.92408247e-01, 9.04959183e-04, 6.68679417e-03]]),
 array([[9.92408247e-01, 9.04959183e-04, 6.68679417e-03]]))

### The value to subtract

It is impossible to know the right `const` value to remove from each row (`1000` or `1_000_000`?). 

Fortunately, we can find the maximum in the whole batch of data and simply subtract that.

### Example

Implement a stable version of the `softmax` function:
- subtract `np.max` from `logits` across `1` axis.
- return the calculated exponential values as done previously.

Here is the `stable softmax`:

In [4]:
def softmax(logits):
    exps = np.exp(logits - np.max(logits, axis=1).reshape(-1, 1))
    return exps / np.sum(exps, axis=1).reshape(-1, 1)

In [5]:
softmax(np.array([[1000, 9, 8], [11, 12, 15]]))

array([[1.        , 0.        , 0.        ],
       [0.01714783, 0.04661262, 0.93623955]])

## One Hot Encoding

Our targets can be encoded in multiple ways. Conventionally, the class numbers are passed as follows (for `5` samples):

$$
[0, 3, 1, 1, 4]
$$

Alternatively, we could use one-hot encoding:

$$
\begin{align}
&[1, 0, 0, 0, 0]\\
&[0, 0, 0, 1, 0]\\
&[0, 1, 0, 0, 0]\\
&[0, 1, 0, 0, 0]\\
&[0, 0, 0, 0, 1]\\
\end{align}
$$

Since most of the data work with the first option, we will code the transformation of the `labels` into one-hot-encoding and vice-versa:

In [6]:
def to_one_hot(labels, max_labels: int = None):
    if max_labels is None:
        max_labels = np.max(labels) + 1
    return np.eye(max_labels)[labels]


def to_labels(one_hot):
    return np.argmax(one_hot, axis=-1)


data = np.array([0, 1, 0, 3, 5])
to_one_hot(data)

array([[1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 1.]])

One hot encoding can also be applied to the inputs. However, please exercise caution with using the whole output directly in the model as this can result in a problem. 

To ascertain the problem, examine the sklearn library called preprocessing:

In [3]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(drop='first')

Notice that one of the arguments that can be passed is `drop`. Can you decipher why we intend to drop a column from the one hot-encoder output?

<details>
  <summary>Click to view the answer.</summary>

  When one-hot encoding is performed, all the columns in the output always sum up to one. This is because only one column is 1, while the rest are 0. Now, the problem with this is that these columns are correlated. 
  
  In ML models, correlation can lead to numerically unstable solutions. For more information on the subject, consult [this page](https://en.wikipedia.org/wiki/Multicollinearity#Consequences).
  
  Therefore, when you perform one-hot encoding, you should drop one column from the output.

</details>

## The Cross-Entropy Loss Function

An appropriate loss function to use for multiclass classification is the __cross-entropy loss function__.

- It is a __generalisation of the BCE loss;__ therefore, it will work in the binary case as well.
- BCE is faster and more stable than the cross-entropy loss for the binary case; thus, __it should be created and used separately__.

The cross-entropy loss uses the same term as the BCE loss: __the negative natural log of the output probability__ is utilised to penalise outputs exponentially as they stray from the ground truth.

> There is no need to simultaneously push down the incorrect class probabilities and push up the correct class probabilities.

Therefore, if we focus on increasing the correct class likelihood element, we will implicitly be decreasing the incorrect class likelihood elements.

<p align=center><img width=1000 src=images/cross_entropy_loss.jpg></p>

## Using Simple Linear Models

We have learnt how to create and use `linear models` for regression and classification. Shortly, we will explore more powerful models. 

Here is a rough summary of when the simple models should be used in practice:

- As a baseline: this gives us an overview and 'starting point' for improvement.
- To realise an easily explainable model: each weight shows the impact of a factor onto our target.
- When there are many features (even more than the data points), and we do not want to overfit the data.

With time and experience, it will become increasingly apparent when to use each.

## Multiclass Classification in sklearn

Implementing multiclass classification in sklearn is similar to implementing binary classification in sklearn. Logistic regression can be utilised for this purpose; however, in this case, instead of loading a dataset with two possible outputs, we will load the `iris` dataset, which has three possible outputs.

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_iris(return_X_y=True, as_frame=True)
display(X)
display(y)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    2
147    2
148    2
149    2
Name: target, Length: 150, dtype: int64

You can see that the dataset has 150 samples, 4 features, and 3 different classes, each one corresponding to a different flower species.

Now, we split the dataset into a training set and a test set, and train the logistic regression. One argument that can be passed to the LogisticRegression class is multi_class. By default, it is set to `auto`, which selects `ovr` for the binary classification and `multinomial` for the multiclass classification. 

- `ovr` means One-vs-Rest, which calculates the probabilities for each pair of classes, similar to the case in a binary classification problem.
- `multinomial` calculates the probability distribution for all the classes; however, this does not mean that it is always more suitable for multiclass classification.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
log_reg = LogisticRegression(multi_class='multinomial', solver='newton-cg')
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
print(accuracy_score(y_test, y_pred))

1.0


The iris dataset is a toy dataset; as such, it is not difficult to train a good model on it. Be suspicious if you get a score of 1.0.

## Conclusion

At this point, you should have a good understanding of

- how classification can be implemented with more than two classes.
- how to implement a multiclass classifier from scratch.