# Multiclass classification

## Learning objectives
- Understand how classification can be implemented when there are more than 2 classes
- Implement a multiclass classifier from scratch

## Intro - Binary classification vs multiclass classification

In binary classification the output must be either true or false. Either the example falls into this class, or it doesn't. We have seen that we can represent this by our model having a single output node whose value is forced between 0 and 1, and as such represents a confidence in the fact that the example belongs to the positive class. Alternatively, still for binary classification, we could have two output nodes, where the value of the first represents the confidence that the input belongs to the positive class (true/class 1) and the value of the second represents the confidence that the input belongs to the negative class (false/class 2). In this case, the values of each output node must be positive and they must sum to 1, because this output layer represents a probability distribution over the output classes. 

# 

![](./images/binary-class.jpg)

In the case where we have two nodes to represent true and false, we can think about it as having trained two models.

Treating true and false as separate classes with separate output nodes shows us how we can extend this idea to do multiclass classification; we simply add more nodes and ensure that their values are positive and sum to one.

![](./images/multiclass.jpg)

### What function can we use to convert the output layer into a distribution over classes?

The **softmax function** exponentiates each value in a vector to make it positive and then divides each of them by their sum to normalise them (make them sum to 1). This ensures that the vector then can be interpreted as a probability distribution.

![](./images/softmax.jpg)

### Properties of softmax
- increasing the value of any entry decreases the value of all of the others, because the whole vector must always sum to one. 
This means that when calculating the loss for a prediction we don't need to consider all of the classes, pushing the correct label larger and pushing the incorrect labels lower. 
Instead we can just push the correct label up, and because of the softmax formula, the others will be pushed down.

Let's implement our own softmax function.

In [4]:
import numpy as np

def softmax(z, label=None, grad=False):
    if grad:
        if not label:
            raise ValueError
        else:
            g = np.empty_like(z)
            for j in range(len(z)):
                if j == label:
                    g[j] = softmax(z)[label] * (1 - softmax(z)[label])
                else:
                    g[j] = - softmax(z)[label] * softmax(z)[j]
            return g
    return np.exp(z) / np.sum(np.exp(z))

x = np.random.rand(6)
print(x)
print(softmax(x))
print(sum(softmax(x)))

[0.51303363 0.75243903 0.38204912 0.31023859 0.47814088 0.72571698]
[0.16215675 0.2060191  0.14224899 0.13239216 0.15659623 0.20058675]
1.0


## Summary
- multiclass classification requires a different loss function 
- softmax is a differentiable function that turns a vector of real numbers into a probability distribution

## Next steps
- 
