# Multiclass classification

## Learning objectives
- Understand how classification can be implemented when there are more than 2 classes
- Implement a multiclass classifier from scratch

## Intro - Binary classification vs multiclass classification

In binary classification the output must be either true or false. Either the example falls into this class, or it doesn't. We have seen that we can represent this by our model having a single output node whose value is forced between 0 and 1, and as such represents a confidence in the fact that the example belongs to the positive class. Alternatively, still for binary classification, we could have two output nodes, where the value of the first represents the confidence that the input belongs to the positive class (true/class 1) and the value of the second represents the confidence that the input belongs to the negative class (false/class 2). In this case, the values of each output node must be positive and they must sum to 1, because this output layer represents a probability distribution over the output classes. 

# Softmax

![](./images/binary-class.jpg)

In the case where we have two nodes to represent true and false, we can think about it as having trained two models.

Treating true and false as separate classes with separate output nodes shows us how we can extend this idea to do multiclass classification; we simply add more nodes and ensure that their values are positive and sum to one.

![](./images/multiclass.jpg)

### What function can we use to convert the output layer into a distribution over classes?

The **softmax function** exponentiates each value in a vector to make it positive and then divides each of them by their sum to normalise them (make them sum to 1). This ensures that the vector then can be interpreted as a probability distribution.

![](./images/softmax.jpg)

## Differentiating the softmax

# show differentiation of softmax here

### Properties of softmax
- increasing the value of any entry decreases the value of all of the others, because the whole vector must always sum to one. 

Let's implement our own softmax function, and again include a boolean flag that will return the gradient.

In [1]:
import numpy as np

def softmax(z, label=None, grad=False):
    if grad:
        num_classes = len(z)
        g = np.zeros((num_classes, num_classes))
        for i in range(num_classes):
            for j in range(num_classes):
                if j == i:
                    g[i][j] = softmax(z)[i] * (1 - softmax(z)[i])
                else:
                    g[i][j] = - softmax(z)[i] * softmax(z)[j]
        return g
    return np.exp(z) / np.sum(np.exp(z))

x = np.random.rand(3)
print(x)
print(softmax(x))
print('softmax sums up to:', sum(softmax(x)))
print(softmax(x, label=2, grad=True))

[0.46523752 0.05120626 0.74960415]
[0.33445931 0.22107101 0.44446968]
1.0
[[ 0.22259628 -0.07393926 -0.14865702]
 [-0.07393926  0.17219862 -0.09825936]
 [-0.14865702 -0.09825936  0.24691638]]


## The cross entropy loss function

In the BCE loss function, the one line equation contained all of the "switches" that it needs to turn on or off certain terms of the equation.
This was possible because in binary classification, the labels will certainly either be 0 or 1.

In multiclass classification however, these switches cannot be contained in a single line.

An appropriate loss function to use for multiclass classification is the cross entropy loss function.
Like BCE loss, cross entropy uses the same term: the negative natural log of the output probability to penalise outputs exponentially as they stray from the ground truth.
Not all of the terms are needed.
By increasing the value of one element of the output of a softmax, the others must decrease, because the whole vector has to sum to 1.
So if we focus on increasing the correct class likelihood, then we will implicitly be decreasing the incorrect class likelihood.

Let's implement the cross entropy loss function.

In [3]:
def CrossEntropyLoss(z, label, grad=False):
    if grad:
        return - 1 / z
    return np.sum(- np.log(z))

Other than the final layer of the model where the softmax is applied and the loss function, the model and algorithm stay the same. 

Of course however, changing the model changes the gradient of the loss function with respect to the model parameters.
So we'll need to change the code that performs the parameter updates.

Below is the same code we wrote to perform binary classification

In [4]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from get_colors import colors


def make_binary_data(m=50): 
    X = 3*np.random.rand(m, 2) #sample from a normal distribution
    # X = np.sort(X, axis=0)
    # print(X.shape)
    X = X[np.argsort(X[:, 0], axis=0)] # sort by first feature value so that i can plot hypothesis boundaries later
    w = np.array([1, -2])
    Xw = np.matmul(X, w.T)
    Y = np.zeros(X.shape[0])
    Y = Y.astype(int)
    Y[Xw > -2] = 1 #np.sum(Wx,axis=1)
    Y[Xw > 0.5] = 2
    return X, Y #returns X (the input) and Y (labels)

def plot_data(X, Y):
    fig = plt.figure(figsize=(20,10))
    ax = fig.add_subplot(111)
    for y in range(3):
        x = X[Y == y]
        ax.scatter(x[:, 0], x[:, 1], c=colors[y], marker='x', s=100)
    ax.set_xlabel('$x_1$')
    ax.set_ylabel('$x_2$')


    # ax1.set_zlabel('$x_3$')

    # ax1.ylabel('Y')
    # # ax1.grid()
    # plt.ion()
    # plt.show()
    return fig, ax

def plot_hypothesis(X, H, ax=None):
    ax = fig.add_subplot(111)
    x = X[:, 0]
    w0 = H.w[:, 0] # w for hypothesis 0
    y0 = (w0[0] * x + H.b[0]) / w0[1]
    w1 = H.w[:, 1] # w for hypothesis 1
    y1 = (w1[0] * x + H.b[1]) / w1[1]
    ax.plot(x, y0)
    ax.plot(x, y1)

    # l = H(X)
    # other_w_b = np.matmul(X[:, 1:], H.w[1:]) + H.b
    # l += other_w_b
    # l /= H.w[1]
    # r = other_w_b
    # r /= H.w[1]
    # # r -= 0.5
    # ax.plot(l, r)

    # h += H.b
    # h += H.w[1]
    # h = np.matmul(X, H.w.T)
    # for y in range(3):
    #     # print(h)
    #     x = X[Y == y]
    #     ax.plot(h[:, 0], h[:, 1], c=colors[y], marker='x')
    ax.set_xlabel('$x_1$')
    ax.set_ylabel('$x_2$')
    
X, Y = make_binary_data()

H = Classifier(n_features=2, n_classes=3)

fig, ax = plot_data(X, Y)
plot_hypothesis(X, H, ax)
fig.canvas.draw()
plt.show()

NameError: name 'Classifier' is not defined

In [5]:
import matplotlib.pyplot as plt

class Classifier:
    def __init__(self, n_features, n_classes):
        self.w = np.random.rand(n_features, n_classes)
        self.b = np.random.rand(n_classes)
    
    def __call__(self, x):
        x = np.matmul(x, self.w) + self.b
        x = softmax(x)
        return x

    def update_params(self, new_w, new_b):
        self.w = new_w
        self.b = new_b    

    def calc_deriv(self, x, y_hat, label):
        m = len(Y) # m = number of examples
        diffs = y_hat - label # calculate errors
        dzdb = 1
        dzdw = x # needs to be distributed
        dhdz = softmax(y_hat, label, grad=True)
        print(dhdz.shape)
        print(dzdw.shape)
        dhdw = np.matmul(dhdz, dzdw)
        dhdb = dhdz * dzdb
        print('dhdw:', dhdw)
        ssdc
        return dhdw, dhdb

learning_rate = 0.001

H = Classifier(n_features=2, n_classes=3)

# PLOT OUR HYPOTHESIS BEFORE TRAINING
# plt.figure()
# plt.title('Before training')
# plt.ylabel('Label')
# plt.xlabel('Features')
# plt.scatter(X, H(X), label='predictions')
# plt.scatter(X, Y, c='r', marker='x', label='ground truth')
# plt.legend()
# plt.show()
fig, ax = plot_data(X, Y)
plot_hypothesis(X, H, ax)

epochs = 1000
losses = []
for epoch in range(epochs):
    epoch_losses = []
    for x, y in zip(X, Y):
        # jnd
        prediction = H(x)
        # print(y)
        loss = CrossEntropyLoss(prediction, y)
        epoch_losses.append(loss)
        dhdw, dhdb = H.calc_deriv(x, prediction, y)
        # print(prediction)
        print('dhdw')
        print(dhdw)
        print(dhdw.shape)

        print('dhdb')
        print(dhdb)
        print(dhdb.shape)


        dLdh = CrossEntropyLoss(prediction, y, grad=True)
        print('dLdh:')
        print(dLdh.shape)
        print(dLdh)
        print('dhdw')
        print(dhdw)
        print(dhdw.shape)
        dLdw = dLdh * dhdw
        print(dLdw)
        dLdb = dLdh * dhdb
        new_w = H.w - learning_rate * dLdw
        new_b = H.b - learning_rate * dLdb
        H.update_params(new_w, new_b)
    plot_hypothesis(X, H, ax=ax1)
    fig.canvas.draw()
    losses.append(np.mean(epoch_losses))
        


(3, 3)
(2,)


ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 3)

Plot class probability landscape for each class
Evaluate mesh and plot probability of being a member of each class vertically 

## Summary
- multiclass classification requires a different loss function 
- softmax is a differentiable function that turns a vector of real numbers into a probability distribution

## Next steps
- 
