In [1]:
%config IPCompleter.greedy=True
%load_ext autoreload
%autoreload 2

import numpy as np
import matplotlib.pyplot as plt
import sklearn.preprocessing, sklearn.datasets, sklearn.model_selection

In this notebook, we will be dealing with multiclass classification. We will have finally model, that can distinguish between all the numbers from the MNIST dataset and we will not need to deal with 4 and 9 only. The proper way of handling this problem is to use *softmax* function. I will show different approaches before, so we can compare them.

Firstly, we need the data. The template is still the same, so I will not describe it anymore.

In [2]:
data, target = sklearn.datasets.fetch_openml('mnist_784', version=1, return_X_y=True)
target = target.astype(int)
data = data.reshape(-1, 784)
data[data < 128] = 0
data[data > 0] = 1
data = np.hstack([data, np.ones((data.shape[0],1))])
train_data, test_data, train_target, test_target = sklearn.model_selection.train_test_split(data, target.astype(int), test_size=0.3, random_state=47)

# Perceptron

If you remember, we dealt with this problem in one of the previous notebook, when we were talking about perceptron algorithm. Just as a reminder, let's do it once again here, co we may compare the reults. I moved it into separate class, so I dont need to copy-paste it here once again. If you are interested, it is in the [src/perceptron.py](src/perceptron.py) file.

In [3]:
from src.perceptron_05 import multiclass_perceptron

train_acc, test_acc = multiclass_perceptron(train_data, train_target, test_data, test_target, iters=500, random_state=42)

print(f"Train accuracy: {train_acc}, Test accuracy: {test_acc}")

Train accuracy: 0.9741020408163266, Test accuracy: 0.911


The results don't tell much yet. The test accuracy is maybe too low compare to the train accuracy, but we will se how different models will behave.

# One-vs-one

We may use the same approach for the logistic regression as well. I will use the `Neuron` implementation from previous notebooks. You can recall it in [src/neuron_05.py](src/neuron_05.py) file. We have seen the `BCELoss` class as well, so I will just copy it.

In [4]:
from src.neuron_05 import Neuron

In [5]:
class BCELoss:
    def __call__(self, target, predicted):
        return np.sum(-target * np.log(np.maximum(predicted, 1e-15)) - (1 - target) * np.log(np.maximum(1 - predicted, 1e-15)), axis=0)
    def gradient(self, target, predicted):
        return - target / (np.maximum(predicted, 1e-15)) + (1 - target) / (np.maximum(1 - predicted, 1e-15))

Now let's make train model for each of the combination. We have seen the code before, so again, I will not comment it.

In [6]:
# train models
models = np.empty((10,10), dtype=object)
for i in range(10):
    for j in range(i):
        models[i][j] = Neuron(BCELoss(), epochs=200, learning_rate=0.001, batch_size=128, random_state=42+i*10+j)
        mask = np.logical_or(train_target == i, train_target == j)
        current_X = train_data[mask]
        current_y = (train_target[mask] - j) / (i - j)
        models[i][j].fit(current_X, current_y, progress=True)

100% (200 of 200) |######################| Elapsed Time: 0:00:07 Time:  0:00:07
100% (200 of 200) |######################| Elapsed Time: 0:00:06 Time:  0:00:06
100% (200 of 200) |######################| Elapsed Time: 0:00:07 Time:  0:00:07
100% (200 of 200) |######################| Elapsed Time: 0:00:06 Time:  0:00:06
100% (200 of 200) |######################| Elapsed Time: 0:00:07 Time:  0:00:07
100% (200 of 200) |######################| Elapsed Time: 0:00:07 Time:  0:00:07
100% (200 of 200) |######################| Elapsed Time: 0:00:06 Time:  0:00:06
100% (200 of 200) |######################| Elapsed Time: 0:00:07 Time:  0:00:07
100% (200 of 200) |######################| Elapsed Time: 0:00:06 Time:  0:00:06
100% (200 of 200) |######################| Elapsed Time: 0:00:07 Time:  0:00:07
100% (200 of 200) |######################| Elapsed Time: 0:00:06 Time:  0:00:06
100% (200 of 200) |######################| Elapsed Time: 0:00:06 Time:  0:00:06
100% (200 of 200) |#####################

That took a while, but we have the models. We may now predict the class of each model and take simply the majority for the example.

In [7]:
# predict
train_predictions = np.zeros((train_target.shape[0], 10), dtype=int)
test_predictions = np.zeros((test_target.shape[0], 10), dtype=int)
for i in range(10):
    for j in range(i):
        prediction = np.around(models[i][j].predict(train_data))
        train_predictions[prediction == 0, j] += 1
        train_predictions[prediction == 1, i] += 1
        prediction = np.around(models[i][j].predict(test_data))
        test_predictions[prediction == 0, j] += 1
        test_predictions[prediction == 1, i] += 1
train_predictions = train_predictions.argmax(axis=1)
test_predictions = test_predictions.argmax(axis=1)

In [8]:
train_acc = sklearn.metrics.accuracy_score(train_target, train_predictions)
test_acc = sklearn.metrics.accuracy_score(test_target, test_predictions)
print(f"Train accuracy: {train_acc}, Test accuracy: {test_acc}")

Train accuracy: 0.9466122448979591, Test accuracy: 0.9168571428571428


That doesn't seem bad. The accuracy is bit higher (for the test set) compared to simple perceptron. Although the training accuracy is lower that the perceptron one, this is not what we care about. When we have model, we care about it's generalization error. The generalization is better for the logistic regression, as the test accuracy is better and training accuracy is closer to the test accuracy.

However, we didn't use the full potencial of the model. Remember, that the logistic regression returns probabilities, not classes. And as you may think, probabilities $45\% : 55\%$ should have lower impact on the result that accuracies $1\% : 99\%$. But in the previous case, we treat them same.

What if we were adding the probabilities instead of classes?

In [9]:
# predict
train_predictions = np.zeros((train_target.shape[0], 10), dtype=float)
test_predictions = np.zeros((test_target.shape[0], 10), dtype=float)
for i in range(10):
    for j in range(i):
        prediction = models[i][j].predict(train_data)
        train_predictions[:, j] += 1-prediction
        train_predictions[:, i] += prediction
        prediction = models[i][j].predict(test_data)
        test_predictions[:, j] += 1-prediction
        test_predictions[:, i] += prediction
train_predictions = train_predictions.argmax(axis=1)
test_predictions = test_predictions.argmax(axis=1)

In [10]:
train_acc = sklearn.metrics.accuracy_score(train_target, train_predictions)
test_acc = sklearn.metrics.accuracy_score(test_target, test_predictions)
print(f"Train accuracy: {train_acc}, Test accuracy: {test_acc}")

Train accuracy: 0.9476326530612245, Test accuracy: 0.9248095238095239


Finally, this is the best result so far. By using probabilities, we allowed model to use it's full potencial.

# Tree based approach

> Note that this chapter is more playing around than what would be done in reality. However, I thought it may be interesting for some of you to see what can be done and how it performs. If you want to see only the mandatory parts, that you gonna need in the following notebooks, please skip to the next chapter.

Previous approach, sometimes called *one-vs-one*, had good results, however we needed to train too much models - in fact when we have $c$ classes, we need $\frac{c\cdot(c-1)}{2}$ models. In our case that is $45$ models in total. In reality, that's too much, especially if training of one model took hours or days. First what we may think about is build a tree.

We may think about it as a decision tree. At each node, there is a logistic regression that tell us, which sides to continue. We put the resulting classes to the leaves. We still need only binary classification, so we may use the perceptron or logistic regression. We may visualize the tree.

![tree in order](img/TreeInOrder_07.svg)

Let's try it. As the implementation is straighrforward, but still quiet long (I prefered simplicity in this case), I left the implementation itself in the [src/treebased_07.py](src/treebased_07.py) file. Fell free to look at it.

In [11]:
from src.treebased_07 import TreeInOrder

In [12]:
tree = TreeInOrder(BCELoss(), epochs=400, learning_rate=0.001, batch_size=128, random_state=42)
tree.fit(train_data, train_target, progress=True)

100% (400 of 400) |######################| Elapsed Time: 0:01:06 Time:  0:01:06
100% (400 of 400) |######################| Elapsed Time: 0:00:34 Time:  0:00:34
100% (400 of 400) |######################| Elapsed Time: 0:00:32 Time:  0:00:32
100% (400 of 400) |######################| Elapsed Time: 0:00:20 Time:  0:00:20
100% (400 of 400) |######################| Elapsed Time: 0:00:19 Time:  0:00:19
100% (400 of 400) |######################| Elapsed Time: 0:00:14 Time:  0:00:14
100% (400 of 400) |######################| Elapsed Time: 0:00:13 Time:  0:00:13
100% (400 of 400) |######################| Elapsed Time: 0:00:12 Time:  0:00:12
100% (400 of 400) |######################| Elapsed Time: 0:00:13 Time:  0:00:13


In [13]:
train_predictions = tree.predict_probbased(train_data)
test_predictions = tree.predict_probbased(test_data)

In [14]:
train_acc = sklearn.metrics.accuracy_score(train_target, train_predictions)
test_acc = sklearn.metrics.accuracy_score(test_target, test_predictions)

In [15]:
print(f"Train accuracy: {train_acc}, Test accuracy: {test_acc}")

Train accuracy: 0.8295102040816327, Test accuracy: 0.814952380952381


That doesn't look good. We had almost $92.5\%$ accuracy before, now we are down to $81\%$. Why is that? We must understand that in the tree based approach, the error is accumulated along the path to the node. Even if each model had only $5\%$ error, it accumulates to $15\%$ or $20\%$ error along the way. We didn't need to train 45 models and 9 was enough. In other word, we may spend more time train the model to archieve better accuracy compared to one-to-one approach.

During the inference (the prediction), I used probabilities and I computed the probability of each class. Finally, I picked up the class with the highest probability. Once again, we may get rid of the probabilities and round the value to either $0$ or $1$. That results in binary search tree of some sort. It shouldn't be surprising, that the accuracy get worse, as we are completely ignoring the cases, when the model knows its not sure about.

In [16]:
train_predictions = tree.predict_direct(train_data)
test_predictions = tree.predict_direct(test_data)

In [17]:
train_acc = sklearn.metrics.accuracy_score(train_target, train_predictions)
test_acc = sklearn.metrics.accuracy_score(test_target, test_predictions)

In [18]:
print(f"Train accuracy: {train_acc}, Test accuracy: {test_acc}")

Train accuracy: 0.818530612244898, Test accuracy: 0.8069047619047619


Note that the order of nodes is one of the hyperparameter - when we change it, the accuracy can increase or decrease. For example digits $1$ and $7$ are harder to distinguish than $6$ and $8$, so we may keep them together down in the tree. Let's take for example following tree.

![tree with custom order](img/TreeSpecialOrder_07.svg)

The order is just the first one, that I came. Let's see, how the accuracies changes.

In [19]:
from src.treebased_07 import TreeSpecialOrder
tree = TreeSpecialOrder(BCELoss(), epochs=400, learning_rate=0.001, batch_size=128, random_state=42)
tree.fit(train_data, train_target, progress=True)

100% (400 of 400) |######################| Elapsed Time: 0:01:07 Time:  0:01:07
100% (400 of 400) |######################| Elapsed Time: 0:00:34 Time:  0:00:34
100% (400 of 400) |######################| Elapsed Time: 0:00:33 Time:  0:00:33
100% (400 of 400) |######################| Elapsed Time: 0:00:22 Time:  0:00:22
100% (400 of 400) |######################| Elapsed Time: 0:00:14 Time:  0:00:14
100% (400 of 400) |######################| Elapsed Time: 0:00:13 Time:  0:00:13
100% (400 of 400) |######################| Elapsed Time: 0:00:20 Time:  0:00:20
100% (400 of 400) |######################| Elapsed Time: 0:00:15 Time:  0:00:15
100% (400 of 400) |######################| Elapsed Time: 0:00:13 Time:  0:00:13


In [20]:
train_predictions = tree.predict_direct(train_data)
test_predictions = tree.predict_direct(test_data)
train_acc = sklearn.metrics.accuracy_score(train_target, train_predictions)
test_acc = sklearn.metrics.accuracy_score(test_target, test_predictions)
print(f"Train accuracy: {train_acc}, Test accuracy: {test_acc}")

Train accuracy: 0.838265306122449, Test accuracy: 0.814952380952381


In [21]:
train_predictions = tree.predict_probbased(train_data)
test_predictions = tree.predict_probbased(test_data)
train_acc = sklearn.metrics.accuracy_score(train_target, train_predictions)
test_acc = sklearn.metrics.accuracy_score(test_target, test_predictions)
print(f"Train accuracy: {train_acc}, Test accuracy: {test_acc}")

Train accuracy: 0.8449387755102041, Test accuracy: 0.8217619047619048


As you may see, both direct prediction (when no probabilities are used) and probability based accuracies increased. Still not a very good, but we were able to increase the accuracy by $0.6\%$ by simply changing order of the leaves.

As I said, this chapter was more playing around than what would be done in reality. Now let's see the proper way.

# One-to-rest

Final and the correct approach used in neural networks (we are slowly getting there, aren't we) is to something called one-vs-rest. We are going to train model for each digit and it will return probability of that digit. Complement is "it is any other digit". In the end, we are going to pick up the most probable digit as the prediction. First what we are going to need is to train the models.

In [22]:
models = np.empty((10,), dtype=object)
for i in range(10):
    current_target = train_target.copy() + 100     # copy the train target and shift it up by 100
    current_target[current_target == 100 + i] = 1  # set the current training digit to equal 1
    current_target[current_target >= 100] = 0      # all other digits set to 0
    models[i] = Neuron(BCELoss(), epochs=400, learning_rate=0.001, batch_size=128, random_state=42+i)
    models[i].fit(train_data, current_target, progress=True)

100% (400 of 400) |######################| Elapsed Time: 0:01:06 Time:  0:01:06
100% (400 of 400) |######################| Elapsed Time: 0:01:06 Time:  0:01:06
100% (400 of 400) |######################| Elapsed Time: 0:01:05 Time:  0:01:05
100% (400 of 400) |######################| Elapsed Time: 0:01:06 Time:  0:01:06
100% (400 of 400) |######################| Elapsed Time: 0:01:06 Time:  0:01:06
100% (400 of 400) |######################| Elapsed Time: 0:01:06 Time:  0:01:06
100% (400 of 400) |######################| Elapsed Time: 0:01:06 Time:  0:01:06
100% (400 of 400) |######################| Elapsed Time: 0:01:06 Time:  0:01:06
100% (400 of 400) |######################| Elapsed Time: 0:01:06 Time:  0:01:06
100% (400 of 400) |######################| Elapsed Time: 0:01:06 Time:  0:01:06


The code should be straighforward. Variable `models` holds logistic regressions - at index $i$ (`models[i]`) is model, that predict probability of input being digit $i$. Now let's predict the labels.

In [23]:
train_distribution = np.zeros((len(train_target), 10))
test_distribution = np.zeros((len(test_target), 10))
for i in range(10):
    train_distribution[:,i] = models[i].predict(train_data)
    test_distribution[:,i] = models[i].predict(test_data)
train_predictions = np.argmax(train_distribution, axis=1)
test_predictions = np.argmax(test_distribution, axis=1)

Now we have our predictions. Notice that although `train_distribution` and `test_distribution` are called as distributions, in fact they are not yet. Remember, probabilities in distribution need's to accumulate to 1. To really obtain the distributions, we would need to divide it by the sum of probabilities.

In [24]:
train_distribution = train_distribution / np.sum(train_distribution, axis=1)[:,np.newaxis]
test_distribution = test_distribution / np.sum(test_distribution, axis=1)[:,np.newaxis]
with np.printoptions(precision=3, suppress=True):
    print(test_distribution[:10])

[[0.    0.    0.003 0.979 0.    0.016 0.    0.    0.002 0.   ]
 [0.    0.953 0.003 0.018 0.    0.003 0.    0.005 0.006 0.011]
 [0.    0.787 0.    0.004 0.    0.018 0.003 0.    0.186 0.   ]
 [0.    0.802 0.161 0.001 0.005 0.01  0.    0.004 0.014 0.002]
 [0.    0.    0.011 0.    0.007 0.    0.982 0.    0.    0.   ]
 [0.    0.001 0.03  0.948 0.    0.015 0.    0.    0.006 0.   ]
 [0.002 0.    0.    0.    0.007 0.984 0.001 0.    0.003 0.002]
 [0.002 0.    0.    0.    0.933 0.002 0.    0.    0.025 0.038]
 [0.    0.    0.002 0.    0.    0.004 0.    0.    0.989 0.006]
 [0.    0.208 0.078 0.485 0.002 0.218 0.001 0.    0.008 0.001]]


We usually want to have the distribution, so the model should return it instead of the unnormalize distribution. The function, that is rensponsible for that is called **softmax** and we will talk about it in the very next notebook. I don't want to implement it yet, as our model needs some refactoring, so we may plug it in.

Let's see the accuracies.

In [25]:
train_acc = sklearn.metrics.accuracy_score(train_target, train_predictions)
test_acc = sklearn.metrics.accuracy_score(test_target, test_predictions)
print(f"Train accuracy: {train_acc}, Test accuracy: {test_acc}")

Train accuracy: 0.9236938775510204, Test accuracy: 0.9060952380952381


That looks good. Although the accuracy is not as good as in the first one-vs-one approach. We trade less models to train for $2\%$ of accuracy. That sound's bad right now, but this approach have one benefit - we can join all ten models into one and train them in parallel. That allow us to train model better, shorter time and track the performance during training - we may monitor the progress and for example stop models with bad hyperparameters. Shorter time means we may test more hyperparameter combinations and we may, in fact, archieve better score by just modifying learning rate and other hyperparameters (we will have plenty of them through the following notebooks).

Let's refactor the code and learn about softmax - the final stop before our first neural network.