# Problem 03.06

## Problem:

There is a link to a very large dataset of handwritten figures on the book website (the MNIST dataset). Download it and use a Perceptron to learn about the dataset.

## Solution:

Let's first import essential modules:

In [1]:
import numpy as np
import pandas as pd
import pcn
import math
from sklearn.metrics import confusion_matrix

Load the dataset:

In [2]:
data = pd.read_csv("../../../datasets/MNIST/mnist_train.csv")

Next, we will add appropriate columns headings:

In [3]:
column_heads = ["{:03d}".format(x) for x in range(1, 785)]
column_heads.insert(0, "label")
data.columns = column_heads

labels = data["label"]
data.pop("label")

0        0
1        4
2        1
3        9
4        2
5        1
6        3
7        1
8        4
9        3
10       5
11       3
12       6
13       1
14       7
15       2
16       8
17       6
18       9
19       4
20       0
21       9
22       1
23       1
24       2
25       4
26       3
27       2
28       7
29       3
        ..
59969    2
59970    2
59971    0
59972    9
59973    2
59974    4
59975    6
59976    7
59977    3
59978    1
59979    3
59980    6
59981    6
59982    2
59983    1
59984    2
59985    6
59986    0
59987    7
59988    8
59989    9
59990    2
59991    9
59992    5
59993    1
59994    8
59995    3
59996    5
59997    6
59998    8
Name: label, Length: 59999, dtype: int64

We will change the labels vector into a labels matrix:

In [4]:
matLabels = np.zeros((data.shape[0], 10))
for k in range(labels.shape[0]):
    matLabels[k, labels[k]] = 1

Set training parameters:

In [5]:
nTraining = 5000

Break up the data into training and cross-validation datasets:

In [6]:
trainData = data[:nTraining].as_matrix()
matTrainLabels = matLabels[:nTraining, :]
trainLabels = labels[:nTraining].as_matrix()

cvData = data[nTraining:].as_matrix()
matCVLabels = matLabels[nTraining:, :]
cvLabels = labels[nTraining:].as_matrix()

Train the model:

In [7]:
p = pcn.pcn(10, eta = 0.25, iter = 5000)
trainoutput = p.trainWeights(trainData, matTrainLabels)

trainpred = np.zeros((np.shape(trainoutput)[0], 1))

for k in range(np.shape(trainoutput)[0]):
    if trainoutput[k, 0] == 1:
        trainpred[k] = 0
    elif trainoutput[k, 1] == 1:
        trainpred[k] = 1
    elif trainoutput[k, 2] == 1:
        trainpred[k] = 2
    elif trainoutput[k, 3] == 1:
        trainpred[k] = 3
    elif trainoutput[k, 4] == 1:
        trainpred[k] = 4
    elif trainoutput[k, 5] == 1:
        trainpred[k] = 5
    elif trainoutput[k, 6] == 1:
        trainpred[k] = 6
    elif trainoutput[k, 7] == 1:
        trainpred[k] = 7
    elif trainoutput[k, 8] == 1:
        trainpred[k] = 8
    elif trainoutput[k, 9] == 1:
        trainpred[k] = 9
    else:
        trainpred[k] = 0

Check how the model did against training data:

In [8]:
conf_mat = confusion_matrix(trainLabels, trainpred)
percentRight = np.trace(conf_mat)/np.sum(conf_mat)*100
print(conf_mat)
print(percentRight)

[[479   0   0   0   0   0   0   0   0   0]
 [  0 563   0   0   0   0   0   0   0   0]
 [  2   0 486   0   0   0   0   0   0   0]
 [ 21   0   5 458   0   1   0   0   8   0]
 [  0   0   7   0 528   0   0   0   0   0]
 [  1   0   5   3   0 424   0   0   0   0]
 [  0   0   7   0   0   2 492   0   0   0]
 [  0   0   4   4   0   0   0 543   0   0]
 [  9   0   7   0   0   1   0   0 445   0]
 [ 57   0   3   0   0   0   0   0  56 379]]
95.94


Predict the cross-validation data:

In [9]:
cvoutput = p.forwardPredict(cvData)

cvpred = np.zeros((np.shape(cvoutput)[0], 1))

for k in range(np.shape(cvoutput)[0]):
    if cvoutput[k, 0] == 1:
        cvpred[k] = 0
    elif cvoutput[k, 1] == 1:
        cvpred[k] = 1
    elif cvoutput[k, 2] == 1:
        cvpred[k] = 2
    elif cvoutput[k, 3] == 1:
        cvpred[k] = 3
    elif cvoutput[k, 4] == 1:
        cvpred[k] = 4
    elif cvoutput[k, 5] == 1:
        cvpred[k] = 5
    elif cvoutput[k, 6] == 1:
        cvpred[k] = 6
    elif cvoutput[k, 7] == 1:
        cvpred[k] = 7
    elif cvoutput[k, 8] == 1:
        cvpred[k] = 8
    elif cvoutput[k, 9] == 1:
        cvpred[k] = 9
    else:
        cvpred[k] = 0

Check how the model did against the cross-validation data:

In [10]:
conf_mat = confusion_matrix(cvLabels, cvpred)
percentRight = np.trace(conf_mat)/np.sum(conf_mat)*100
print(conf_mat)
print(percentRight)

[[5313    0   39    6    9    9   16   12   39    1]
 [ 105 5843   36   27    6   17    5    8  129    3]
 [ 348  111 4645   77   35    9   43   50  143    9]
 [ 505   67  210 4413    7  158   16   41  206   15]
 [ 260   31  156   74 4550   15   26   18  125   52]
 [ 505   34  135  358  150 3539   59   23  168   16]
 [ 356   34  373   18  194  246 4173    4   17    2]
 [ 322   40  160  273   69   41    3 4669   31  106]
 [ 505  123  244  156   71  237   46   35 3956   16]
 [ 964   65   55  131  303  133    2  351  532 2918]]
80.0360006546
