# Practical 6: Image classification

In this practical, you will build a model for hand-written digit classification on the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset. The MNIST dataset is a famous benchmark dataset for machine learning. It consists of pairs of images and labels (10 classes, from 0 to 9). It looks like this.

![](mnist.png)

You model, once trained, will be able to predict for each image, what the digit is. You can use any classification model from the [sklearn](https://scikit-learn.org) library.

In [None]:
# Import libaries (provided)
import numpy as np
import pandas as pd
import gzip
import os
import struct
import matplotlib.pyplot as plt
import sklearn
from sklearn import neighbors, ensemble, svm, metrics
import time

In [None]:
# Functions for loading MNIST image data (provided)
def extract_images(f_name):
    """ Extract the images into a 2D uint8 numpy array [index, rows x cols]. """
    print('Extracting', f_name)
    with gzip.open(f_name, 'rb') as f:
        # Read file header
        buffer = f.read(16)
        magic, num_images, rows, cols = struct.unpack(">IIII", buffer)
        if magic != 2051:
            raise ValueError('Invalid magic number {0} in MNIST image file {1}.'.format(magic, f_name))

        # Read data
        buffer = f.read(rows * cols * num_images)
        data = np.frombuffer(buffer, dtype=np.uint8)
        data = data.reshape(num_images, rows * cols)
        return data


# Functions for loading MNIST label data (provided)
def extract_labels(f_name):
    """ Extract the labels into a 1D uint8 numpy vector [index,]. """
    print('Extracting', f_name)
    with gzip.open(f_name, 'rb') as f:
        # Read file header
        buffer = f.read(8)
        magic, num_items = struct.unpack(">II", buffer)
        if magic != 2049:
            raise ValueError('Invalid magic number {0} in MNIST label file {1}.'.format(magic, f_name))

        # Read data
        buffer = f.read(num_items)
        data = np.frombuffer(buffer, dtype=np.uint8)
        return data

## 1. Load and browse data.

The MNIST dataset is split into a training set (60,000 samples) and a test set (10,000 samples). In total, there are 4 files.

* `train-images-idx3-ubyte.gz`: training images
* `train-labels-idx1-ubyte.gz`: training labels
* `t10k-images-idx3-ubyte.gz`: test images
* `t10k-labels-idx1-ubyte.gz`: test labels

#### 1.1 Load data (provided).

In [None]:
# Training set
X_train = extract_images('train-images-idx3-ubyte.gz')
y_train = extract_labels('train-labels-idx1-ubyte.gz')

# Test set
X_test = extract_images('t10k-images-idx3-ubyte.gz')
y_test = extract_labels('t10k-labels-idx1-ubyte.gz')

#### 1.2 Print out the shapes of the four arrays.

#### 1.3 For the first 25 samples in the training set, show their images and corresponding labels.

Hint: Each image is of size 28 by 28. You need to reformat the correct shape before displaying.

#### 1.4 For each of the 10 classes (digit 0 to 9), count the number of samples in the training and test sets.

## 2. Analyse data.

#### 2.1 Here, you need to train a classification model on the training set, then apply it to the test set. Most [sklearn](https://scikit-learn.org) classifiers require two inputs for training a model, respectively features X and labels y.

X: N x M matrix, N denoting the number of samples, M denoting the number of features for each sample.

y: N x 1 vector, each element recording the label of a sample.

Since we have already prepared the data in this format, you only need to train the model. You can use any classification model supported by sklearn, including

* [K nearest neighbour classifier](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors)
* [Random forest classifier](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble)
* [Support vector machine classifier](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.svm)
* or any other models

Since in today's lecture, we just discussed the nearest neighbour method. Why not try this?

ALso, please use the `time` function to count how long it takes for training.

#### 1.2 Apply the model onto the first `100` samples in the test set. Count how long it takes.

#### 1.3 What is the classification accuracy for the first 100 test samples?

Accuracy is defined as the percentage of correctly classified samples.

#### 1.4 Display the wrongly classsified images, their prediction results and ground truth.