# Assignment 1: ML workflow with scikit-learn
## Description
In this lab you will have a chance to develop a machine learning pipeline that recognises hand-written digits. The dataset that you will use for this assignment is [MNIST](http://yann.lecun.com/exdb/mnist/): 27x27 pixels grayscale images.  
In this assignment you are asked to pick any 2 digits: `0, 1, 2, 3, 4, 5, 6, 7, 8, 9`, from the MNIST dataset (500 random samples per chosen digit). Then extract any 2 selected features from the ones prepared for you: `f1, f2, f3, f4, f5, f6, f7, f8, f9`.

Once you have prepared the dataset you need to choose a machine learning algorithm from `scikit-learn` and apply it. You need to decide how to evaluate your results (cross-validation, training-test split, etc.), what performance metric to use and how to report your results. Motivate all your choices.

To get you started you are given the dataset loader and two functions:
* `pick_digits`, and
* `pick_features`;

together with and an example how to use them.

One thing you might want to do is to plot the decision boundaries. Lab3 showed how to do this for a linear classifier, but for other classifiers this can be a bit more involved. One way to visualise decision regions is to colour-code the predictions of the classifier over a mesh grid (see [link](http://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#sphx-glr-auto-examples-neighbors-plot-classification-py) to NearestNeighbors in scikit-learn). Another possibility is to calculate posterior class probabilities over a mesh grid and draw a contour line at p=$\frac{1}{2}$ (see [link](http://matplotlib.org/examples/pylab_examples/contour_demo.html) to matplotlib contour demo).

## Marking criteria
* 30% - develop one working solution;
* 10% - support it with relevant plots and figures eg. data scatter plot with decision boundary, predictive accuracy;
---
* 20% - develop another working solution of different type;
* 10% - support it with relevant plots and figures eg. data scatter plot with decision boundary, predictive accuracy;
---
* 10% - motivate your choices (5% per solution);
* 10% - create a baseline for your dataset;
* 10% - compare your results against each other (5%) and the baseline (5%).
---

Please put all your working and comments in this Jupyter notebook and submit it as **the only** file with name `<your_candidate_number>.ipynb` eg. `12321.ipynb`. If you do not use this naming convention **10%** will be subtracted from your mark!  
Your candidate number can be found on your [FEN](https://wwwa.fen.bris.ac.uk/coms/index.jsp) **profile** next to *Candidate:* keyword.

In [1]:
# Import all necessary packages
import math
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_mldata
%matplotlib inline
mnist = fetch_mldata("MNIST original")

In [2]:
# Load and reformat the data
data = pd.DataFrame(mnist.data)
data["label"] = mnist.target

In [3]:
# Pick selected digits from the dataset; random sample 500 instances
# for each selected digit
def pick_digits(l, d, random_sample=500, seed=42):
    digits = []
    for i in l:
        digit = d[d.label == i]
        digits.append(digit.sample(random_sample, random_state=seed))
    return pd.concat(digits)

# Compute selected features
def pick_features(l, d):
    df = d.copy()
    df.drop("label", axis=1, inplace=True)
    
    features = pd.DataFrame(d["label"])
    for i in l:
        features[i] = np.nan
    
    for index, row in df.iterrows():
        if "f1" in l:
            sum = 0
            for ii, i in row.iteritems():
                image_row = math.floor(ii/28.0)
                if 1 < image_row and image_row < 8:
                    sum += i
            features["f1"][index] = sum
        if "f2" in l:
            sum = 0
            for ii, i in row.iteritems():
                image_row = math.floor(ii/28.0)
                if 10 < image_row and image_row < 17:
                    sum += i
            features["f2"][index] = sum
        if "f3" in l:
            sum = 0
            for ii, i in row.iteritems():
                image_row = math.floor(ii/28.0)
                if 19 < image_row and image_row < 26:
                    sum += i
            features["f3"][index] = sum

        if "f4" in l:
            sum = 0
            for ii, i in row.iteritems():
                image_col = ii%28
                if 1 < image_col and image_col < 8:
                    sum += i
            features["f4"][index] = sum
        if "f5" in l:
            sum = 0
            for ii, i in row.iteritems():
                image_col = ii%28
                if 10 < image_col and image_col < 17:
                    sum += i
            features["f5"][index] = sum
        if "f6" in l:
            sum = 0
            for ii, i in row.iteritems():
                image_col = ii%28
                if 19 < image_col and image_col < 26:
                    sum += i
            features["f6"][index] = sum

        if "f7" in l:
            sum = 0
            for i in range(1, 28):
                j = i - 1
                ind = j * 28 + i
                sum += row[ind]
            for i in range(0, 28):
                j = i
                ind = j * 28 + i
                sum += row[ind]
            for i in range(0, 27):
                j = i + 1
                ind = j * 28 + i
                sum += row[ind]
            features["f7"][index] = sum
        if "f8" in l:
            sum = 0
            for i in range(1, 28):
                j = i - 1
                i = 27 - i
                ind = j * 28 + i
                sum += row[ind]
            for i in range(0, 28):
                j = i
                i = 27 - i
                ind = j * 28 + i
                sum += row[ind]
            for i in range(0, 27):
                j = i + 1
                i = 27 - i
                ind = j * 28 + i
                sum += row[ind]
            features["f8"][index] = sum

        if "f9" in l:
            sum = 0
            for ii, i in row.iteritems():
                image_row = math.floor(ii/28.0)
                image_col = ii%28
                if 10 < image_row and image_row < 17 and 10 < image_col and image_col < 17:
                    sum += i
            features["f9"][index] = sum

    return features

In [4]:
#
## Example
#
x = pick_digits([2,7], data)
y = pick_features(["f7", "f8"], x)
y

Unnamed: 0,label,f7,f8
17646,2.0,2210.0,1958.0
17530,2.0,3837.0,6503.0
15599,2.0,2958.0,5292.0
17671,2.0,2387.0,4201.0
17072,2.0,6644.0,5106.0
12898,2.0,3972.0,4322.0
14219,2.0,2699.0,5738.0
62348,2.0,1571.0,6442.0
16713,2.0,3719.0,5586.0
18051,2.0,2432.0,5065.0
