# Session 6 - Benchmark classification on ```cifar-10```

This notebook builds on what we were doing last week with the handwritten digits from the MNIST dataset.

This week, we're working with another famous dataset in computer vision and image processing research - [cifar10](https://www.cs.toronto.edu/~kriz/cifar.html).

In [None]:
# path tools
import os

# data loader
from tensorflow.keras.datasets import cifar10

# machine learning tools
# from sklearn.preprocessing import LabelBinarizer not needed
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# classificatio models
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

# Ross forgot these how embarrassing
import cv2
import numpy as np

We're going to load the data using a function from the library ```TensorFlow```, which we'll be looking at in more detail next week. 

For now, we're just using it to fetch the data!

In [None]:
(X_train, y_train), (X_test, y_test) = cifar10.load_data()

**Question:** What is the shape of the data?

In [None]:
X_train.shape

Unfortunately, this version of the data set doesn't have explict labels, so we need to create our own.

In [None]:
labels = ["airplane", 
          "automobile", 
          "bird", 
          "cat", 
          "deer", 
          "dog", 
          "frog", 
          "horse", 
          "ship", 
          "truck"]

### Convert all the data to greyscale

In the following cell, I'm converting all of my images to greyscale and then making a ```numpy``` array at the end.

Notice that I'm using something funky here called *[list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)*.

In [None]:
# normal for loop
for x in y:
    do_this(x)

# list comprehension for loop
[do_this(x) for x in y]

In [None]:
cols = ["red", "green", "blue"]
uppers = []

# normal for loop
for col in cols:
    upper = col.upper()
    uppers.append(upper)

# list comprehension
uppers = [col.upper() for col in cols]

In [None]:
# you could make a function that does everything you would put inside a loop and then put it in a list comprehension

In [None]:
X_train_grey = np.array([cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) for image in X_train])
X_test_grey = np.array([cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) for image in X_test])

Then, we're going to do some simple scaling by dividing by 255.

In [None]:
X_train_scaled = (X_train_grey)/255.0
X_test_scaled = (X_test_grey)/255.0

## The model converges quicker when everything is scaled down. Then the weights are smaller and the right W&B's are found more quickly.
The same information is available - but instead of color channels being from 0 to 255, it is between 0 and 1.

### Reshaping the data

Next, we're going to reshape this data. 

This is maybe a bit overkill way to reshape
nsamples  50000
nx 32
ny 32

New shape should be nsamples,nx*ny (50000, 1024)
Cubessss.
So row i is picture i
and column i is the value in pixel i

In [None]:
nsamples, nx, ny = X_train_scaled.shape
X_train_dataset = X_train_scaled.reshape((nsamples,nx*ny))

In [None]:
nsamples, nx, ny = X_test_scaled.shape
X_test_dataset = X_test_scaled.reshape((nsamples,nx*ny))

## Simple logistic regression classifier

We define our Logistic Regression classifier as we have done previously. You'll notice that I've set a lot of different parameters here - you can learn more in the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [None]:
clf = LogisticRegression(penalty="none", # if a weights is small - set it to 0 (L1 regularization). But do not use any penalty here
                        tol=0.1, # tolerance - how much the weights can change between X iterations before the training stops. If the weights change less than 0.1 the training stops. Other claassifiers can change over x iterations.
                        verbose=True, # prints some updates while training
                        solver="saga", # optimise the logistic regression. Optimise the way the weights are updated.
                        multi_class="multinomial" # not just a binary classifier, but a multi-class classifier
                        ).fit(X_train_dataset, y_train)

The solver - the optimisation problem.
The loss function.
Epoch er hver gang man har gået over alle datapunkterne

In [None]:
y_pred = clf.predict(X_test_dataset)

We can then print our classification report, using the label names that we defined earlier.

In [None]:
report = classification_report(y_test, 
                               y_pred, 
                               target_names=labels)
print(report)

## Neural network classifier

I've set a couple of different parameters here - you can see more in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).

**NB!** This will take a long time to run! On the 32 CPU machine on UCloud, this takes around 30 seconds per iteration.

In [None]:
clf = MLPClassifier(random_state=42,
                    hidden_layer_sizes=(64, 10), # 64 nodes in the first layer, 10 nodes in the second layer
                    learning_rate="adaptive", # how fast the weights are updated
                    early_stopping=True, # stop training if the weights are not changing much
                    verbose=True, # print some updates while training
                    max_iter=20 # stop training after 20 iterations
                    ).fit(X_train_dataset, y_train)

Train data
Learn loss values and minimise/optimise

Validation data (10% random train data)
Get accuracy and minimise/optimise

Test data
Final scores

In [None]:
y_pred = clf.predict(X_test_dataset)

Lastly, we can get our classification report as usual.

In [None]:
report = classification_report(y_test, 
                               y_pred, 
                               target_names=labels)
print(report)

## Tasks

Take the code outlined in this notebook and turn it into two separate Python scripts, one which performs Logistic Regression classification and one which uses the MLPClassifier on the ```Cifar10``` dataset.

Try to use the things we've spoken about in clas
- Requirements.txt
- Virtual environment
- Setup scripts
- Argparse

This task is [Assignment 2 for Visual Analytics](https://classroom.github.com/a/KLVvny7d).