# Session 6 - Benchmark classification on ```cifar-10```

This notebook builds on what we were doing last week with the handwritten digits from the MNIST dataset.

This week, we're working with another famous dataset in computer vision and image processing research - [cifar10](https://www.cs.toronto.edu/~kriz/cifar.html).

In [17]:
# path tools
import os
import cv2

# data loader
import numpy as np
from tensorflow.keras.datasets import cifar10

# machine learning tools
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# classificatio models
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

We're going to load the data using a function from the library ```TensorFlow```, which we'll be looking at in more detail next week. 

For now, we're just using it to fetch the data!

In [7]:
(X_train, y_train), (X_test, y_test) = cifar10.load_data()
#return training data and training data labels, and test data and test data labelse. 

**Question:** What is the shape of the data?

In [8]:
type(X_train)

numpy.ndarray

In [9]:
X_train.shape
# its a four-dimentional numpy.array. The four numbers: 50000, 32, 32, 3 => 32x32 pixels, 3 colour channels, 50000 images of this sort. 

(50000, 32, 32, 3)

Unfortunately, this version of the data set doesn't have explict labels, so we need to create our own.

In [10]:
y_train
#the labels doesnøt have names, only a number refering to the names listed alphabetically. 

array([[6],
       [9],
       [9],
       ...,
       [9],
       [1],
       [1]], dtype=uint8)

In [11]:
labels = ["airplane", 
          "automobile", 
          "bird", 
          "cat", 
          "deer", 
          "dog", 
          "frog", 
          "horse", 
          "ship", 
          "truck"]

### Convert all the data to greyscale

In the following cell, I'm converting all of my images to greyscale and then making a ```numpy``` array at the end.

Notice that I'm using something funky here called *[list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)*.

In [18]:
# turn images into greyscale using list comprehensions. 
X_train_grey = np.array([cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) for image in X_train])
X_test_grey = np.array([cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) for image in X_test])
#list comprehension = can do the same as a for loop, but in a shorter way (I think). 
# The list comprehension turns everything into a list, so you don't have to append something to an empty list as we often do in for loops. 
# sometimes list comprehensions decrease readibility compared to for loops, but here it increases readability (it becomes clear that the thing happening in the two lines are identical).

In [20]:
X_train_grey.shape # now we have a 3D image, because we have only one colour channel.

(50000, 32, 32)

Then, we're going to do some simple scaling by dividing by 255.

In [21]:
# Make the pixel values in each image smaller (between 0 and 1 instead of between 0 and 255)
# This helps models convert faster and get better results, because the weights and biases, the model has to learn are smaller values. 
X_train_scaled = (X_train_grey)/255.0
X_test_scaled = (X_test_grey)/255.0

### Reshaping the data

Next, we're going to reshape this data. 

In [25]:
# reshaping images in training data
nsamples, nx, ny = X_train_scaled.shape # the .shape returns the number of samples (5000) and the number of values on the x and y axes (32 and 32)
X_train_dataset = X_train_scaled.reshape((nsamples,nx*ny)) 
# reshaping: we want only two values: the number of samples (50000) and 32x32 (=1024) 
# --> the images are each 'flattened down' to one string of values

In [29]:
X_train_dataset.shape

(50000, 1024)

In [30]:
#reshaping images in test data
nsamples, nx, ny = X_test_scaled.shape
X_test_dataset = X_test_scaled.reshape((nsamples,nx*ny))

In [31]:
X_test_dataset.shape

(10000, 1024)

## Simple logistic regression classifier

We define our Logistic Regression classifier as we have done previously. You'll notice that I've set a lot of different parameters here - you can learn more in the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [32]:
clf = LogisticRegression(penalty="none",
                        tol=0.1, # if the model is not improving when the weights are changed by this value, it should stop (?)
                        verbose=True, # a flag
                        solver="saga", # 
                        multi_class="multinomial").fit(X_train_dataset, y_train)

  y = column_or_1d(y, warn=True)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Epoch 1, change: 1.00000000
Epoch 2, change: 0.22901270
Epoch 3, change: 0.12814616
convergence after 4 epochs took 10 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   10.4s finished


In [None]:
y_pred = clf.predict(X_test_dataset)

We can then print our classification report, using the label names that we defined earlier.

In [None]:
report = classification_report(y_test, 
                               y_pred, 
                               target_names=labels)
print(report)

## Neural network classifier

I've set a couple of different parameters here - you can see more in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).

**NB!** This will take a long time to run! On the 32 CPU machine on UCloud, this takes around 30 seconds per iteration.

In [None]:
clf = MLPClassifier(random_state=42,
                    hidden_layer_sizes=(64, 10),
                    learning_rate="adaptive",
                    early_stopping=True,
                    verbose=True,
                    max_iter=20).fit(X_train_dataset, y_train)

In [None]:
y_pred = clf.predict(X_test_dataset)

Lastly, we can get our classification report as usual.

In [None]:
report = classification_report(y_test, 
                               y_pred, 
                               target_names=labels)
print(report)

## Tasks

Take the code outlined in this notebook and turn it into two separate Python scripts, one which performs Logistic Regression classification and one which uses the MLPClassifier on the ```Cifar10``` dataset.

Try to use the things we've spoken about in clas
- Requirements.txt
- Virtual environment
- Setup scripts
- Argparse

This task is [Assignment 2 for Visual Analytics](https://classroom.github.com/a/KLVvny7d).