# Session 6 - Benchmark classification on ```cifar-10```

This notebook builds on what we were doing last week with the handwritten digits from the MNIST dataset.

This week, we're working with another famous dataset in computer vision and image processing research - [cifar10](https://www.cs.toronto.edu/~kriz/cifar.html).

## Notes
- cifar-10 dataset = consists of 60 000 images. 10 classes. 32x32 pixel. Colour images. 
- test batch contains 1000 randomly selected images 
- train batch contains 5000 images.


- Tenserlow is better than scikit learn. 
- Today we are only using it to load the dataset


In [13]:
# path tools
import os
import numpy as np
import cv2

# data loader
from tensorflow.keras.datasets import cifar10

# machine learning tools
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# classificatio models
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

We're going to load the data using a function from the library ```TensorFlow```, which we'll be looking at in more detail next week. 

For now, we're just using it to fetch the data!

In [14]:
(X_train, y_train), (X_test, y_test) = cifar10.load_data() # .load_data 
# returns 4 objects 
# () groups them together. 
# Creating a tuple. 

**Question:** What is the shape of the data?

In [15]:
X_train.shape
# first number = amount of pictures 
# Second number =32 by 32
# 3rd number = 32 by 32
# 4th number  = colour channels.

(50000, 32, 32, 3)

Unfortunately, this version of the data set doesn't have explict labels, so we need to create our own.

In [16]:
# the current labels are just numbers from 1-9
labels = ["airplane", 
          "automobile", 
          "bird", 
          "cat", 
          "deer", 
          "dog", 
          "frog", 
          "horse", 
          "ship", 
          "truck"]

### Convert all the data to greyscale

In the following cell, I'm converting all of my images to greyscale and then making a ```numpy``` array at the end.

Notice that I'm using something funky here called *[list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)*.

## Notes
- List comprehensions = creating a for loop in a single line. 
- Syntax of for loop = __for x in y do this(x)__
- Syntax for list comprehension = __[do this(x) for x in y]__

In [17]:
X_train_grey = np.array([cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) for image in X_train])
X_test_grey = np.array([cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) for image in X_test])
# Using opencvs colour conversion model 
# Create a np.array which is made from the list we make when converging images to grey scale for every image in X_test

Then, we're going to do some simple scaling by dividing by 255.

In [19]:
X_train_scaled = (X_train_grey)/255.0
X_test_scaled = (X_test_grey)/255.0
# Scaling numbers down to between 0-1
# Makes the weights the model has to learn easier/smaller 
# dividing all pixel values (colours)

### Reshaping the data

Next, we're going to reshape this data. 

In [20]:
#numpy reshape 
# from X_train_scaled.shape gets 3 numbers
# nsamples = 50 000
# nx & ny = 32
nsamples, nx, ny = X_train_scaled.shape
# new shape to be two values, 50 000 and 1024 (32 times 32)  
X_train_dataset = X_train_scaled.reshape((nsamples,nx*ny))

# almost the same as numpy flatten function

In [21]:
nsamples, nx, ny = X_test_scaled.shape
X_test_dataset = X_test_scaled.reshape((nsamples,nx*ny))

## Simple logistic regression classifier

We define our Logistic Regression classifier as we have done previously. You'll notice that I've set a lot of different parameters here - you can learn more in the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [22]:
clf = LogisticRegression(penalty="none", # Math about logistic regression. Dont use any penalty values. Use all weights 
# When penaly is set to l1 = if a value is very small set it to zero. Keeping only meaningful weights 
                        tol=0.1, # by how much weights should be changing. if not changing by this much then just the training. If the model is not improving by this much then just stop
                        verbose=True, # Print an output about model preformance. 
                        solver="saga", # multiclass dataset 
                        multi_class="multinomial").fit(X_train_dataset, y_train) # multiclass problem 

  y = column_or_1d(y, warn=True)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Epoch 1, change: 1.00000000
Epoch 2, change: 0.28085799
Epoch 3, change: 0.13340751
convergence after 4 epochs took 10 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   10.2s finished


In [23]:
y_pred = clf.predict(X_test_dataset)

We can then print our classification report, using the label names that we defined earlier.

## Notes 
- use a logistc reg to set a benchmark

In [24]:
report = classification_report(y_test, 
                               y_pred, 
                               target_names=labels) # setting labels 
print(report)

              precision    recall  f1-score   support

    airplane       0.37      0.35      0.36      1000
  automobile       0.38      0.33      0.35      1000
        bird       0.28      0.18      0.22      1000
         cat       0.23      0.14      0.18      1000
        deer       0.25      0.19      0.22      1000
         dog       0.29      0.31      0.30      1000
        frog       0.29      0.32      0.30      1000
       horse       0.32      0.31      0.32      1000
        ship       0.31      0.49      0.38      1000
       truck       0.35      0.51      0.42      1000

    accuracy                           0.31     10000
   macro avg       0.31      0.31      0.30     10000
weighted avg       0.31      0.31      0.30     10000



## Neural network classifier

I've set a couple of different parameters here - you can see more in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).

**NB!** This will take a long time to run! On the 32 CPU machine on UCloud, this takes around 30 seconds per iteration.

In [28]:
clf = MLPClassifier(random_state=42, 
                    hidden_layer_sizes=(100, 10), #hidden layers, everytime it goes from one layer to another and a weight changes it goes back and starts agian.
                    learning_rate="adaptive", # beginning of the model it will just be guessing. we want it to learn quickly. as soon as it learns a bit, we get it to slow down, and think more about how to predict.
                    early_stopping=True, # Stop early if it is not getting better scores. you can change it by chaning tollerance level
                    verbose=True,
                    max_iter=20).fit(X_train_dataset, y_train) # max iteratoin of 20 times

  y = column_or_1d(y, warn=True)


Iteration 1, loss = 2.18081081
Validation score: 0.230200
Iteration 2, loss = 2.03512719
Validation score: 0.285000
Iteration 3, loss = 1.96959989
Validation score: 0.277400
Iteration 4, loss = 1.93144341
Validation score: 0.313000
Iteration 5, loss = 1.89991063
Validation score: 0.318000
Iteration 6, loss = 1.86922342
Validation score: 0.334000
Iteration 7, loss = 1.84902258
Validation score: 0.340000
Iteration 8, loss = 1.82470768
Validation score: 0.340600
Iteration 9, loss = 1.80246310
Validation score: 0.347600
Iteration 10, loss = 1.78740588
Validation score: 0.353000
Iteration 11, loss = 1.77777913
Validation score: 0.353200
Iteration 12, loss = 1.75619007
Validation score: 0.360800
Iteration 13, loss = 1.74167545
Validation score: 0.368000
Iteration 14, loss = 1.73132602
Validation score: 0.373400
Iteration 15, loss = 1.71718153
Validation score: 0.372400
Iteration 16, loss = 1.70771119
Validation score: 0.374600
Iteration 17, loss = 1.69674908
Validation score: 0.381800
Iterat



## Notes 
- Iteration x, loss = number == should be going down pr iteration to show the model is improving.
- Validation score = see how the model compares vs som validation data with the correct answer. 

In [29]:
y_pred = clf.predict(X_test_dataset)

Lastly, we can get our classification report as usual.

In [30]:
report = classification_report(y_test, 
                               y_pred, 
                               target_names=labels)
print(report)

              precision    recall  f1-score   support

    airplane       0.49      0.30      0.37      1000
  automobile       0.46      0.50      0.48      1000
        bird       0.30      0.23      0.26      1000
         cat       0.28      0.20      0.23      1000
        deer       0.31      0.32      0.31      1000
         dog       0.37      0.36      0.36      1000
        frog       0.35      0.48      0.40      1000
       horse       0.44      0.46      0.45      1000
        ship       0.42      0.60      0.49      1000
       truck       0.45      0.44      0.45      1000

    accuracy                           0.39     10000
   macro avg       0.39      0.39      0.38     10000
weighted avg       0.39      0.39      0.38     10000



## Tasks

Take the code outlined in this notebook and turn it into two separate Python scripts, one which performs Logistic Regression classification and one which uses the MLPClassifier on the ```Cifar10``` dataset.

Try to use the things we've spoken about in clas
- Requirements.txt
- Virtual environment
- Setup scripts
- Argparse

This task is [Assignment 2 for Visual Analytics](https://classroom.github.com/a/KLVvny7d).