# Session 6 - Benchmark classification on ```cifar-10```

This notebook builds on what we were doing last week with the handwritten digits from the MNIST dataset.

This week, we're working with another famous dataset in computer vision and image processing research - [cifar10](https://www.cs.toronto.edu/~kriz/cifar.html).

In [16]:
# path tools
import os
import numpy as np
import cv2
# data loader
from tensorflow.keras.datasets import cifar10

# machine learning tools
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# classificatio models
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

We're going to load the data using a function from the library ```TensorFlow```, which we'll be looking at in more detail next week. 

For now, we're just using it to fetch the data!

In [4]:
(X_train, y_train), (X_test, y_test) = cifar10.load_data() # cifar10 method "load_data" its returning 4 object. training data and label + test data and label
                                                           # gives touple 

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz


**Question:** What is the shape of the data?

In [5]:
X_train.shape # shape = 4 numbers = 4 dimensions
              # 50000 = amount of images 
              # 32 x 32 pixel intensity
              # 3 = the 3 colour channels

(50000, 32, 32, 3)

Unfortunately, this version of the data set doesn't have explict labels, so we need to create our own.

In [6]:
labels = ["airplane", 
          "automobile", 
          "bird", 
          "cat", 
          "deer", 
          "dog", 
          "frog", 
          "horse", 
          "ship", 
          "truck"]

### Convert all the data to greyscale

In the following cell, I'm converting all of my images to greyscale and then making a ```numpy``` array at the end.

Notice that I'm using something funky here called *[list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)*.

In [17]:
X_train_grey = np.array([cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) for image in X_train]) #list comprehensions = 
X_test_grey = np.array([cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) for image in X_test])

In [None]:
X_train_grey

In [None]:
# list comprehension

# for loop example

# for x in y:
    # do_this(x)

#list comprehension example

#[do_this(x) for x in y]

#example with a list of colours
  
#colours = ["red", "gren", "blue"]
#uppers = []

#for colour in colours:
    #upper = colour.upper()
    #uppers.append(upper)


#uppers = [colour.upper() for colour in colours]

Then, we're going to do some simple scaling by dividing by 255.

In [19]:
X_train_scaled = (X_train_grey)/255.0 # pixel values scaled down between 0 and 1
X_test_scaled = (X_test_grey)/255.0 # just compressing the pixel values. You can recreate it by multiplying it again

### Reshaping the data

Next, we're going to reshape this data. 

In [20]:
# n samples is 50000, nx = number of values on y and y 32. then we reshape the x_train_scaled into a 1-dimensional array. 
# by telling numpy that it shall time nx by ny in order to flatten
nsamples, nx, ny = X_train_scaled.shape 
X_train_dataset = X_train_scaled.reshape((nsamples,nx*ny))

In [21]:
nsamples, nx, ny = X_test_scaled.shape
X_test_dataset = X_test_scaled.reshape((nsamples,nx*ny))

In [22]:
#checking that the training data has been flattened to 1 dimension
X_train_dataset.shape

(50000, 1024)

## Simple logistic regression classifier

logistic regression classifier is a good initial approach to scratch the surface. But for more in depth machine learning and prediction use Neural networks

We define our Logistic Regression classifier as we have done previously. You'll notice that I've set a lot of different parameters here - you can learn more in the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [25]:
clf = LogisticRegression(penalty="none", # force our model to make very small values be set to zero. tool for only keeping the most meaningful weights in a model. if set to L1 or L2 penalties
                        tol=0.1, # tolerance = this is by how much weights should be changig when our model improves every time. if it doesnt improve to satisfy the tolerance, the model stops. If the weights arent modified by more than 0.1 
                        verbose=True, # if verbose set to true it gives a rolling update of how the model is performing
                        solver="saga", 
                        multi_class="multinomial").fit(X_train_dataset, y_train) # its a multiclass problem

  y = column_or_1d(y, warn=True)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Epoch 1, change: 1.00000000
Epoch 2, change: 0.22068667
Epoch 3, change: 0.15329279
Epoch 4, change: 0.10016197
convergence after 5 epochs took 13 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   12.6s finished


In [26]:
y_pred = clf.predict(X_test_dataset)

We can then print our classification report, using the label names that we defined earlier.

In [27]:
report = classification_report(y_test, 
                               y_pred, 
                               target_names=labels) # using labels assigned earlier 
print(report)

              precision    recall  f1-score   support

    airplane       0.35      0.36      0.35      1000
  automobile       0.37      0.40      0.38      1000
        bird       0.25      0.26      0.25      1000
         cat       0.23      0.14      0.17      1000
        deer       0.26      0.16      0.19      1000
         dog       0.27      0.35      0.31      1000
        frog       0.29      0.30      0.30      1000
       horse       0.31      0.32      0.32      1000
        ship       0.35      0.40      0.37      1000
       truck       0.39      0.44      0.41      1000

    accuracy                           0.31     10000
   macro avg       0.31      0.31      0.31     10000
weighted avg       0.31      0.31      0.31     10000



## Neural network classifier

I've set a couple of different parameters here - you can see more in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).

**NB!** This will take a long time to run! On the 32 CPU machine on UCloud, this takes around 30 seconds per iteration.

In [31]:
clf = MLPClassifier(random_state=42,
                    hidden_layer_sizes=(100, 10),
                    learning_rate="adaptive", #when its gotten a rough idea of where the weights should be. Start learning, but when you achieve something, slow down and finetune. Learn at different speeds
                    early_stopping=True, 
                    verbose=True,
                    max_iter=20).fit(X_train_dataset, y_train) # only running for a maximum of 20 iterations

  y = column_or_1d(y, warn=True)


Iteration 1, loss = 2.18081081
Validation score: 0.230200
Iteration 2, loss = 2.03512719
Validation score: 0.285000
Iteration 3, loss = 1.96959989
Validation score: 0.277400
Iteration 4, loss = 1.93144341
Validation score: 0.313000
Iteration 5, loss = 1.89991063
Validation score: 0.318000
Iteration 6, loss = 1.86922342
Validation score: 0.334000
Iteration 7, loss = 1.84902258
Validation score: 0.340000
Iteration 8, loss = 1.82470768
Validation score: 0.340600
Iteration 9, loss = 1.80246310
Validation score: 0.347600
Iteration 10, loss = 1.78740588
Validation score: 0.353000
Iteration 11, loss = 1.77777913
Validation score: 0.353200
Iteration 12, loss = 1.75619007
Validation score: 0.360800
Iteration 13, loss = 1.74167545
Validation score: 0.368000
Iteration 14, loss = 1.73132602
Validation score: 0.373400
Iteration 15, loss = 1.71718153
Validation score: 0.372400
Iteration 16, loss = 1.70771119
Validation score: 0.374600
Iteration 17, loss = 1.69674908
Validation score: 0.381800
Iterat



In [None]:
# validation should be going up in percentage == loss score should be going down pr. iteration
# the model is only using a small portion of the data to train
# during training we minimize loss value and gain an accuracy score on the validation score.
# this is what happens for 20 iterations
# then we test it on data it has never seen before

In [32]:
y_pred = clf.predict(X_test_dataset)

Lastly, we can get our classification report as usual.

In [33]:
report = classification_report(y_test, 
                               y_pred, 
                               target_names=labels)
print(report)

              precision    recall  f1-score   support

    airplane       0.49      0.30      0.37      1000
  automobile       0.46      0.50      0.48      1000
        bird       0.30      0.23      0.26      1000
         cat       0.28      0.20      0.23      1000
        deer       0.31      0.32      0.31      1000
         dog       0.37      0.36      0.36      1000
        frog       0.35      0.48      0.40      1000
       horse       0.44      0.46      0.45      1000
        ship       0.42      0.60      0.49      1000
       truck       0.45      0.44      0.45      1000

    accuracy                           0.39     10000
   macro avg       0.39      0.39      0.38     10000
weighted avg       0.39      0.39      0.38     10000



In [None]:
# pip install pipreqs
# optimizing the packages 

## Tasks

Take the code outlined in this notebook and turn it into two separate Python scripts, one which performs Logistic Regression classification and one which uses the MLPClassifier on the ```Cifar10``` dataset.

Try to use the things we've spoken about in clas
- Requirements.txt
- Virtual environment
- Setup scripts
- Argparse

This task is [Assignment 2 for Visual Analytics](https://classroom.github.com/a/KLVvny7d).