# Assignment 2 - Classification benchmarks with Logistic Regression and Neural Networks

For this assignment, we'll be writing scripts which classify the ```Cifar10``` dataset. You can read more about this dataset [here](https://www.cs.toronto.edu/~kriz/cifar.html)

You should write code which does the following:

- Load the Cifar10 dataset
- Preprocess the data (e.g. greyscale, normalize, reshape)
- Train a classifier on the data
    - A logistic regression classifier *and* a neural network classifier
- Save a classification report
- Save a plot of the loss curve during training

You should write **two scripts** for this assignment one script which does this for a logistic regression classifier **and** one which does it for a neural network classifier. In both cases, you should use the machine learning tools available via ```scikit-learn``` to evaluate model performance.

## Starter code

The data already has a train/test split and can be loaded in the following way:

```python
from tensorflow.keras.datasets import cifar10
(X_train, y_train), (X_test, y_test) = cifar10.load_data()
```

Once you have loaded the data, made it greyscale, and scaled the values then you will need to *reshape* the array to be the correct dimensions - essentially flattening the 2d array like we saw with greyscale histograms. 

You can do that in the following way using ```numpy```:

```python
X_train_scaled.reshape(-1, 1024)
X_test_scaled.reshape(-1, 1024)
```

## Tips

- The Cifar10 dataset you download does not have explict label names but instead has numbers from 0-9. You'll need to make a list of labels based on the object names - you can find these on the website.
- You should structure your project by having scripts saved in a folder called ```src```, and have a folder called ```out``` where you save the classification reports.

## Purpose

- To ensure that you can use ```scikit-learn``` to build simple benchmark classifiers on image classification data
- To demonstrate that you can build reproducible pipelines for machine learning projects
- To make sure that you can structure repos appropriately


## Import ```modules```

In [14]:
import os
import sys
sys.path.append("../../..") # go three steps back to get the utils package 

import numpy as np
import cv2
import matplotlib.pyplot as plt # for loss curve - delete for logreg

import utils.classifier_utils as clf_util

# Import sklearn metrics
from sklearn import metrics
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Data
from tensorflow.keras.datasets import cifar10

## ```Process``` function

In [4]:

# make a function that makes it greyscale, and scale the values. After that you will need to *reshape* the array to be the correct dimensions 
# - essentially flattening the 2d array like we saw with greyscale histograms. 

def process(X_train, X_test):
    # initialize empty lists
    X_train_results = []
    X_test_results = []
    # preprocessing
    for image in X_train:
        greyed_X_train = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        scaled_X_train = greyed_X_train/255.0
        X_train_results.append(scaled_X_train)
    reshaped_X_train = np.array(X_train_results).reshape(-1, 1024)

    for image in X_test:
        greyed_X_test = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        scaled_X_test = greyed_X_test/255.0
        X_test_results.append(scaled_X_test)
    reshaped_X_test = np.array(X_test_results).reshape(-1, 1024)

    return reshaped_X_train, reshaped_X_test



## Train the classifier

In [18]:
# Make function for training a classifier and saving the classification report 
# Save a plot of the loss curve during training 
from joblib import dump, load


def train_classifier_LR(reshaped_X_train, y_train, reshaped_X_test, y_test): 
        
# logistic regression 
    clf = LogisticRegression(tol=0.1, 
                         solver='saga',
                         multi_class='multinomial').fit(reshaped_X_train, y_train)


    #calculate predictions for all data the scaled test data.
    y_pred = clf.predict(reshaped_X_test)

# get label names
    label_names = ["airplane", 
                    "automobile", 
                    "bird", 
                    "cat", 
                    "deer", 
                    "dog", 
                    "frog", 
                    "horse", 
                    "ship", 
                    "truck"] 
# report
    LR_report = metrics.classification_report(y_test, y_pred, target_names = label_names)
    

# save report 
    f = open('../out/classification_report_LR.txt', 'w') # open in write mode
    f.write(LR_report) # write the variable into the txt file 
    f.close() 

                         

In [19]:
# main for saving model 

def main():
    #load data
    (X_train, y_train), (X_test, y_test) = cifar10.load_data()
    #process
    X_train_scaled, X_test_scaled = process(X_train, X_test)
    #train classifier and make report
    train_classifier_LR(X_train_scaled, y_train, X_test_scaled, y_test)

if __name__=="__main__": # if it's executed from the command line run the function "main", otherwise don't do anything 
    main()



  y = column_or_1d(y, warn=True)


In [1]:
import logistic_class

logistic_class.main()

2024-03-14 10:34:01.677001: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-14 10:34:01.681527: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-14 10:34:01.738430: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  y = column_or_1d(y, warn=True)
