## Assignment 4

Classifier benchmarks using Logistic Regression and a Neural Network



This assignment builds on the work we did in class and from session 6.



You'll use your new knowledge and skills to create two command-line tools which can be used to perform a simple classification task on the MNIST data and print the output to the terminal. These scripts can then be used to provide easy-to-understand benchmark scores for evaluating these models.



You should create two Python scripts. One takes the full MNIST data set, trains a Logistic Regression Classifier, and prints the evaluation metrics to the terminal. The other should take the full MNIST dataset, train a neural network classifier, and print the evaluation metrics to the terminal.





Tips

I suggest using scikit-learn for the Logistic Regression Classifier
In class, we only looked at a small sample of MNIST data. I suggest using fetch_openml() to get the full dataset, like we did in session 6
You can use the NeuralNetwork() class that I introduced you to during the code along session
I recommend saving your .py scripts in a folder called src﻿; and have your NeuralNetwork class in a folder called utils, like we have on worker02
You may need to do some data manipulation to get the MNIST data into a usable format for your models
If you have trouble doing this on your own machine, use worker02!


Bonus Challenges

Have the scripts save the classifier reports in a folder called out, as well as printing them to screen. Add The user should be able to define the file name as a command line argument (easier)
Allow users to define the number and size of the hidden layers using command line arguments (intermediate)
Allow the user to define Logistic Regression parameters using command line arguments (intermediate)
Add an additional step where you import some unseen image, process it, and use the trained model to predict it's value - like we did in session 6 (intermediate)
Add a new method to the Neural Network class which will allow you to save your trained model for future use (advanced)


General instructions

Save your script as lr-mnist.py and nn-mnist.py
If you have external dependencies, you must include a requirements.txt
You can either upload the script here or push to GitHub and include a link - or both!
Your code should be clearly documented in a way that allows others to easily follow along
Similarly, remember to use descriptive variable names! A name like X_train is (just) more readable than x1.
The filenames of the saved images should clearly relate to the original image


Purpose

This assignment is designed to test that you have a understanding of:

how to train classification models using machine learning and neural networks;
how to create simple models that can be used as statistical benchmarks;
how to do this using scripts which can be executed from the command line

In [1]:
#path tools
import sys,os
sys.path.append(os.path.join(".."))

import argparse
import numpy as np
import utils.classifier_utils as clf_util
#neural networks with numpy
from utils.neuralnetwork import NeuralNetwork
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn import datasets
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
#Fetch data

X, y = fetch_openml("mnist_784", version=1, return_X_y=True)

In [3]:
#Convert to numpy arrays
X = np.array(X)
y = np.array(y)

In [4]:
# Specify categories
classes = sorted(set(y)) # The names of each class (0-9)
nclasses = len(classes) #number of classes

In [31]:
classes

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [5]:
# Create training data and test dataset 
X_train, X_test, y_train, y_test = train_test_split(X, # our data
                                                    y, # our labels
                                                    #random_state=9,
                                                    train_size=7500, 
                                                    test_size=2500)


In [6]:
#scaling the features 
#scaling the features
X_train_scaled = X_train/255.0
X_test_scaled = X_test/255.0

In [7]:
# train a logistic regression model
clf = LogisticRegression(penalty='none', 
                         tol=0.1, 
                         solver='saga',
                         multi_class='multinomial').fit(X_train_scaled, y_train)

In [8]:
#to check the shape of the coefficient matrix
clf.coef_.shape

(10, 784)

In [9]:
#Predict test data
y_pred = clf.predict(X_test_scaled)

In [10]:
cm = metrics.classification_report(y_test, y_pred)
print(cm)

              precision    recall  f1-score   support

           0       0.96      0.94      0.95       234
           1       0.93      0.96      0.95       278
           2       0.90      0.90      0.90       267
           3       0.89      0.89      0.89       261
           4       0.90      0.90      0.90       232
           5       0.88      0.86      0.87       224
           6       0.91      0.94      0.92       257
           7       0.91      0.95      0.93       248
           8       0.89      0.83      0.86       243
           9       0.90      0.88      0.89       256

    accuracy                           0.91      2500
   macro avg       0.91      0.90      0.91      2500
weighted avg       0.91      0.91      0.91      2500

