In [1]:
# importing libraries, etc...

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()

path = "https://raw.githubusercontent.com/LennardVaarten/ML-Workshops/main/data/"

The mnist dataset contains 70,000 labeled handwritten digits. The digits were recorded as 28x28 images, and the dataset contains the grey values of all the pixel values (so 28*28=784 pixel values for each sample). The aim is to create a model that can accurately classify whether a given digit is a 0, 1, 2, etc... 

Examples:

<div>
<img src="https://raw.githubusercontent.com/LennardVaarten/ML-Workshops/main/media/mnist.png" width="500"/>
</div>

To make model-building a little more feasible, I've taken a subset of the dataset containing 20,000 samples, rather than the full 70,000. 

In [2]:
# loading

mnist = pd.read_csv(path+"mnist.csv")

In [3]:
# viewing

mnist

Unnamed: 0,label,1x1,1x2,1x3,1x4,1x5,1x6,1x7,1x8,1x9,...,28x19,28x20,28x21,28x22,28x23,28x24,28x25,28x26,28x27,28x28
0,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19996,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19997,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19998,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# Check if there are any missing values...

mnist.isna().any(axis=1).sum()

# Nope!

0

Splitting into training and test set. 

(Changing the random_state parameter in the train_test_split function to a different number will result in a different random split of the data. Try playing around with it and then running your model(s) again to see how a different split might result in a different score.)

In [5]:
from sklearn.model_selection import train_test_split

features_train, features_test, target_train, target_test = train_test_split(mnist.iloc[:,1:], 
                                                                    mnist.iloc[:,0], 
                                                                    random_state=99)

Let's start with our good old k-NN classifier...

Note that even this subset I've taken of the mnist dataset is quite large, with 20,000 samples and 784 features. Because of this, model-building / prediction can take some time.

However, we can save a bit of time by using the n_jobs parameter when building a model. Setting n_jobs=-1 (as done below) tells sklearn to use all of your CPU cores, whereas by default it will only use 1. This means that if your PC has 8 CPU cores, building the model will be 8 times as fast!

In [6]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=9, weights="uniform", n_jobs=-1).fit(features_train, target_train)
print("Training set score: {:.4f}".format(knn.score(features_train, target_train)))
print("Test set score: {:.4f}".format(knn.score(features_test, target_test)))

Training set score: 0.9611
Test set score: 0.9530


Now, it's your turn to use any of the models we've discussed to see how well they perform on this task. Note that this is a classification problem, so only classification models will work on it. Perhaps even more important than choosing a classifier is trying out different parameter settings (e.g. n_neighbors for k-Nearest Neighbors, C for Logistic Regression, n_estimators for the Random Forest Classifier, etc...). 

Below are the classification models we've discussed, along with the import statement and the parameters that we've covered during the sessions.

- **k-Nearest Neighbors Classifier** (already imported in the cell above)
    - n_neighbors (any number above 0)
    - weights ("uniform", "distance")
- **Decision Tree Classifier** (from sklearn.tree import DecisionTreeClassifier)
    - max_depth (a whole number above 0)
    - min_samples_split (a whole number above 1)
- **Random Forest Classifier** (from sklearn.ensemble import RandomForestClassifier)
    - n_estimators (a whole number above 0)
    - max_depth (a whole number above 0)
    - min_samples_split (a whole number above 1)
- **Gradient Boosting Classifier** (from sklearn.ensemble import GradientBoostingClassifier)
    - n_estimators (a whole number above 0)
    - max_depth (a whole number above 0)
    - min_samples_split (a whole number above 1)
    - learning_rate (a number between 0 and 1)
    - subsample (a number between 0 and 1)
    
If you want to access even more parameter settings than we've discussed in class (models tend to have a lot), you can also access the sklearn documentation. For example, [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), you can find all possible parameters to tune for the KNeighborsClassifier.

Good luck and feel free to share your model (and the results you obtain with it) on the Canvas discussion page!

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=500, random_state=99)
rf.fit(features_train, target_train)

rfScores = cross_val_score(rf, features_train, target_train, cv=10)
print(rfScores)
print(f"Random Forest mean 5-fold Cross-Validation score: {np.mean(rfScores):.3f}")