# Supervised Learning: Classification with sklearn

Let's take a look at some supervised learning examples using sklearn. We'll start with some image classification examples, followed by a look at linear regression. However, one important point: your choice of classification model matters greatly. Different models will excel at different tasks. You can see a comparison of classifiers here:

![](https://scikit-learn.org/stable/_images/sphx_glr_plot_classifier_comparison_001.png)

We'll be looking at a few options, but we don't have nearly enough time to cover the details of all. For now, explore and test!

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import metrics

We'll import our dataset using the keras library. You might recall Keras is another python machine learning library. We're only using it to easily obtain the dataset here; we'll still be doing all our training using sklearn.

In [None]:
import keras

(X_train, y_train), (X_test, y_test) = keras.datasets.fashion_mnist.load_data() #returns a tuple of numpy arrays

#X_train: NumPy array of grayscale image data with shapes (60000, 28, 28), containing the training data.
#y_train: NumPy array of labels (integers in range 0-9) with shape (60000,) for the training data.
#X_test: NumPy array of grayscale image data with shapes (10000, 28, 28), containing the test data.
#y_test: NumPy array of labels (integers in range 0-9) with shape (10000,) for the test data.

In [None]:
# For model training speed purposes, we'll cut out the majority of our dataset
X_train = X_train[:1200] #keep only the first 1200 images
y_train = y_train[:1200] #keep only the first 1200 labels
X_test = X_test[:200] #keep only the first 200 images
y_test = y_test[:200] #keep only the first 200 labels

In [None]:
X_train[0]

In the dataset, the labels are classified by number:

*    0 = T-shirt/top
*    1 = Trouser
*    2 = Pullover
*    3 = Dress
*    4 = Coat
*    5 = Sandal
*    6 = Shirt
*    7 = Sneaker
*    8 = Bag
*    9 = Ankle boot

In [None]:
y_train[0] #this is the label, i.e. the classification, of X_train[0]

We can take a quick look at a subset of images in the dataset by plotting them with matplotlib:

In [None]:
n_row = 1
n_col = 5
plt.figure(figsize=(10,8))
for i in list(range(n_row * n_col)):
    plt.subplot(n_row, n_col, i+1)
    plt.imshow(X_train[i,:].reshape(28,28), cmap="gray")
    title_text = "Image" + str(i+1)
    plt.title(title_text, size=6.5)

plt.show()

One important thing we'll need to do in order to prepare our training data, is reduce the dimensionality of our arrays. Currently, they are three dimensional, but models need them to be 2 dimensional.

In [None]:
print(X_train.shape)
print(X_test.shape)

To fix this, we can use the [.reshape() method](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html) to flatten our 28 x 28 image data:

In [None]:
nsamples, nx, ny = X_train.shape
X_train_d2 = X_train.reshape((nsamples, nx * ny))
nsamples, nx, ny = X_test.shape
X_test_d2 = X_test.reshape((nsamples, nx * ny))

In [None]:
print(X_train_d2.shape)
print(X_test_d2.shape)

Now that we've prepped our training data, we can train a model. We'll start with a [MLP Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html). As we discussed earlier, this can work for simple images but struggles with high-res, complex images. But our images here are quite simple, so let's see how it does:

In [None]:
from sklearn.neural_network import MLPClassifier

MLP_model = MLPClassifier(max_iter=500, tol=1e-3)
MLP_model.fit(X_train_d2,y_train)
mlp_predict = MLP_model.predict(X_test_d2)

In [None]:
print(metrics.classification_report(y_test, mlp_predict))
print("average accuracy:", np.mean(y_test == mlp_predict) * 100)

Next let's try a [Logistic Regression classifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) (logistic regression was traditionally designed for binary classifications but has been improved in sklearn to support multi-class classification):

In [None]:
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(multi_class="multinomial",
                                    solver="saga", max_iter=100, tol=1e-3)
lr_model.fit(X_train_d2, y_train)
lr_predict = lr_model.predict(X_test_d2)

In [None]:
print(metrics.classification_report(y_test, lr_predict))
print("average accuracy:", np.mean(y_test == lr_predict) * 100)

We can try other models, and you can explore on your own as well - simply import the model and take a quick look at the documentation to read up on the parameters to see if any need to be specified.

In [None]:
from sklearn.naive_bayes import GaussianNB

gNB_model = GaussianNB()
gNB_model.fit(X_train_d2,y_train)
nb_predict = gNB_model.predict(X_test_d2)

print(metrics.classification_report(y_test, nb_predict))
print("average accuracy:", np.mean(y_test == nb_predict) * 100)

In [None]:
from sklearn.svm import SVC

svm_model = SVC(max_iter=100, tol=1e-3)
svm_model.fit(X_train_d2,y_train)
svm_predict = svm_model.predict(X_test_d2)

print(metrics.classification_report(y_test, svm_predict))
print("average accuracy:", np.mean(y_test == svm_predict) * 100)

## Principal Component Analysis (PCA)

**Principal Component Analysis** (PCA) is a dimensionality reduction technique used to simplify the complexity of high-dimensional data while preserving most of its important features. It achieves this by transforming the original features into a new set variables called principal components. This can help eliminate redundancy. For example if 10 out of 12 variables all measure similar things, they might be given too much weight. e.g.
*    variable1 = temperature
*    variable2 = humidity
*    variable3 = wind speed

These variables might all be reduced to one feature called weather.




Note that Principal Component Analysis (PCA) can potentially reduce accuracy in some cases because some information may be lost, especially if the new principal components do not capture all the variation in the original data. Likewise, there is a risk that the principal components capture noise rather than signal, resulting in a loss of accuracy.

*    **noise**: irrelevant or random variations in the data that do not represent meaningful patterns or relationships
*    **signal**: meaningful patterns in the data that is relevant to the task at hand

It also assumes a linear relationship between variables - if the relationship between variables is non-linear, PCA may struggle to properly capture the relationships between variables.

In [None]:
from sklearn.decomposition import PCA

#let's redownload the full 60,000 row dataset and use PCA
(X_train, y_train), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
nsamples, nx, ny = X_train.shape
X_train_d2 = X_train.reshape((nsamples, nx * ny))
nsamples, nx, ny = X_test.shape
X_test_d2 = X_test.reshape((nsamples, nx * ny))

#specify the number of principal components to retain
n_components = 400
pca = PCA(n_components=n_components)
X_train_pca = pca.fit_transform(X_train_d2)
X_test_pca = pca.fit_transform(X_test_d2)

In [None]:
from sklearn.linear_model import LogisticRegression
lr_model = LogisticRegression(multi_class="multinomial",
                                    solver="saga", max_iter=100, tol=1e-3)
lr_model.fit(X_train_d2, y_train)
lr_predict = lr_model.predict(X_test_d2)

print(metrics.classification_report(y_test, lr_predict))
print("average accuracy:", np.mean(y_test == lr_predict) * 100)

## Exercise: Training a Classification Model

We just learned how to do classification using Fashion MNIST, a data set containing items of clothing. There's another, similar dataset called MNIST which has images of handwriting -- specifically handwritten digits 0 through 9.

*    Write an MNIST classifier that is trained to recognise the written digit. I've started the code for you below -- how would you finish it? What's the best accuracy you can achieve?

In [None]:
import tensorflow as tf
mnist = tf.keras.datasets.mnist

(X_train, y_train),(X_test, y_test) = mnist.load_data()

In [None]:
X_train[0]

In [None]:
y_train[0] #once again, this is the label, i.e. classification, of X_train[0]

In [None]:
nsamples, nx, ny = X_train.shape
X_train_d2 = X_train.reshape((nsamples, nx * ny))
nsamples, nx, ny = X_test.shape
X_test_d2 = X_test.reshape((nsamples, nx * ny))