# scikit-learn classification using SVC - recognizing hand-written digits

This lesson uses the [MNIST Digits dataset](course_datasets.md#mnist-digits). It uses a support vector classifier.  It is based on the tutorial in the scikit-learn documentation [here](https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html#sphx-glr-auto-examples-classification-plot-digits-classification-py).


In [None]:
import matplotlib.pyplot as plt

from sklearn import datasets, svm, metrics  # import datasets, classifiers and performance metrics
from sklearn.model_selection import train_test_split    # import train_test_split function

Load the digits dataset.  This is a built-in dataset in scikit-learn. It's a smaller version of the full MNIST dataset. It has 1797 samples of hand-written digits, each represented as an 8x8 pixel image (64 features). The target variable is the digit (0-9) represented by each image.

In [None]:
digits = datasets.load_digits()  # 
print(f'digits.keys(): {digits.keys()}')  # print the keys of the dataset to understand its structure
print('digits.data.shape:', digits.data.shape)  # print the shape of the data array
print('digits.images.shape:', digits.images.shape)  # print the shape of the images array
print('digits.target.shape:', digits.target.shape)  # print the shape of the target array
print('first 20 elements of digits.target:', digits.target[:20])  # print the first 20 elements of the target array
print('first element of digits.data:\n', digits.data[0])  # print the first 20 elements of the data array
print('first element of digits.images:\n', digits.images[0])  # print the first element of the images array


Create a plot of the an image in the dataset along with its label.

In [None]:
n = 14
image = digits.images[n]
plt.figure(figsize=(3, 3))
plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
plt.title(f"Label: {digits.target[n]}")
plt.axis('off')
plt.show()

Aside: This shows that an images is just a 2D array of pixel values

In [None]:
# create an array and show it
my_number_array = [[1,2,3], [4,5,6], [7,8,9]]
plt.figure(figsize=(3, 3))
plt.imshow(my_number_array, cmap=plt.cm.gray_r, interpolation='nearest')
plt.axis('off')
plt.show()


Split into training and test datasets

In [None]:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.5, shuffle=False)
print(f'X_train.shape: {X_train.shape}, X_test.shape: {X_test.shape}, y_train.shape: {y_train.shape}, y_test.shape: {y_test.shape}')

In [None]:
clf = svm.SVC(gamma=0.001)
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
predicted[:10] # first 10 predictions

Show the first 10 of the list of tuples where the predicted value is not equal to the test value

In [None]:
my_list =list(zip (predicted, y_test))
[(x,y) for x, y in my_list if x != y][:10]  # 


Show the first few images of the test dataset along with the actual and predicted labels

In [None]:
_, axes = plt.subplots(nrows=1, ncols=6, figsize=(10, 3))
for ax, image, actual, prediction in zip(axes, X_test, y_test, predicted):
    ax.set_axis_off()
    image = image.reshape(8, 8)
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    ax.set_title(f' Actual: {actual}\nPrediction: {prediction}')

Evaluate the classifier's performance using classification report and confusion matrix

In [None]:
print(f"Classification report for classifier {clf}:\n" f"{metrics.classification_report(y_test, predicted)}\n")

In [None]:
disp = metrics.ConfusionMatrixDisplay.from_predictions(y_test, predicted)
disp.figure_.suptitle("Confusion Matrix")
print(f"{disp.confusion_matrix}")
plt.show()