### Exploring Handwritten Digits

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import requests
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.decomposition import PCA
from sklearn.mixture import GaussianMixture
from sklearn.datasets import load_digits
from sklearn.manifold import Isomap

#### 1. Loading and Visualizing data

In [None]:
digits = load_digits()
digits.images.shape

The Image produced above is a 3-D array of 1797 samples, each consisting if an 8x8 grid of pixels\
Now let's visualize only 100 of these

In [None]:
fig, axes = plt.subplots(
    10,
    10,
    figsize=(8, 8),
    subplot_kw={"xticks": [], "yticks": []},
    gridspec_kw=dict(hspace=0.1, wspace=0.1),
)

for i, ax in enumerate(axes.flat):
    ax.imshow(digits.images[i], cmap="binary", interpolation="nearest")
    ax.text(0.05, 0.05, str(digits.target[i]), transform=ax.transAxes, color="green")

Inorder to work with this data within Scikit-learn, we need to form a 2D representation to serve as our feature matrix\
We can accomplish this by <u>treating each pixel in the image as a feature</u> i.e by flattening out the pixel arrays so\
that we have a length- 64 array of pixels values representing each digit.\
<br>
Additionally, ofcourse, we need the target vector, which gives the previously deterimined label for each digit.

In [None]:
X = digits.data
X.shape

In [None]:
y = digits.target
y.shape

Unsupervised Learning: Dimensionality reduction to visualize our points in 2D instead of 64D\
We will use the manifold learning algorithm <b>Isomap</b>

In [None]:
iso = Isomap(n_components=2)
data_projected = iso.fit_transform(digits.data)
data_projected.shape


Plotting our results

In [None]:
plt.scatter(
    x=data_projected[:, 0],
    y=data_projected[:, 1],
    c=digits.target,
    edgecolor="none",
    alpha=0.5,
    cmap=plt.get_cmap("nipy_spectral", 10),
)
plt.colorbar(label="digit label", ticks=range(10))
plt.clim(-0.5, 9.5)

Since the groups appear fairly separated, we can use classification

In [None]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)

In [None]:
model = GaussianNB()
model.fit(Xtrain, ytrain)
y_model = model.predict(Xtest)


In [None]:
print(f"{accuracy_score(ytest, y_model) * 100: .2f}%")

We can compute a confusion matrix and use a heatmap to visualize our results so as to see where we could've gone wrong

In [None]:
mat = confusion_matrix(ytest, y_model)

In [None]:
sns.heatmap(mat, square=True, annot= True, cbar=False)
plt.xlabel('Predicted Values')
plt.ylabel('True Values')

We can get intuition for the next steps (such as moving to a more sophisticated algorithm) from our heatmap
* A large number of twos were misclassified as either ones or eights
* A fraction of nines were misclassified as either ones, threes, sevens or eights