# Digits Project

## About the project

The project is about telling which handwritten numbers are hardest to be correctly transformed to digits using image recognition. The idea is that for the hardest image recognition handwritten numbers we want to prepare more training data so we can become better in guesssing them right. But for which digits we should use more training data to be able to guess them correctly?

## About the data

The data source is the `digits.csv` file.

|Column| Definiton|
|---|---|
|pixel_x_x|All the columns in this format are pixel positions for a handwritten number|
|number_label|The actual number which was handwritten|

Note that we have 64 pixel columns. The actual image is 8x8 pixels, which means every 8 columns are a new row.

## Solution

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Data exploration

In [None]:
digits = pd.read_csv('digits.csv')
digits

Get only the pixels columns from the original dataset

In [None]:
pixels = digits.drop('number_label', axis=1)
pixels

Let's use and display the first row from the dataframe as an example

In [None]:
first_image = pixels.iloc[0]
first_image

Let's also convert the row series into a numpy array so we can use it as our image source

In [None]:
first_image.to_numpy()

In [None]:
first_image.to_numpy().shape

Transform the 64 row columns in 8x8 matrix to have it in a image form

In [None]:
first_image.to_numpy().reshape(8, 8)

Let's use the pixel matrix to draw the image. We can see that it is the digit 4

In [None]:
plt.imshow(first_image.to_numpy().reshape(8, 8))

Gray color version of the image

In [None]:
plt.imshow(first_image.to_numpy().reshape(8, 8), cmap='gray')

In [None]:
sns.heatmap(first_image.to_numpy().reshape(8, 8), annot=True, cmap='gray')

### Scaling Data

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
scaled_pixels = scaler.fit_transform(pixels)
scaled_pixels

### Principal component analysis

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca_model = PCA(n_components=2)

In [None]:
pca_pixels = pca_model.fit_transform(scaled_pixels)

Let's check how much variance is explained by the 2 principal components

In [None]:
np.sum(pca_model.explained_variance_ratio_)

Let's create a scatterplot of the digits in the 2 dimensional PCA space, color/label based on the original number_label column in the original dataset.

We can see that the digit `4` is very easy to be recognized and the digit `8` is not easy. Which means we will need to prepare much more `8` digits for our algorithm to have it better trained in recognizing the digit `8` without mistaking it with other digits.

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x=pca_pixels[:,0], y=pca_pixels[:,1], hue=digits['number_label'].values, palette='Set1')
plt.legend(loc=(1.05, 0))

### 3D version of our PCA

Below is 3d example version of the above findings. We will also use 3 pca components for it.

In [None]:
pca_model = PCA(n_components=3)

In [None]:
pca_pixels = pca_model.fit_transform(scaled_pixels)

In [None]:
from mpl_toolkits import mplot3d

In [None]:
# the below command will make the notebook image interactive
%matplotlib notebook

plt.figure(figsize=(8, 8))
ax = plt.axes(projection='3d')
ax.scatter3D(pca_pixels[:,0], pca_pixels[:,1], pca_pixels[:,2], c=digits['number_label']);