# HANDWRITTEN DIGITS DATASET

**File:** HandwrittenDigits.ipynb

**Course:** Data Science Foundations: Data Mining in Python

# IMPORT LIBRARIES

In [None]:
import pandas as pd                                   # For dataframes
import numpy as np                                    # For various functions
import matplotlib.pyplot as plt                       # For plotting functions
import seaborn as sns                                 # For additional plotting functions
from sklearn.model_selection import train_test_split  # For train/test splits

# LOAD AND PREPARE DATA
Many of the datasets for this course come from the Machine Learning Repository at the University of California, Irvine (UCI) at [https://archive.ics.uci.edu/](https://archive.ics.uci.edu/).

For all three demonstrations of dimensionality reduction, we'll use the "Optical Recognition of Handwritten Digits Data Set," which can be accessed via [https://j.mp/34NFNGn](https://j.mp/34NFNGn). We'll use the dataset saved in "optdigits.tra," which is the training dataset. 

## Import Data

- To read read the dataset from a local CSV file, run the following cell. (This is the recommended approach.)

In [None]:
df = pd.read_csv('data/optdigits_raw.csv')

- Alternatively, to read the data from the UCI ML Repository, uncomment the lines in the cell below and run them.

In [None]:
# df = pd.read_csv(
#     'https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tra', 
#     header=None)

- Check the data.

In [None]:
df.head()

## Rename Variables

- Sequentially renames all attribute columns (i.e., pixel data) as `P0`, `P2`, etc.
- Renames the class variable as `y`.

In [None]:
df.columns = ["P" + str(i) for i in range(0, len(df.columns) - 1)] + ["y"]

## Select Cases

- Select cases for the digits {1, 2, 3}.
- Look at the first 5 rows.

In [None]:
df = df.loc[df.y.isin([1, 3, 6])]

df.head()

## Split Data

- `train_test_split()` splits the data into training and testing sets.
- Specify the data matrix `X`, which contrains the attributes of the pixel data.
- Extract columns `P0`, `P2`, ..., `P63` with `df.filter(regex='\d')`, which will keep only names with a numeric digits in them.
- Specify the target variable as `df.y`.
- Create the `trn` and `tst` dataframes.

In [None]:
# Renames columns
X_trn, X_tst, y_trn, y_tst = train_test_split(
    df.filter(regex='\d'),  
    df.y, 
    test_size=0.30,
    random_state=1)

# Creates the training dataset, trn
trn = X_trn
trn["y"] = y_trn

# Creates the testing dataset, tst
tst = X_tst
tst["y"] = y_tst

# EXPLORE TRAINING DATA

## Display Images
Display the images of the first 20 digits of `X_train`.

In [None]:
# Sets up a grid for the images
fig, ax = plt.subplots(
    nrows=1, 
    ncols=20, 
    figsize=(15, 3.5), 
    subplot_kw=dict(xticks=[], yticks=[]))

# Plots 20 digits
for i in np.arange(20):
    ax[i].imshow(X_trn.to_numpy()[i, 0:-1].reshape(8, 8), cmap=plt.cm.gray)   
plt.show()

## Explore Attribute Variables
Select four arbitrary features (any four will do) and get paired plots.

In [None]:
# Creates a grid using Seaborn's PairGrid()
g = sns.PairGrid(
    trn, 
    vars=["P25", "P30", "P45", "P60"], 
    hue="y", 
    diag_sharey=False, 
    palette=["red", "green", "blue"])

# Adds histograms on the diagonal
g.map_diag(plt.hist)

# Adds density plots above the diagonal
g.map_upper(sns.kdeplot)

# Adds scatterplots below the diagonal
g.map_lower(sns.scatterplot)

# Adds a legend
g.add_legend()

# SAVE DATA
Save `df`, `trn`, and `tst` to CSV files to be used later.

In [None]:
df.to_csv('data/optdigits.csv', sep=',', index=False)
trn.to_csv('data/optdigits_trn.csv', sep=',' ,index=False)
tst.to_csv('data/optdigits_tst.csv', sep=',' ,index=False)

# CLEAN UP

- If desired, clear the results with Cell > All Output > Clear. 
- Save your work by selecting File > Save and Checkpoint.
- Shut down the Python kernel and close the file by selecting File > Close and Halt.