# Digit Recognizer

### Competition Description
MNIST (*Modified National Institute of Standards and Technology*) is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.

In this competition, our goal is to correctly identify digits from a dataset of tens of thousands of handwritten images. We will experiment with different algorithms to learn first-hand what works well and how techniques compare.

### Plan:
1. Data Ingestion and Split
2. EDA
3. Data Cleaning
   * Reshaping data
   * Dimensionality Reduction
   * Normalization
4. Train Distinct ML Models
   * Linear Classifier
   * Logistic Regression
   * SVM Classifier
   * Random Forest
   * XGBoost Classifier
5. Evaluation
   * Cross Validation
   * Metrics (accuracy, precision, recall, f-score)
   * Confusion Matrix (use heatmap for error analysis)
   * Diagonistic Assesment (Bias-Variance tradeoff, Learning Curves)
6. Data Pipelines
7. Custom Function(s) - code reuse
8. CNN
9. Transfer Learning (Fine Tuning)
10. Choosing best model and submit predictions

## 1. Data Ingestion and Split

In [1]:
# Lets import all the necessary libraries for this task first
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data = pd.read_csv("train.csv")

In [3]:
data.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# Lets check the dimensions of our dataset next
data.shape

(42000, 785)

Next, lets split the data into train, val, and test sets. We will use stratified split incase our train, val, and test sets are not good representatives of the population.

In [5]:
from sklearn.model_selection import train_test_split

X = data.iloc[:,1:]
y = data['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.3, random_state=42)

In [6]:
# Lets examine if our distinct sets are goods samples
perc = list()
for y in [y_train, y_val, y_test]:
    perc.append(y.value_counts(normalize=True).sort_index())
pd.DataFrame(np.array(perc).T, columns=['y_train', 'y_val', 'y_test'])

Unnamed: 0,y_train,y_val,y_test
0,0.099728,0.096032,0.093386
1,0.112075,0.110091,0.110582
2,0.098061,0.103741,0.100265
3,0.101905,0.107029,0.10873
4,0.096939,0.099887,0.090212
5,0.092177,0.085374,0.087831
6,0.097993,0.09932,0.100529
7,0.103469,0.107483,0.10873
8,0.097075,0.093537,0.101587
9,0.100578,0.097506,0.098148


Indeed, all of the sets are good representatives of the population. Therefore, we can move on to next task. Otherwise, we could have use StratifiedShuffleSplit from sklearn's model_selection module.

So good practice is to hold the test data to the very end of our work and never explore it. Lets quickly look at the size of these sets next.

In [7]:
print(f"Training set: {X_train.shape[0]}")
print(f"Validation set: {X_val.shape[0]}")
print(f"Training set: {X_test.shape[0]}")

Training set: 29400
Validation set: 8820
Training set: 3780
