# 📝 Handwritten Character Regconition using Machine Learning

This notebook will go through a machine learning project with the goal of classifying different handwritten characters including letters and digits.

To do this, we will be using data downloaded from [EMNIST (Extended MNIST)](https://www.kaggle.com/datasets/crawford/emnist). It consists of six different splits with details provided below: 
* EMNIST ByClass: 814,255 characters. 62 unbalanced classes.
* EMNIST ByMerge: 814,255 characters. 47 unbalanced classes.
* EMNIST Balanced:  131,600 characters. 47 balanced classes.
* EMNIST Letters: 145,600 characters. 26 balanced classes.
* EMNIST Digits: 280,000 characters. 10 balanced classes.
* EMNIST MNIST: 70,000 characters. 10 balanced classes.

Visual breakdown of EMNIST datasets is specified as follow:
<img src="../report/images/emnistdataset.jpg" width="400" height="800">

Please refer to the [EMNIST: an extension of MNIST to handwritten letters](https://arxiv.org/abs/1702.05373v1) paper for more details of the dataset. 

In this project, the `EMNIST Balanced` dataset will be used to reduce mis-classification errors due to capital and lower case letters and also has an equal number of samples per class. The `EMNIST Balanced` includes two main datasets: 
* **emnist-balanced-train.csv** - 112,800 images 
* **emnist-balanced-test.csv** - 18,800 images

Each image is stored in `CSV` files using a separated row and 785 columns (first column represents the class lable )


In [3]:
# Regular EDA (exploratory data analysis) and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import CNN.cnn 

ModuleNotFoundError: No module named 'layers'

In [10]:
train_data = pd.read_csv('../../data/emnist-balanced-train.csv')
test_data = pd.read_csv('../../data/emnist-balanced-test.csv')
mapp = pd.read_csv("../../data/emnist-balanced-mapping.txt", 
                   delimiter = ' ', index_col=0, header=None, squeeze=True)

In [11]:
train_data.head()

Unnamed: 0,45,0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,...,0.524,0.525,0.526,0.527,0.528,0.529,0.530,0.531,0.532,0.533
0,36,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,43,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,15,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,42,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
test_data.head()

Unnamed: 0,41,0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,...,0.523,0.524,0.525,0.526,0.527,0.528,0.529,0.530,0.531,0.532
0,39,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,26,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,44,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,33,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
print(f'Train: {train_data.shape}, Test: {test_data.shape}, Map: {mapp.shape}')

Train: (112799, 785), Test: (18799, 785), Map: (47,)


In [26]:
# Define width and height of images
WIDTH = 28
HEIGHT = 28

In [30]:
# Split X and y
X_train = train_data.iloc[:,1:]
y_train = train_data.iloc[:,0]

X_test = test_data.iloc[:,1:]
y_test = test_data.iloc[:,0]

In [31]:
print(f'Train data: {X_train.shape}')
print(f'Train labels: {y_train.shape}')
print(f'Test data: {X_test.shape}')
print(f'Train labels: {y_test.shape}')

Train data: (112799, 784)
Train labels: (112799,)
Test data: (18799, 784)
Train labels: (18799,)


In [32]:
def rotate(image):
    image = image.reshape([HEIGHT, WIDTH])
    image = np.fliplr(image)
    image = np.rot90(image)
    return image

In [33]:
# Flip and rotate image
X_train = np.asarray(X_train)
X_train = np.apply_along_axis(rotate, 1, X_train)
print(f'Train data: {X_train.shape}')

X_test = np.asarray(X_test)
X_test = np.apply_along_axis(rotate, 1, X_test)
print(f'Test data: {X_test.shape}')

Train data: (112799, 28, 28)
Test data: (18799, 28, 28)


In [34]:
# Normalize data
X_train = X_train.astype('float32')/255
X_test = X_test.astype('float32')/255

In [37]:
# Reshaping for input to CNN
X_train = X_train.reshape([-1,1,HEIGHT,WIDTH])
X_test = X_test.reshape([-1,1,HEIGHT,WIDTH])

In [40]:
# number of classes
num_classes = y_train.nunique()