# Handwritten Character Recognition

In [1]:
import csv
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.activations import relu,linear
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.optimizers import Adam

2023-01-17 23:26:41.807875: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## The Data
The data was taken from [this](https://www.kaggle.com/datasets/sachinpatel21/az-handwritten-alphabets-in-csv-format) Kaggle dataset, which provides a csv format of the original data from the [NIST](https://www.nist.gov/srd/nist-special-database-19) and MNIST datasets. 
<br>
The data represents 372451 images drawn on a 28x28 grid. The 1st column contains the classification label (0->25 representing A->Z), and the rest of the 784 columns contain the input features (numbers from 0->255 represnting the color degree of the pixel).
<br>
*The next cell takes some time to run

In [2]:
# Open the CSV file
with open('A_Z handwritten Data.csv', 'r') as file:
    reader = csv.reader(file)
    data = []
    for i, row in enumerate(reader):
        if i % 50000 == 0:
            print(f'iteration: {i}')
        data.append(row)
    data = np.array(data)

# Get the first column and store it in a numpy array
y = np.array(data[:, 0], dtype=float)

# Get the rest of the columns and store it in a numpy matrix
X = np.matrix(data[:, 1:], dtype=float)

print(y.shape)
print(X.shape)

iteration: 0
iteration: 50000
iteration: 100000
iteration: 150000
iteration: 200000
iteration: 250000
iteration: 300000
iteration: 350000
(372451,)
(372451, 784)


## Splitting our Data
We split our dataset into training and test sets so we can later compare the missclassifications to check for the presence of bias and variance issues. 
<br>
Training Set: 75%
<br>
Test Set: 25%

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25, random_state=1)

## The Model
### Layers
I tested different models with different layer strctures, 100-100-100, 500-200-100, 784-784-784 etc. Increasing the number of nodes per layer didnt seem to help reduce the cost/loss much beyond 100 layers. The model I ended up staying with was 100-100-100-100 (4 layers of 100 nodes and then a final layer of 26 nodes).
### Regulaizer
Using a regulaizer was causing the model to underfit the data, and the absence of one wasn't causing an overfitting issue, so ended up not using one.
### Loss & Optimizer
Used the Sparse Categorical Crossentropy loss function because this is a multi-class classification problem along with the setting to let tensorflow know to use the sigmoid function since the layer is set to a linear activation for numerical accuracy.
<br>
Used the Adam optimizer with a small learning rate of 0.0001 since larger values seemed to cause gradient descent to not work properly.

In [4]:
model = Sequential([
    Dense(100, activation='relu'),
    Dense(100, activation='relu'),
    Dense(100, activation='relu'),
    Dense(100, activation='relu'),
    Dense(26, activation='linear')
])

2023-01-17 23:33:50.551222: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [5]:
model.compile(
    loss=SparseCategoricalCrossentropy(from_logits=True),
    optimizer=Adam(learning_rate=0.0001)
)

In [6]:
model.fit(X_train, y_train, epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x2536ef8e0>

## Testing our Model
To test our model, we can simply loop over our dataset and see how many values it correctly classifies (or miss-classifies for that matter). We do the same for our test set so we can compare these values and see if our model generalizes well to data it has not seen before.

In [7]:
# training set
yhat = model.predict(X_train)
correct = 0
for i in range(len(y_train)):
    if y_train[i] == np.argmax(tf.nn.softmax(yhat[i])):
        correct += 1
print('accuracy with trainingt set: ' + str(correct/len(y_train)))

# test set
yhat = model.predict(X_test)
correct = 0
for i in range(len(y_test)):
    if y_test[i] == np.argmax(tf.nn.softmax(yhat[i])):
        correct += 1
print('accuracy with test set: ' + str(correct/len(y_test)))

accuracy with trainingt set: 0.9945585634607537
accuracy with test set: 0.9821507200927905


## Conclusion
On our training set, the model predicted ~ 99.4% of the labels correctly. On our test set, our model predicted ~ 98.2% of the labels correctly. The model does seem to generalize out to unseen data pretty well.

In [9]:
model.save('my_model')



INFO:tensorflow:Assets written to: my_model/assets


INFO:tensorflow:Assets written to: my_model/assets


## Error Analysis
We can run some code on our test set predictions to check if there is a pattern on letters that our model has trouble predicting. We can use this information to further tune our model in the future to get better results.

In [8]:
incorrect_count = [0] * 26
for i in range(len(y_test)):
    if y_test[i] != np.argmax(tf.nn.softmax(yhat[i])):
        incorrect_count[int(y_test[i])] += 1
for i in range(26):
    print(f'{i}: {incorrect_count[i]}')


0: 56
1: 50
2: 52
3: 119
4: 40
5: 11
6: 64
7: 80
8: 14
9: 97
10: 44
11: 35
12: 77
13: 97
14: 136
15: 64
16: 104
17: 63
18: 61
19: 26
20: 115
21: 40
22: 52
23: 40
24: 96
25: 29


From the looks of it, the missclassifications are split pretty evenly among all the letters except for a few anomalies. 
<br>
It has the most issues with predicting the letters 'O', 'D', and 'U' in with 136, 119 & 115 missclassificataions respectively. Note the average for each letter seems to be around ~ 50.
<br>
On the other hand, it almost always correctly classified 'F' and 'I' with only 11 and 14 missclassifications.