## Letter Image Recognition using MLP , KNN and CNN 

### A: Source Information

Creator: David J. Slate Odesta Corporation; 1890 Maple Ave; Suite 115; Evanston, IL 60201  
Donor: David J. Slate (dave@math.nwu.edu) (708) 491-3867  
Date: January, 1991  


### B: Relevant Information

The character images is based on 20 different fonts and each letter within these 20 fonts has been randomly distorted to produce a file of 20,000 unique stimuli.  Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15. We typically train on the first 16000 items and then use the resulting model to predict the letter category for the remaining 4000 (**NOT in this assignment**). See the article cited below for more details: 

P. W. Frey and D. J. Slate (Machine Learning Vol 6 #2 March 91): "Letter Recognition Using Holland-style Adaptive Classifiers".

In [1]:
import sklearn
import csv
import pandas as pd
import numpy as np

In [2]:
with open('Letter.csv') as f:
    reader = csv.reader(f)
    print("Header line: %s" % next(reader))
    annotated_data = [r for r in reader]
print(annotated_data[0])
print("Total number of rows:", len(annotated_data))

df = pd.DataFrame(annotated_data,columns=['lettr', 'x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr', 'xy2br', 'x-ege', 'xegvy', 'y-ege', 'yegvx'])


Header line: ['lettr', 'x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr', 'xy2br', 'x-ege', 'xegvy', 'y-ege', 'yegvx']
['T', '2', '8', '3', '5', '1', '8', '13', '0', '6', '6', '10', '8', '0', '8', '0', '8']
Total number of rows: 20000


# Exploratory Data Analaysis(EDA)
- Class Distribution 
- Data Separation
- Data Balance
- Standardization

## Class Distribution
- Compute and print the percentage and its number of stimuli corresponding to the five letters A-E (class label lettr).

In [3]:
df2 = df[df['lettr'] == 'A']
print('A: is',df2['lettr'].count()/df['lettr'].count()*100,'%', ' with Totoal of: ', df2['lettr'].count() )
df2 = df[df['lettr'] == 'B']
print('B: is',df2['lettr'].count()/df['lettr'].count()*100,'%', 'with Totoal of: ', df2['lettr'].count() )
df2 = df[df['lettr'] == 'C']
print('C: is',df2['lettr'].count()/df['lettr'].count()*100,'%', 'with Totoal of: ', df2['lettr'].count() )
df2 = df[df['lettr'] == 'D']
print('D: is',df2['lettr'].count()/df['lettr'].count()*100,'%', 'with Totoal of: ', df2['lettr'].count() )
df2 = df[df['lettr'] == 'E']
print('E: is',df2['lettr'].count()/df['lettr'].count()*100,'%', 'with Totoal of: ', df2['lettr'].count() )

A: is 3.945 %  with Totoal of:  789
B: is 3.83 % with Totoal of:  766
C: is 3.6799999999999997 % with Totoal of:  736
D: is 4.025 % with Totoal of:  805
E: is 3.84 % with Totoal of:  768


## Split the data for training and testing purpose.
Split the data into a training set, a dev-test set, and a test set. Use the following ratio for splitting the data:

* Training set: 80%
* Dev-test set: 10%
* Test set: 10%

In [4]:
import random  
random.seed(1234)  
random.shuffle(annotated_data)  


The above three lines of code are used to randomize the order of the data.  
After that,  use the first 80% as training set, then 10% as Dev-test, and the last 10% as test set.
- At this point annotated_data is shuffled, therefore i create the dataframe of annotated_data to have a randomised dataset Then I proceed the split
- I simply take the first 80% as train next 10% as Dev_test and last 10% as Test
- splitting datasets into x and y

In [5]:

df = pd.DataFrame(annotated_data,columns=['lettr', 'x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar',
                                          'x2bar', 'y2bar', 'xybar', 'x2ybr', 'xy2br', 'x-ege', 'xegvy', 'y-ege', 'yegvx'])


train = df.iloc[:int(len(df)*0.8)]
Dev_test = df.iloc[int(len(df)*0.8):int(len(df)*0.9)]
test = df.iloc[int(len(df)*0.9):int(len(df))]

print('Training set:', train.shape)
print('Dev-Test set:',Dev_test.shape)
print('Test set:',test.shape)

Training set: (16000, 17)
Dev-Test set: (2000, 17)
Test set: (2000, 17)


In [6]:
#Training set
x_train = train.drop(columns=['lettr'])
y_train = train['lettr']

#Validation set
x_valid = Dev_test.drop(columns=['lettr'])
y_valid = Dev_test['lettr']

#Test set
x_test = test.drop(columns=['lettr'])
y_test = test['lettr']

## Encoding 
- Use one hot encoder methods to encode labels
- Encoded labels are only used in CNN

In [7]:
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
label_encoder = preprocessing.LabelEncoder() 

#Encode Y_train
train_integer_encoded = label_encoder.fit_transform(train['lettr'])
onehot_encoder = OneHotEncoder(sparse=False)
train_integer_encoded = train_integer_encoded.reshape(len(train_integer_encoded), 1)
y_encoded_train = onehot_encoder.fit_transform(train_integer_encoded)

#Encode Y_test
test_integer_encoded = label_encoder.fit_transform(test['lettr']) 
test_integer_encoded = test_integer_encoded.reshape(len(test_integer_encoded), 1)
y_encoded_test = onehot_encoder.fit_transform(test_integer_encoded)

In [8]:
# Example of Encoded_labels
y_encoded_test

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## Check that the data are balanced after splitting
- Print the percentage of class label lettr (A-E) in each partition, and check that they are similar.

Training set:  

In [9]:
df2 = train[train['lettr'] == 'A']
print('A: ',df2['lettr'].count()/train['lettr'].count()*100,'%', 'totoal of: ', df2['lettr'].count() )

df2 = train[train['lettr'] == 'B']
print('B: ',df2['lettr'].count()/train['lettr'].count()*100,'%', 'totoal of: ', df2['lettr'].count() )

df2 = train[train['lettr'] == 'C']
print('C: ',df2['lettr'].count()/train['lettr'].count()*100,'%', 'totoal of: ', df2['lettr'].count() )

df2 = train[train['lettr'] == 'D']
print('D: ',df2['lettr'].count()/train['lettr'].count()*100,'%', 'totoal of: ', df2['lettr'].count() )

df2 = train[train['lettr'] == 'E']
print('E: ',df2['lettr'].count()/train['lettr'].count()*100,'%', 'totoal of: ', df2['lettr'].count() )


A:  3.9375 % totoal of:  630
B:  3.7562499999999996 % totoal of:  601
C:  3.6062499999999997 % totoal of:  577
D:  4.075 % totoal of:  652
E:  3.88125 % totoal of:  621


Test set:

In [10]:
df2 = test[test['lettr'] == 'A']
print('A: ',df2['lettr'].count()/test['lettr'].count()*100,'%', 'totoal of: ', df2['lettr'].count() )

df2 = test[test['lettr'] == 'B']
print('B: ',df2['lettr'].count()/test['lettr'].count()*100,'%', 'totoal of: ', df2['lettr'].count() )

df2 = test[test['lettr'] == 'C']
print('C: ',df2['lettr'].count()/test['lettr'].count()*100,'%', 'totoal of: ', df2['lettr'].count() )

df2 = test[test['lettr'] == 'D']
print('D: ',df2['lettr'].count()/test['lettr'].count()*100,'%', 'totoal of: ', df2['lettr'].count() )

df2 = test[test['lettr'] == 'E']
print('E: ',df2['lettr'].count()/test['lettr'].count()*100,'%', 'totoal of: ', df2['lettr'].count() )


A:  4.2 % totoal of:  84
B:  3.75 % totoal of:  75
C:  4.15 % totoal of:  83
D:  4.2 % totoal of:  84
E:  3.15 % totoal of:  63


### Results 
- Looking at the percentage of each class from different dataset, the differences are at most + - 0.5% between different dataset
- Within the datasets each label is +- 0.5% from the average of 3.85%
- Therefore, we can safe to assume every label is evenly distributed in Train set, Validation set and Test set 
- The datasets are balanced 


## Standardization
- caculate mean and standard diviation for X 

###  Compute the mean value per feature  (except for the class lable) in the training set

In [11]:
# X_mean contains the mean of each features 
x_mean = x_train.astype(int).mean()



### Compute the standard deviation of each feature  (except for the class lable) in the training set 

In [12]:
# X_std contains the standard deviation of each features 
x_std = x_train.astype(int).std()



###  [Scaling the training set] subtract the mean, and scale by inverse standard deviation. 

In [13]:
# I created a new datafram called x_Scale_train to store all Standardization features value

x_Scale_train = ((x_train.astype(int)- x_mean )/ x_std)
x_Scale_train.head()

Unnamed: 0,x-box,y-box,width,high,onpix,x-bar,y-bar,x2bar,y2bar,xybar,x2ybr,xy2br,x-ege,xegvy,y-ege,yegvx
0,0.515738,1.200393,1.429765,1.164333,0.226582,-0.946731,0.218675,-0.973193,1.18147,0.691783,1.72871,0.516272,-0.020188,0.42753,0.120829,-1.12051
1,-0.53118,-0.312938,-0.05725,-0.606642,-1.143701,0.543206,1.084499,-0.973193,-1.332964,-0.51464,2.487998,0.035068,-0.448009,1.719285,-1.434805,0.119507
2,-0.007721,1.200393,0.438422,1.164333,1.140104,-0.450085,-0.64715,0.505332,-0.075747,-0.51464,0.210133,0.997476,2.118919,-2.155979,-0.656988,0.739515
3,-0.007721,-0.615604,-0.552922,-1.049386,-0.230179,0.046561,0.218675,0.135701,-0.075747,-0.51464,-0.169511,-0.92734,0.835455,1.073408,-0.656988,-1.740519
4,0.515738,-0.615604,0.934093,0.721589,0.226582,1.039852,0.218675,0.135701,-1.332964,-0.51464,0.589777,0.035068,2.54674,0.42753,-1.434805,0.119507


### Do the same (using training mean and std) with respect to the test set


In [14]:
x_Scale_test=((x_test.astype(int)- x_test.astype(int).mean()) / x_test.astype(int).std())
x_Scale_test.head()

Unnamed: 0,x-box,y-box,width,high,onpix,x-bar,y-bar,x2bar,y2bar,xybar,x2ybr,xy2br,x-ege,xegvy,y-ege,yegvx
18000,-1.067475,0.872716,-1.089248,1.116359,-1.141729,2.444594,-1.870841,2.000418,-0.906899,1.890593,0.188712,2.39455,-0.911053,-1.498661,-1.439239,0.128585
18001,1.466289,0.872716,1.868655,1.116359,1.515515,-1.37807,0.645985,-0.23323,1.229998,1.491565,1.314232,0.481015,-0.055403,-0.223476,0.521844,-2.320645
18002,-0.560722,-0.329579,-0.10328,-0.633077,-0.255981,0.055429,1.065456,-0.97778,-1.334279,-0.503573,0.939059,0.002631,1.228071,1.689301,-1.439239,0.128585
18003,0.959536,0.572142,-0.10328,-0.633077,-0.698855,1.966761,-2.290312,-0.97778,-0.05214,0.69351,-1.687154,0.481015,-0.483228,-0.223476,-0.262589,1.965507
18004,-0.560722,-0.029005,-0.10328,-0.195718,-0.255981,-1.37807,0.226514,0.511319,0.375239,-0.503573,0.939059,0.959399,-0.055403,0.414116,-1.047022,0.128585


# Multilayer Perceptron (MLP) classifier
- Default State 
- random_stae = 0
- Max_iter = 300

In [15]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import  accuracy_score
import time
start_time = time.time()


clf = MLPClassifier(random_state=0,max_iter=300)
clf.fit(x_train,y_train)

predicted = clf.predict(x_test)

MLP_test_accuracy = accuracy_score(y_test, predicted)
MLP_train_accuracy = accuracy_score(y_train, clf.predict(x_train))

print("--- %s seconds ---" % (time.time() - start_time))
print('Test set accuracy:', accuracy_score(y_test, predicted))
print('Training set accuracy:',accuracy_score(y_train, clf.predict(x_train)))

--- 18.253190279006958 seconds ---
Test set accuracy: 0.921
Training set accuracy: 0.9540625


In [16]:
start_time = time.time()
clf = MLPClassifier(random_state=0,max_iter=300)
clf.fit(x_Scale_train,y_train)
predicted = clf.predict(x_Scale_test)
print("--- %s seconds ---" % (time.time() - start_time))
print('Test set accuracy:', accuracy_score(y_test, predicted))
print('Training set accuracy:',accuracy_score(y_train, clf.predict(x_Scale_train)))

--- 18.454651832580566 seconds ---
Test set accuracy: 0.949
Training set accuracy: 0.9944375


# Convolutional Neuron Network
- Using Tesorflow
- Sequential model 
- 4 Convolution layers
- 2 Dense layers 
- batch_size = 50 
- epochs = 10

In [17]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D

  '{0}.{1}.{2}'.format(*version.hdf5_built_version_tuple)
Using TensorFlow backend.


In [18]:
#reshaping
x_train = x_train.values.reshape(-1,4,4,1)
x_test = x_test.values.reshape(-1,4,4,1)

In [22]:

model = Sequential()
model.add(Conv2D(filters = 16, kernel_size = (1,1), padding = "same", activation = 'relu' , input_shape = (4,4,1)))
model.add(Conv2D(filters = 16, kernel_size = (1,1), padding = "same", activation = 'relu' ))
model.add(Conv2D(filters = 5, kernel_size = (2,2), padding = "same", activation = 'relu' , input_shape = (4,4,1)))
model.add(Conv2D(filters = 5, kernel_size = (2,2), padding = "same", activation = 'relu' ))
model.add(Flatten())
model.add(Dense(256, activation = "relu"))
model.add(Dense(26, activation = "softmax"))
model.compile(optimizer = 'adam' , loss = "categorical_crossentropy", metrics=["accuracy"])

#timer
start_time = time.time()

model.fit(x_train, y_encoded_train , batch_size = 50, epochs = 10  , validation_data = (x_test, y_encoded_test))

print("--- %s seconds ---" % (time.time() - start_time))

Train on 16000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
--- 12.04246735572815 seconds ---


## CNN with sacled datasets
- Here we want to see does the scaled makes difference in CNN methods

In [20]:
x_Scale_train = x_Scale_train.values.reshape(-1,4,4,1)
x_Scale_test = x_Scale_test.values.reshape(-1,4,4,1)

In [23]:
#timer
start_time = time.time()

model.fit(x_Scale_train, y_encoded_train , batch_size = 50, epochs = 10  , validation_data = (x_Scale_test, y_encoded_test))

print("--- %s seconds ---" % (time.time() - start_time))

Train on 16000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
--- 10.475039958953857 seconds ---


# *Final Results*

## MLP vs CNN

### Performance 
- From the above results we can see both MLP and CNN are able to provides high accuracy peformance
- MLP with 92% on test set , similarly CNN with 90% 

### Computation Time

- The time difference between MLP and CNN are relatively significant. Considering the size of the experiment, usually ther will be more variables than 16. I believe MLP will take longer computation than CNN as we can see the differences from above

- MLP with 16 Seconds 
- CNN with 12 Seconds

## Standardised Vs Unstandardzed 

- Due to the assignments, the 16 vairables are collected without any correlation, therefore there is a high vairances and bias need to be consider. To eliminate vairances we standardised the dataset.
- We can see this as both methods provides better performance with scaled datasets 


## Potential problem / Overfitting


