# Assignment 1 (20 marks)

Letter Image Recognition

In this assignment, you will use neural networks (MLP) for the task of predicting the each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet.

### A: Source Information

Creator: David J. Slate Odesta Corporation; 1890 Maple Ave; Suite 115; Evanston, IL 60201  
Donor: David J. Slate (dave@math.nwu.edu) (708) 491-3867  
Date: January, 1991  


### B: Relevant Information

The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli.  Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15.  We typically train on the first 16000 items and then use the resulting model to predict the letter category for the remaining 4000.  See the article cited above for more details:   

P. W. Frey and D. J. Slate (Machine Learning Vol 6 #2 March 91): "Letter Recognition Using Holland-style Adaptive Classifiers".

### C:Attribute Information

<img src="./Att_Letter.png" width = "600" height = "300" align=center />

The following code uses Python's `csv` module to load the data and prints the first row and the total number of rows.

In [2]:
import sklearn
import csv

In [12]:
with open('Letter.csv') as f:
    reader = csv.reader(f)
    print("Header line: %s" % next(reader))
    annotated_data = [r for r in reader]
print(annotated_data[0])
print("Total number of rows:", len(annotated_data))

Header line: ['lettr', 'x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr', 'xy2br', 'x-ege', 'xegvy', 'y-ege', 'yegvx']
['T', '2', '8', '3', '5', '1', '8', '13', '0', '6', '6', '10', '8', '0', '8', '0', '8']
Total number of rows: 20000


In [88]:
Header=['lettr', 'x-box', 'y-box', 'width', 'high', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr', 'xy2br', 'x-ege', 'xegvy', 'y-ege', 'yegvx']

17


## Exercise 1 (1 mark) - Class Distribution
Print the number of class label lettr (A-E).

* A: ?
* B: ?
* C: ?
* D: ?
* E: ?

In [72]:
import string
from collections import Counter

counts = Counter(map(lambda x: x[0], annotated_data))
for letter in string.ascii_uppercase:
    print("%s: %s" % (letter,counts[letter]))

A: 789
B: 766
C: 736
D: 805
E: 768
F: 775
G: 773
H: 734
I: 755
J: 747
K: 739
L: 761
M: 792
N: 783
O: 753
P: 803
Q: 783
R: 758
S: 748
T: 796
U: 813
V: 764
W: 752
X: 787
Y: 786
Z: 734


# Exercise 2 (1 mark) - Check that the distrubution of the class label 'lettr'
Split the data into a training set, a dev-test set, and a test set. Use the following ratio for splitting the data:

* Training set: 80%
* Dev-test set: 10%
* Test set: 10%

In [73]:
#import random  
#random.seed(1234)  
#random.shuffle(annotated_data)

In [74]:
from sklearn.model_selection import train_test_split

#train_test_split should randomly shuffle the data
training_set,other_sets = train_test_split(annotated_data,test_size=0.2)
dev_test_set,test_set = train_test_split(other_sets, test_size=0.5)

print(len(training_set)/len(annotated_data))
print(len(dev_test_set)/len(annotated_data))
print(len(test_set)/len(annotated_data))

0.8
0.1
0.1


# Exercise 3 (1 mark) - Check that the data are balanced
Print the percentage of class label lettr (A-E) in each partition, and check that they are similar.

In [75]:
letter_count = sum(map(lambda x: 1 if x[0] in list("ABCDE") else 0,training_set))
average = letter_count/len(training_set)
print(average)

0.1924375


In [76]:
letter_count = sum(map(lambda x: 1 if x[0] in list("ABCDE") else 0,dev_test_set))
average = letter_count/len(dev_test_set)
print(average)

0.181


In [77]:
letter_count = sum(map(lambda x: 1 if x[0] in list("ABCDE") else 0,test_set))
average = letter_count/len(test_set)
print(average)

0.2115


## Exercise 4 (3 marks) - Neural Network MLP in Scikit-Learn 
Train an `sklearn` MLPClassifier with default settings (random_state=0) using the training set and report the accuracy on the training and test set.

In [101]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
import numpy as np

training_set_letters = np.array(training_set)[:,0]
training_set_data = np.array(training_set)[:,1:].astype(float)

clf = MLPClassifier(random_state=0)
x_train, x_test, y_train, y_test = train_test_split(training_set_data,training_set_letters)
clf.fit(x_train, y_train)

x_test_pred = clf.predict(x_test)
print("Training set prediction score:")
print(accuracy_score(x_test_pred, y_test))

test_set_letters = np.array(test_set)[:,0]
test_set_data = np.array(test_set)[:,1:].astype(float)

test_pred = clf.predict(test_set_data)
print("Test set prediction score:")
print(accuracy_score(test_pred, test_set_letters))


[['W', '6', '8', '6', '6', '6', '4', '9', '2', '3', '9', '8', '8', '7', '11', '2', '6'], ['G', '4', '6', '6', '6', '6', '7', '7', '5', '4', '7', '7', '8', '7', '10', '6', '8'], ['T', '2', '7', '4', '4', '1', '9', '14', '1', '6', '5', '11', '9', '0', '8', '0', '8'], ['P', '3', '7', '5', '5', '3', '9', '8', '2', '5', '12', '4', '5', '1', '10', '2', '9'], ['G', '6', '6', '7', '8', '3', '8', '5', '8', '9', '6', '5', '10', '2', '8', '6', '11']]
[[ 6.  8.  6. ... 11.  2.  6.]
 [ 4.  6.  6. ... 10.  6.  8.]
 [ 2.  7.  4. ...  8.  0.  8.]
 ...
 [ 7. 12.  7. ...  9.  2.  7.]
 [ 3.  7.  5. ...  9.  3.  7.]
 [ 3.  7.  4. ...  9.  9.  8.]]
Training set prediction score:
0.89825
Test set prediction score:
0.9075


## Exercise 5 (8 marks) - Neural Network MLP with Scaling of the Data
Neural networks expect all input features to vary in a way, and ideally to have a mean of 0, and a variance of 1.

### 5.1 compute the mean value per feature on the training set [1 mark]

In [106]:
for column in range(training_set_data.shape[1]):
    print("%s mean: %s" % (Header[column+1],training_set_data[:,column].mean()))

x-box mean: 4.0240625
y-box mean: 7.0431875
width mean: 5.1286875
high mean: 5.38125
onpix mean: 3.5175
x-bar mean: 6.8875625
y-bar mean: 7.515875
x2bar mean: 4.6114375
y2bar mean: 5.198875
xybar mean: 8.2895
x2ybr mean: 6.461625
xy2br mean: 7.921875
x-ege mean: 3.0325625
xegvy mean: 8.3413125
y-ege mean: 3.7068125
yegvx mean: 7.8024375


### 5.2 compute the standard deviation of each feature on the training set [1 mark]

In [107]:
for column in range(training_set_data.shape[1]):
    print("%s std: %s" % (Header[column+1],training_set_data[:,column].std()))

x-box std: 1.908627385346273
y-box std: 3.301487897879341
width std: 2.0130206226821796
high std: 2.264927910000669
onpix std: 2.1904494401834524
x-bar std: 2.024764260005038
y-bar std: 2.3195900466192296
x2bar std: 2.680896619340953
y2bar std: 2.373594054250853
xybar std: 2.4884613217809917
x2ybr std: 2.6347252910645165
xy2br std: 2.074010965345892
x-ege std: 2.3245703223593277
xegvy std: 1.5331970771377532
y-ege std: 2.5697477677475953
yegvx std: 1.6118022703153603


### 5.3 subtract the mean, and scale by inverse standard deviation;  afterward, mean=0 and std=1 [1 mark]

In [126]:
training_set_data_normalized = training_set_data.copy()
for column in range(training_set_data.shape[1]):
    array = training_set_data[:,column]
    array = (array - array.mean()) / array.std()
    training_set_data_normalized[:,column] = array
    
    print("%s mean: %0.1f, std: %0.1f" % (Header[column+1],array.mean(),array.std()))

x-box mean: -0.0, std: 1.0
y-box mean: -0.0, std: 1.0
width mean: 0.0, std: 1.0
high mean: 0.0, std: 1.0
onpix mean: -0.0, std: 1.0
x-bar mean: 0.0, std: 1.0
y-bar mean: -0.0, std: 1.0
x2bar mean: -0.0, std: 1.0
y2bar mean: -0.0, std: 1.0
xybar mean: -0.0, std: 1.0
x2ybr mean: 0.0, std: 1.0
xy2br mean: -0.0, std: 1.0
x-ege mean: -0.0, std: 1.0
xegvy mean: -0.0, std: 1.0
y-ege mean: 0.0, std: 1.0
yegvx mean: 0.0, std: 1.0


### 5.4 use THE SAME transformation (using training mean and std) on the test set [1 mark]

In [127]:
test_set_data_normalized = test_set_data.copy()
for column in range(test_set_data.shape[1]):
    array = test_set_data[:,column]
    array = (array - array.mean()) / array.std()
    test_set_data_normalized[:,column] = array
    
    print("%s mean: %0.1f, std: %0.1f" % (Header[column+1],array.mean(),array.std()))

x-box mean: -0.0, std: 1.0
y-box mean: 0.0, std: 1.0
width mean: 0.0, std: 1.0
high mean: -0.0, std: 1.0
onpix mean: -0.0, std: 1.0
x-bar mean: 0.0, std: 1.0
y-bar mean: 0.0, std: 1.0
x2bar mean: -0.0, std: 1.0
y2bar mean: -0.0, std: 1.0
xybar mean: -0.0, std: 1.0
x2ybr mean: -0.0, std: 1.0
xy2br mean: -0.0, std: 1.0
x-ege mean: -0.0, std: 1.0
xegvy mean: -0.0, std: 1.0
y-ege mean: 0.0, std: 1.0
yegvx mean: 0.0, std: 1.0


### 5.5 Train an `sklearn` MLPClassifier with default settings (random_state=0) using the scaled training set and report the accuracy on the scaled training and scaled test set.  [2 marks]

In [128]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
import numpy as np

clf = MLPClassifier(random_state=0)
x_train, x_test, y_train, y_test = train_test_split(training_set_data_normalized,training_set_letters)
clf.fit(x_train, y_train)

x_test_pred = clf.predict(x_test)
print("Training set prediction score:")
print(accuracy_score(x_test_pred, y_test))

test_pred = clf.predict(test_set_data_normalized)
print("Test set prediction score:")
print(accuracy_score(test_pred, test_set_letters))

Training set prediction score:
0.9425
Test set prediction score:
0.941




### 5.6 Increase the number of iterations to 1000 to see whether the optimization has been converged. [2 marks]

In [129]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
import numpy as np

clf = MLPClassifier(random_state=0,max_iter=1000)
x_train, x_test, y_train, y_test = train_test_split(training_set_data_normalized,training_set_letters)
clf.fit(x_train, y_train)

x_test_pred = clf.predict(x_test)
print("Training set prediction score:")
print(accuracy_score(x_test_pred, y_test))

test_pred = clf.predict(test_set_data_normalized)
print("Test set prediction score:")
print(accuracy_score(test_pred, test_set_letters))

Training set prediction score:
0.94575
Test set prediction score:
0.9455


## Exercise 6 (2 marks)-KNN with different k values
Training KNN models with different k values (1-10), and then report the best accuracy and its k value on unscaled training/test and scaled  training/test data.

[code...]

## Exercise 7 (4 marks) - Analysis of Results
Analyse the results of all the classifiers from the previous exercises, and answer these questions. In all answers you must include any code that you need to implement to answer the questions, the output of the code, and an interpretation of the output that shows how it can be used to answer the questions.

1. (1 mark) Did you observe any overfitting in any of the classifiers? How did you determine whether they have overfitting?
2. (3 marks) Do we have too little training data, or do we have too much training data for these classifiers?


# Submission of Results

Your submission should consist of this jupyter notebook with all your code and explanations inserted in the notebook. The notebook should contain the output of the runs so that it can be read by the assessor without needing to run the output.

DataCamp: Jupyter Notebook Tutorial: The Definitive Guide. A good overview of Jupyter notebooks, how to install and run them, why they are a good idea, some key features. Click https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook link to open resource.

Late submissions will have a penalty of **4 marks deduction per day late**.

Each question specifies a mark. The final mark of the assignment is the sum of all the individual marks, after applying any deductions for late submission.

By submitting this assignment you are acknowledging that this is your own work. Any submissions that break the code of academic honesty will be penalised as per the [academic honesty policy](https://staff.mq.edu.au/work/strategy-planning-and-governance/university-policies-and-procedures/policies/academic-honesty).