# Hand written digit recognition using kNN


# Importing the dataset

The dataset is a modified version of the “Optical Recognition of Handwritten Digits Data Set” by E. Alpaydin,
C. Kaynak, Department of Computer Engineering at Bogazici University, 80815 Istanbul Turkey, retrieved
from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml) on October 3, 2010. You can get the dataset from here https://github.com/TarunNanduri/Artificial-Intelligence/kNN-DigitRecognition/

In [1]:
from os import listdir
for name in listdir('./'):
    print(name)

.ipynb_checkpoints
kNN-DigitRecognition.ipynb
testDigits
trainingDigits


In [2]:
files = listdir('./trainingDigits')
for i in range(0,5):
    print(files[i])

0_0.txt
0_1.txt
0_10.txt
0_100.txt
0_101.txt


The files are of text format and each digit contains nearly 200 files

# Let's check how the file looks like

In [3]:
digit = open("./trainingDigits/0_13.txt", "r")
for line in digit.readlines():
    print(line)

00000000000000011100000000000000

00000000000001111110000000000000

00000000000011111111000000000000

00000000000001111111100000000000

00000000000011111111110000000000

00000000000111111111110000000000

00000001100111111001111000000000

00000001111111110001111100000000

00000011111111000000011100000000

00000011111111000000011100000000

00000011111111000000001100000000

00000011111111000000000110000000

00000011111111000000000110000000

00000011111111000000000110000000

00000011100110000000000110000000

00000001110011000000000111000000

00000001100000000000000111000000

00000001110000000000000111000000

00000000111000000000000111100000

00000001111000000000000111000000

00000001111000000000001111000000

00000000111000000000011110000000

00000000011100000000001111000000

00000000011110000000011111000000

00000000011111000000111100000000

00000000001110000011111100000000

00000000000111111111111100000000

00000000000111111111111000000000

00000000000111111111111000000000

00000000000011

# Spliting data

The data is already split into "testDigits" and "trainingDigits".So, we need just need to make them suitable to build model.

## Pre-processing dataset

Our dataset is in the form of 32x32 image.In order to train the model, we need to convert our 32x32 image into (1,1024) NumPy array.

In [4]:
from numpy import zeros

def img2vector(filename):
    imgVector = []
    file = open(filename)
    for i in range(32):
        lineStr = file.readline()
        for j in range(32):
            imgVector.append(int(lineStr[j]))
    return imgVector

In [5]:
trainData = './trainingDigits'
trainingFiles = listdir(trainData)
trainDigits = []
trainLabels = []
for file in trainingFiles:
    trainDigits.append(img2vector(trainData+'/'+file))
    trainLabels.append(int(file[0]))

### Our train data is ready. Now, let's make our test data ready.

In [6]:
testData = './testDigits'
testingFiles = listdir(testData)
testDigits = []
testLabels = []
for file in testingFiles:
    testDigits.append(img2vector(testData+'/'+file))
    testLabels.append(int(file[0]))

### Now the data is ready for building a model. So, let's start build our model 

# Building the model

In [7]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

knn.fit(trainDigits,trainLabels)
trainScore = knn.score(trainDigits,trainLabels)

print(trainScore)

0.9819027921406411


Whooohh! That was a great accuracy.Now let's see how our model behaves on unseen data.

## Testing the model

In [8]:
pred = knn.predict(testDigits)

## Evaluating Model

In [9]:
from sklearn.metrics import classification_report, confusion_matrix
print("Confusion Matrix:\n")
print(confusion_matrix(testLabels,pred))
print("\nClassification Report:\n")
print(classification_report(testLabels,pred))


Confusion Matrix:

[[ 87   0   0   0   0   0   0   0   0   0]
 [  0  96   0   0   0   0   0   1   0   0]
 [  0   0  92   0   0   0   0   0   0   0]
 [  0   0   0  83   0   0   0   0   1   1]
 [  1   0   0   0 113   0   0   0   0   0]
 [  0   0   0   1   0 106   0   0   0   1]
 [  0   0   0   0   0   0  87   0   0   0]
 [  0   0   0   0   0   0   0  96   0   0]
 [  0   5   0   2   0   0   1   0  83   0]
 [  0   1   0   1   0   1   0   1   0  85]]

Classification Report:

              precision    recall  f1-score   support

           0       0.99      1.00      0.99        87
           1       0.94      0.99      0.96        97
           2       1.00      1.00      1.00        92
           3       0.95      0.98      0.97        85
           4       1.00      0.99      1.00       114
           5       0.99      0.98      0.99       108
           6       0.99      1.00      0.99        87
           7       0.98      1.00      0.99        96
           8       0.99      0.91     

In [10]:
from sklearn.metrics import accuracy_score
accuracy_score(testLabels,pred)

0.9809725158562368

## Our model has an accuracy of 0.98