## KNN Handwritten Digit Recognition

In this notebook, we will use the HWdigits dataset to demonstrate how to implement a simple K-nearest neighbors (KNN) algorithm for handwritten digit recognition using Python. The HWdigits dataset contains 1934 images of digits from 0 to 9, each with a size of 32 x 32 pixels. The images are stored as text files, where each line represents a row of pixels and each character represents a pixel value (0 or 1). Our goal is to use the pixel values as features to classify the images into their correct labels.

In [1]:
import os
import numpy as np
from os import listdir

# Define a function to convert an image file to a vector
def img2vector(filename, h, w):
    imgVector = np.zeros((1, h * w))
    fileIn = open(filename)
    for row in range(h):
        lineStr = fileIn.readline()
        for col in range(w):
            imgVector[0, row * 32 + col] = int(lineStr[col])
    return imgVector

# Define a function to load the dataset from two directories
def loadDataSet():
    print("1.Loading tranSet...")
    trainFileList = listdir('../data/HWdigits/trainSet')
    trainNum = len(trainFileList)

    trainX = np.zeros((trainNum, 32 * 32))
    trainY = []
    for i in range(trainNum):
        trainFile = trainFileList[i]
        trainX[i, :] = img2vector('../data/HWdigits/trainSet/%s'%trainFile, 32, 32)
        label = int(trainFile.split('_')[0])
        trainY.append(label)
    print("2.Loading testSet...")
    testFileList = listdir('../data/HWdigits/testSet')
    testNum = len(testFileList)
    testX = np.zeros((testNum, 32 * 32))
    testY = []
    for i in range(testNum):
        testFile = testFileList[i]
        testX[i, :] = img2vector('../data/HWdigits/trainSet/%s'%testFile, 32, 32)
        label = int(testFile.split('_')[0])
        testY.append(label)
    return trainX, trainY, testX, testY

# Define a function to apply the KNN algorithm to a test digit
def myKNN(testDigit, trainX, trainY, k):
    numSamples = trainX.shape[0]
    diff = []
    for n in range(numSamples):
        diff.append(testDigit - trainX[n])
    diff = np.array(diff)
    squareDiff = diff ** 2
    squareDist = np.sum(squareDiff, axis = 1)
    distance = squareDist ** 0.5
    sortedDistIndices = np.argsort(distance)
    classCount = {}
    for i in range(k):
        voteLabel = trainY[sortedDistIndices[i]]
        classCount[voteLabel] = classCount.get(voteLabel, 0) + 1
    maxCount = 0
    for key, value in classCount.items():
        if value > maxCount:
            maxCount = value
            maxIndex = key
    return maxIndex

# Load dataset using our function
train_x, train_y, test_x, test_y = loadDataSet()

# Get the number of samples in the test set
numTestSamples = test_x.shape[0]

# Initialize a variable for storing the number of matches
matchCount = 0

print("3.Find the most frequent label in k-nearest...")
print("4.Show the result...")

for i in range(numTestSamples):
    predict = myKNN(test_x[i], train_x, train_y, 3)
    print("result is: %d, real answer is: %d" %(predict, test_y[i]))
    if predict == test_y[i]:
        matchCount += 1

# Calculate accuracy score as matchCount / numTestSamples
accuracy = float(matchCount) / numTestSamples

# Print accuracy
print("5.Show the accuracy...")

# Print how many errors we made out of all predictions
print("The total number of errors is: %d" % (numTestSamples - matchCount))

# Print accuracy score as percentage with two decimal
print('The classify accuracy is: %.2f%%' % (accuracy * 100))

1.Loading tranSet...
2.Loading testSet...
3.Find the most frequent label in k-nearest...
4.Show the result...
result is: 3, real answer is: 3
result is: 2, real answer is: 2
result is: 8, real answer is: 8
result is: 8, real answer is: 8
result is: 2, real answer is: 2
result is: 3, real answer is: 3
result is: 4, real answer is: 4
result is: 8, real answer is: 8
result is: 1, real answer is: 1
result is: 9, real answer is: 9
result is: 1, real answer is: 1
result is: 0, real answer is: 0
result is: 4, real answer is: 4
result is: 3, real answer is: 3
result is: 2, real answer is: 2
result is: 3, real answer is: 3
result is: 0, real answer is: 0
result is: 9, real answer is: 9
result is: 0, real answer is: 0
result is: 9, real answer is: 9
result is: 7, real answer is: 7
result is: 1, real answer is: 1
result is: 4, real answer is: 4
result is: 5, real answer is: 5
result is: 4, real answer is: 4
result is: 3, real answer is: 3
result is: 3, real answer is: 3
result is: 0, real answer 