### 1. Handle Data

#### The first thing we need to do is load our data file and then we need to split the data into a training dataset 

In [1]:
import csv
import random
def loadDataset(filename,split,trainingSet=[] ,testSet=[]):
    with open(filename, 'r') as csvfile:
        lines = csv.reader(csvfile)
        dataset = list(lines)
        for x in range(len(dataset)):
            for y in range(4):
                dataset[x][y] = float(dataset[x][y])
            if random.random() < split:
                trainingSet.append(dataset[x])     
            else:
                testSet.append(dataset[x])   

In [2]:
trainingSet=[]
testSet=[]
loadDataset('iris.data.txt', 0.66, trainingSet, testSet)
print ('Train: ' + repr(len(trainingSet)))
print ('Test: ' + repr(len(testSet)) )

Train: 98
Test: 52


### 2. Similarity



#### In order to make predictions we need to calculate the similarity between any two given data instances. This is needed so that we can locate the k most similar data instances in the training dataset for a given member of the test dataset and in turn make a prediction.

In [3]:
import math
def euclideanDistance(instance1, instance2, length):
    distance = math.sqrt(sum([math.pow(instance1[i]-instance2[i],2) for i in range(length)]))
    return distance

In [4]:
data1 = [2, 2, 2, 'a']
data2 = [4, 4, 4, 'b']
distance = euclideanDistance(data1, data2, 3)
print('Distance:' + repr(distance))

Distance:3.4641016151377544


### 3. Neighbors

#### Now that we have a similarity measure, we can use it to collect the k most similar instances for a given unseen instance.

In [5]:
import operator
def getNeighbors(trainingSet, testInstance, k):
    distances = []
    length = len(testInstance)-1
    for x in range(len(trainingSet)):
        dist = euclideanDistance(testInstance, trainingSet[x], length)
        distances.append((trainingSet[x], dist))
        distances.sort(key=operator.itemgetter(1))
        neighbors = []
    for x in range(k):
        neighbors.append(distances[x][0])
    return neighbors

In [6]:
trainSet = [[2, 2, 2, 'a'], [4, 4, 4, 'b']]
testInstance = [5, 5, 5]
k = 1
neighbors = getNeighbors(trainSet, testInstance, 1)
print(neighbors)

[[4, 4, 4, 'b']]


### 4. Response

#### Once we have located the most similar neighbors for a test instance, the next task is to devise a predicted response based on those neighbors. We can do this by allowing each neighbor to vote for their class attribute, and take the majority vote as the prediction.

In [7]:
def getResponse(neighbors):
    classVotes = {}
    for x in range(len(neighbors)):
        response = neighbors[x][-1] 
        if response in classVotes:
            classVotes[response] += 1
        else:
            classVotes[response] = 1    
        sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
        return sortedVotes[0][0]

###### We can test out this function with some test neighbors, as follows: neighbors = [[1,1,1,'a'], [2,2,2,'a'], [3,3,3,'b']]

In [8]:
neighbors = [[1,1,1,'a'], [2,2,2,'a'], [3,3,3,'b']]
response = getResponse(neighbors)
print(response)

a


### 5. Accuracy


#### We have all of the pieces of the kNN algorithm in place. An important remaining concern is how to evaluate the accuracy of predictions.

In [9]:
def getAccuracy(testSet, predictions):
    correct = 0
    for i in range(len(testSet)):
        if testSet[i][-1] == predictions[i]:
            correct += 1
    return (correct/float(len(testSet))) * 100.0

In [10]:
testSet = [[1,1,1,'a'], [2,2,2,'a'], [3,3,3,'b']]
predictions = ['a', 'a', 'a']
accuracy = getAccuracy(testSet, predictions)
print(accuracy)

66.66666666666666


### 6. Main

#### We now have all the elements of the algorithm you can put them all in one main function.

In [11]:
def main(filename,split,k):
    trainingSet,testSet,predictions,accuracies=[],[],[],[]
    loadDataset(filename,split,trainingSet,testSet)
    length = len(trainingSet)-1
    for x_test in testSet:
            neighbors = []
            neighbors = getNeighbors(trainingSet, x_test, k)
            predictions.append(getResponse(neighbors))
    return getAccuracy(testSet, predictions)

In [12]:
for k in range(1,30):
    print(f"Accuracy for {k} is {main('iris.data.txt',0.66,k)}")


Accuracy for 1 is 95.16129032258065
Accuracy for 2 is 96.07843137254902
Accuracy for 3 is 94.23076923076923
Accuracy for 4 is 94.54545454545455
Accuracy for 5 is 97.67441860465115
Accuracy for 6 is 94.54545454545455
Accuracy for 7 is 95.45454545454545
Accuracy for 8 is 90.9090909090909
Accuracy for 9 is 95.91836734693877
Accuracy for 10 is 95.74468085106383
Accuracy for 11 is 95.74468085106383
Accuracy for 12 is 94.11764705882352
Accuracy for 13 is 100.0
Accuracy for 14 is 96.22641509433963
Accuracy for 15 is 93.87755102040816
Accuracy for 16 is 98.0392156862745
Accuracy for 17 is 95.83333333333334
Accuracy for 18 is 96.15384615384616
Accuracy for 19 is 100.0
Accuracy for 20 is 96.15384615384616
Accuracy for 21 is 92.3076923076923
Accuracy for 22 is 93.33333333333333
Accuracy for 23 is 95.91836734693877
Accuracy for 24 is 94.44444444444444
Accuracy for 25 is 96.72131147540983
Accuracy for 26 is 94.23076923076923
Accuracy for 27 is 93.65079365079364
Accuracy for 28 is 98.0392156862745
A

### 7.Another distance metric

In [13]:
def manhattanDistance(instance1, instance2, length):
    distance = sum([abs(instance1[i]-instance2[i]) for i in range(length)])
    return distance

In [14]:
data1 = [2, 2, 2, 'a']
data2 = [4, 4, 4, 'b']
distance = manhattanDistance(data1, data2, 3)
print('Distance:' + repr(distance))

Distance:6


In [15]:
def getNeighbors(trainingSet, testInstance, k):
    distances = []
    length = len(testInstance)-1
    for x in range(len(trainingSet)):
        dist = manhattanDistance(testInstance, trainingSet[x], length)
        distances.append((trainingSet[x], dist))
        distances.sort(key=operator.itemgetter(1))
        neighbors = []
    for x in range(k):
        neighbors.append(distances[x][0])
    return neighbors

In [16]:
for k in range(1,30):
    print(f"Accuracy for {k} is {main('iris.data.txt',0.66,k)} using manhattan distance")

Accuracy for 1 is 89.28571428571429 using manhattan distance
Accuracy for 2 is 100.0 using manhattan distance
Accuracy for 3 is 93.33333333333333 using manhattan distance
Accuracy for 4 is 90.74074074074075 using manhattan distance
Accuracy for 5 is 91.66666666666666 using manhattan distance
Accuracy for 6 is 93.10344827586206 using manhattan distance
Accuracy for 7 is 94.0 using manhattan distance
Accuracy for 8 is 98.11320754716981 using manhattan distance
Accuracy for 9 is 96.36363636363636 using manhattan distance
Accuracy for 10 is 96.29629629629629 using manhattan distance
Accuracy for 11 is 88.63636363636364 using manhattan distance
Accuracy for 12 is 95.45454545454545 using manhattan distance
Accuracy for 13 is 94.0 using manhattan distance
Accuracy for 14 is 88.0 using manhattan distance
Accuracy for 15 is 98.18181818181819 using manhattan distance
Accuracy for 16 is 96.07843137254902 using manhattan distance
Accuracy for 17 is 98.14814814814815 using manhattan distance
Accura