# k-Nearest Neighbors

In the case of unseen data, k-Nearest Neighbors will go through the training dataset and seek to find the k-most similar instances (k = the number of seperate groups). Basically, if there are 3 groups of seperate data points, you can train kNN to seperate these three groups and predict which group a new data point will be part of.

There are different ways to measure the similarity and it depends on the type of data. For real-valued data, the Euclidean distance can be used. Other types of data such as categorical or binary data could use the Hamming distance.

The principle behind nearest neighbor methods (not just k-Nearest Neighbor) is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. In kNN learning, the number of samples can be a user-defined constant. In order to find which "k" value gives the minimum error, we would usually run through several different "k" values and chose the one with the lowest error.

You can apply kNN to either regression problems or classification problems. For regression, the average of the predicted attribute may be returened. For classification, the most prevalent class may be returned.

# How does k-Nearest Neighbors work?

(Will give a detailed explanation)

# Let's classify the iris dataset using kNN!

The iris dataset contains 150 observation of iris flowers from three different species. There are 4 measurements of given flowers: sepal length, sepal width, petal length and petal width. We will be using kNN to classify and predict the species of the flower: setosa, versicolor or virginia.

# Handle the data

Let's load the dataset and see what it looks like! The dataset is available here: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

In [5]:
import csv
with open('iris.data.txt', 'r') as csvfile:
    lines = csv.reader(csvfile)
    for row in lines:
        print(', '.join(row))

5.1, 3.5, 1.4, 0.2, Iris-setosa
4.9, 3.0, 1.4, 0.2, Iris-setosa
4.7, 3.2, 1.3, 0.2, Iris-setosa
4.6, 3.1, 1.5, 0.2, Iris-setosa
5.0, 3.6, 1.4, 0.2, Iris-setosa
5.4, 3.9, 1.7, 0.4, Iris-setosa
4.6, 3.4, 1.4, 0.3, Iris-setosa
5.0, 3.4, 1.5, 0.2, Iris-setosa
4.4, 2.9, 1.4, 0.2, Iris-setosa
4.9, 3.1, 1.5, 0.1, Iris-setosa
5.4, 3.7, 1.5, 0.2, Iris-setosa
4.8, 3.4, 1.6, 0.2, Iris-setosa
4.8, 3.0, 1.4, 0.1, Iris-setosa
4.3, 3.0, 1.1, 0.1, Iris-setosa
5.8, 4.0, 1.2, 0.2, Iris-setosa
5.7, 4.4, 1.5, 0.4, Iris-setosa
5.4, 3.9, 1.3, 0.4, Iris-setosa
5.1, 3.5, 1.4, 0.3, Iris-setosa
5.7, 3.8, 1.7, 0.3, Iris-setosa
5.1, 3.8, 1.5, 0.3, Iris-setosa
5.4, 3.4, 1.7, 0.2, Iris-setosa
5.1, 3.7, 1.5, 0.4, Iris-setosa
4.6, 3.6, 1.0, 0.2, Iris-setosa
5.1, 3.3, 1.7, 0.5, Iris-setosa
4.8, 3.4, 1.9, 0.2, Iris-setosa
5.0, 3.0, 1.6, 0.2, Iris-setosa
5.0, 3.4, 1.6, 0.4, Iris-setosa
5.2, 3.5, 1.5, 0.2, Iris-setosa
5.2, 3.4, 1.4, 0.2, Iris-setosa
4.7, 3.2, 1.6, 0.2, Iris-setosa
4.8, 3.1, 1.6, 0.2, Iris-setosa
5.4, 3.4

To use this data, we will need to convert the flower measurement from strings to numbers. Once we've done this, we will split the data into the training and test set. We want to train kNN for it to make predictions with the training set and then use the test to evaluate the accuracy of the model. In this case, we will split the data with a 67/33 ratio for train/test sets.

Let's create a function to load a csv and then split the data randomly into the train/test sets.

In [2]:
import csv
import random # to split the data randomly
def loadDataset(filename, split, trainingSet=[], testSet=[]):
    with open(filename, 'r') as csvfile: # python 2.7 uses 'rb' instead
        lines = csv.reader(csvfile) 
        dataset = list(lines)
        for x in range(len(dataset)-1):
            for y in range(4):
                dataset[x][y] = float(dataset[x][y])
            if random.random() < split:
                trainingSet.append(dataset[x])
            else:
                testSet.append(dataset[x])
                
trainingSet = []
testSet = []
loadDataset('iris.data.txt', 0.66, trainingSet, testSet)
print('Train:' + repr(len(trainingSet)))
print('Test:' + repr(len(testSet)))

Train:97
Test:53


In [4]:
import pandas as pd
import random
def loadDataset(filename, split, trainingSet=[], testSet=[]):
        df = pd.read_csv(filename)
        dataset = list(df)
        for x in range(len(dataset)-1):
            for y in range(4):
                dataset[x][y] = float(dataset[x][y])
            if random.random() < split:
                trainingSet.append(dataset[x])
            else:
                testSet.append(dataset[x])
                
trainingSet = []
testSet = []
loadDataset('iris.data.txt', 0.66, trainingSet, testSet)
print('Train:' + repr(len(trainingSet)))
print('Test:' + repr(len(testSet)))

TypeError: 'str' object does not support item assignment

In [6]:
df = pd.read_csv('iris.data.txt')
dataset = list(df)
print(dataset)

['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']


In [7]:
print(df)

     5.1  3.5  1.4  0.2     Iris-setosa
0    4.9  3.0  1.4  0.2     Iris-setosa
1    4.7  3.2  1.3  0.2     Iris-setosa
2    4.6  3.1  1.5  0.2     Iris-setosa
3    5.0  3.6  1.4  0.2     Iris-setosa
4    5.4  3.9  1.7  0.4     Iris-setosa
5    4.6  3.4  1.4  0.3     Iris-setosa
6    5.0  3.4  1.5  0.2     Iris-setosa
7    4.4  2.9  1.4  0.2     Iris-setosa
8    4.9  3.1  1.5  0.1     Iris-setosa
9    5.4  3.7  1.5  0.2     Iris-setosa
10   4.8  3.4  1.6  0.2     Iris-setosa
11   4.8  3.0  1.4  0.1     Iris-setosa
12   4.3  3.0  1.1  0.1     Iris-setosa
13   5.8  4.0  1.2  0.2     Iris-setosa
14   5.7  4.4  1.5  0.4     Iris-setosa
15   5.4  3.9  1.3  0.4     Iris-setosa
16   5.1  3.5  1.4  0.3     Iris-setosa
17   5.7  3.8  1.7  0.3     Iris-setosa
18   5.1  3.8  1.5  0.3     Iris-setosa
19   5.4  3.4  1.7  0.2     Iris-setosa
20   5.1  3.7  1.5  0.4     Iris-setosa
21   4.6  3.6  1.0  0.2     Iris-setosa
22   5.1  3.3  1.7  0.5     Iris-setosa
23   4.8  3.4  1.9  0.2     Iris-setosa
