# k-Nearest Neighbors

The k-Nearest Neighbors algorithm (or kNN for short) is an easy algorithm to understand and to implement, and a powerful tool to have at your disposal.

In this tutorial you will implement the k-Nearest Neighbors algorithm from scratch in Python. The implementation will be specific for classification problems and will be demonstrated using the Iris flowers classification problem.

This tutorial is for you if you are a Python programmer, or a programmer who can pick-up Python quickly, and you are interested in how to implement k-Nearest Neighbors from scratch.

<img src="files/k-Nearest-Neighbors-algorithm.png">

## What is k-Nearest Neighbors

The model for kNN is the entire training dataset. When a prediction is required for an unseen data instance, the kNN algorithm will search through the training dataset for the k-most similar instances. The prediction attribute of the most similar instances is summarized and returned as the prediction for the unseen instance.

The similarity measure is dependent on the type of data. For real-valued data, the Euclidean distance can be used. Other types of data such as categorical or binary data, Hamming distance can be used.

In the case of regression problems, the average of the predicted attribute may be returned. In the case of classification, the most prevalent class may be returned.

## How does k-Nearest Neighbors Work

The kNN algorithm belongs to the family of instance-based, competitive learning and lazy learning algorithms.

Instance-based algorithms are those algorithms that model the problem using data instances (or rows) in order to make predictive decisions. The kNN algorithm is an extreme form of instance-based methods because all training observations are retained as part of the model.

It is a competitive learning algorithm, because it internally uses competition between model elements (data instances) in order to make a predictive decision. The objective similarity measure between data instances causes each data instance to compete to "win" or be most similar to a given unseen data instance and contribute to a prediction.

Lazy learning refers to the fact that the algorithm does not build a model until the time that a prediction is required. It is lazy because it only does work at the last second. This has the benefit of only including data relevant to the unseen data, called a localized model. A disadvantage is that it can be computationally expensive to repeat the same or similar searches over larger training datasets.

Finally, kNN is powerful because it does not assume anything about the data, other than a distance measure that can be calculated consistently between any two instances. As such, it is called non-parametric or non-linear as it does not assume a functional form

## Classify Flowers Using Measurements

The test problem we will be using in this tutorial is iris classification.

The problem is comprised of 150 observations of iris flowers from three different species. There are 4 measurements of given flowers: sepal length, sepal width, petal length and petal width, all in the same unit of centimeters. The predicted attribute is the species, which is one of setosa, versicolor or virginica.

It is a standard dataset where the species is known for all instances. As such we can split the data into training and test datasets and use the results to evaluate our algorithm implementation. Good classification accuracy on this problem is above 90% correct, typically 96% or better.

-  [Download the Iris Flowers Dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data)

Save the file in your current working directory with the file name "_iris.data_".

## How to implement k-Nearest Neighbors in Python

This tutorial is broken down into the following steps:

1. __Handle Data__: Open the dataset from CSV and split into test/train datasets.
2. __Similarity__: Calculate the distance between two data instances.
3. __Neighbors__: Locate k-most similar data instances.
4. __Response__: Generate a response from a set of data instances.
5. __Accuracy__: Summarize the accuracy of predictions.
6. __Main__: Tie it all together.

### 1. Handle Data

The first thing we need to do is load our data file. The data is in CSV format without a header line or any quotes. We also need to convert the attributes that were loaded as strings into numbers so that we can work with them. Below is the __read_csv()__ function for loading the Iris dataset.

In [1]:
def read_csv(filename):
    with open(filename) as f:
        dataset = [[x for x in line.split(',')] for line in f if line.strip()]
        for row in dataset:
            for i in range(len(row)-1):
                row[i] = float(row[i])
            row[-1] = row[-1].strip()
    return dataset

We can test this function by loading the Iris dataset and printing the number of data instances that were loaded.

In [2]:
filename = 'iris.data'
dataset = read_csv(filename)
print('Loaded data file "{0}" with {1} rows'.format(filename, len(dataset)))

Loaded data file "iris.data" with 150 rows


Next we need to split the data into a training dataset that kNN can use to make predictions and a test dataset that we can use to evaluate the accuracy of the model. We need to split the dataset randomly into train and test datasets with a ratio of 67% train and 33% test (this is a common ratio for testing an algorithm on a dataset).

Below is the __train_test_split()__ function that will split a given dataset into a given split ratio.

In [3]:
import random

def train_test_split(dataset, split_ratio):
    train_size = int(len(dataset) * split_ratio)
    train = []
    test = list(dataset)
    while len(train) < train_size:
        index = random.randrange(len(test))
        train.append(test.pop(index))
    return train, test

Download the iris flowers dataset CSV file to the local directory. We can test this function out with our iris dataset, as follows:

In [4]:
dataset = read_csv('iris.data')
train, test = train_test_split(dataset, 0.67)
print('Split %d rows into train = %d and test = %d rows'
      % (len(dataset), len(train), len(test)))

Split 150 rows into train = 100 and test = 50 rows


### 2. Similarity

In order to make predictions we need to calculate the similarity between any two given data instances. This is needed so that we can locate the k most similar data instances in the training dataset for a given member of the test dataset and in turn make a prediction.

Given that all four flower measurements are numeric and have the same units, we can directly use the Euclidean distance measure. This is defined as the square root of the sum of the squared differences between the two arrays of numbers (read that again a few times and let it sink in).

Additionally, we want to control which fields to include in the distance calculation. Specifically, we only want to include the first 4 attributes. One approach is to limit the euclidean distance to a fixed length, ignoring the final dimension.

Putting all of this together we can define the __euclidean_distance()__ function as follows:

In [5]:
import math

def euclidean_distance(instance1, instance2, length):
    distance = 0
    for i in range(length):
        distance += (instance1[i] - instance2[i])**2
    return math.sqrt(distance)

We can test this function with some sample data, as follows:

In [6]:
data1 = [2, 2, 2, 'a']
data2 = [4, 4, 4, 'b']
distance = euclidean_distance(data1, data2, 3)
print('Distance: {0}'.format(distance))

Distance: 3.4641016151377544


### 3. Neighbors

Now that we have a similarity measure, we can use it to collect the k most similar instances for a given unseen instance.

This is a straight forward process of calculating the distance for all instances and selecting a subset with the smallest distance values.

Below is the __get_neighbors()__ function that returns k most similar neighbors from the training set for a given test instance (using the already defined __euclidean_distance()__ function).

In [7]:
def get_neighbors(training_set, test_instance, k):
    distances = []
    length = len(training_set[0]) - 1
    for train in training_set:
        dist = euclidean_distance(test_instance, train, length)
        distances.append((train, dist))
    distances.sort(key = lambda x: x[1])
    neighbors = [instance for instance, _ in distances[:k]]
    return neighbors

We can test out this function as follows:

In [8]:
training_set = [[2, 2, 2, 'a'], [4, 4, 4, 'b'], [6, 6, 6, 'c']]
test_instance = [1, 1, 1]
neighbors = get_neighbors(training_set, test_instance, 2)
print(neighbors)

[[2, 2, 2, 'a'], [4, 4, 4, 'b']]


### 4. Response

Once we have located the most similar neighbors for a test instance, the next task is to devise a predicted response based on those neighbors.

We can do this by allowing each neighbor to vote for their class attribute, and take the majority vote as the prediction.

Below provides a function for getting the majority voted response from a number of neighbors. It assumes the class is the last attribute for each neighbor.

In [9]:
def get_response(neighbors):
    votes = [neighbor[-1] for neighbor in neighbors]
    return max(set(votes), key=votes.count)

We can test this function with some test neighbors, as follows:

In [10]:
neighbors = [[1, 1, 1, 'a'], [2, 2, 2, 'a'], [3, 3, 3, 'b']]
response = get_response(neighbors)
print(response)

a


This approach returns one response in the case of a draw, but you could handle such cases in a specific way, such as returning no response or selecting an unbiased random response.

### 5. Make Predictions

Finally, we can estimate the accuracy of the model by making predictions for each data instance in our test dataset. The __get_predictions()__ function will do this and return a list of predictions for each test instance.

In [11]:
def get_predictions(training_set, test_set, k):
    predictions = [get_response(get_neighbors(training_set, input_vector, k)) for input_vector in test_set]
    return predictions

We can test the __get_predictions()__ function as follows:

In [12]:
training_set = [[2, 2, 2, 'a'], [5, 5, 5, 'b']]
test_set = [[1, 1, 1], [3, 3, 3], [4, 4, 4], [6, 6, 6]]
predictions = get_predictions(training_set, test_set, 1)
print('Predictions: {0}'.format(predictions))

Predictions: ['a', 'a', 'b', 'b']


### 6. Accuracy

We have all the pieces of the kNN algorithm in place. An important remaining concern is how to evaluate the accuracy of predictions.

An easy way to evaluate the accuracy of the model is to calculate a ratio of the total correct predictions out of all predictions made, called the classification accuracy.

Below is the __get_accuracy()__ function that sums the total correct predictions and returns the accuracy as a percentage of correct classifications.

In [13]:
def get_accuracy(test_set, predictions):
    correct = 0
    for i in range(len(test_set)):
        if test_set[i][-1] == predictions[i]:
            correct += 1
    return correct / float(len(test_set)) * 100.0

We can test this function with a test dataset and predictions, as follows:

In [14]:
test_set = [[1, 1, 1, 'a'], [2, 2, 2, 'a'], [3, 3, 3, 'b']]
predictions = ['a', 'a', 'a']
accuracy = get_accuracy(test_set, predictions)
print('Accuracy: {0}'.format(accuracy))

Accuracy: 66.66666666666666


### 7. Main

We now have all the elements of the algorithm and we can tie them together with a main function.

Below is the complete example of implementing the kNN algorithm from scratch in Python.

In [15]:
# Example of kNN implemented from scratch in Python
import random
import math

def read_csv(filename):
    with open(filename) as f:
        dataset = [[x for x in line.split(',')] for line in f if line.strip()]
        for row in dataset:
            for i in range(len(row)-1):
                row[i] = float(row[i])
            row[-1] = row[-1].rstrip()
    return dataset

def train_test_split(dataset, split_ratio):
    train_size = int(len(dataset) * split_ratio)
    train = []
    test = list(dataset)
    while len(train) < train_size:
        index = random.randrange(len(test))
        train.append(test.pop(index))
    return train, test

def euclidean_distance(instance1, instance2, length):
    distance = 0
    for i in range(length):
        distance += (instance1[i] - instance2[i])**2
    return math.sqrt(distance)

def get_neighbors(training_set, test_instance, k):
    distances = []
    length = len(training_set[0]) - 1
    for train in training_set:
        dist = euclidean_distance(test_instance, train, length)
        distances.append((train, dist))
    distances.sort(key = lambda x: x[1])
    neighbors = [instance for instance, _ in distances[:k]]
    return neighbors

def get_response(neighbors):
    votes = [neighbor[-1] for neighbor in neighbors]
    return max(set(votes), key=votes.count)

def get_predictions(training_set, test_set, k):
    predictions = [get_response(get_neighbors(training_set, input_vector, k)) for input_vector in test_set]
    return predictions

def get_accuracy(test_set, predictions):
    correct = 0
    for i in range(len(test_set)):
        if test_set[i][-1] == predictions[i]:
            correct += 1
    return correct / float(len(test_set)) * 100.0

if __name__ == '__main__':
    dataset = read_csv('iris.data')
    train, test = train_test_split(dataset, 0.67)
    print('Split %d rows into train = %d and test = %d rows'
          % (len(dataset), len(train), len(test)))
    
    # Test model
    predictions = get_predictions(train, test, 3)
    accuracy = get_accuracy(test, predictions)
    print('Accuracy: {0}%'.format(accuracy))

Split 150 rows into train = 100 and test = 50 rows
Accuracy: 94.0%


## Ideas For Extensions

This section provides you with ideas for extensions that you could apply and investigate with the Python code you have implemented as part of this tutorial.

-  __Regression__: You could adapt the implementation to work for regression problems (predicting a real-valued attribute). The summarization of the closest instances could involve taking the mean or the median of the predicted attribute.
-  __Normalization__: When the units of measure differ between attributes, it is possible for attributes to dominate in their contribution to the distance measure. For these types of problems, you will want to rescale all data attributes into the range 0-1 (called normalization) before calculating similarity. Update the model to support data normalization.
-  __Alternative Distance Measure__: There are many distance measures available, and you can even develop your own domain-specific distance measures if you like. Implement an alternative distance measure, such as Manhattan distance or the vector dot product.

There are many more extensions to this algorithm you might like to explore. Two additional ideas include support for distance-weighted contribution for the k-most similar instances to the prediction and more advanced data tree-based structures for searching for similar instances.