# **K-Nearest Neighbours**

## Introduction

Before diving into the K-Nearest Neighbours algorithm, let’s first take a minute to think about an example.

Consider a dataset of movies. Let’s brainstorm some features of a movie data point. A feature is a piece of information associated with a data point. Here are some potential features of movie data points:
- the *length* of the movie in minutes.
- the *budget* of a movie in dollars.

If you think back to the previous exercise, you could imagine movies being places in that two-dimensional space based on those numeric features. There could also be some boolean features: features that are either true or false. For example, here are some potential boolean features:
- *Black and white*. This feature would be `True` for black and white movies and `False` otherwise.
- *Directed by Stanley Kubrick*. This feature would be `False` for almost every movie, but for the few movies that were directed by Kubrick, it would be `True`.

Finally, let’s think about how we might want to classify a movie. For the rest of this lesson, we’re going to be classifying movies as either good or bad. In our dataset, we’ve classified a movie as good if it had an IMDb rating of 7.0 or greater. Every “good” movie will have a class of `1`, while every bad movie will have a class of `0`.

## Distance Between Points - 2D

We need to define what it means for two points to be close together or far apart. To do this, we’re going to use the Distance Formula.

For this example, the data has two dimensions:
- The length of the movie
- The movie’s release date

Consider *Star Wars* and *Raiders of the Lost Ark*. *Star Wars* is 125 minutes long and was released in 1977. *Raiders of the Lost Ark* is 115 minutes long and was released in 1981.

The distance between the movies is computed below:
$$\sqrt{(125 - 115)^2 + (1977 - 1981)^2} = 10.77$$

## Distance Between Points - 3D

Making a movie rating predictor based on just the length and release date of movies is pretty limited. There are so many more interesting pieces of data about movies that we could use! So let’s add another dimension.

Let’s say this third dimension is the movie’s budget. We now have to find the distance between these two points in three dimensions.

<img src = "3d_graph_img.png" height=40% width=40%/>

What if we’re not happy with just three dimensions? Unfortunately, it becomes pretty difficult to visualize points in dimensions higher than 3. But that doesn’t mean we can’t find the distance between them.

The generalised distance formula between points A and B is as follows:
$$\sqrt{(A_1-B_1)^2+(A_2-B_2)^2+ \dots+(A_n-B_n)^2}$$

Here, A1-B1 is the difference between the first feature of each point. An-Bn is the difference between the last feature of each point.

Using this formula, we can find the K-Nearest Neighbours of a point in N-dimensional space! We now can use as much information about our movies as we want.

We will eventually use these distances to find the nearest neighbours to an unlabelled point.

## Data with Different Scales - Normalisation

In the next three lessons, we’ll implement the three steps of the K-Nearest Neighbour Algorithm:
- **Normalise the data**
- Find the `k` nearest neighbours
- Classify the new point based on those neighbours

When we added the dimension of budget, you might have realised there are some problems with the way our data currently looks.

Consider the two dimensions of release date and budget. The maximum difference between two movies’ release dates is about 125 years (The Lumière Brothers were making movies in the 1890s). However, the difference between two movies’ budget can be millions of dollars.

The problem is that the distance formula treats all dimensions equally, regardless of their scale. If two movies came out 70 years apart, that should be a pretty big deal. However, right now, that’s exactly equivalent to two movies that have a difference in budget of 70 dollars. The difference in one year is exactly equal to the difference in one dollar of budget. That’s absurd!

Another way of thinking about this is that the budget completely outweighs the importance of all other dimensions because it is on such a huge scale. The fact that two movies were 70 years apart is essentially meaningless compared to the difference in millions in the other dimension.

The solution to this problem is to normalise the data so every value is between 0 and 1. In this lesson, we’re going to be using min-max normalisation.

-----

It is unlikely, but possible, that the minimum and maximum values for some feature are the same value. In that case, our calculation for normalisation would fail, due to division by sero.

$$\frac{value-minimum}{maximum-minimum}$$

To account for this possibility, one thing you can do is skip the calculation, and instead set all the values of that feature to the same value, say 0 or 1, for each data point. This way, they will all be weighed the same.

However, we may determine that when all values are the same, then this does not provide any useful information to us. So, we might also consider excluding that feature entirely. For example, say that we had a dataset for animal physical features and that every animal in our dataset had two legs. Since we know that each animal has two legs, then we might exclude that feature in our calculations.

In [15]:
release_dates = [1897, 1998, 2000, 1948, 1962, 1950, 1975, 1960, 2017, 1937, 1968, 1996, 1944, 1891, 1995, 1948, 2011, 1965, 1891, 1978]

def min_max_normalise(lst):
    minimum = min(lst)
    maximum = max(lst)

    normalised = []
    for i in range(len(lst)):
        new_val  = (lst[i] - minimum) / (maximum - minimum)
        normalised.append(new_val)

    return normalised

print(min_max_normalise(release_dates))

[0.047619047619047616, 0.8492063492063492, 0.8650793650793651, 0.4523809523809524, 0.5634920634920635, 0.46825396825396826, 0.6666666666666666, 0.5476190476190477, 1.0, 0.36507936507936506, 0.6111111111111112, 0.8333333333333334, 0.42063492063492064, 0.0, 0.8253968253968254, 0.4523809523809524, 0.9523809523809523, 0.5873015873015873, 0.0, 0.6904761904761905]


## Finding the Nearest Neighbours

The K-Nearest Neighbour Algorithm:
- Normalise the data
- **Find the `k` nearest neighbours**
- Classify the new point based on those neighbours

Now that our data has been normalised and we know how to find the distance between two points, we can begin classifying unknown data!

To do this, we want to find the `k` nearest neighbours of the unclassified point. In a few exercises, we’ll learn how to properly choose `k`, but for now, let’s choose a number that seems somewhat reasonable. Let’s choose 5.

In order to find the 5 nearest neighbours, we need to compare this new unclassified movie to every other movie in the dataset. This means we’re going to be using the distance formula again and again. We ultimately want to end up with a sorted list of distances and the movies associated with those distances.

It might look something like this:
```
[
  [0.30, 'Superman II'],
  [0.31, 'Finding Nemo'],
  ...
  ...
  [0.38, 'Blazing Saddles']
] ```

In this example, the unknown movie has a distance of 0.30 to Superman II. 

In [5]:
import json

f = open('movie_labels.json')
movie_labels = json.load(f)

f = open('movie_dataset.json')
movie_dataset = json.load(f)

In [6]:
def distance(movie1, movie2):
    squared_difference = 0
    for i in range(len(movie1)):
        squared_difference += (movie1[i] - movie2[i]) ** 2
        distance = squared_difference ** 0.5
    return distance

def classify(unknown, dataset, k):
    distances = []

    for title in dataset:
        movie_data = dataset[title]
        distance_to_point = distance(movie_data, unknown)
        distances.append([distance_to_point, title])

    distances.sort()

    neighbours = distances[:k]

    return neighbours

classify([0.4, 0.2, 0.9], movie_dataset, 5)

[[0.08273614694606074, 'Lady Vengeance'],
 [0.22989623153818367, 'Steamboy'],
 [0.23641372358159884, 'Fateless'],
 [0.26735445689589943, 'Princess Mononoke'],
 [0.3311022951533416, 'Godzilla 2000']]

## Count Neighbours

The K-Nearest Neighbour Algorithm:
- Normalise the data
- **Find the `k` nearest neighbours**
- Classify the new point based on those neighbours

We’ve now found the `k` nearest neighbours, and have stored them in a list that looks like this:
```
[
  [0.083, 'Lady Vengeance'],
  [0.236, 'Steamboy'],
  ...
  ...
  [0.331, 'Godzilla 2000']
]
```

Our goal now is to count the number of good movies and bad movies in the list of neighbours. If more of the neighbours were good, then the algorithm will classify the unknown movie as good. Otherwise, it will classify it as bad.

In order to find the class of each of the labels, we’ll need to look at our `movie_labels` dataset. For example, `movie_labels['Akira']` would give us `1` because Akira is classified as a good movie.

You may be wondering what happens if there’s a tie. What if `k = 8` and four neighbours were good and four neighbours were bad? There are different strategies, but one way to break the tie would be to choose the class of the closest point.

In [14]:
def classify(unknown, dataset, labels, k):
    distances = []

    for title in dataset:
        movie_data = dataset[title]
        distance_to_point = distance(movie_data, unknown)
        distances.append([distance_to_point, title])

    distances.sort()

    neighbours = distances[:k]
    
    num_good = 0
    num_bad = 0

    for movie in neighbours:
        title = movie[1]
        
        if labels[title] == 1:
            num_good += 1
        else:
            num_bad += 1
    
    if num_good > num_bad:
        return 1
    else:
        return 0

classify([0.4, 0.2, 0.9], movie_dataset, movie_labels, 5)

1

## Classify Your Favorite Movie

Nice work! Your classifier is now able to predict whether a movie will be good or bad. So far, we’ve only tested this on a completely random point [.4, .2, .9]. In this exercise we’re going to pick a real movie, normalize it, and run it through our classifier to see what it predicts!

We are going to be testing our classifier using the 2020 movie *The Call Of The Wild*.

In [19]:
movie_title = 'The Call Of The Wild'
movie_data = [150000, 100, 2020]
# check that the movie is not already in the dataset
print(movie_title in movie_dataset)

normalised_data = min_max_normalise(movie_data)
print(normalised_data)

print(classify(normalised_data, movie_dataset, movie_labels, 5))

False
[1.0, 0.0, 0.012808539026017345]
1


## Training and Validation Sets

You’ve now built your first K Nearest Neighbours algorithm capable of classification. You can feed your program a never-before-seen movie and it can predict whether its IMDb rating was above or below 7.0. However, we’re not done yet. We now need to report how effective our algorithm is. After all, it’s possible our predictions are totally wrong!

As with most machine learning algorithms, we have split our data into a training set and validation set.

Once these sets are created, we will want to use every point in the validation set as input to the K Nearest Neighbour algorithm. We will take a movie from the validation set, compare it to all the movies in the training set, find the K Nearest Neighbours, and make a prediction. After making that prediction, we can then peek at the real answer (found in the validation labels) to see if our classifier got the answer correct.

If we do this for every movie in the validation set, we can count the number of times the classifier got the answer right and the number of times it got it wrong. Using those two numbers, we can compute the validation accuracy.

Validation accuracy will change depending on what K we use. In the next exercise, we’ll use the validation accuracy to pick the best possible K for our classifier.

In [50]:
f = open('training_labels.json')
training_labels = json.load(f)

f = open('training_set.json')
training_dataset = json.load(f)

f = open('validation_labels.json')
validation_labels = json.load(f)

f = open('validation_set.json')
validation_dataset = json.load(f)

In [55]:
print(validation_dataset['Bee Movie'])
print(validation_labels['Bee Movie'])

guess = classify(validation_dataset['Bee Movie'], training_dataset, training_labels, 5)

if guess == validation_labels['Bee Movie']:
    print("Correct!")
else:
    print("Wrong!")

[0.012279463360232739, 0.18430034129692832, 0.898876404494382]
0
Correct!


## Choosing K

In the previous exercise, we found that our classifier got one point in the training set correct. Now we can test every point to calculate the validation accuracy.

The validation accuracy changes as `k` changes. The first situation that will be useful to consider is when `k` is very small. Let’s say `k = 1`. We would expect the validation accuracy to be fairly low due to *overfitting*. Overfitting is a concept that will appear almost any time you are writing a machine learning algorithm. Overfitting occurs when you rely too heavily on your training data; you assume that data in the real world will always behave exactly like your training data. In the case of K-Nearest Neighbours, overfitting happens when you don’t consider enough neighbours. A single outlier could drastically determine the label of an unknown point. Consider the image below.

<img src = "dots_img.png" height=40% width=40%/>

The dark blue point in the top left corner of the graph looks like a fairly significant outlier. When `k = 1`, all points in that general area will be classified as dark blue when it should probably be classified as green. Our classifier has relied too heavily on the small quirks in the training data.

On the other hand, if `k` is very large, our classifier will suffer from *underfitting*. Underfitting occurs when your classifier doesn’t pay enough attention to the small quirks in the training set. Imagine you have 100 points in your training set and you set `k = 100`. Every single unknown point will be classified in the same exact way. The distances between the points don’t matter at all! This is an extreme example, however, it demonstrates how the classifier can lose understanding of the training data if `k` is too big.

In [58]:
def find_validation_accuracy(training_set, training_labels, validation_set, validation_labels, k):
    num_correct = 0.0
    for title in validation_set:
        guess = classify(validation_set[title], training_set, training_labels, k)

        if guess == validation_labels[title]:
            num_correct += 1

    error = num_correct / len(validation_set)

    return error

print(find_validation_accuracy(training_dataset, training_labels, validation_dataset, validation_labels, k = 3))

0.6639344262295082


Outliers can affect the classifications in a negative way, because of the sensitivity of K-Nearest Neighbours to them.

Even a single outlier can cause problems if the value of `k` is very small, like `k = 1`, because any point near the outlier will be more influenced by it.

One reason why outliers are so impactful is that the K-Nearest Neighbours technique is completely dependent upon the input data. Outliers in the input data can impact the boundaries of classification because points that fall near to them can be classified differently than expected.

To avoid these issues caused by outliers, it can be a good idea to try and remove them initially. Another thing you can do is choose higher values of `k`, larger than 1, but not too large, because this can cause underfitting. By choosing a good value of `k`, it can still remain accurate even despite possible outliers, because it will not only take into account the outlier, but also the surrounding neighbour points.

## Graph of K

The graph below shows the validation accuracy of our movie classifier as `k` increases. When `k` is small, overfitting occurs and the accuracy is relatively low. On the other hand, when `k` gets too large, underfitting occurs and accuracy starts to drop.

<img src = "validation_acc_img.png" height=40% width=40%/>

In general, yes, any dataset should follow a similar shape as the one shown, although it may appear slightly different.
- For small values of `k`, the accuracy will be low, because the model will overfit the data.
- As `k` increases, accuracy will also increase, until eventually reaching a sort of “hump” shape, where the best value of `k` will be between. In this particular graph, this happens around the value `k = 74`, where the validation accuracy is highest.
- After this “hump”, the accuracy will continue to drop, as `k` increases further, and underfitting occurs due to high `k` values.

## Using sklearn

You’ve now written your own K-Nearest Neighbour classifier from scratch! However, rather than writing your own classifier every time, you can use Python’s `sklearn` library. `sklearn` is a Python library specifically used for Machine Learning. It has an amazing number of features, but for now, we’re only going to investigate its K-Nearest Neighbour classifier.

There are a couple of steps we’ll need to go through in order to use the library. First, you need to create a `KNeighborsClassifier` object. This object takes one parameter - `k`. For example, the code below will create a classifier where `k = 3`:

`classifier = KNeighborsClassifier(n_neighbors = 3)`

Next, we’ll need to train our classifier. The `.fit()` method takes two parameters. The first is a list of points, and the second is the labels associated with those points. So for our movie example, we might have something like this
```
training_points = [
  [0.5, 0.2, 0.1],
  [0.9, 0.7, 0.3],
  [0.4, 0.5, 0.7]
]

training_labels = [0, 1, 1]
classifier.fit(training_points, training_labels)
```

Finally, after training the model, we can classify new points. The `.predict()` method takes a list of points that you want to classify. It returns a list of its guesses for those points.
```
unknown_points = [
  [0.2, 0.1, 0.7],
  [0.4, 0.7, 0.6],
  [0.5, 0.8, 0.1]
]
guesses = classifier.predict(unknown_points)
```

In [62]:
import pandas as pd
from random import shuffle, seed
import numpy as np

seed(100)

df = pd.read_csv("movies.csv")
df = df.dropna()

good_movies = df.loc[df['imdb_score'] >= 7]
bad_movies = df.loc[df['imdb_score'] < 7]

def min_max_normalise(lst):
    minimum = min(lst)
    maximum = max(lst)
    
    normalised = []
    for value in lst:
        normalised_num = (value - minimum) / (maximum - minimum)
        normalised.append(normalised_num)

    return normalised

x_good = good_movies["budget"]
y_good = good_movies["duration"]
z_good = good_movies['title_year']
x_bad = bad_movies["budget"]
y_bad = bad_movies["duration"]
z_bad = bad_movies['title_year']

data = [x_good, y_good, z_good, x_bad, y_bad, z_bad]

arrays_data = []
for d in data:
    norm_d = min_max_normalise(d)
    arrays_data.append(np.array(norm_d))

good_class = list(zip(arrays_data[0].flatten(), arrays_data[1].flatten(), arrays_data[2].flatten(),(np.array(([1] * len(arrays_data[0])))) ))
bad_class = list(zip(arrays_data[3].flatten(), arrays_data[4].flatten(), arrays_data[5].flatten(),(np.array(([0] * len(arrays_data[0])))) ))

dataset = good_class + bad_class
shuffle(dataset)

movie_dataset = []
labels = []
for movie in dataset:
    movie_dataset.append(movie[:-1])
    labels.append(movie[-1])

In [63]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(movie_dataset, labels)

print(classifier.predict([[.45, .2, .5], [.25, .8, .9],[.1, .1, .9]]))

[1 1 0]


## Review

Congratulations! You just implemented your very own classifier from scratch and used Python’s sklearn library. In this lesson, you learned some techniques very specific to the K-Nearest Neighbour algorithm, but some general machine learning techniques as well. Some of the major takeaways from this lesson include:
- Data with `n` features can be conceptualised as points lying in n-dimensional space.
- Data points can be compared by using the distance formula. Data points that are similar will have a smaller distance between them.
- A point with an unknown class can be classified by finding the `k` nearest neighbours.
- To verify the effectiveness of a classifier, data with known classes can be split into a training set and a validation set. Validation error can then be calculated.
- Classifiers have parameters that can be tuned to increase their effectiveness. In the case of K-Nearest Neighbours, `k` can be changed.
- A classifier can be trained improperly and suffer from overfitting or underfitting. In the case of K-Nearest Neighbours, a low `k` often leads to overfitting and a large `k` often leads to underfitting.
- Python’s `sklearn` library can be used for many classification and machine learning algorithms.