# K-Nearest Neighbours Classifier(KNN)

KNN is a classification algorithm, based on the idea that data points with similar atributes tend to fall into similar categories.

Examining the graph, every data point — whether its color is red, green, or white — has an x value and a y value. As a result, it can be plotted on this two-dimensional graph.

The color represents the class that the KNN algorithm is trying to classify. Data points can either have the class green or the class red. If a data point is white, this means that it doesn't have a class yet. The purpose of the algorithm is to classify these unknown points.

<img src="img/knn-1.gif" width="700">

Finally, consider the expanding circle around the white point. This circle is finding the k nearest neighbors to the white point. When k = 3, the circle is fairly small. Two of the three nearest neighbors are green, and one is red. So in this case, the algorithm would classify the white point as green. However, when we increase k to 5, the circle expands, and the classification changes. Three of the nearest neighbors are red and two are green, so now the white point will be classified as red.

This is the central idea behind the K-Nearest Neighbor algorithm. If you have a dataset of points where the class of each point is known, you can take a new point with an unknown class, find it's nearest neighbors, and classify it.

## Thinking about data points

A feature is a piece of information(attribute/property) associated with a data point.K-Nearest Neighbour can work with any number of features greater than 0. When thinking about features of movie data points, we might consider:

 - the length of the movie in minutes.
 
 - the budget of a movie in dollars.
 
 - whether the movie was directed/produced/starred a certain individual - True/False (boolen feature)
 
 - whether it was shot in black and white - True/False (boolean feature)
 
You then need to consider how you are going to classify your data points. In this example, we're going to be classifying movies as either good or bad. In our dataset, we're going to classify a movie as good if it had an IMDb rating of 7.0 or greater. Every "good" movie will have a class of 1, while every bad movie will have a class of 0.

Below are some movie data points where the first item in the list is the length, the second is the budget, and the third is whether the movie was directed by Stanley Kubrick.

```py
mean_girls = [97, 17000000, False]
the_shining = [146, 19000000, True]
gone_with_the_wind = [238, 3977000, False]
```

## Distance between 2D points

To determine whether 2 points are close together or far apart we'll use the `Distance Formula`. When thinking about a movie, we can consider the following dimensions, it's `length` and `release date`.

In [2]:
# data points
star_wars = [125, 1977]
raiders = [115, 1981]
mean_girls = [97, 2004]

# using euclidean distance
def distance(pt1, pt2):
    distance = 0
    for i in range(len(pt1)):
        distance += (pt1[i] - pt2[i]) ** 2
    return distance ** 0.5    

print(distance(star_wars, raiders))
print(distance(star_wars, mean_girls))

10.770329614269007
38.897300677553446


## Distance between Nth-D points

If we were to add the movie's budget to our data points, we would have to find the the distance between two points in three dimensions. We don't have to stop there. We can add further dimensions to our data points. The generalized distance formula between points A and B is as follows:

![Formula](img/formula-2.png)

Using the `distance()` method above, we can find the K-Nearest Neighbors of a point in N-dimensional space! We now can use as much information about our movies as we want.

## Calculating K-Nearest Neighbour - Write a KNN Classifier from Scratch
 
Implementing the KNN algorithm involves three steps:
 
**1. Normalize the data**
 
When you consider the movie data, we find that is easy for certain features/dimensions to overwelm/outweigh others, e.g. the difference between the release dates of movies is measuered in single or double digits, while the budgets can be hundreds of millions of dollars apart. A distance formula treats all dimensions equally, regardless of their scale. The difference in one year is treated exactly equal to the difference in one dollar of budget. This results in the budget completely outweighing the importance of all other dimensions because it is on such a huge scale. The fact that two movies were 10 years apart is essentially meaningless compared to the difference in millions in the budget dimension.

The solution is to `Normalize` data so that every value is between 0 and 1. One such technique is `Min-Max` normalization. The distance formula will work with any number of dimensions/features.

In [6]:
release_dates = [1897, 1998, 2000, 1948, 1962, 1950, 1975, 1960, 2017, 1937, 1968, 1996, 1944, 1891, 1995, 1948, 2011, 1965, 1891, 1978]

def min_max_normalize(lst):
  minimum = min(lst) # 1891 == 0.0
  maximum = max(lst) # 2017 == 1.0
  normalized = []
  
  for value in lst:
    normalized_num = (value - minimum) / (maximum - minimum)
    normalized.append(normalized_num)
  
  return normalized

print(min_max_normalize(release_dates))

# What does the date 1897 get normalized to? 
print(min_max_normalize(release_dates)[0])

[0.047619047619047616, 0.8492063492063492, 0.8650793650793651, 0.4523809523809524, 0.5634920634920635, 0.46825396825396826, 0.6666666666666666, 0.5476190476190477, 1.0, 0.36507936507936506, 0.6111111111111112, 0.8333333333333334, 0.42063492063492064, 0.0, 0.8253968253968254, 0.4523809523809524, 0.9523809523809523, 0.5873015873015873, 0.0, 0.6904761904761905]
0.047619047619047616


**2. Find the `k` nearest neighbour**

Next we want to find the k nearest neighbors of the unclassified point. In this case we will set `k` to 5.

In order to find the 5 nearest neighbors, we need to compare this new unclassified movie to every other movie in the dataset. This means we’re going to be using the distance formula again and again. We ultimately want to end up with a sorted list of distances and the movies associated with those distances, e.g. the unknown movie has a distance of 0.30 to Superman II.

```py
[
  [0.30, 'Superman II'],
  [0.31, 'Finding Nemo'],
  ...
  ...
  [0.38, 'Blazing Saddles']
]
```



#### Example

```py
from movies import training_set, training_labels, validation_set, validation_labels

# print(movie_dataset['Bruce Almighty'])
# print(movie_labels['Bruce Almighty'])

def distance(movie1, movie2):
  squared_difference = 0
  for i in range(len(movie1)):
    squared_difference += (movie1[i] - movie2[i]) ** 2
  final_distance = squared_difference ** 0.5
  return final_distance
  
# determine the distance for every point from the 'unknown' point, sort from smallest to largest  & return k neighbors
def classify(unknown, dataset, k):
  distances = []
  for i in dataset:
    distance_to_point = distance(dataset[i], unknown)
    distances.append([distance_to_point, i])
  distances.sort()
  neighbors = distances[0:k]
  return neighbors
  

# TEST - each movie data dimension has already been normalized
print(classify([0.4, 0.2, 0.9], training_set, 5))

# k nearest neighbours - assuming k == 5
[
    [0.08273614694606074, 'Lady Vengeance'], 
    [0.22989623153818367, 'Steamboy'], 
    [0.23641372358159884, 'Fateless'], 
    [0.26735445689589943, 'Princess Mononoke'], 
    [0.3311022951533416, 'Godzilla 2000']
]
```

**3. Classify the new point based on those neighbours**

Next we count the number of good movies and bad movies in the list of neighbors. If more of the neighbors were good, then the algorithm will classify the unknown movie as good. Otherwise, it will classify it as bad.

In order to find the class of each of the labels, we'll need to look at our `movie_labels` dataset. For example, movie_labels['Akira'] would give us 1 because Akira is classified as a good movie.

If there's a tie, e.g. there are four neighbours classified as good and four as bad, we would choose the class of the closest neighbour to classify the unknown point.

```py
# Full code example
from movies import training_set, training_labels, validation_set, validation_labels, normalize_point

def distance(movie1, movie2):
  squared_difference = 0
  for i in range(len(movie1)):
    squared_difference += (movie1[i] - movie2[i]) ** 2
  final_distance = squared_difference ** 0.5
  return final_distance

def classify(unknown, dataset, labels, k):
  distances = []
  #Looping through all points in the dataset
  for title in dataset:
    movie = dataset[title]
    distance_to_point = distance(movie, unknown)
    #Adding the distance and point associated with that distance
    distances.append([distance_to_point, title])
  distances.sort()
  #Taking only the k closest points
  neighbors = distances[0:k]
	
  # classify the k closest neighbours - classify movies based on good(1) or bad(0)
  num_good = 0
  num_bad = 0
  for movie in neighbors:
    title = movie[1] # [distance, title]
    if labels[title]:
      num_good += 1
    else:
    	num_bad += 1
    
    if num_good > num_bad:
      return 1
  return 0

# test
print(classify([0.4, 0.2, 0.9], training_set, training_labels, 5)) # 1
```

Now that we've built a classifier, it's time to test it with a real movie, normalize it and run it through the classifier to see what it predicts.

1. check that the movie is not in our movie database

```py
# check if the film title 'Little Big Man' is in the database
print('Little Big Man' in training_set) # False
```

2. create a datapoint for the movie: movie's budget(dollars), runtime(minutes), year released - in that order

```py
my_movie = [15000000, 139, 1970]
```

3. normalize the datapoint

```py
normalized_my_movie = normalize_point(my_movie)
```

4. classify the normalized data point

```py
print(classify(normalized_my_movie, training_set, trining_labels, 5)) 
# 1 - Good, IMBD rating 7.6
```

Movies were classified as good if it had an IMDb rating of 7.0 or greater. Every "good" movie will have a class of 1, while every bad movie will have a class of 0.

## Validating our Classifier

It's now time to validate the classifier and determine how effective it is by feeding it every data point in our validation set and calculating the accuracy(we know what the actual rating is for each movie in the validation set). The validation accuracy will change depending on what K value we use.

```py
print(validation_set['Bee Movie'])
# [0.012279463360232739, 0.18430034129692832, 0.898876404494382]

print(validation_labels['Bee Movie']) # 0 - actual

# use the classifier to predict the movie
guess = classify(validation_set['Bee Movie'], training_set, training_labels, 5)
print(guess)# 0 - predicted
```

## Choosing K

The validation accuracy changes as `k` changes.When considering the `k` value we run into the problem of `overfitting` or `underfitting`. `Overfitting` happens when we pick a small `k` value and don't consider enough neighbours, you rely too heavily on your training data and assume that data in the real world will always behave exactly like your training data. When `k` is too small, outliers will dominate the result. a single outlier could predict a datapoint to be class A even if every other point in the same area is class B.

<img src="img/scatter-plot-4.png" width="400">

Consider the black dot on the top left. A single outlier could drastically determine the label of an unknown point when you have a very small `k == 1`. All points in that general area will be classified as dark blue when it should probably be classified as green.

`Underfitting` occurs when we have a very large `k` value and so consider to many points, your classifier doesn't pay enough attention to the small quirks in the training set. If you have a `k == 100` and the data set has 100 data points, then every unknown point will be classified in the same way, the distances between the points will not matter.  When `k` is too big, larger trends in the dataset aren't represented - all predictions will be exactly the same.

One problem can occur when choosing an even value of `k` . In such cases, if there are equal numbers of nearest neighbours of both classes, the classifier will not know which one to choose.

```py
def find_validation_accuracy(training_set, training_labels, validation_set, validation_labels, k):
    num_correct = 0.0
    for title in validation_set:
      guess = classify(validation_set[title], training_set, training_labels, k)
      if guess == validation_labels[title]:
        num_correct += 1
    validation_error = num_correct / len(validation_set)  
    return validation_error

# calculate accuracy with k == 3
print(find_validation_accuracy(training_set, training_labels, validation_set, validation_labels, 3)) # 0.6639344262295082  
```

The graph to the right shows the validation accuracy of our movie classifier as k increases. When k is small, overfitting occurs and the accuracy is relatively low. On the other hand, when k gets too large, underfitting occurs and accuracy starts to drop.

<img src="img/validation-accuracy.png" width="600">

As k increases, you begin to avoid overfitting and accuracy goes up. Once k gets too big, you begin to underfit, and accuracy will go back down

## Summary

1. Data with n features can be conceptualized as points lying in n-dimensional space.

2. It is essential to normalize data when calculating KNN, otherwise a feature with a vastly different scale will dominate other features.

2. Data points can be compared by using the distance formula. Data points that are similar will have a smaller distance between them.

3. A point with an unknown class can be classified by finding the k nearest neighbors.

4. To verify the effectiveness of a classifier, data with known classes can be split into a training set and a validation set. Validation error can then be calculated.

5. Classifiers have parameters that can be tuned to increase their effectiveness. In the case of K-Nearest Neighbors, k can be changed.

6. A classifier can be trained improperly and suffer from overfitting or underfitting. In the case of K-Nearest Neighbors, a low k often leads to overfitting and a large k often leads to underfitting.