### K-Nearest Neighbors Classifier

K-Nearest Neighbors (KNN) is a classification algorithm. The central idea is that data points with similar attributes tend to fall into similar categories.

Before diving into the K-Nearest Neighbors algorithm, let’s first take a minute to think about an example.

Consider a dataset of movies. Let’s brainstorm some features of a movie data point. A feature is a piece of information associated with a data point. Here are some potential features of movie data points:

* the length of the movie in minutes.
* the budget of a movie in dollars.

In [1]:
# this movie is [minutes long, budget, not directed by Stanley Kubrick(False)]
mean_girls = [97, 17000000, False]
the_shining = [146, 19000000, True]
gone_with_the_wind = [238, 3977000, False]

#### Distance Between Points - 2D

We were able to visualize the dataset and estimate the k nearest neighbors of an unknown point. But a computer isn’t going to be able to do that!

We need to define what it means for two points to be close together or far apart. To do this, we’re going to use the Distance Formula:
$$ \sqrt{(A_{0}-B_{0})^2+(A_{1}-B_{1})^2} $$

#### Distance Between Points - 3D

For example:

Making a movie rating predictor based on just the length and release date of movies is pretty limited. There are so many more interesting pieces of data about movies that we could use! So let’s add another dimension.

Let’s say this third dimension is the movie’s budget. We now have to find the distance between these two points in three dimensions.

What if we’re not happy with just three dimensions? Unfortunately, it becomes pretty difficult to visualize points in dimensions higher than 3. But that doesn’t mean we can’t find the distance between them.

The generalized distance formula between points A and B is as follows:
$$ \sqrt{(A_{0}-B_{0})^2+(A_{1}-B_{1})^2+...+(A_{n}-B_{n})^2} $$

Using this formula, we can find the K-Nearest Neighbors of a point in N-dimensional space! We now can use as much information about our movies as we want.

In [2]:
star_wars = [125, 1977, 11000000]
raiders = [115, 1981, 18000000]
mean_girls = [97, 2004, 17000000]

# distance function with any n-dimensions
def distance(movie1, movie2):
  squared_difference = 0.
  for i in range(len(movie1)):
    squared_difference += (movie1[i] - movie2[i]) ** 2
  # out of loop
  distance = squared_difference ** 0.5
  return distance

# print the new distance between Star Wars and Raiders
print(f"distance between Star Wars and Raiders: {distance(star_wars, raiders)}")
# print the new distance between Star Wars and Mean Girls
print(f"distance between Star Wars and Mean Girls: {distance(star_wars, mean_girls)}")

distance between Star Wars and Raiders: 7000000.000008286
distance between Star Wars and Mean Girls: 6000000.000126083


### Data with Different Scales: Normalization

We’ll implement the three steps of the K-Nearest Neighbor Algorithm:

1. Normalize the data
2. Find the k nearest neighbors
3. Classify the new point based on those neighbors

When we added the dimension of budget, you might have realized there are some problems with the way our data currently looks.

Consider the two dimensions of release date and budget. The maximum difference between two movies’ release dates is about 125 years (The Lumière Brothers were making movies in the 1890s). However, the difference between two movies’ budget can be millions of dollars.

The problem is that the distance formula treats all dimensions equally, regardless of their scale. If two movies came out 70 years apart, that should be a pretty big deal. However, right now, that’s exactly equivalent to two movies that have a difference in budget of 70 dollars. The difference in one year is exactly equal to the difference in one dollar of budget. That’s absurd!

Another way of thinking about this is that the budget completely outweighs the importance of all other dimensions because it is on such a huge scale. The fact that two movies were 70 years apart is essentially meaningless compared to the difference in millions in the other dimension.

The solution to this problem is to normalize the data so every value is between 0 and 1. In this case, we’re going to be using min-max normalization.

In [4]:
release_dates = [1897, 1998, 2000, 1948, 1962, 1950, 
                 1975, 1960, 2017, 1937, 1968, 1996, 
                 1944, 1891, 1995, 1948, 2011, 1965, 
                 1891, 1978]

#### 1. Normalize the data

In [5]:
# Normalize the data
# create min-max normalization
def min_max_normalize(lst):
  minimum = min(lst)
  maximum = max(lst)
  normalized = []
  for value in lst:
    normalized_value = (value - minimum) / (maximum - minimum)
    normalized.append(normalized_value)
  return normalized
  
# call min-max normalize give the release_dates
result_normalized = min_max_normalize(release_dates)
# print the resulting list
print(f"resulting list: {result_normalized}")
# print the result of date 1897 after normalied
print(f"result of 1987: {result_normalized[0]}")

resulting list: [0.047619047619047616, 0.8492063492063492, 0.8650793650793651, 0.4523809523809524, 0.5634920634920635, 0.46825396825396826, 0.6666666666666666, 0.5476190476190477, 1.0, 0.36507936507936506, 0.6111111111111112, 0.8333333333333334, 0.42063492063492064, 0.0, 0.8253968253968254, 0.4523809523809524, 0.9523809523809523, 0.5873015873015873, 0.0, 0.6904761904761905]
result of 1987: 0.047619047619047616


### 2. Finding the Nearest Neighbors

Now that our data has been normalized and we know how to find the distance between two points, we can begin classifying unknown data!

To do this, we want to find the k nearest neighbors of the unclassified point.

for now, let’s choose a number that seems somewhat reasonable. Let’s choose k = 5

In order to find the 5 nearest neighbors, we need to compare this new unclassified movie to every other movie in the dataset. This means we’re going to be using the distance formula again and again. We ultimately want to end up with a sorted list of distances and the movies associated with those distances.

In [7]:
# create classify functions that has 3 parameters
def classify(unknown, dataset, k):
  distances = []
  for title in dataset:
    distance_to_point = distance(dataset[title], unknown)
    distances.append([distance_to_point, title])
  # sort the list by the distances(from smallest to largest)
  distances.sort()
  # k of nearest neighbors
  neighbors = distances[0:k]
  return neighbors

### 3. Classify the new point based on those neighbors

Count Neighbors

We’ve now found the k nearest neighbors, and have stored them in a list

Our goal now is to count the number of good movies and bad movies in the list of neighbors. If more of the neighbors were good, then the algorithm will classify the unknown movie as good. Otherwise, it will classify it as bad.

What happens if there’s a tie. What if k = 8 and four neighbors were good and four neighbors were bad? There are different strategies, but one way to break the tie would be to choose the class of the closest point.

* Our classify function now needs to have knowledge of the labels. Add a parameter named labels to classify. It should be the third parameter.

In [10]:
# add labels parmeters
def classify(unknown, dataset, labels, k):
  distances = []
  #Looping through all points in the dataset
  for title in dataset:
    movie = dataset[title]
    distance_to_point = distance(movie, unknown)
    #Adding the distance and point associated with that distance
    distances.append([distance_to_point, title])
  distances.sort()
  #Taking only the k closest points
  neighbors = distances[0:k]

  num_class1 = 0
  num_class0 = 0
  for movie in neighbors:
    title = movie[1]
    if labels[title] == 0:
      num_class0 += 1
    else:
      num_class1 += 1
  # classify our unknown movie
  if num_class1 > num_class0:
    return 1
  else:
    return 0

now we can get classify data by calling the functions, before that we must normlize new data that we want to classify.

#### Training and Validation Sets

we’re not done yet. We now need to report how effective our algorithm is. After all, it’s possible our predictions are totally wrong!

As with most machine learning algorithms, we have split our data into a training set and validation set.

Once these sets are created, we will want to use every point in the validation set as input to the K Nearest Neighbor algorithm. We will take a movie from the validation set, compare it to all the movies in the training set, find the K Nearest Neighbors, and make a prediction. After making that prediction, we can then peek at the real answer (found in the validation labels) to see if our classifier got the answer correct.

If we do this for every movie in the validation set, we can count the number of times the classifier got the answer right and the number of times it got it wrong. Using those two numbers, we can compute the validation accuracy.

Validation accuracy will change depending on what K we use.

note: this code is not include dataset that means it cannot run

In [None]:
# call classify functions to predict with k = 5
guess = classify(validation_set["Bee Movie"], training_set, training_labels, k = 5)
# print the guess
print(f"the guess of Bee Movie is: {guess}")

# Let’s check to see if our classification got it right
if guess == validation_labels["Bee Movie"]:
  print("classificatio of Bee Movie got Correct!")
else:
  print("classificatio of Bee Movie got Wrong!")