### K-Nearest Neighbors Classifier

K-Nearest Neighbors (KNN) is a classification algorithm. The central idea is that data points with similar attributes tend to fall into similar categories.

Before diving into the K-Nearest Neighbors algorithm, let’s first take a minute to think about an example.

Consider a dataset of movies. Let’s brainstorm some features of a movie data point. A feature is a piece of information associated with a data point. Here are some potential features of movie data points:

* the length of the movie in minutes.
* the budget of a movie in dollars.

In [1]:
# this movie is [minutes long, budget, not directed by Stanley Kubrick(False)]
mean_girls = [97, 17000000, False]
the_shining = [146, 19000000, True]
gone_with_the_wind = [238, 3977000, False]

#### Distance Between Points - 2D

We were able to visualize the dataset and estimate the k nearest neighbors of an unknown point. But a computer isn’t going to be able to do that!

We need to define what it means for two points to be close together or far apart. To do this, we’re going to use the Distance Formula:
$$ \sqrt{(A_{0}-B_{0})^2+(A_{1}-B_{1})^2} $$

#### Distance Between Points - 3D

For example:

Making a movie rating predictor based on just the length and release date of movies is pretty limited. There are so many more interesting pieces of data about movies that we could use! So let’s add another dimension.

Let’s say this third dimension is the movie’s budget. We now have to find the distance between these two points in three dimensions.

What if we’re not happy with just three dimensions? Unfortunately, it becomes pretty difficult to visualize points in dimensions higher than 3. But that doesn’t mean we can’t find the distance between them.

The generalized distance formula between points A and B is as follows:
$$ \sqrt{(A_{0}-B_{0})^2+(A_{1}-B_{1})^2+...+(A_{n}-B_{n})^2} $$

Using this formula, we can find the K-Nearest Neighbors of a point in N-dimensional space! We now can use as much information about our movies as we want.

In [2]:
star_wars = [125, 1977, 11000000]
raiders = [115, 1981, 18000000]
mean_girls = [97, 2004, 17000000]

# distance function with any n-dimensions
def distance(movie1, movie2):
  squared_difference = 0.
  for i in range(len(movie1)):
    squared_difference += (movie1[i] - movie2[i]) ** 2
  # out of loop
  distance = squared_difference ** 0.5
  return distance

# print the new distance between Star Wars and Raiders
print(f"distance between Star Wars and Raiders: {distance(star_wars, raiders)}")
# print the new distance between Star Wars and Mean Girls
print(f"distance between Star Wars and Mean Girls: {distance(star_wars, mean_girls)}")

distance between Star Wars and Raiders: 7000000.000008286
distance between Star Wars and Mean Girls: 6000000.000126083
