### Regression

The K-Nearest Neighbors algorithm is a powerful supervised machine learning algorithm typically used for classification. However, it can also perform regression.

instead of classifying a new movie as either good or bad, we are now going to predict its IMDb rating as a real number.

This process is almost identical to classification, except for the final step. Once again, we are going to find the k nearest neighbors of the new movie by using the distance formula. However, instead of counting the number of good and bad neighbors, the regressor averages their IMDb ratings.

For example, if the three nearest neighbors to an unrated movie have ratings of 5.0, 9.2, and 6.8, then we could predict that this new movie will have a rating of 7.0.

#### Weighted Regression

We’re off to a good start, but we can be even more clever in the way that we compute the average. We can compute a weighted average based on how close each neighbor is.

Let’s say we’re trying to predict the rating of movie X and we’ve found its three nearest neighbors. Consider the following table:

| Movie | Rating | Distance to movie X |
| --- | --- | --- |
| A | 5.0 | 3.2 |
| B | 6.8 | 11.5 |
| C | 9.0 | 1.1 |

If we find the mean, the predicted rating for X would be 6.93. However, movie X is most similar to movie C, so movie C’s rating should be more important when computing the average. Using a weighted average, we can find movie X’s rating:

$$\frac{\frac{5.0}{3.2} + \frac{6.8}{11.5} + \frac{9.0}{1.1}}{\frac{1}{3.2}+\frac{1}{11.5}+\frac{1}{1.1}} = 7.9$$

The numerator is the sum of every rating divided by their respective distances. The denominator is the sum of one over every distance. Even though the ratings are the same as before, the weighted average has now gone up to 7.9.

In [91]:
import pandas as pd

# Load CSV file into a DataFrame
movie_dataset = pd.read_csv('movie_regression_dataset.csv')
movie_ratings = pd.read_csv('movie_regression_labels.csv')


In [92]:
movie_dataset.shape

(3654, 3)

In [93]:
movie_ratings.shape

(3654, 1)

#### Scikit-learn

Now that you’ve written your own K-Nearest Neighbor regression model, let’s take a look at scikit-learn’s implementation. The KNeighborsRegressor class is very similar to KNeighborsClassifier.

We first need to create the regressor. We can use the parameter n_neighbors to define our value for k.

We can also choose whether or not to use a weighted average using the parameter weights. If weights equals "uniform", all neighbors will be considered equally in the average. If weights equals "distance", then a weighted average is used.

Next, we need to fit the model to our training data using the .fit() method. .fit() takes two parameters. The first is a list of points, and the second is a list of values associated with those points.

In [94]:
from sklearn.neighbors import KNeighborsRegressor

# create KNN Regressor object with k = 5, weights = distance but some scenarios it can be uniform
regressor = KNeighborsRegressor(n_neighbors = 5, weights = "distance")

# training the model
regressor.fit(movie_dataset, movie_ratings)

Let’s predict some movie ratings. Predict the ratings for the following movies:

* [0.016, 0.300, 1.022],
* [0.0004092981, 0.283, 1.0112],
* [0.00687649, 0.235, 1.0112] .

These three lists are the features for Incredibles 2, The Big Sick, and The Greatest Showman. Those three numbers associated with a movie are the normalized budget, runtime, and year of release.

In [95]:
# predictions
new_movies = [[0.016, 0.300, 1.022],
[0.0004092981, 0.283, 1.0112],
[0.00687649, 0.235, 1.0112]]
ratings_predictions = regressor.predict(new_movies)
# print the predictions of each new movies
print(f"rating of Incredibles 2: {ratings_predictions[0]}")
print(f"rating of The Big Sick: {ratings_predictions[1]}")
print(f"rating of The Greatest Showman: {ratings_predictions[2]}")

rating of Incredibles 2: [6.84913968]
rating of The Big Sick: [5.47572913]
rating of The Greatest Showman: [6.91067999]




### Review

Great work! Here are some of the major takeaways:

* The K-Nearest Neighbor algorithm can be used for regression. Rather than returning a classification, it returns a number.
* By using a weighted average, data points that are extremely similar to the input point will have more of a say in the final result.
* scikit-learn has an implementation of a K-Nearest Neighbor regressor named KNeighborsRegressor.