# K-Nearest Neighbors

In [52]:
from sklearn.neighbors import KNeighborsClassifier
import json

## 3. Distance Between Points - 2D

**Task 1**  
- Write a function named `distance` that takes two lists named `movie1` and `movie2` as parameters.
- You can assume that each of these lists contains two numbers — the first number being the movie’s runtime and the second number being the year the movie was released. 
- The function should return the distance between the two lists.
- Remember, in python, `x ** 0.5` will give you the square root of `x`.
- Similarly, `x ** 2` will give you the square of `x`.

<br>

**Task 2**  
- Call the function on some of the movies we’ve given you.
- Print the distance between *Star Wars* and *Raiders of the Lost Ark*.
- Print the distance between *Star Wars* and *Mean Girls*.
- Which movie is *Star Wars* more similar to?

In [3]:
star_wars = [125, 1977]
raiders = [115, 1981]
mean_girls = [97, 2004]

In [None]:
# Task 1
def distance(movie1, movie2):
    dist = 0
    if len(movie1) != len(movie2):
        raise ValueError("Movies must have the same number of features")
    for i in range(len(movie1)):
        dist += (movie1[i] - movie2[i])**2
    return dist**0.5

# Task 2
print(distance(star_wars, raiders)) # smaller distance
print(distance(star_wars, mean_girls))

10.770329614269007
38.897300677553446


## 4. Distance Between Points - 3D

**Task 1**  
- Modify your `distance` function to work with any number of dimensions. 
- Use a `for` loop to iterate through the dimensions of each movie.
- Return the total distance between the two movies.

<br>

**Task 2**  
- We’ve added a third dimension to each of our movies.
- Print the new distance between `Star Wars` and `Raiders of the Lost Ark`.
- Print the new distance between `Star Wars` and `Mean Girls`.
- Which movie is Star Wars closer to now?


In [5]:
star_wars = [125, 1977, 11000000]
raiders = [115, 1981, 18000000]
mean_girls = [97, 2004, 17000000]

In [None]:
def distance(movie1, movie2):
    dist = 0
    if len(movie1) != len(movie2):
        raise ValueError("Movies must have the same number of features")
    for i in range(len(movie1)):
        dist += (movie1[i] - movie2[i])**2
    return dist**0.5

print(distance(star_wars, raiders)) 
print(distance(star_wars, mean_girls)) # smaller distance

7000000.000008286
6000000.000126083


## 5. Data with Different Scales: Normalization

**Task 1**  
- Write a function named `min_max_normalize` that takes a list of numbers named `lst` as a parameter (`lst` short for list).
- Begin by storing the minimum and maximum values of the list in variables named `minimum` and `maximum`

<br>

**Task 2**  
- Create an empty list named `normalized`. 
- Loop through each value in the original list.
- Using min-max normalization, normalize the value and add the normalized value to the new list.
- After adding every normalized value to `normalized`, return `normalized`.

<br>

**Task 3**  
- Call `min_max_normalize` using the given list release_dates. 
- Print the resulting list.
- What does the date `1897` get normalized to? Why is it closer to `0` than `1`?

In [7]:
release_dates = [1897, 1998, 2000, 1948, 1962, 1950, 1975, 1960, 2017, 1937, 1968, 1996, 1944, 1891, 1995, 1948, 2011, 1965, 1891, 1978]

In [8]:
def min_max_normalize(lst):
    minimum = min(lst)
    maximum = max(lst)
    normalized = []
    for i in lst:
        normalized.append((i - minimum) / (maximum - minimum))
    return normalized

print(min_max_normalize(release_dates))

[0.047619047619047616, 0.8492063492063492, 0.8650793650793651, 0.4523809523809524, 0.5634920634920635, 0.46825396825396826, 0.6666666666666666, 0.5476190476190477, 1.0, 0.36507936507936506, 0.6111111111111112, 0.8333333333333334, 0.42063492063492064, 0.0, 0.8253968253968254, 0.4523809523809524, 0.9523809523809523, 0.5873015873015873, 0.0, 0.6904761904761905]


1879 is closer to 0 because it is one of smallest values in the list.

## 6. Finding the Nearest Neighbors

**Task 1**  
- We’ve imported and normalized a movie dataset for you and printed the data for the movie `Bruce Almighty`. 
- Each movie in the dataset has three features:
    - the normalized budget (dollars)
    - the normalized duration (minutes)
    - the normalized release year.
- We’ve also imported the labels associated with every movie in the dataset. 
- The label associated with `Bruce Almighty` is a `0`, indicating that it is a bad movie. 
- Remember, a bad movie had a rating less than 7.0 on IMDb.
- Comment out the two print lines after you have run the program.

<br>

**Task 2**  
- Create a function called ` that has three parameters: the data point you want to classify named `, the dataset you are using to classify it named `, and `, the number of neighbors you are interested in.
- For now put pass inside your function.

<br>

**Task 3**  
- Inside the `classify` function remove `pass`. Create an empty list called `distances`.
- Loop through every `title` in the `dataset`.
- Access the data associated with every title by using `dataset[title]`.
- Find the distance between `dataset[title]` and `unknown` and store this value in a var
- Add the list `[distance_to_point, title]` to `distances`.
- Outside of the loop, return `distances`.

<br>

**Task 4**  
- We now have a list of distances and points. 
- We want to sort this list by the distance (from smallest to largest). 
- Before returning `distances`, use Python’s built-in `sort()` function to sort `distances`.

<br>

**Task 5**  
- The `k` nearest neighbors are now the first `k` items in `distances`. 
- Create a new variable named `neighbors` and set it equal to the first `k` items of `distances`. 
- You can use Python’s built-in slice function.
- For example, `lst[2:5]` will give you a list of the items at indices 2, 3, and 4 of `lst`.
- Return `neighbors`.

<br>

**Task 6**  
- Test the `classify` function and print the results. 
- The three parameters you should use are:
    - `[.4, .2, .9]`
    - `movie_dataset`
    - `5`
- Take a look at the `5` nearest neighbors. 
- In the next exercise, we’ll check to see how many of those neighbors are good and how many are bad.

In [None]:
movie_dataset = json.loads(open("movie_dataset.json").read())
movie_labels = json.loads(open("movie_labels.json").read())


print(movie_dataset['Bruce Almighty'])
print(movie_labels['Bruce Almighty'])

def distance(movie1, movie2):
    squared_difference = 0
    for i in range(len(movie1)):
        squared_difference += (movie1[i] - movie2[i]) ** 2
    final_distance = squared_difference ** 0.5
    return final_distance

[0.006630902005283176, 0.21843003412969283, 0.8539325842696629]
0


In [17]:
def classify(unknown, dataset, k):
    distances = []
    for title in dataset:
        distance_to_point = distance(dataset[title], unknown)
        distances.append([distance_to_point, title])
    distances.sort()
    neighbors = distances[:k]
    return neighbors

for data in classify([.4, .2, .9], movie_dataset, 5):
    print(data)

[0.08273614694606074, 'Lady Vengeance']
[0.22989623153818367, 'Steamboy']
[0.23641372358159884, 'Fateless']
[0.26735445689589943, 'Princess Mononoke']
[0.3311022951533416, 'Godzilla 2000']


## 7. Count Neighbors

**Task 1**  
- Our classify function now needs to have knowledge of the labels. 
- Add a parameter named `labels` to `classify`. It should be the third parameter.

<br>

**Task 2**  
- Continue writing your classify function.
- Create two variables named `num_good` and `num_bad` and set them each at `0`. 
- Use a for loop to loop through every `movie` in `neighbors`. 
- Store their title in a variable called `title`.
- Remember, every neighbor is a list of `[distance, title]` so the title can be found at index `1`.
- For now, return `title` at the end of your function (outside of the loop).

<br>

**Task 3**  
- Use `labels` and `title` to find the label of each movie:
    - If that label is a `0`, add one to `num_bad`.
    - If that label is a `1`, add one to `num_good`.
- For now, return `num_good` at the end of your function.

<br>

**Task 4**  
- We can finally classify our unknown movie:
    - If `num_good` is greater than `num_bad`, return a `1`.
    - Otherwise, return a `0`.

<br>

**Task 5**  
- Call `classify` using the following parameters and print the result.
    - `[.4, .2, .9]` as the movie you’re looking to classify.
    - `movie_dataset` the training dataset.
    - `movie_labels` as the training labels.
    - `k = 5`
- Does the system predict this movie will be good or bad?

In [None]:
movie_dataset = json.loads(open("movie_dataset.json").read())
movie_labels = json.loads(open("movie_labels.json").read())


def distance(movie1, movie2):
    squared_difference = 0
    for i in range(len(movie1)):
        squared_difference += (movie1[i] - movie2[i]) ** 2
    final_distance = squared_difference ** 0.5
    return final_distance

def classify(unknown, dataset, labels, k):
    distances = []
    #Looping through all points in the dataset
    for title in dataset:
        movie = dataset[title]
        distance_to_point = distance(movie, unknown)
        #Adding the distance and point associated with that distance
        distances.append([distance_to_point, title])
    distances.sort()
    #Taking only the k closest points
    neighbors = distances[:k]
    num_good = 0
    num_bad = 0
    for movie in neighbors:
        title = movie[1]
        if labels[title] == 0:
            num_bad += 1
        else:
            num_good += 1
    return 1 if num_good > num_bad else 0

print(classify([.4, .2, .9], movie_dataset, movie_labels, 5))

1


## 8. Classify Your Favorite Movie

Nice work! Your classifier is now able to predict whether a movie will be good or bad. So far, we’ve only tested this on a completely random point `[.4, .2, .9]`. In this exercise we’re going to pick a real movie, normalize it, and run it through our classifier to see what it predicts!

In the instructions below, we are going to be testing our classifier using the 2017 movie *Call Me By Your Name*. Feel free to pick your favorite movie instead!

**Task 1**  
- To begin, we want to make sure the movie that we want to classify isn’t already in our database. 
- This is important because we don’t want one of the nearest neighbors to be itself!
- You can do this by using the `in` keyword.
- Begin by printing if the title of your movie is in `movie_dataset`. This should print False.

<br>

**Task 2**  
- Once you confirm your movie is not in your database, we need to make a datapoint for your movie. 
- Create a variable named `my_movie` and set it equal to a list of three numbers. They should be:
    - The movie’s budget (dollars)
    - The movie’s runtime (minutes)
    - The year the movie was released
- Make sure to put the information in that order.

<br>

**Task 3**  
- Next, we want to normalize this datapoint. 
- We’ve included the function `normalize_point` which takes a datapoint as a parameter and returns the point normalized. 
- Create a variable called `normalized_my_movie` and set it equal to the normalized value of `my_movie`. 
- Print the result!

<br>

**Task 4**  
- Finally, call classify with the following parameters:
    - `normalized_my_movie`
    - `movie_dataset`
    - `movie_labels`
    - `5`
- Print the result? 
- Did your classifier think your movie was good or bad?

In [None]:
movie_dataset = json.loads(open("movie_dataset.json").read())
movie_labels = json.loads(open("movie_labels.json").read())


def distance(movie1, movie2):
    squared_difference = 0
    for i in range(len(movie1)):
        squared_difference += (movie1[i] - movie2[i]) ** 2
    final_distance = squared_difference ** 0.5
    return final_distance

def classify(unknown, dataset, labels, k):
    distances = []
    #Looping through all points in the dataset
    for title in dataset:
        movie = dataset[title]
        distance_to_point = distance(movie, unknown)
        #Adding the distance and point associated with that distance
        distances.append([distance_to_point, title])
    distances.sort()
    #Taking only the k closest points
    neighbors = distances[0:k]
    num_good = 0
    num_bad = 0
    for neighbor in neighbors:
        title = neighbor[1]
        if labels[title] == 0:
            num_bad += 1
        elif labels[title] == 1:
            num_good += 1
    if num_good > num_bad:
        return 1
    else:
        return 0


In [32]:
# Task 1
print("The Hobbit: An Unexpected Journey" in movie_labels)

# Task 2
my_movie = [180000000, 182, 2012]

# Task 3
# normalized_my_movie = normalize_point(my_movie)
normalized_my_movie = [0.014735359601515157, 0.4948805460750853, 0.9550561797752809]
print(normalized_my_movie)

# Task 4
print(classify(normalized_my_movie, movie_dataset, movie_labels, 5))

False
[0.014735359601515157, 0.4948805460750853, 0.9550561797752809]
1


## 9. Training and Validation Sets

**Task 1**  
- We’ve `training_set`, `training_labels`, `validation_set`, and `validation_labels`. 
- Let’s take a look at one of the movies in `validation_set`.
- The movie `"Seven Samurai"` is in `validation_set`. 
- Print out the data associated with *Bee Movie*. 
- Print *Seven Samurai* ‘s label as well (which can be found in `validation_labels`).
- Is *Seven Samurai* a good or bad movie?

<br>

**Task 2**  
- Let’s have our classifier predict whether *Seven Samurai* is good or bad using k = 5. Call the classify function using the following parameters:
    - *Seven Samurai*‘s data
    - `training_set`
    - `training_labels`
    - `5`
- Store the results in a variable named `guess` and print `guess`.

<br>

**Task 3**  
- Let’s check to see if our classification got it right. 
- If `guess` is equal to *Seven Samurai*‘s real class (found in `validation_labels`), print `"Correct!"`. Otherwise, print `"Wrong!"`.

In [None]:
movie_dataset = json.loads(open("movie_dataset.json").read())
movie_labels = json.loads(open("movie_labels.json").read())
print(len(movie_dataset)) # 5364 movies in dataset

# 3289 movies in training set
training_set = {title: movie_dataset[title] for title in list(movie_dataset.keys())[:3289]}
training_labels = {title: movie_labels[title] for title in list(movie_labels.keys())[:3289]}
# Last 366 movies in validation set
validation_set = {title: movie_dataset[title] for title in list(movie_dataset.keys())[3289:]}
validation_labels = {title: movie_labels[title] for title in list(movie_labels.keys())[3289:]}

print(len(training_set)) # 3289
print(len(validation_set)) # 366


def distance(movie1, movie2):
    squared_difference = 0
    for i in range(len(movie1)):
        squared_difference += (movie1[i] - movie2[i]) ** 2
    final_distance = squared_difference ** 0.5
    return final_distance

def classify(unknown, dataset, labels, k):
    distances = []
    #Looping through all points in the dataset
    for title in dataset:
        movie = dataset[title]
        distance_to_point = distance(movie, unknown)
        #Adding the distance and point associated with that distance
        distances.append([distance_to_point, title])
    distances.sort()
    #Taking only the k closest points
    neighbors = distances[0:k]
    num_good = 0
    num_bad = 0
    for neighbor in neighbors:
        title = neighbor[1]
        if labels[title] == 0:
            num_bad += 1
        elif labels[title] == 1:
            num_good += 1
    if num_good > num_bad:
        return 1
    else:
        return 0


3655
3289
366


In [42]:
# Task 1
print(validation_set["Seven Samurai"])

# Task 2
guess = classify(validation_set["Seven Samurai"], training_set, training_labels, 5)

# Task 3
if guess == validation_labels["Seven Samurai"]:
    print("Correct!")
else:
    print("Wrong!")

[0.00016370856990614123, 0.5631399317406144, 0.30337078651685395]
Correct!


## 10. Choosing K

**Task 1**  
- Begin by creating a function called `find_validation_accuracy` that takes five parameters. 
- The parameters should be `training_set`, `training_labels`, `validation_set`, `validation_labels`, and `k`.

<br>

**Task 2**  
- Create a variable called `num_correct` and have it begin at `0.0`. 
- Loop through the movies of `validation_set`, and call `classify` using each movie’s data, the `training_set`, the `training_labels`, and `k`. 
- Store the result in a variable called `guess`. 
- For now, return `guess` outside of your loop.
- Remember, the movie’s data can be found by using `validation_set[title]`.

<br>

**Task 3**  
- Inside the for loop, compare `guess` to the corresponding label in `validation_labels`. 
- If they were equal, add `1` to `num_correct`. 
- For now, outside of the for loop, return `num_correct`

<br>

**Task 4**  
- Outside the for loop return the validation error. 
- This should be `num_correct` divided by the total number of points in the validation set.

<br>

**Task 5**  
- Call `find_validation_accuracy` with `k = 3`. 
- Print the results The code should take a couple of seconds to run.

In [None]:
movie_dataset = json.loads(open("movie_dataset.json").read())
movie_labels = json.loads(open("movie_labels.json").read())

# 3289 movies in training set
training_set = {title: movie_dataset[title] for title in list(movie_dataset.keys())[:3289]}
training_labels = {title: movie_labels[title] for title in list(movie_labels.keys())[:3289]}
# Last 366 movies in validation set
validation_set = {title: movie_dataset[title] for title in list(movie_dataset.keys())[3289:]}
validation_labels = {title: movie_labels[title] for title in list(movie_labels.keys())[3289:]}

def distance(movie1, movie2):
    squared_difference = 0
    for i in range(len(movie1)):
        squared_difference += (movie1[i] - movie2[i]) ** 2
    final_distance = squared_difference ** 0.5
    return final_distance

def classify(unknown, dataset, labels, k):
    distances = []
    #Looping through all points in the dataset
    for title in dataset:
        movie = dataset[title]
        distance_to_point = distance(movie, unknown)
        #Adding the distance and point associated with that distance
        distances.append([distance_to_point, title])
    distances.sort()
    #Taking only the k closest points
    neighbors = distances[0:k]
    num_good = 0
    num_bad = 0
    for neighbor in neighbors:
        title = neighbor[1]
        if labels[title] == 0:
            num_bad += 1
        elif labels[title] == 1:
            num_good += 1
    if num_good > num_bad:
        return 1
    else:
        return 0

def find_validation_accuracy(training_set, training_labels, validation_set, validation_labels, k):
    num_correct = 0.0
    for movie in validation_set:
        guess = classify(validation_set[movie], training_set, training_labels, k)
        if guess == validation_labels[movie]:
            num_correct += 1
    return num_correct / len(validation_set)

In [51]:
print(find_validation_accuracy(training_set, training_labels, validation_set, validation_labels, 3))

0.6120218579234973


## 12. Using sklearn

**Task 1**  
- Create a `KNeighborsClassifier` named classifier that uses `k=5`.

<br>

**Task 2**  
- Train your classifier using `movie_dataset` as the training points and `labels` as the training labels.

<br>

**Task 3**  
- Let’s classify some movies. 
- Classify the following movies: `[.45, .2, .5]`, `[.25, .8, .9]`, `[.1, .1, .9]`. 
- Print the classifications!
- Which movies were classified as good movies and which were classified as bad movies?
- Remember, those three numbers associated with a movie are the normalized budget, run time, and year of release.

In [None]:
from movies import movie_dataset, labels
assert len(movie_dataset) == len(labels)

In [58]:
# Task 1
classifier = KNeighborsClassifier(n_neighbors = 5)

# Task 2
classifier.fit(movie_dataset, labels)

# Task 3
classifier.predict([[.45, .2, .5], [.25, .8, .9], [.1, .1, .9]])

array([1, 1, 0])