This workshop is based on [the original work](https://www.kaggle.com/rounakbanik/movie-recommender-systems/notebook) done by Rounak Banik. Thank you! 🙏

# Step 1: Download dataset
The following command will download our dataset, a file called `movies.csv`, from Google Drive. Run it and then check the files tab on the left to make sure it's there.

In [18]:
!gdown https://drive.google.com/uc?id=1vbCnO9blSUE8IR5BHNMXY_OfOY_Cos88

Downloading...
From: https://drive.google.com/uc?id=1vbCnO9blSUE8IR5BHNMXY_OfOY_Cos88
To: /home/owen/Documents/projects/Workshops/movie-recommendation-021222/movies.csv
100%|██████████████████████████████████████| 18.1M/18.1M [00:00<00:00, 54.0MB/s]


# Step 2: Load movie data in Python
The following code is *almost* correct, but it contains a typo. It's *supposed* to load the file `movies.csv` into a pandas dataframe called `movies`. Can you fix it? (Once you have it working, you should be able to see the first 5 rows of data displayed.)

In [19]:
import pandas as pd

# Load the "movies" dataframe from movies.csv
movies = pd.read_csv("movies.csv")

# Display the first 5 rows
movies.head(5)

Unnamed: 0,title,tagline,overview,popularity,vote_count,vote_average,release_date,runtime,budget,language
0,Toy Story,,"Led by Woody, Andy's toys live happily in his ...",21.946943,5415,7.7,1995-10-30,81.0,30000000,en
1,Jumanji,Roll the dice and unleash the excitement!,When siblings Judy and Peter discover an encha...,17.015539,2413,6.9,1995-12-15,104.0,65000000,en
2,Grumpier Old Men,Still Yelling. Still Fighting. Still Ready for...,A family wedding reignites the ancient feud be...,11.7129,92,6.5,1995-12-22,101.0,0,en
3,Waiting to Exhale,Friends are the people who let you be yourself...,"Cheated on, mistreated and stepped on, the wom...",3.859495,34,6.1,1995-12-22,127.0,16000000,en
4,Father of the Bride Part II,Just When His World Is Back To Normal... He's ...,Just when George Banks has recovered from his ...,8.387519,173,5.7,1995-02-10,106.0,0,en


That's a lot of data! Having all these columns is awesome, but it's a bit distracting when we just want to focus on a few. Let's fix that. The following code simplifies the dataframe to only contain three columns: `title`, `release_date`, and `budget`.

**Your task:** Can you change it to keep the `title` and `popularity` columns instead?

In [20]:
# Create a simpler dataframe with only a few columns
# Your job: Change this to keep the `title` and `popularity` columns instead
movies_simple = movies[["title", "release_date", "budget"]]

# Display the first 5 rows of the simplified dataframe
movies_simple.head(5)

Unnamed: 0,title,release_date,budget
0,Toy Story,1995-10-30,30000000
1,Jumanji,1995-12-15,65000000
2,Grumpier Old Men,1995-12-22,0
3,Waiting to Exhale,1995-12-22,16000000
4,Father of the Bride Part II,1995-02-10,0


Hopefully you see just two columns: "title" and "popularity".

# Step 3: Sort movies by popularity
By default, `movies.csv` is sorted from oldest to newest. But we're looking for a nice Valentine's day option, and the oldest movies aren't necessarily the best. Can you sort the movies so that the most popular are at the top?

The following code sorts the movies alphabetically based on their title. Can you change it to sort by popularity? We want the most popular movies at the top.

In [21]:
# Sort movies_simple
movies_simple_sorted = movies_simple.sort_values("title", ascending=True)

# Display the first 10 rows
movies_simple_sorted.head(10)

Unnamed: 0,title,release_date,budget
18757,!Women Art Revolution,2010-01-01,0
30957,#1 Cheerleader Camp,2010-07-27,0
36147,#Horror,2015-11-20,1500000
39988,#Pellichoopulu,2016-07-29,200000
42266,#SELFIEPARTY,2016-03-31,0
23499,#chicagoGirl,2013-11-21,0
16773,$ Dollars,1971-12-17,0
15775,$5 a Day,2008-01-01,0
36180,$50K and a Call Girl: A Love Story,2014-01-10,0
14832,$9.99,2008-09-04,0


This... This is not good. We were looking for a good Valentine's day movie, and what we got was *Minions*.

!["You are 1 in a minion" meme](https://i.imgur.com/bQ5VwDs.png)

# Step 4: Sort movies by votes (optional)

(This step is optional. So if you are in the mood to relax a bit, you are free to skip on to the next step. But if you want to learn about a cool equation that solves a tricky problem, then continue here.)

It's not entirely clear what this "popularity" metric even means, but it doesn't seem to be giving us the best results. When we simplified our database, we got rid of two rows called `vote_count` and `vote_average`. Those columns keep track of the ratings (out of 10 stars) that reviewers gave each movie. So perhaps instead of relying on popularity, we could sort movies based on the votes they received.

The following code creates a dataframe that includes the vote columns. **Can you add a final line that will sort it based on `vote_average`?** We want the highest vote averages at the top. (Refer to your previous code for a reminder on how to sort.)

In [22]:
# Create a dataframe with the vote columns:
movies_with_votes = movies[["title", "vote_count", "vote_average"]]

# Sort the `movies_with_votes` dataframe by "vote_average":
# ???

# Display the first 10 results:
movies_with_votes.head(10)

Unnamed: 0,title,vote_count,vote_average
0,Toy Story,5415,7.7
1,Jumanji,2413,6.9
2,Grumpier Old Men,92,6.5
3,Waiting to Exhale,34,6.1
4,Father of the Bride Part II,173,5.7
5,Heat,1886,7.7
6,Sabrina,141,6.2
7,Tom and Huck,45,5.4
8,Sudden Death,174,5.5
9,GoldenEye,1194,6.6


Hmmm... There's something interesting about these results. They all have a `vote_count` of 1.

This makes sense. We're just sorting movies by their average rating. If only one person has rated your movie, it's much easier to achieve a 10/10 than if 500 people all rate your movie. With fewer total votes, it's easier to achieve a more extreme result (good or bad) than it is with many votes.

So just looking for the highest average score won't do. We also need to reward the movies with more total votes.

This is tricky to do correctly, but IMDB allegedly uses the following formula to compute a score for each movie:

$$
\text{Weighted Rating} = \left(\frac{v}{v+m} \cdot R\right) + \left(\frac{m}{v+m} \cdot C\right)
$$

Where...
* $v$ is the `vote_count` for the movie
* $m$ is the minimum `vote_count` required to be included on the chart
* $R$ is the `vote_average` for the movie
* $C$ is the vote average across all movies

Notice that the definition of $m$ means that we need to choose a cutoff point for our list. (i.e. We can't include all the movies.) For our list, let's look at which movies have the most votes cast (the highest `vote_count`) and choose the top 30% (i.e. the 70th percentile). We'll calculate that value $m$ in the code below:

In [23]:
# Get the entire vote_counts column as a list
vote_counts = movies_with_votes["vote_count"]

# Compute the 70th percentile vote count
m = vote_counts.quantile(0.70)

print(m)

25.0


Now that we have our cutoff point $m$, the minimum `vote_count` required to be considered for our top movies list, let's filter to only look at movies with a `vote_count` of at least $m$:

In [24]:
top_movies = movies_with_votes[movies_with_votes["vote_count"] >= m]

print("Total number of movies:", len(movies_with_votes))
print("Number of movies above the 70th percentile:", len(top_movies))

Total number of movies: 45460
Number of movies above the 70th percentile: 13810


The number of movies above the 70th percentile should be about 30% the total number of movies, and it is! Now let's compute $C$, the average vote across all our `top_movies`:

In [25]:
vote_averages = top_movies['vote_average']
C = vote_averages.mean()
print(C)

6.2833743664011585


One issue with the above calculation is that we aren't taking into account the number of votes each movie has received. You may try to fix it if you are so inclined. Otherwise, we'll take this value of $C$ as a good-enough approximation.

Now we have $m$ and $C$, the two variables that depend on *all* the movies. The other two variables, $v$ and $R$, are the `vote_count` and `vote_average` for a particular movie respectively.

To remind you, the equation we're using to score each movie is this:

$$
\text{Weighted Rating} = \left(\frac{v}{v+m} \cdot R\right) + \left(\frac{m}{v+m} \cdot C\right)
$$

Your job is to create a `weighted_rating` function which computes the weighted rating for a given movie. Then we can use this function to add a new `weighted_rating` column to the table.

The function is already started for you; your job is to complete it based on the equation above. (Remember that you have the variables `m` and `C` available to you because we calculated them earlier.)

In [26]:
def weighted_rating(movie):
    v = movie["vote_count"]
    R = movie["vote_average"]

    # Calculated weighted rating based on v, m, R, and C
    # return # ???

top_movies["weighted_rating"] = top_movies.apply(weighted_rating, axis=1)

top_movies.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_movies["weighted_rating"] = top_movies.apply(weighted_rating, axis=1)


Unnamed: 0,title,vote_count,vote_average,weighted_rating
0,Toy Story,5415,7.7,
1,Jumanji,2413,6.9,
2,Grumpier Old Men,92,6.5,
3,Waiting to Exhale,34,6.1,
4,Father of the Bride Part II,173,5.7,


Amazing! If your function is correct, you should see a new `weighted_rating` column, with values between about 0 and 10.

All that's left to do is to sort the movies based on their weighted rating. Can you do it?

In [27]:
# Sort top_movies based on the new weighted_rating column:
top_movies = # ???

# Display the top 10 movies:
top_movies.head(10)

SyntaxError: invalid syntax (4229843810.py, line 2)

Looking good. :)

<img src="https://img.buzzfeed.com/buzzfeed-static/static/2015-07/10/13/campaign_images/webdr04/how-much-do-you-hate-minions-2-19159-1436549099-1_dblbig.jpg" alt="Goodbye minions" width="300" />

# What's next?
Okay! This is great. We've created a list of good movies; the peoples' choice. Is this all we need to do?

Well, no.

Our top 10 list includes Schindler's List. It's an excellent movie, with a [98% on rotten tomatoes](https://www.rottentomatoes.com/m/schindlers_list). But it doesn't exactly give me Valentine's vibes. It would be lovely if we could create a recommendation algorithm that recommends new movies based on what you've already watched. That way, we could pick a few movies that we know fit the mood, and then find more with similar vibes.

# Step 5: Create descriptions for each movie
To determine which movies are similar, we are going to compare text descriptions of one movie to another. This process requires a lot of memory, and doing it for all ~50,000 movies in the dataset will be too much for the computer to handle.

So let's cut the size of the dataset in half by selecting only the most popular movies.

We can filter for popular movies by setting a lower bound, such as the following code, which selects movies with a popularity of more than 4.00:

In [28]:
# Select movies with popularity > 4
popular_movies = movies[movies["popularity"] > 4]

print("Total number of movies:", len(movies))
print("Number of popular movies:", len(popular_movies))

Total number of movies: 45460
Number of popular movies: 10801


The code above selects the 10,801 most popular movies, because that's how many movies have a popularity greater than 4.

But we want to select about half the movies. We *could* do this by guessing and checking for a threshold, but there's an easier way.

The following code prints out the popularity threshold that selects for only the top 30% of movies (i.e. the 30th percentile). Can you change it to get the threshold at 50%?

In [29]:
# Print the 70th percentile popularity (which will select the top 30% of movies)
# Can you change it to select 50% of the movies?
print(movies["popularity"].quantile(0.7))

2.7150251


Now use the value you get to grab the top 50% of movies:

In [30]:
# CHANGE THIS LINE. We want the top 50% of movies. Use your value from above.
popular_movies = movies[movies["popularity"] > 4]

print("Total number of movies:", len(movies))
print("Number of popular movies (hopefully about 22730):", len(popular_movies))

Total number of movies: 45460
Number of popular movies (hopefully about 22730): 10801


Amazing. We've simplified our dataset to only contain the top half of movies. (The computer will be thanking us later for making the dataset smaller. 😅)

Now... We want to compare movies based on their text description. So let's create a simplified table which only stores the text information about each movie.

The following code selects just the "title" and "language" columns. Can you change it to select "title", "tagline", and "overview" instead?

In [31]:
# Get just the text columns for each movie. (Requires changes)
movies_text = popular_movies[["title", "language"]]

movies_text.head(5)

Unnamed: 0,title,language
0,Toy Story,en
1,Jumanji,en
2,Grumpier Old Men,en
4,Father of the Bride Part II,en
5,Heat,en


Pretty good! But as you can see, the `tagline` for Toy Story is `NaN`, which stands for "not a number". Obviously the tagline is never a number, but in this case `NaN` means that there is no tagline. The same can also happen in the `overview` column. To fix this, let's replace `NaN` values with empty strings instead:

In [32]:
movies_text["tagline"].fillna("", inplace=True)
movies_text["overview"].fillna("", inplace=True)

# Display the first 5 rows now:
movies_text.head(5)

KeyError: 'tagline'

Nice! Now every cell contains a string (albeit sometimes an empty one).

Now, we want a description of each movie that contains as much information as possible. We could use just the tagline or just the overview, but then we would be throwing away the other column, which seems bad.

Instead, let's merge the two columns into one new column called "description". The description of a movie will just be the tagline plus the overview, concatenated together.

The following code creates a "description" column which is just the title doubled. Can you change it to merge the "tagline" and "overview" columns into one instead?

In [None]:
# Create a new column called "description" which
# merges "tagline" and "overview" together. (Requires changes.)
movies_text["description"] = movies_text["title"] + movies_text["title"]

# Display the first 3 rows:
movies_text.head(3)

(If you think your code is correct but can't tell because the text is getting cut off, click the little magic wand button beneath the table.)

---

Amazing! Now that we've created the description column, we don't really need the original tagline and overview columns anymore. The following code deletes some columns, but accidentally deletes too many. Can you change it to *keep* the "title" and "description" columns?

In [None]:
# Oops! This code drops too many columns. Can you change it to keep the description column?
movie_descriptions = movies_text.drop(["tagline", "overview", "description"], axis=1)

movie_descriptions.head(5)

Fantastic. Now we just need a way to compare movie descriptions to each other. This will allow us to find similar movies based on a description alone.

# Step 6: Use Tfidf to compare movies to each other
At the beginning of this workshop, we talked about TF-IDF, which stands for "term frequency, inverse document frequency". It's a technique for comparing two pieces of text to see if they are similar. And it uses the context clues from all the other texts to ignore common words.

Fortunately, the `sklearn` package has lots of help in this department. The following code creates a `TfidfVectorizor` and uses it to compute the `tfidf_matrix`. There's no need to make any changes; just run it and see what happens.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Get a list of all the video descriptions:
all_descriptions = movie_descriptions["description"]

# Create tfidf vectorizer:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), min_df=0, stop_words='english')

# Use vectorizer to create tfidf matrix.
# This will take a moment to complete.
tfidf_matrix = tf.fit_transform(all_descriptions)

# Print out the shape of the tfidf_matrix (it is large):
print(tfidf_matrix.shape)

NameError: name 'movie_descriptions' is not defined

If all went well, you should see the shape of the tfidf matrix printed above. Mine has 22730 rows and 594892 columns. Hopefully yours is similar.

**Question:** What do you think the rows and columns represent? The number of rows should look familiar. What does that tell you about the meaning of each row?

The number of columns is much larger than anything we've seen so far. What do you think each column might represent?

---

Now that we have the tfidf matrix, we can use it to compute the similarity matrix. If the tfidf matrix is $M$, then the similarity matrix is computed by performing $M \cdot M^T$. If you've taken linear algebra, you might be able to figure out why this makes sense. If not, don't worry too much about it.

Let's compute this new similarity matrix. Again, no code changes are necessary:

In [None]:
from sklearn.metrics.pairwise import linear_kernel

# Compute the similarity matrix. (This will take some time.)
similarity_matrix = linear_kernel(tfidf_matrix, tfidf_matrix)

print(similarity_matrix.shape)

If all went well, you should see the shape of the similarity matrix printed above. What are the dimensions? Mine has 22730 rows and 22730 columns. Hopefully yours is similar.

**Question:** What does each entry of the similarity matrix tell you? The number of rows/columns is a clue. (We also talked about this during the presentation.)

---

Now that we have `similarity_matrix`, we can use it to check the similarity between any two movies. Each movie has a row number in our dataset, so to check the similarity between the first two movies (in rows 0 and 1 respectively), we get the corresponding entry from the similarity matrix: `similarity_matrix[0, 1]`.

Let's write some code that can take any two movies, display their title and description, and then print their similarity. The code has been started for you, and your job is to finish it:

In [None]:
movie_A = 0
movie_B = 1

# Print the title and description for movie A:
print("Movie A:", movie_descriptions.loc[movie_A, "title"])
print(movie_descriptions.loc[movie_A, "description"])
print()

# Print the title and description for movie B:
# ???

# Get the similarity of the two movies:
# (Which entry of similarity_matrix should you grab? You
# should be using the `movie_A` and `movie_B` variables.)
similarity = # ???
print("Similarity:", similarity)

If your code is correct and you check the similarity between movie 0 and movie 1, you should see that they are Toy Story and Jumanji respectively, and their similarity score is 0.0056 (not very similar).

# Step 7: Recommend similar movies
Now that we can check the similarity between any two movies, we can take any existing movie and find all the other movies that are *most* similar!

Looking up movies by their row number is pretty annoying, so let's start by making it possible to look up a movie by its title.

The following code creates a mapping between titles and row numbers. Run it and see what you get:

In [None]:
# Reset the numbering of the rows to make sure no numbers are skipped because we removed rows
movie_descriptions.reset_index(inplace=True)

# Create a dictionary (or, really, a pandas "Series") mapping
# movie titles to the index of that movie in movie_descriptions
indices = pd.Series(movie_descriptions.index, index=movie_descriptions['title'])

print(indices)

Great! Now we can use the following code to check the row number for a particular title.

In [None]:
print(indices["Grumpier Old Men"])

**Question:** What is the row number of the movie `"Minions"`? (Asking for a friend.)

Now, if we know the row number for a movie, we can get that same row out of the similarity_matrix. That row will contain the similarities of the movie with every other movie:

In [None]:
# Get the movie's index based on its title:
index = indices["Toy Story"]

# Use that index to get the similarities matrix row
# that gives a similarity score for this movie
# compared to each other movie:
row = similarity_matrix[index]

# Print the similarity scores with the first 20 movies:
print(row[:20])

# (...Of course this row goes on for a very long time
# if we don't look at just the first few entries.)

As you can see, Toy Story has a similarity of 0 to most of the movies, but it is much more similar to a few of them.

We would like to sort this array in order to get the most similar movies first. But if we just do that, we actually lose track of which similarity score corresponds to which movie. (Because the order of this array currently matters.)

To solve this, let's use the `enumerate` function to turn every entry into a pair of values: The row of the movie, and the similarity score.

In [None]:
# Convert that row to a list of (movie_row, similarity_score) pairs:
sim_scores = list(enumerate(row))

print(sim_scores)

Now we can sort these pairs by their second value (the similarity score). Right now the following code sorts from lowest similarity to highest. Can you reverse the sort?

In [None]:
# Sort the (movie_row, similarity_score) pairs by similarity score:
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=False)

print(sim_scores)

Once you have the high similarity scores first, we can grab just the top 10 highest scorers:

In [None]:
# Get the top 10 (movie_row, similarity_score) pairs:
closest_matches = sim_scores[:10]

print(closest_matches)

Now that we have our top 10, we don't need the similarity scores any more. We just care about the first number in each pair, which is the row number of the movie.

The following code is *supposed* to grab just the first entry from each pair. But instead, it grabs the second entry (the similarity score). Can you fix it?

In [None]:
# Supposed to get just the movie_row out of each (movie_row, similarity_score) pair.
# But currently gets the similarity_score instead. (Can you fix it?)
movie_indices = [i[1] for i in closest_matches]

print(movie_indices)

You should see a list like `[0, 2408, 11336, ...]`.

Finally, we can use these row numbers to filter our original movie descriptions table and get a nice top 10 list:

In [None]:
# Filter the movie_descriptions table to only include the top 10 rows:
movie_descriptions.iloc[movie_indices]

Amazing! This top 10 list isn't perfect, but it's not terrible either. We got the first three toy story movies, in order, as being most similar to "Toy Story". Pretty good!

For convenience, I've taken all the steps from above and merged them into one code block so that you can try running the code with a different movie. Do the results seem any good? (What happens if you try to find movies similar to `"Minions"`?)

In [None]:
# Get the movie's index based on its title:
index = indices["Toy Story"]

# Use that index to get the similarities matrix row
# that gives a similarity score for this movie
# compared to each other movie:
row = similarity_matrix[index]

# Convert that row to a list of (movie_index, similarity_score) pairs:
sim_scores = list(enumerate(row))

# Sort the (movie_index, similarity_score) pairs by similarity score:
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

# Get the top 10 (movie_index, similarity_score) pairs:
sim_scores = sim_scores[:10]

# Get just the movie_index out of each pair:
movie_indices = [i[0] for i in sim_scores]

# Filter the movie_descriptions table to only include the top 10 rows:
movie_descriptions.iloc[movie_indices]

Hopefully with this knowledge, you are well-equiped to choose a movie.

Or, y'know, you could just follow your heart.