# Lab 10 - Recommender Systems

The objective of a recommender system is to recommend relevant items for users, based on their preferences. User preference is generally inferred from previous usage and ratings. In practice, there are a few different kinds of recommendation systems:

- **Demographic Filtering**: Offer the generalized recommendations to users, based on movie popularity and/or genre. Usually the most simple to implement but also the most impersonal.
- **Collaborative Filtering**: Filters items that a user might like based on interactions of similar users.
- **Content-based Filtering**: Uses item descriptions or attributes of items users have previously consumed to learn user preferences and make recommendations
- **Hybrid**: Combines collaborative filtering and content-based filtering to generate recommendations

This notebook will demonstrate how to implement a popularity-based recommender system and collaborative filtering recommender system for movie recommendations. We'll use a subset of data from [Movielens](https://grouplens.org/datasets/movielens/latest/) and [TMDB](https://www.themoviedb.org/) databases.

## Import Libraries & Datasets

We'll install and load the Surprise (Simple Python RecommendatIon System Engine) library. In addition, we'll load two datasets, one containing the user ratings for movies they've watched, and one containing (some) movies' metadata.

In [1]:
# Uncomment and execute this cell if running locally or on Colab
# to install the surprise package for recommender systems
# !pip install scikit-surprise --upgrade

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


%matplotlib inline
plt.style.use('fivethirtyeight')
sns.set_style('whitegrid')

In [3]:
try:
    ratings = pd.read_csv('../data/movie_ratings.csv')
except:
    ratings = pd.read_csv('https://raw.githubusercontent.com/GUC-DM/W2021-Berlin/main/data/movie_ratings.csv')
ratings.head()

Unnamed: 0,userId,tmdbId,rating
0,1,9909,2.5
1,7,9909,3.0
2,31,9909,4.0
3,32,9909,4.0
4,36,9909,3.0


In [4]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99810 entries, 0 to 99809
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   userId  99810 non-null  int64  
 1   tmdbId  99810 non-null  int64  
 2   rating  99810 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 2.3 MB


In [5]:
ratings.describe()

Unnamed: 0,userId,tmdbId,rating
count,99810.0,99810.0,99810.0
mean,346.983228,13044.446729,3.542972
std,195.166743,31288.952146,1.057939
min,1.0,2.0,0.5
25%,182.0,674.0,3.0
50%,367.0,4964.0,4.0
75%,520.0,11310.0,4.0
max,671.0,416437.0,5.0


In [6]:
try:
    movies_db = pd.read_csv('../data/movies_db.csv')
except:
    movies_db = pd.read_csv('https://raw.githubusercontent.com/GUC-DM/W2021-Berlin/main/data/movies_db.csv')

# We'll set the TMDB ID as the index for quick indexing by ID
movies_db = movies_db.set_index('tmdbId')
movies_db.head()

Unnamed: 0_level_0,imdb_id,title,overview,original_language,vote_average,vote_count,release_year,genre_1,genre_2
tmdbId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
862,tt0114709,Toy Story,"Led by Woody, Andy's toys live happily in his ...",en,7.7,5415.0,1995.0,Animation,Comedy
8844,tt0113497,Jumanji,When siblings Judy and Peter discover an encha...,en,6.9,2413.0,1995.0,Adventure,Fantasy
15602,tt0113228,Grumpier Old Men,A family wedding reignites the ancient feud be...,en,6.5,92.0,1995.0,Romance,Comedy
31357,tt0114885,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",en,6.1,34.0,1995.0,Comedy,Drama
11862,tt0113041,Father of the Bride Part II,Just when George Banks has recovered from his ...,en,5.7,173.0,1995.0,Comedy,


In [7]:
movies_db.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9042 entries, 862 to 265189
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   imdb_id            9042 non-null   object 
 1   title              9042 non-null   object 
 2   overview           9030 non-null   object 
 3   original_language  9042 non-null   object 
 4   vote_average       9042 non-null   float64
 5   vote_count         9042 non-null   float64
 6   release_year       9042 non-null   float64
 7   genre_1            9007 non-null   object 
 8   genre_2            7143 non-null   object 
dtypes: float64(3), object(6)
memory usage: 706.4+ KB


## Demographic Filtering Recommender System

This type of recommender system is fairly simple in practice, but also the least personalized. The general idea behind a popularity-based recommender system is that movies that are more popular and critically acclaimed will *probably* be liked by the average audience.

For this purpose, we'll use a weighted rating formula (IMDB's) instead of the raw average score. The idea is that, if we have few votes for a particular movie, we won't put much trust in the rating and lean instead towards a conservative estimate instead. The more votes there are for a particular movie, the more we can trust its rating.

$$ \text{Weighted Rating (WR)} = (\frac{v}{v + m} \cdot R) + (\frac{m}{v + m} \cdot C)$$
Where:
- $v$ is the number of votes for the movie
- $m$ is the minimum votes required to be listed in the chart
- $R$ is the average rating of the movie
- $C$ is the mean vote across the whole report


We'll set the minimum number of votes to be the 90th percentile. In other words, a movie must have more votes than 90% of the movies in the list. 

In [8]:
vote_counts = movies_db[movies_db['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = movies_db[movies_db['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
C

5.91616898916169

In [9]:
m = vote_counts.quantile(0.90)
m

1129.800000000001

Filter the movies to those that meet the minimum vote count.

Note: `&` is bitwise AND; you cannot use python's `and` `or` `not` operators with numpy arrays / pandas series

In [10]:
qualified_movies = movies_db[
    (movies_db['vote_count'] >= m) & 
    (movies_db['vote_count'].notnull()) & 
    (movies_db['vote_average'].notnull())
].copy()
qualified_movies.shape

(905, 9)

In [11]:
def weighted_rating(row):
    v = row['vote_count']
    R = row['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

# Apply the weighted_rating over each row in the dataframe
qualified_movies['weighted_rating'] = qualified_movies.apply(weighted_rating, axis=1)

# Get the top 250 movies according to its calculated weighted_rating
qualified_movies = qualified_movies.nlargest(250, 'weighted_rating')
qualified_movies.head(15)

Unnamed: 0_level_0,imdb_id,title,overview,original_language,vote_average,vote_count,release_year,genre_1,genre_2,weighted_rating
tmdbId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
278,tt0111161,The Shawshank Redemption,Framed in the 1940s for the double murder of h...,en,8.5,8358.0,1994.0,Drama,Crime,8.192319
155,tt0468569,The Dark Knight,Batman raises the stakes in his war on crime. ...,en,8.3,12269.0,2008.0,Drama,Action,8.098993
238,tt0068646,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...",en,8.5,6024.0,1972.0,Drama,Crime,8.091935
550,tt0137523,Fight Club,A ticking-time-bomb insomniac and a slippery s...,en,8.3,9678.0,1999.0,Drama,,8.050805
680,tt0110912,Pulp Fiction,"A burger-loving hit man, his philosophical par...",en,8.3,8670.0,1994.0,Thriller,Crime,8.025173
27205,tt1375666,Inception,"Cobb, a skilled thief who commits corporate es...",en,8.1,14075.0,2010.0,Action,Thriller,7.937729
13,tt0109830,Forrest Gump,A man with a low IQ has accomplished great thi...,en,8.2,8147.0,1994.0,Comedy,Drama,7.921858
157336,tt0816692,Interstellar,Interstellar chronicles the adventures of a gr...,en,8.1,11187.0,2014.0,Adventure,Drama,7.899681
1891,tt0080684,The Empire Strikes Back,"The epic saga continues as Luke Skywalker, in ...",en,8.2,5998.0,1980.0,Adventure,Action,7.837999
122,tt0167260,The Lord of the Rings: The Return of the King,Aragorn is revealed as the heir to the ancient...,en,8.1,8226.0,2003.0,Adventure,Fantasy,7.836282


In [12]:
def genre_chart(genre, movies_db):
    
    qualified_movies = movies_db[
        (movies_db['vote_count'] >= m) & 
        (movies_db['vote_count'].notnull()) & 
        (movies_db['vote_average'].notnull())
    ].copy()
    qualified_movies['weighted_rating'] = qualified_movies.apply(weighted_rating, axis=1)

    # Filter to just the genre specified
    # Note, | is bitwise OR
    qualified_movies = qualified_movies[
        (qualified_movies['genre_1'] == genre) |
        (qualified_movies['genre_2'] == genre)
    ]
    return qualified_movies.nlargest(250, 'weighted_rating')

In [13]:
genre_chart('Thriller', movies_db).head(15)

Unnamed: 0_level_0,imdb_id,title,overview,original_language,vote_average,vote_count,release_year,genre_1,genre_2,weighted_rating
tmdbId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
680,tt0110912,Pulp Fiction,"A burger-loving hit man, his philosophical par...",en,8.3,8670.0,1994.0,Thriller,Crime,8.025173
27205,tt1375666,Inception,"Cobb, a skilled thief who commits corporate es...",en,8.1,14075.0,2010.0,Action,Thriller,7.937729
101,tt0110413,Leon: The Professional,"Leon, the top hit man in New York, has earned ...",fr,8.2,4293.0,1994.0,Thriller,Crime,7.724181
77,tt0209144,Memento,Suffering short-term memory loss after a head ...,en,8.1,4168.0,2000.0,Mystery,Thriller,7.63428
694,tt0081505,The Shining,Jack Torrance accepts a caretaker job at the O...,en,8.1,3890.0,1980.0,Horror,Thriller,7.608488
500,tt0105236,Reservoir Dogs,A botched robbery indicates a police informant...,en,8.1,3821.0,1992.0,Crime,Thriller,7.601638
210577,tt2267998,Gone Girl,With his wife's disappearance having become th...,en,7.9,6023.0,2014.0,Mystery,Thriller,7.58665
11324,tt1130884,Shutter Island,World War II soldier-turned-U.S. Marshal Teddy...,en,7.8,6559.0,2010.0,Drama,Thriller,7.523188
1422,tt0407887,The Departed,"To take down South Boston's Irish Mafia, the p...",en,7.9,4455.0,2006.0,Drama,Thriller,7.498673
264644,tt3170832,Room,Jack is a young boy of 5 years old who has liv...,en,8.1,2838.0,2015.0,Drama,Thriller,7.478171


## Collaborative Filtering Recommender System

This method makes automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on a set of items, A is more likely to have B's opinion for a given item than that of a randomly chosen person. 

In [14]:
from surprise import Dataset
from surprise import Reader

# The Reader class is used to parse a file containing ratings
# Since we already loaded it as a dataframe, we only need to set the rating_scale parameter.
reader = Reader(rating_scale=(0.5, 5))

# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(ratings[['userId', 'tmdbId', 'rating']], reader)

ModuleNotFoundError: No module named 'surprise'

To build user-user and item-item similarity-based models like the ones covered in the lecture, we can use the `KNNBasic` class from the surprise library. We'll create two models, one that computes similarities between users (user-user) and the other computes similarities between items (item-item).

In [None]:
from surprise import KNNBasic

sim_options_user = {
    'name': 'cosine', # there are other options as well, including pearson
    'user_based': True  # compute similarities between users
}

user_knn_model = KNNBasic(k=40, min_k=1, sim_options=sim_options_user)

In [None]:
sim_options_item = {
    'name': 'cosine',
    'user_based': False  # compute similarities between items
}

item_knn_model = KNNBasic(k=40, min_k=1, sim_options=sim_options_item)

Let's also use the Singular Value Decomposition (SVD) model for comparison. This is the model proposed by [Simon Funk](https://sifter.org/~simon/journal/20061211.html), which achieved third place in the Netflix Prize (2009).

In [None]:
from surprise import SVD

svd_model = SVD()

## Evaluating Recommender System Performance

Since we're predicting user ratings (numeric value ranging from 0.5 to 5), we can use the $\text{RMSE}$ score, given by the following formula:
$$\text{RMSE} = \sqrt{\sum\frac{(y_{pred} - y_{actual})^2}{N}}$$

We can use the `cross_validate` method to automatically split the data into folds and compute the $RMSE$ per fold for us.

In [None]:
from surprise.model_selection import cross_validate

cross_validate(user_knn_model, data, cv=2)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.


{'test_rmse': array([1.01438303, 1.01842069]),
 'test_mae': array([0.78453074, 0.78751173]),
 'fit_time': (0.312028169631958, 0.30510735511779785),
 'test_time': (3.0288262367248535, 2.9403207302093506)}

In [None]:
cross_validate(item_knn_model, data, cv=2)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.


{'test_rmse': array([0.9986614 , 0.98807488]),
 'test_mae': array([0.77288428, 0.76749731]),
 'fit_time': (9.937341213226318, 9.433929443359375),
 'test_time': (14.273617506027222, 13.356948137283325)}

In [None]:
cross_validate(svd_model, data, cv=2)

{'test_rmse': array([0.91041097, 0.91429124]),
 'test_mae': array([0.70319538, 0.70683345]),
 'fit_time': (3.2958407402038574, 3.058335542678833),
 'test_time': (0.5096189975738525, 0.46736860275268555)}

## Predicting Ratings for Users and Recommending Items

`Dataset.build_full_trainset()` builds a training set from the entire dataset (no splitting is done), since we're not evaluating the model, but instead we'd like to generate recommendations from all the ratings that we already have.

In [None]:
trainset = data.build_full_trainset()

# Fit each model to the training set
item_knn_model.fit(trainset)
user_knn_model.fit(trainset)
svd_model.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f5840babeb0>

Predicting the rating a particular user would give a particular item can be done using the `predict` model function

In [None]:
svd_model.predict(1,819)

Prediction(uid=1, iid=819, r_ui=None, est=2.725490198270817, details={'was_impossible': False})

Accessing the predicted score itself can be done by adding `.est` at the end of the line

In [None]:
svd_model.predict(1,819).est

2.725490198270817

Recommender systems implementations differ greatly based on the problem, data/attributes available and approach.

Here, we simply predict the rating for all the movies the user has not rated yet, and return the top `n` movies to recommend based on the predicted rating

In [None]:
def get_top_recommendations(user_id, algorithm, n=10):
    # Get the IDs of movies that the user has already rated
    rated_movies = ratings.loc[ratings['userId'] == user_id, 'tmdbId']
    
    # Get the IDs of movies that were not yet rated by the user
    # Note: ~ is bitwise not
    movies_to_predict = movies_db[~movies_db.index.isin(rated_movies)].index

    # Setup dataframe to use for building and sorting the movie rating predictions for the user
    user_predictions = pd.DataFrame(movies_to_predict)

    # Predict the user's rating for each of the movies that were not previously rated
    user_predictions['predicted_rating'] = user_predictions['tmdbId'].apply(lambda movie_id: algorithm.predict(user_id, movie_id).est)

    # Return the top n recommendations based on the predicted score (and merge with movies_db for us to see movie title, genre, etc.)
    return user_predictions.nlargest(n, 'predicted_rating')

These are the movies the user (whose user_id is 1 in this example) has already rated before

In [None]:
ratings.loc[ratings['userId'] == 1].merge(movies_db.reset_index()).sort_values('rating', ascending=False)

Unnamed: 0,userId,tmdbId,rating,imdb_id,title,overview,original_language,vote_average,vote_count,release_year,genre_1,genre_2
4,1,11216,4.0,tt0095765,Cinema Paradiso,"A filmmaker recalls his childhood, when he fel...",it,8.2,834.0,1988.0,Drama,Romance
13,1,97,4.0,tt0084827,Tron,As Kevin Flynn searches for proof that he inve...,en,6.6,717.0,1982.0,Science Fiction,Action
12,1,1051,4.0,tt0067116,The French Connection,Tough narcotics detective 'Popeye' Doyle is in...,en,7.4,435.0,1971.0,Action,Crime
8,1,6114,3.5,tt0103874,Dracula,When Dracula leaves the captive Jonathan Harke...,en,7.1,1087.0,1992.0,Romance,Horror
19,1,11072,3.0,tt0071230,Blazing Saddles,A town – where everyone seems to be named John...,en,7.2,619.0,1974.0,Western,Comedy
1,1,11360,3.0,tt0033563,Dumbo,Dumbo is a baby elephant born with oversized e...,en,6.8,1206.0,1941.0,Animation,Family
2,1,819,3.0,tt0117665,Sleepers,Two gangsters seek revenge on the state jail w...,en,7.3,729.0,1996.0,Crime,Drama
14,1,8393,3.0,tt0080801,The Gods Must Be Crazy,Misery is brought to a small group of Sho in t...,en,7.1,251.0,1980.0,Action,Comedy
17,1,9426,2.5,tt0091064,The Fly,When Seth Brundle makes a huge scientific and ...,en,7.1,1038.0,1986.0,Horror,Science Fiction
0,1,9909,2.5,tt0112792,Dangerous Minds,Former Marine Louanne Johnson lands a gig teac...,en,6.4,249.0,1995.0,Drama,Crime


These are the recommendations the user would receive using the SVD model with our function

In [None]:
get_top_recommendations(1, svd_model)

Unnamed: 0,tmdbId,predicted_rating
47,629,3.764362
283,278,3.694023
1492,654,3.612122
722,3078,3.568589
782,488,3.564249
2458,11787,3.548065
958,144,3.544187
716,872,3.530533
1486,887,3.527895
740,15,3.524882


## Content-based Filtering

As an example of a content-based recommender system, we'll recommend movies with similar plot descriptions, which is provided in the `'overview'` column.

In [None]:
movies_db['overview'].head(5)

tmdbId
862      Led by Woody, Andy's toys live happily in his ...
8844     When siblings Judy and Peter discover an encha...
15602    A family wedding reignites the ancient feud be...
31357    Cheated on, mistreated and stepped on, the wom...
11862    Just when George Banks has recovered from his ...
Name: overview, dtype: object

To compute the similarity between different plot descriptions (or any text for that matter), we'll need to represent the text numerically. The TF-IDF, short for Term Frequency-Inverse Document Frequency, is a popular numerical statistic that represents how important each word is, often used as a weighing factor for information retrieval and text mining tasks. We'll compute the TF-IDF vectors for each overview, and then compute the similarity between them.

Term frequency refers to the relative frequency of a word in a document, given as **term instances/total instances**.

Inverse Document Frequency is the relative count of documents containing the term, given as **log(number of documents/documents with term)**

The overall importance of each word to the documents in which they appear is equal to TF * IDF.

Computing the TF-IDF for the plot description will generate a matrix, where: each column represents a word in the overview vocabulary (all the words that appear in at least one document); and, each row represents a movie. This is done to reduce the importance (and consequently, the weights) of words that occur frequently in plot overviews in computing the final similarity score.

Scikit-learn has a built-in `TfIdfVectorizer` class that we can use to do this entire process in a few lines of code, which is quite convenient.

For a breakdown of the text processing done, see the following **extra** section.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

# Replace NaN with an empty string
movies_db['overview'] = movies_db['overview'].fillna('')

# Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(movies_db['overview'])

# Output the shape of tfidf_matrix
tfidf_matrix.shape

(9042, 29636)

Since we have used the TF-IDF vectorizer, calculating the dot product will directly give us the cosine similarity score. Therefore, we will use sklearn's `linear_kernel` instead of `cosine_similarities` since it is faster.

In [None]:
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim.shape

(9042, 9042)

Construct a reverse map of indices and movie titles. This step is just for our convenience and to demonstrate easily using movie titles instead of the TMDB ID. For simplicity, we'll keep only the most recently released movie in case of movies with the same title.

In [None]:
no_dup_titles = movies_db.reset_index().sort_values('release_year').drop_duplicates(subset=['title'], keep='last')
indices = pd.Series(no_dup_titles.index, index=no_dup_titles['title'])
indices

title
A Trip to the Moon              6089
The Birth of a Nation           4952
The Immigrant                   5416
A Dog's Life                    2649
Billy Blazes, Esq.              7352
                                ... 
SOMM: Into the Bottle           8989
The Boy                         8987
The Lovers and the Despot       8986
Zootopia                        8993
Independence Day: Resurgence    8850
Length: 8754, dtype: int64

Now we're in a good position to define our recommendation function. These are the following steps we'll follow :-

- Get the index of the movie given its title (just for our convenience and to demonstrate easily, normally we'll be using the ID directly)
- Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position and the second is the similarity score.
- Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.
- Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).
- Return the titles corresponding to the indices of the top elements.

In [None]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return movies_db['title'].iloc[movie_indices]

A few recommendations based on the plot overview for some popular movies. As seen, it tends to recommend movies in the same series (e.g., other 007 movies when recommending for movies similar to *Quantum of Solace*, which is quite natural since they would share similar plot overviews.

To improve the content-based recommender system, we can consider other attributes available to compute similarity between movies, such as genre, directors, cast, language, tags, etc., with different weights for each, depending on what we'd like to prioritize for content similarity.

In [None]:
get_recommendations('The Matrix')

tmdbId
10428                   Hackers
10999                  Commando
9682                      Pulse
55931             The Animatrix
1557                         23
19995                    Avatar
157353            Transcendence
157834         The Zero Theorem
21519                 Project A
8766      Hellraiser: Bloodline
Name: title, dtype: object

In [None]:
get_recommendations('Quantum of Solace')

tmdbId
36670           Never Say Never Again
206647                        Spectre
681              Diamonds Are Forever
657             From Russia with Love
36557                   Casino Royale
253                  Live and Let Die
646                            Dr. No
42657               The Mark of Zorro
700                         Octopussy
682       The Man with the Golden Gun
Name: title, dtype: object

In [None]:
get_recommendations('The Dark Knight Rises')

tmdbId
414                                Batman Forever
155                               The Dark Knight
364                                Batman Returns
40662                  Batman: Under the Red Hood
268                                        Batman
69735                            Batman: Year One
123025    Batman: The Dark Knight Returns, Part 1
14919                Batman: Mask of the Phantasm
142061    Batman: The Dark Knight Returns, Part 2
272                                 Batman Begins
Name: title, dtype: object

## Extra: TF-IDF Text Preprocessing

"Stop words" are words that are excluded from the TF-IDF matrix. It is possible to add or remove as needed (see the documentation for sciki-learn's `TfidfVectorizer`), and are usually omitted from TF-IDF computation since they don't carry information.

In [None]:
tfidf.get_stop_words()

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

Each column in the TF-IDF matrix corresponds to a particular word present in at least one of the movie overviews.

In [None]:
tfidf.get_feature_names()

['00',
 '000',
 '007',
 '01',
 '05pm',
 '10',
 '100',
 '1000',
 '100th',
 '101',
 '101st',
 '103',
 '108',
 '10th',
 '11',
 '111',
 '1138',
 '114',
 '117',
 '1183',
 '119',
 '11th',
 '12',
 '120',
 '1200',
 '1250',
 '125th',
 '12th',
 '13',
 '1300',
 '1344',
 '13th',
 '14',
 '140',
 '1408',
 '142',
 '1429',
 '143',
 '145',
 '1475',
 '14pm',
 '14th',
 '15',
 '150',
 '150000',
 '150th',
 '1536',
 '1564',
 '1572',
 '15th',
 '15yrs',
 '15½',
 '16',
 '1600',
 '1600s',
 '1606',
 '161',
 '1630s',
 '164',
 '165',
 '1681',
 '168th',
 '1691',
 '1692',
 '16th',
 '17',
 '1700',
 '1700s',
 '174',
 '1746',
 '175',
 '1786',
 '1787',
 '1797',
 '17th',
 '18',
 '180',
 '1800',
 '1804',
 '181',
 '1812',
 '1816',
 '1818',
 '1820',
 '1820s',
 '1828',
 '1830s',
 '1831',
 '1836',
 '1839',
 '1840s',
 '1847',
 '1850s',
 '1858',
 '1860',
 '1860s',
 '1862',
 '1863',
 '1868',
 '1870',
 '1870s',
 '1871',
 '1873',
 '1875',
 '1877',
 '1879',
 '1880',
 '1880s',
 '1885',
 '1888',
 '1890',
 '1890s',
 '1893',
 '1896',
 

To better understand the TF-IDF matrix, we'll create a dataframe from the TF-IDF matrix and assign the feature names as the columns and the movie titles as the index. By default, the TF-IDF is a sparse matrix because of the presence of 0s across many cells, which is why we had to initialize the dataframe using a different method than we're accustomed to.

If a word does not appear in the movie overview text, its weight is 0.

In [None]:
tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf_matrix, columns=tfidf.get_feature_names(), index=movies_db.title)
tfidf_df

Unnamed: 0_level_0,00,000,007,01,05pm,10,100,1000,100th,101,...,æon,édith,élan,émigré,état,étienne,évocateur,ôtomo,østergaard,žižek
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Jumanji,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grumpier Old Men,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Waiting to Exhale,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Father of the Bride Part II,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Mohenjo Daro,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Beatles: Eight Days a Week - The Touring Years,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Pokémon: Spell of the Unknown,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Pokémon 4Ever: Celebi - Voice of the Forest,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


As an example, let's have a look at the overview of a single movie, e.g., Interstellar

In [None]:
interstellar_overview = movies_db.loc[movies_db.title == 'Interstellar', 'overview'].iloc[0]
interstellar_overview

'Interstellar chronicles the adventures of a group of explorers who make use of a newly discovered wormhole to surpass the limitations on human space travel and conquer the vast distances involved in an interstellar voyage.'

Essentially, each word is tokenized (i.e., split the text into a list of words), stop words are removed, and all words lowercased. Additional text preprocessing may be applied such as stemming (reducing words to its stem or base/root form), but is not done by default. Then, for each of the words we are left with in each movie overview, we compute its TF-IDF weight by applying the TF-IDF computation, and we end up with something that looks like the following (not shown: all the other columns contain a 0 because they are not present in the movie's overview).

If interested, have a look at the [NLTK Book](http://nltk.org/book) for a practical introduction to python programming for language processing. It is freely available online and written by the creators of the [NLTK](https://www.nltk.org/)--the most popular python library for NLP tasks.

In [None]:
movie_title = 'Interstellar'
tfidf_df.loc[movie_title, tfidf_df.loc[movie_title] != 0]

adventures      0.171389
chronicles      0.192296
conquer         0.212016
discovered      0.182800
distances       0.266633
explorers       0.219011
group           0.123767
human           0.146318
interstellar    0.492246
involved        0.153132
limitations     0.266633
make            0.121416
newly           0.180033
space           0.169851
surpass         0.266633
travel          0.165344
use             0.160056
vast            0.210498
voyage          0.201619
wormhole        0.278630
Name: Interstellar, dtype: Sparse[float64, 0]

## References

- Ahmed, I. (2020). Getting Started with a Movie Recommendation System. Kaggle. Retrieved from https://www.kaggle.com/ibtesama/getting-started-with-a-movie-recommendation-system
- Daniel, F. (2017). Film recommendation engine. Kaggle. Retrieved from https://www.kaggle.com/fabiendaniel/film-recommendation-engine
- Banik, R. (2017). The Movies Dataset. Kaggle. Retrieved from https://www.kaggle.com/rounakbanik/the-movies-dataset
- DLao (2020). Netflix Movie Recommendation. Kaggle. Retrieved from https://www.kaggle.com/rounakbanik/the-movies-dataset
- Surprise Library Documentation. Retrieved from https://surprise.readthedocs.io/en/stable/