# Movie Recommender System

This notebook implements a simple movie recommender system on the TMDB 5000 Movie Dataset ([more info here](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata)). The high level idea is that we have useful information for each movie stored in strings so we can use the [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) method to produce a vector representation for each movie and then make recommendations based on the cosine similarity between the vector representations.

Notebook Workflow:


1.   Download data from Kaggle.
2.   Extract all the useful information in a single string (Genre & Keywords).
3.   Fit the TfidfVectorizer().
4.   Map the titles to indicies.
5.   Make recommendations.



In [1]:
# Get dataset from Kaggle

!pip install -q opendatasets

import opendatasets as od
import pandas as pd

od.download('https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata')
# Kaggle username and token required

Skipping, found downloaded files in "./tmdb-movie-metadata" (use force=True to force download)


In [2]:
# Load dataset into a pandas dataframe
data = pd.read_csv('/content/tmdb-movie-metadata/tmdb_5000_movies.csv')

# Inspection
data.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [3]:
# Get info about the data
print(f"The shape of the data is {data.shape[0]} by {data.shape[1]}.")
print(f"We have {data.shape[0]} movies and for each movie we have {data.shape[1]} features.")
print(f"The features are: {data.columns.values}.")

The shape of the data is 4803 by 20.
We have 4803 movies and for each movie we have 20 features.
The features are: ['budget' 'genres' 'homepage' 'id' 'keywords' 'original_language'
 'original_title' 'overview' 'popularity' 'production_companies'
 'production_countries' 'release_date' 'revenue' 'runtime'
 'spoken_languages' 'status' 'tagline' 'title' 'vote_average' 'vote_count'].


In [4]:
import json
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity,euclidean_distances

In [5]:
# Let's do some further exploration

# First row
x = data.iloc[0]
print(f"{x['title']} belongs to these genres: {x['genres']}")
print(f"{x['title']} has these keywords assosiated with it: {x['keywords']}")

Avatar belongs to these genres: [{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]
Avatar has these keywords assosiated with it: [{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "space war"}, {"id": 3388, "name": "space colony"}, {"id": 3679, "name": "society"}, {"id": 3801, "name": "space travel"}, {"id": 9685, "name": "futuristic"}, {"id": 9840, "name": "romance"}, {"id": 9882, "name": "space"}, {"id": 9951, "name": "alien"}, {"id": 10148, "name": "tribe"}, {"id": 10158, "name": "alien planet"}, {"id": 10987, "name": "cgi"}, {"id": 11399, "name": "marine"}, {"id": 13065, "name": "soldier"}, {"id": 14643, "name": "battle"}, {"id": 14720, "name": "love affair"}, {"id": 165431, "name": "anti war"}, {"id": 193554, "name": "power relations"}, {"id": 206690, "name": "mind and soul"}, {"id": 209714, "name": "3d"}]


In [6]:
type(x['genres'])

str

In [7]:
x_list = json.loads(x['genres']) # returns a list of dictionaries
type(x_list),type(x_list[0])

(list, dict)

In [8]:
x_list

[{'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 14, 'name': 'Fantasy'},
 {'id': 878, 'name': 'Science Fiction'}]

In [9]:
x_list[0]['name'],x_list[1]['name'],x_list[2]['name'],x_list[3]['name'] # now that we have converted the data into the correct type we can join all relevant info in a single string

('Action', 'Adventure', 'Fantasy', 'Science Fiction')

In [10]:
# Join everything
# All genres should be a single word
x_genres = ' '.join(''.join(i['name'].split()) for i in x_list) # Science Fiction -> ScienceFiction
x_genres # now we can encode all relevant information about a movie into a string

'Action Adventure Fantasy ScienceFiction'

In [11]:
# Create a function to convert keywords and genres into a single string for each movie

def to_single_string(row):
  """This function takes as input a row (corresponding to a movie) from our dataset and returns the genres and keywords
  into a single string format."""
  # Genres
  genres = json.loads(row['genres'])
  genres = ' '.join(''.join(i['name'].split()) for i in genres)
  # Keywords
  keywords = json.loads(row['keywords'])
  keywords  = ' '.join(''.join(i['name'].split()) for i in keywords )

  return "%s %s" % (genres,keywords) # as string



# Test
to_single_string(x)



'Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d'

In [12]:
# Get string representation for each movie
data['string'] = data.apply(to_single_string, axis=1) # This will be used as an input to the tfidf vectorizer

In [13]:
# Create tfidf vectorizer object
tfidf = TfidfVectorizer(max_features=2000) # we consider the top 2000 tokens

In [14]:
# Create data matrix
X = tfidf.fit_transform(data['string'])
X.shape # docs by tokens

(4803, 2000)

In [15]:
# Now we need a mapping from title to indices
title_to_index = pd.Series(data.index, index = data['title'])
title_to_index

title
Avatar                                         0
Pirates of the Caribbean: At World's End       1
Spectre                                        2
The Dark Knight Rises                          3
John Carter                                    4
                                            ... 
El Mariachi                                 4798
Newlyweds                                   4799
Signed, Sealed, Delivered                   4800
Shanghai Calling                            4801
My Date with Drew                           4802
Length: 4803, dtype: int64

In [16]:
index =  title_to_index['Avatar']
index # Avatar is the first movie

0

In [17]:
query = X[index]
query = query.toarray() # Now we can compute how similar is the query to every other vector representation in X
query.shape

(1, 2000)

In [18]:
scores = cosine_similarity(query,X)
scores = scores.flatten()
recommendation_idx = (-scores).argsort()[1:6] # The 0th index represents the movie itself, get top 5 recommendations
recommendations = data['title'].iloc[recommendation_idx]
recommendations # It works! -> make a function

47      Star Trek Into Darkness
3214                 Barbarella
1287         A Monster in Paris
61            Jupiter Ascending
3730                      Cargo
Name: title, dtype: object

In [37]:
def recommend(title):

  """Takes as input a movie title as a string and
  returns the top 5 recommendations as a list based on the cosine similarity score """

  # Get index
  index =  title_to_index[title]
  # Get query
  query = X[index]
  query = query.toarray()
  # Get similarity scores
  scores = cosine_similarity(query,X)
  scores = scores.flatten()
  # Make recommendations
  recommendation_idx = (-scores).argsort()[1:6] # The 0th index represents the movie itself, get top 5 recommendations
  recommendations = data['title'].iloc[recommendation_idx]

  return list(recommendations)



In [38]:
print(f"The top 5 recommendations for Barbarella are {recommend('Barbarella')}.")

The top 5 recommendations for Barbarella are ['Soldier', 'Planet 51', 'Space Battleship Yamato', 'Planet of the Apes', 'Transformers: Dark of the Moon'].


In [39]:
print(f"The top 5 recommendations for El Mariachi are {recommend('El Mariachi')}.")

The top 5 recommendations for El Mariachi are ['No Country for Old Men', 'The Three Burials of Melquiades Estrada', 'Trade', 'Traffic', 'Checkmate'].


In [42]:
print(f"The top 5 recommendations for Quantum of Solace are {recommend('Quantum of Solace')}.") # Not all recommendations are Bond movies-interesting

The top 5 recommendations for Quantum of Solace are ['Die Another Day', 'Spectre', 'Skyfall', 'Lethal Weapon 4', 'The Pursuit of D.B. Cooper'].
