# Movie Recommendation System

#### Using the "IMDB Movies Dataset", we will create different Movie Recommendation systems. 

#### First a simple system that offers general recommendations based on movie popularity and rating.

#### Then a system which finds movies whith similar plot description to the selected one

#### Moreover a system which also takes into consideration Credits, Genres and Keywords to find similar movies to the selected one.

In [1]:
import pandas as pd
import numpy as np

# Load Movies Metadata
metadata = pd.read_csv('movies_metadata.csv', low_memory=False)

metadata.shape

(45466, 24)

In [2]:
# Print the first three rows
metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


## Simple recommender: offers generalized recommendations to every user, based on overall movie popularity and rating

In order to avoid that movies with just few but high ratings (say 10 movies with avg rating 9/10), being higher in the ranking then movies with many more a little lower rating (say 10000 movies with avg rating 8.8/10), we will have to calculate a weighted average

$$WR = R*v/(v+m) +  C*m/(v+m)$$

Where:
> *v* is the number of votes for the movie (vote_count)
>
> *m* is the minimum votes required to be listed in the chart (to be set)
>
> *R* is the average rating of the movie (vote_average)
>
> *C* is the mean vote across the whole report

In [3]:
# calculate C
C = metadata['vote_average'].mean()
print(C)

5.618207215133889


In [4]:
# calculate the minimum number of votes required to be in the chart: it will be the 95% percentile
m = metadata['vote_count'].quantile(0.95)
print(m)

434.0


In [5]:
# save out all qualified movies into a new DataFrame
q_movies = metadata.copy().loc[metadata['vote_count'] >= m]
q_movies.shape

(2274, 24)

In [6]:
# function that computes the weighted rating of each movie
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C) # calculation based on the IMDB formula

In [7]:
# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

In [8]:
q_movies.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,score
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,7.545529
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,6.704602
5,False,,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",,949,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,187436818.0,170.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886.0,7.310561


In [9]:
#Sort movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending=False)

In [10]:
#Print the top 15 movies
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(15)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.357746
834,The Godfather,6024.0,8.5,8.306334
12481,The Dark Knight,12269.0,8.3,8.208376
2843,Fight Club,9678.0,8.3,8.184899
292,Pulp Fiction,8670.0,8.3,8.172155
351,Forrest Gump,8147.0,8.2,8.069421
522,Schindler's List,4436.0,8.3,8.061007
23673,Whiplash,4376.0,8.3,8.058025
5481,Spirited Away,3968.0,8.3,8.035598
1154,The Empire Strikes Back,5998.0,8.2,8.025793


## Plot Description Based Recommender

#### Recommending movies that are similar to a particular movie, by computing pairwise similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score.

In [11]:
#Print plot description (overview) of the first 5 movies.
q_movies['overview'].head()  # We will use again q_movies instead of the original "metadata" because smaller and easier to handle

314      Framed in the 1940s for the double murder of h...
834      Spanning the years 1945 to 1955, a chronicle o...
12481    Batman raises the stakes in his war on crime. ...
2843     A ticking-time-bomb insomniac and a slippery s...
292      A burger-loving hit man, his philosophical par...
Name: overview, dtype: object

In order to compute the similarity between any two overviews, we need to calculate the *Term Frequency-Inverse Document Frequency* (TF-IDF) vectors first.

TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that occur frequently in plot overviews and therefore, their significance in computing the final similarity score.

In [12]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
q_movies['overview'] = q_movies['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(q_movies['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(2274, 13570)

With this matrix in hand, we can now compute a similarity score. There are several candidates for this; such as the euclidean, the Pearson and the cosine similarity scores. 

We will be using the cosine similarity to calculate a numeric quantity that denotes the similarity between two movies (it is independent of magnitude).

$$ cosine(x,y) = (xy^T)/||x||||y||$$

Since we have used the TF-IDF vectorizer, calculating the dot product will directly give us the cosine similarity score. Therefore, we will use sklearn's linear_kernel() instead of cosine_similarities() since it is faster.

In [13]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

We're going to define a function that takes in a movie title as an input and outputs a list of the 10 most similar movies. Firstly, for this, we need a reverse mapping of movie titles and DataFrame indices. In other words, we need a mechanism to identify the index of a movie in our q_movies DataFrame, given its title

In [14]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(q_movies.index, index=q_movies['title']).drop_duplicates()

In [15]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return q_movies['title'].iloc[movie_indices]

In [16]:
get_recommendations('Batman Returns')

21068              Man of Steel
21418                   Riddick
28131         The Hateful Eight
18995                Prometheus
3182                Pitch Black
3559     For a Few Dollars More
20383               Upside Down
18100                 Immortals
31953                  Southpaw
6783     The Matrix Revolutions
Name: title, dtype: object

In [17]:
get_recommendations('The Godfather')

5878                             City of God
1650                        The Big Lebowski
10841                  Thank You for Smoking
7234                        Dawn of the Dead
13219    The Curious Case of Benjamin Button
38898                           Café Society
35362                           Daddy's Home
5315                               Mr. Deeds
23465                       Edge of Tomorrow
1199                               Manhattan
Name: title, dtype: object

While this system has done a decent job of finding movies with similar plot descriptions, the quality of recommendations is not that great.

## Credits, Genres and Keywords Based Recommender

#### The quality of your recommender would be increased with the usage of better metadata. 
#### Four relevant info to base the model on would be: the 3 top actors, the director, related genres and the movie plot keywords.

Cast and crew data is not available in the current dataset so we are going to load and merge them into the main DataFrame "metadata".

In [20]:
# Load keywords and credits
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')

# Convert IDs to int. Required for merging
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
q_movies['id'] = q_movies['id'].astype('int')

# Merge keywords and credits into your main metadata dataframe
q_movies = q_movies.merge(credits, on='id')
q_movies = q_movies.merge(keywords, on='id')

In [21]:
# Print the first two movies of the newly merged metadata
q_movies.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,status,tagline,title,video,vote_average,vote_count,score,cast,crew,keywords
0,False,,25000000,"[{'id': 18, 'name': 'Drama'}, {'id': 80, 'name...",,278,tt0111161,en,The Shawshank Redemption,Framed in the 1940s for the double murder of h...,...,Released,Fear can hold you prisoner. Hope can set you f...,The Shawshank Redemption,False,8.5,8358.0,8.357746,"[{'cast_id': 3, 'character': 'Andy Dufresne', ...","[{'credit_id': '52fe4231c3a36847f800b127', 'de...","[{'id': 378, 'name': 'prison'}, {'id': 417, 'n..."
1,False,"{'id': 230, 'name': 'The Godfather Collection'...",6000000,"[{'id': 18, 'name': 'Drama'}, {'id': 80, 'name...",http://www.thegodfather.com/,238,tt0068646,en,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...",...,Released,An offer you can't refuse.,The Godfather,False,8.5,6024.0,8.306334,"[{'cast_id': 5, 'character': 'Don Vito Corleon...","[{'credit_id': '52fe422bc3a36847f80093db', 'de...","[{'id': 131, 'name': 'italy'}, {'id': 699, 'na..."


In [22]:
# At the moment, the data is present in the form of "stringified" lists. We need to convert them 
# into a form that is usable.

# Parse the stringified features into their corresponding python objects
from ast import literal_eval  # conveerts sting to dictionary

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    q_movies[feature] = q_movies[feature].apply(literal_eval)

In [23]:
# Write functions that help to extract required info from each feature

In [24]:
# Get the director's name from the crew feature. If director is not listed, return NaN
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [25]:
# Returns the list top 3 elements or entire list; whichever is more.
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [26]:
# Define new director, cast, genres and keywords features that are in a suitable form.
q_movies['director'] = q_movies['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    q_movies[feature] = q_movies[feature].apply(get_list)

In [27]:
# Print the new features of the first 3 films
q_movies[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,The Shawshank Redemption,"[Tim Robbins, Morgan Freeman, Bob Gunton]",Frank Darabont,"[prison, corruption, police brutality]","[Drama, Crime]"
1,The Godfather,"[Marlon Brando, Al Pacino, James Caan]",Francis Ford Coppola,"[italy, love at first sight, loss of father]","[Drama, Crime]"
2,The Dark Knight,"[Christian Bale, Michael Caine, Heath Ledger]",Christopher Nolan,"[dc comics, crime fighter, secret identity]","[Drama, Action, Crime]"


The next step is to convert the names and keyword instances into lowercase and strip all the spaces between them. This is done so that our vectorizer doesn't count the Johnny of "Johnny Depp" and "Johnny Galecki" as the same. After this processing step, the aforementioned actors will be represented as "johnnydepp" and "johnnygalecki" and will be distinct to our vectorizer.

In [28]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [29]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    q_movies[feature] = q_movies[feature].apply(clean_data)

Create a "metadata soup", which is a string that contains all the metadata that we want to feed to our vectorizer (namely actors, director and keywords).

In [30]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

In [31]:
# Create a new soup feature
q_movies['soup'] = q_movies.apply(create_soup, axis=1)

In [32]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(q_movies['soup'])

In [33]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [34]:
# Reset index of the main DataFrame and construct reverse mapping as before
q_movies = q_movies.reset_index()
indices = pd.Series(q_movies.index, index=q_movies['title'])

In [35]:
get_recommendations('Batman Returns', cosine_sim2)

540                                        Batman
1970                                 Dark Shadows
59                                       Scarface
342     The Hobbit: The Battle of the Five Armies
2289                               Batman & Robin
885                                       Matilda
1794                                 Sucker Punch
1633                   Once Upon a Time in Mexico
1888                                       Duplex
2232        The Mummy: Tomb of the Dragon Emperor
Name: title, dtype: object

In [36]:
get_recommendations('The Godfather', cosine_sim2)

526             The Godfather: Part III
14               The Godfather: Part II
78                       Apocalypse Now
59                             Scarface
172                                Heat
350                       Carlito's Way
389                       Donnie Brasco
480                   Dog Day Afternoon
748             The Talented Mr. Ripley
1955    Wall Street: Money Never Sleeps
Name: title, dtype: object

The recommender has been successful in capturing more information due to more metadata and has given us better recommendations.