In this notebook, I will attempt at implementing a few recommendation algorithms (content based, popularity based 
and collaborative filtering) and try to build an ensemble of these models to come up with our final recommendation system. With us, we have two MovieLens datasets.

The Full Dataset: Consists of 26,000,000 ratings and 750,000 tag applications applied to 45,000 movies by 270,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.

The Small Dataset: Comprises of 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users.

I will build a Simple Recommender using movies from the Full Dataset whereas all personalised recommender systems will make use of the small dataset (due to the computing power I possess being very limited). As a first step, I will build my simple recommender system.


In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD 
# use cross_validate for evaluate
from surprise.model_selection import cross_validate
from surprise import accuracy
from surprise.model_selection import KFold
from surprise import NormalPredictor

import warnings; warnings.simplefilter('ignore')

In [2]:
dataset = pd.read_csv('datasets/movies_metadata.csv')
dataset.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [3]:
dataset['genres'] = dataset['genres'].fillna('[]').apply(literal_eval).apply(
    lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

I use the TMDB Ratings to come up with our Top Movies Chart. I will use IMDB's weighted rating formula to construct my chart. Mathematically, it is represented as follows:

Weighted Rating (WR) =  (v/v+m.R)+(m/v+m.C) 
where,

v is the number of votes for the movie
m is the minimum votes required to be listed in the chart
R is the average rating of the movie
C is the mean vote across the whole report
The next step is to determine an appropriate value for m. I will use 95th percentile as our cutoff. In other words, for a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.

I will build our overall Top 250 Chart and will define a function to build charts for a particular genre. Let's begin!

In [4]:
vote_counts = dataset[dataset['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = dataset[dataset['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
C

5.244896612406511

In [5]:
m = vote_counts.quantile(0.95)
m

434.0

In [6]:
dataset['year'] = pd.to_datetime(dataset['release_date'], errors='coerce').apply(
    lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [7]:
qualified = dataset[(dataset['vote_count'] >= m) & (dataset['vote_count'].notnull()) & (
    dataset['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

(2274, 6)

Therefore, to qualify to be considered for the chart, a movie has to have at least 434 votes on TMDB. 
We also see that the average rating for a movie on TMDB is 5.244 on a scale of 10. 2274 Movies qualify 
to be on our chart.

In [8]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [9]:
qualified['wr'] = qualified.apply(weighted_rating, axis=1)
qualified = qualified.sort_values('wr', ascending=False).head(250)

Top Movies

In [10]:
qualified.head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,wr
15480,Inception,2010,14075,8,29.1081,"[Action, Thriller, Science Fiction, Mystery, A...",7.917588
12481,The Dark Knight,2008,12269,8,123.167,"[Drama, Action, Crime, Thriller]",7.905871
22879,Interstellar,2014,11187,8,32.2135,"[Adventure, Drama, Science Fiction]",7.897107
2843,Fight Club,1999,9678,8,63.8696,[Drama],7.881753
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.0707,"[Adventure, Fantasy, Action]",7.871787
292,Pulp Fiction,1994,8670,8,140.95,"[Thriller, Crime]",7.86866
314,The Shawshank Redemption,1994,8358,8,51.6454,"[Drama, Crime]",7.864
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.3244,"[Adventure, Fantasy, Action]",7.861927
351,Forrest Gump,1994,8147,8,48.3072,"[Comedy, Drama, Romance]",7.860656
5814,The Lord of the Rings: The Two Towers,2002,7641,8,29.4235,"[Adventure, Fantasy, Action]",7.851924


We see that the top three movies on the chart, Inception, The Dark Knight and Interstellar are from 
Christopher Nolan Films. Moveover, The chart also indicates a strong bias of TMDB Users towards particular 
genres and directors.

Let us now construct a function that builds charts for particular genres. For this, I will use or relax our 
default conditions to the 85th percentile instead of 95.

In [11]:
s = dataset.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_dataset = dataset.drop('genres', axis=1).join(s)

In [12]:
def build_chart(genre, percentile=0.85):
    df = gen_dataset[gen_dataset['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (
        df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    qualified['wr'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (
        m/(m+x['vote_count']) * C), axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(250)
    
    return qualified

Let us see our method in action by displaying the Top 15 Romance Movies 
(Romance almost didn't feature at all in our Generic Top Chart despite being one of the most popular movie genres).

Top Romance Movies

In [13]:
build_chart('Romance').head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
10309,Dilwale Dulhania Le Jayenge,1995,661,9,34.457,8.565285
351,Forrest Gump,1994,8147,8,48.3072,7.971357
876,Vertigo,1958,1162,8,18.2082,7.811667
40251,Your Name.,2016,1030,8,34.461252,7.789489
883,Some Like It Hot,1959,835,8,11.8451,7.745154
1132,Cinema Paradiso,1988,834,8,14.177,7.744878
19901,Paperman,2012,734,8,7.19863,7.713951
37863,Sing Street,2016,669,8,10.672862,7.689483
882,The Apartment,1960,498,8,11.9943,7.599317
38718,The Handmaiden,2016,453,8,16.727405,7.566166


The top romance movie according to our metrics is Dilwale Dulhania Le Jayenge.

Content Based Recommender

The chart we built earlier has some severe limitations. For one, it didn't consider on romantic movies. If a person who loves romantic movies (and hates action) were to look at our Top 15 Chart, s/he wouldn't probably like most of the movies. If s/he were to go one step further and look at our charts by genre, s/he wouldn't still be getting the best recommendations.

For instance, consider a person who loves Dilwale Dulhania Le Jayenge.
One inference we can obtain is that what if some people loves such movies. Even if 
s/he were to access the romance chart, s/he wouldn't find these as the top recommendations.

To personalise our recommendations more, I am going to build an engine that computes similarity between movies based 
on certain metrics and suggests movies that are most similar to a particular movie that a user liked. Since we will 
be using movie metadata (or content) to build this engine, this also known as Content Based Filtering.

I will build two Content Based Recommenders based on:

Movie Overviews and Taglines
Movie Cast, Crew, Keywords and Genre
Also, as mentioned in the introduction, I will be using a subset of all the movies available to us due to limiting 
computing power available to me.

In [14]:
links_small = pd.read_csv('datasets/links_small.csv')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')

Before we are able to perform any mergers of the main dataframe, we need to make sure that the ID column of our 
main dataframe is clean and of type integer. To do this, let us try to perform an integer conversion of our IDs 
and if an exception is raised, we will replace the ID with NaN. We will then proceed to drop these rows from our 
dataframe.

In [15]:
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

In [16]:
dataset['id'] = dataset['id'].apply(convert_int)

In [17]:
dataset[dataset['id'].isnull()]

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,year
19730,- Written by Ørnås,0.065736,/ff9qCepilowshEtG2GYWwzt2bs4.jpg,"[Carousel Productions, Vision View Entertainme...","[{'iso_3166_1': 'CA', 'name': 'Canada'}, {'iso...",,0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,,,,,,,,,,NaT
29503,Rune Balot goes to a casino connected to the ...,1.931659,/zV8bHuSL6WXoD6FWogP9j4x80bL.jpg,"[Aniplex, GoHands, BROSTA TV, Mardock Scramble...","[{'iso_3166_1': 'US', 'name': 'United States o...",,0,68.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,...,,,,,,,,,,NaT
35587,Avalanche Sharks tells the story of a bikini ...,2.185485,/zaSf5OG7V8X8gqFvly88zDdRm46.jpg,"[Odyssey Media, Pulser Productions, Rogue Stat...","[{'iso_3166_1': 'CA', 'name': 'Canada'}]",,0,82.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,,,,,,,,,,NaT


In [18]:

links_small_data = dataset.drop([19730, 29503, 35587])

In [19]:
links_small_data['id'] = links_small_data['id'].astype('int')


In [20]:
s_links_small_data = links_small_data[links_small_data['id'].isin(links_small)]
s_links_small_data.shape

(9099, 25)

We have 9099 movies avaiable in our small movies metadata dataset which is 5 times smaller than our original 
dataset of 45000 movies.

Movie Description Based Recommender

Let us first try to build a recommender using movie descriptions and taglines. We do not have a quantitative metric 
to judge our machine's performance so this will have to be done qualitatively.

In [21]:
s_links_small_data['tagline'] = s_links_small_data['tagline'].fillna('')
s_links_small_data['description'] = s_links_small_data['overview'] + s_links_small_data['tagline']
s_links_small_data['description'] = s_links_small_data['description'].fillna('')

In [22]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(s_links_small_data['description'])

tfidf_matrix.shape

(9099, 268124)

Cosine Similarity

I will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two 
movies. Mathematically, it is defined as follows:

cosine(x,y)=x.y⊺/||x||.||y|| 

Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity 
Score. Therefore, we will use sklearn's linear_kernel instead of cosine_similarities since it is much faster.

In [23]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

cosine_sim[0]

array([1.        , 0.00680476, 0.        , ..., 0.        , 0.00344913,
       0.        ])

We now have a pairwise cosine similarity matrix for all the movies in our dataset. The next step is to write a 
function that returns the 30 most similar movies based on the cosine similarity score.

In [24]:
s_links_small_data = s_links_small_data.reset_index()
titles = s_links_small_data['title']
indices = pd.Series(s_links_small_data.index, index=s_links_small_data['title'])

In [25]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

Let us now try and get the top recommendations for a few movies and see how good the recommendations are.

In [26]:
get_recommendations('The Dark Knight').head(10)

7931                      The Dark Knight Rises
132                              Batman Forever
1113                             Batman Returns
8227    Batman: The Dark Knight Returns, Part 2
7565                 Batman: Under the Red Hood
524                                      Batman
7901                           Batman: Year One
2579               Batman: Mask of the Phantasm
2696                                        JFK
8165    Batman: The Dark Knight Returns, Part 1
Name: title, dtype: object

In [27]:
get_recommendations('The Godfather').head(10)

973      The Godfather: Part II
8387                 The Family
3509                       Made
4196         Johnny Dangerously
29               Shanghai Triad
5667                       Fury
2412             American Movie
1582    The Godfather: Part III
4221                    8 Women
2159              Summer of Sam
Name: title, dtype: object

We see that for The Dark Knight, our system is able to identify it as a Batman film and subsequently recommend 
other Batman films as its top recommendations. But unfortunately, that is all this system can do at the moment. 
This is not of much use to most people as it doesn't take into considerations very important features such as cast, 
crew, director and genre, which determine the rating and the popularity of a movie. Someone who liked 
The Dark Knight probably likes it more because of Nolan and would hate Batman Forever and every other substandard 
movie in the Batman Franchise.

Therefore, we are going to use much more suggestive metadata than Overview and Tagline. In the next subsection, 
we will build a more sophisticated recommender that takes genre, keywords, cast and crew into consideration.

Metadata Based Recommender

To build our standard metadata based content recommender, we will need to merge our current dataset with the crew 
and the keyword datasets. Let us prepare this data as our first step.

In [28]:
credits = pd.read_csv('datasets/credits.csv')
keywords = pd.read_csv('datasets/keywords.csv')

In [29]:
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
links_small_data['id'] = links_small_data['id'].astype('int')

In [30]:
links_small_data.shape

(45463, 25)

In [31]:
links_small_data = links_small_data.merge(credits, on='id')
links_small_data = links_small_data.merge(keywords, on='id')

In [32]:
s_links_small_data = links_small_data[links_small_data['id'].isin(links_small)]

s_links_small_data.shape

(9219, 28)

We now have our cast, crew, genres and credits, all in one dataframe. Let us wrangle this a little more using the 
following intuitions:

Crew: From the crew, we will only pick the director as our feature since the others don't contribute that much to 
    the feel of the movie.
    
Cast: Choosing Cast is a little more tricky. Lesser known actors and minor roles do not really affect people's 
    opinion of a movie. Therefore, we must only select the major characters and their respective actors. 
    Arbitrarily we will choose the top 3 actors that appear in the credits list.

In [33]:
s_links_small_data['cast'] = s_links_small_data['cast'].apply(literal_eval)
s_links_small_data['crew'] = s_links_small_data['crew'].apply(literal_eval)
s_links_small_data['keywords'] = s_links_small_data['keywords'].apply(literal_eval)
s_links_small_data['cast_size'] = s_links_small_data['cast'].apply(lambda x: len(x))
s_links_small_data['crew_size'] = s_links_small_data['crew'].apply(lambda x: len(x))

In [34]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [35]:
s_links_small_data['director'] = s_links_small_data['crew'].apply(get_director)

In [36]:
s_links_small_data['cast'] = s_links_small_data['cast'].apply(lambda x: [i['name'] for i in x] 
                                                              if isinstance(x, list) else [])
s_links_small_data['cast'] = s_links_small_data['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

In [37]:
s_links_small_data['keywords'] = s_links_small_data['keywords'].apply(lambda x: [i['name'] for i in x] 
                                                                      if isinstance(x, list) else [])

My approach to building the recommender is going to be extremely hacky. What I plan on doing is creating a metadata
dump for every movie which consists of genres, director, main actors and keywords. I then use a Count Vectorizer to 
create our count matrix as we did in the Description Recommender. The remaining steps are similar to what we did 
earlier: we calculate the cosine similarities and return movies that are most similar.

These are steps I will follow in the preparation of my genres and credits data:

Strip Spaces and Convert to Lowercase from all our features. This way, our engine will not confuse people
such as Johnny Depp and Johnny Galecki.

Mention Director 3 times to give it more weight relative to the entire cast.

In [38]:
s_links_small_data['cast'] = s_links_small_data['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) 
                                                                         for i in x])

In [39]:
s_links_small_data['director'] = s_links_small_data['director'].astype('str').apply(
    lambda x: str.lower(x.replace(" ", "")))
s_links_small_data['director'] = s_links_small_data['director'].apply(lambda x: [x,x, x])


Keywords

I will do a small amount of pre-processing of our keywords before putting them to any use. As a first step, we 
calculate the frequenct counts of every keyword that appears in the dataset.

In [40]:
s = s_links_small_data.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'

In [41]:
s = s.value_counts()
s[:5]

independent film        610
woman director          550
murder                  399
duringcreditsstinger    327
based on novel          318
Name: keyword, dtype: int64

Keywords occur in frequencies ranging from 1 to 610. We do not have any use for keywords that occur only once. 
Therefore, these can be safely removed. Finally, we will convert every word to its stem so that words such as Dogs 
and Dog are considered the same.


In [42]:
s = s[s > 1]

stemmer = SnowballStemmer('english')
stemmer.stem('dogs')

'dog'

In [43]:
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words


In [44]:
s_links_small_data['keywords'] = s_links_small_data['keywords'].apply(filter_keywords)
s_links_small_data['keywords'] = s_links_small_data['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
s_links_small_data['keywords'] = s_links_small_data['keywords'].apply(
    lambda x: [str.lower(i.replace(" ", "")) for i in x])

s_links_small_data['soup'] = s_links_small_data['keywords'] + s_links_small_data['cast'] + s_links_small_data[
    'director'] + s_links_small_data['genres']
s_links_small_data['soup'] = s_links_small_data['soup'].apply(lambda x: ' '.join(x))

In [45]:
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(s_links_small_data['soup'])

In [46]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)


In [47]:
s_links_small_data = s_links_small_data.reset_index()
titles = s_links_small_data['title']
indices = pd.Series(s_links_small_data.index, index=s_links_small_data['title'])

We will reuse the get_recommendations function that we had written earlier. Since our cosine similarity scores 
have changed, we expect it to give us different (and probably better) results. Let us check for The Dark Knight 
again and see what recommendations I get this time around.

In [48]:
get_recommendations('The Dark Knight').head(10)

8031         The Dark Knight Rises
6218                 Batman Begins
6623                  The Prestige
2085                     Following
7648                     Inception
4145                      Insomnia
3381                       Memento
8613                  Interstellar
7659    Batman: Under the Red Hood
1134                Batman Returns
Name: title, dtype: object

I am much more satisfied with the results I get this time around. The recommendations seem to have recognized other 
Christopher Nolan movies (due to the high weightage given to director) and put them as top recommendations. 

We can of course experiment on this engine by trying out different weights for our features 
(directors, actors, genres), limiting the number of keywords that can be used in the soup, weighing genres based 
on their frequency, only showing movies with the same languages, etc.

Let me also get recommendations for another movie, Friday the 13th

In [49]:
get_recommendations('Friday the 13th').head(10)

3368                       DeepStar Six
5573                    To End All Wars
6855     The Seeker: The Dark Is Rising
1567      Hello Mary Lou: Prom Night II
1325    I Know What You Did Last Summer
1555           Friday the 13th Part III
4326                        The Burning
2439                        Creepshow 2
4867                          Phenomena
1566                         Prom Night
Name: title, dtype: object

Popularity and Ratings

One thing that we can notice about our recommendation system is that it recommends movies regardless of ratings and 
popularity.

Therefore, we will add a mechanism to remove bad movies and return movies which are popular and have had a good 
critical response.

I will take the top 25 movies based on similarity scores and calculate the vote of the 60th percentile movie. Then, 
using this as the value of  m , we will calculate the weighted rating of each movie using IMDB's formula like we 
did in the Simple Recommender section.


In [50]:
def improved_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = s_links_small_data.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)
    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (
        movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(10)
    return qualified

In [51]:
improved_recommendations('The Dark Knight')

Unnamed: 0,title,vote_count,vote_average,year,wr
7648,Inception,14075,8,2010,7.917588
8613,Interstellar,11187,8,2014,7.897107
6623,The Prestige,4510,8,2006,7.758148
3381,Memento,4168,8,2000,7.740175
8031,The Dark Knight Rises,9263,7,2012,6.921448
6218,Batman Begins,7511,7,2005,6.904127
1134,Batman Returns,1706,6,1992,5.846862
132,Batman Forever,1529,5,1995,5.054144
9024,Batman v Superman: Dawn of Justice,7189,5,2016,5.013943
1260,Batman & Robin,1447,4,1997,4.287233


Let me also get the recommendations for Friday the 13th, my favorite movie.

In [52]:
improved_recommendations('Friday the 13th')

Unnamed: 0,title,vote_count,vote_average,year,wr
3845,The Devil's Backbone,277,7,2001,5.928671
7145,Camp Rock,432,6,2008,5.621576
7222,Eden Lake,415,6,2008,5.613999
1554,Friday the 13th Part 2,321,6,1981,5.565941
1559,Friday the 13th Part VII: The New Blood,196,5,1988,5.168707
1555,Friday the 13th Part III,257,5,1982,5.153814
1868,I Still Know What You Did Last Summer,381,5,1998,5.130411
4775,Freddy vs. Jason,608,5,2003,5.102001
1325,I Know What You Did Last Summer,698,5,1997,5.093891
8511,Carrie,1505,5,2013,5.054814


Collaborative Filtering

Our content based engine suffers from some severe limitations. It is only capable of suggesting movies which are close to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.

Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who s/he is.

Therefore, in this section, we will use a technique called Collaborative Filtering to make recommendations to Movie Watchers. Collaborative Filtering is based on the idea that users similar to me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.

I will not be implementing Collaborative Filtering from scratch. Instead, I will use the Surprise library that used extremely powerful algorithms like Singular Value Decomposition (SVD) to minimise RMSE (Root Mean Square Error) and give great recommendations.


In [53]:
reader = Reader()

In [54]:
ratings = pd.read_csv('datasets/ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


Model Evaluation and Optimization

In [55]:

data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
cross_validate(NormalPredictor(), data, cv=5)

{'test_rmse': array([1.45044135, 1.43638771, 1.4288821 , 1.44119028, 1.43905455]),
 'test_mae': array([1.15775451, 1.14720704, 1.1406164 , 1.15271804, 1.14879277]),
 'fit_time': (0.15515685081481934,
  0.280595064163208,
  0.862389087677002,
  0.6193978786468506,
  0.45327091217041016),
 'test_time': (0.23175597190856934,
  0.24556398391723633,
  0.25471019744873047,
  0.3945457935333252,
  0.4882500171661377)}

In [56]:
# svd = SVD()
# cross_validate(svd, data, measures=['RMSE', 'MAE'])

# define a cross-validation iterator
kf = KFold(n_splits=5)

svd = SVD()

for trainset, testset in kf.split(data):

    # train and test algorithm.
    svd.fit(trainset)
    predictions = svd.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

RMSE: 0.8986
RMSE: 0.8917
RMSE: 0.8993
RMSE: 0.8977
RMSE: 0.8953


We get a mean Root Mean Sqaure Error of 0.8926 which is more than good enough for our case. Let us now train on our dataset and arrive at predictions.

In [57]:
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fb246ca9fd0>

In [58]:
ratings[ratings['userId'] == 1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
5,1,1263,2.0,1260759151
6,1,1287,2.0,1260759187
7,1,1293,2.0,1260759148
8,1,1339,3.5,1260759125
9,1,1343,2.0,1260759131


In [59]:
svd.predict(1, 302, 3)

Prediction(uid=1, iid=302, r_ui=3, est=2.6001314837572442, details={'was_impossible': False})

Advanced Recommender

In this section, I will try to build a simple Advanced recommender that brings together techniques implemented in the content based and collaborative filter based engines. This is how it will work:

Input: User ID and the Title of a Movie
Output: Similar movies sorted on the basis of expected ratings by that particular user.

In [60]:
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

In [61]:
data_id_mapping = pd.read_csv('datasets/links_small.csv')[['movieId', 'tmdbId']]
data_id_mapping['tmdbId'] = data_id_mapping['tmdbId'].apply(convert_int)
data_id_mapping.columns = ['movieId', 'id']
data_id_mapping = data_id_mapping.merge(s_links_small_data[['title', 'id']], on='id').set_index('title')


In [62]:
data_indices_mapping = data_id_mapping.set_index('id')

In [63]:
def advanced_recommender(userId, title):
    idx = indices[title]
    tmdbId = data_id_mapping.loc[title]['id']
    #print(idx)
    movie_id = data_id_mapping.loc[title]['movieId']
    
    sim_scores = list(enumerate(cosine_sim[int(idx)]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = s_links_small_data.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id']]
    movies['est'] = movies['id'].apply(lambda x: svd.predict(userId, data_indices_mapping.loc[x]['movieId']).est)
    movies = movies.sort_values('est', ascending=False)
    return movies.head(10)

In [64]:
advanced_recommender(5, 'Friday the 13th')

Unnamed: 0,title,vote_count,vote_average,year,id,est
3357,The Abominable Dr. Phibes,79.0,6.9,1971,17965,4.108996
4420,Alone in the Dark,26.0,5.5,1982,40952,3.908112
4326,The Burning,90.0,6.2,1981,24124,3.844549
3845,The Devil's Backbone,277.0,7.2,2001,1433,3.842347
3777,13 Ghosts,31.0,5.6,1960,29756,3.830841
2129,The Mummy's Tomb,6.0,8.3,1942,29239,3.795079
1559,Friday the 13th Part VII: The New Blood,196.0,5.3,1988,10281,3.783589
2696,Two Thousand Maniacs!,30.0,6.2,1964,28177,3.781924
7222,Eden Lake,415.0,6.7,2008,13510,3.773525
5573,To End All Wars,42.0,6.7,2001,1783,3.769191


In [65]:
advanced_recommender(300, 'Friday the 13th')

Unnamed: 0,title,vote_count,vote_average,year,id,est
7222,Eden Lake,415.0,6.7,2008,13510,4.158224
3357,The Abominable Dr. Phibes,79.0,6.9,1971,17965,4.114824
2129,The Mummy's Tomb,6.0,8.3,1942,29239,3.988308
4867,Phenomena,146.0,6.7,1985,29161,3.983526
3845,The Devil's Backbone,277.0,7.2,2001,1433,3.969813
5573,To End All Wars,42.0,6.7,2001,1783,3.953151
6855,The Seeker: The Dark Is Rising,111.0,4.8,2007,2274,3.903525
2439,Creepshow 2,133.0,5.9,1987,16288,3.903442
8511,Carrie,1505.0,5.8,2013,133805,3.901217
7145,Camp Rock,432.0,6.0,2008,13655,3.881036


In [66]:
advanced_recommender(201, 'Thor')

Unnamed: 0,title,vote_count,vote_average,year,id,est
1163,Hamlet,118.0,7.3,1996,10549,4.543065
7009,Iron Man,8951.0,7.4,2008,1726,4.542843
444,Much Ado About Nothing,194.0,7.2,1993,11971,4.483492
997,Henry V,73.0,7.4,1989,10705,4.383318
7969,The Avengers,12000.0,7.4,2012,24428,4.293839
8626,Captain America: The Winter Soldier,5881.0,7.6,2014,100402,4.290788
2463,Dead Again,81.0,6.7,1991,11498,4.268429
8869,Ant-Man,6029.0,7.0,2015,102899,4.148832
7923,Captain America: The First Avenger,7174.0,6.6,2011,1771,4.128314
8872,Captain America: Civil War,7462.0,7.1,2016,271110,4.114705


In [67]:
!pip install streamlit

[33mDEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support[0m
Defaulting to user installation because normal site-packages is not writeable
Collecting streamlit
  Using cached streamlit-0.55.2-py2.py3-none-any.whl (6.9 MB)
Collecting watchdog
  Using cached watchdog-0.10.6.tar.gz (99 kB)


Building wheels for collected packages: watchdog
  Building wheel for watchdog (setup.py) ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/qw/26_2l0t11ld06gyd25dmxn6w0000gn/T/pip-install-Mnnn5T/watchdog/setup.py'"'"'; __file__='"'"'/private/var/folders/qw/26_2l0t11ld06gyd25dmxn6w0000gn/T/pip-install-Mnnn5T/watchdog/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /private/var/folders/qw/26_2l0t11ld06gyd25dmxn6w0000gn/T/pip-wheel-0lj0BD
       cwd: /private/var/folders/qw/26_2l0t11ld06gyd25dmxn6w0000gn/T/pip-install-Mnnn5T/watchdog/
  Complete output (51 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  c

Failed to build watchdog
Installing collected packages: watchdog, streamlit
    Running setup.py install for watchdog ... [?25lerror
[31m    ERROR: Command errored out with exit status 1:
     command: /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/qw/26_2l0t11ld06gyd25dmxn6w0000gn/T/pip-install-Mnnn5T/watchdog/setup.py'"'"'; __file__='"'"'/private/var/folders/qw/26_2l0t11ld06gyd25dmxn6w0000gn/T/pip-install-Mnnn5T/watchdog/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/qw/26_2l0t11ld06gyd25dmxn6w0000gn/T/pip-record-kfQ31R/install-record.txt --single-version-externally-managed --user --prefix= --compile --install-headers /Users/anthonysakala/Library/Python/2.7/include/python2.7/watchdog
   