# MOVIES Recommender Systems

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet


import warnings; warnings.simplefilter('ignore')



## Simple Recommender

The Simple Recommender offers generalized recommnendations to every user based on movie popularity and (sometimes) genre. The basic idea behind this recommender is that movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience. This model does not give personalized recommendations based on the user. 

The implementation of this model is extremely trivial. All we have to do is sort our movies based on ratings and popularity and display the top movies of our list. As an added step, we can pass in a genre argument to get the top movies of a particular genre. 

In [2]:
md = pd. read_csv('movies_metadata.csv')
pd.options.display.max_columns = None
md.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [3]:
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])


In [4]:
md.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


We will use IMDB's *weighted rating* formula to come up with our **Top Movies Chart.**. Mathematically, it is represented as follows:

Weighted Rating (WR) = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$

where,
* *v* is the number of votes for the movie
* *m* is the minimum votes required to be listed in the chart
* *R* is the average rating of the movie
* *C* is the mean vote across the whole report

The next step is to determine an appropriate value for *m*, the minimum votes required to be listed in the chart. We will use **95th percentile** as our cutoff. In other words, for a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.

Let's build our overall Top 250 Chart by defining a function to build charts for a particular genre!

In [5]:
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
C

5.244896612406511

In [6]:
m = vote_counts.quantile(0.95)
m

434.0

In [7]:
md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [8]:
qualified = md[(md['vote_count'] >= m) & (md['vote_count'].notnull()) & (md['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

(2274, 6)

Therefore, to qualify to be considered for the chart, a movie has to have at least **425 votes** on IMDB. We also see that the average rating for a movie on IMDB is **5.238** on a scale of 10. **2335** Movies qualify to be on our chart.

In [9]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [10]:
qualified['wr'] = qualified.apply(weighted_rating, axis=1)

In [11]:
qualified = qualified.sort_values('wr', ascending=False).head(250)

### Top Movies

In [12]:
qualified.head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,wr
15480,Inception,2010,14075,8,29.108149,"[Action, Thriller, Science Fiction, Mystery, A...",7.917588
12481,The Dark Knight,2008,12269,8,123.167259,"[Drama, Action, Crime, Thriller]",7.905871
22879,Interstellar,2014,11187,8,32.213481,"[Adventure, Drama, Science Fiction]",7.897107
2843,Fight Club,1999,9678,8,63.869599,[Drama],7.881753
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.070725,"[Adventure, Fantasy, Action]",7.871787
292,Pulp Fiction,1994,8670,8,140.950236,"[Thriller, Crime]",7.86866
314,The Shawshank Redemption,1994,8358,8,51.645403,"[Drama, Crime]",7.864
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.324358,"[Adventure, Fantasy, Action]",7.861927
351,Forrest Gump,1994,8147,8,48.307194,"[Comedy, Drama, Romance]",7.860656
5814,The Lord of the Rings: The Two Towers,2002,7641,8,29.423537,"[Adventure, Fantasy, Action]",7.851924


We see that three Christopher Nolan Films, **Inception**, **The Dark Knight** and **Interstellar** occur at the very top of our chart. The chart also indicates a strong bias of IMDB Users towards particular genres and directors. 

Let us now construct our function that builds charts for particular genres. For this, we will use relax our default conditions to the **85th** percentile instead of 95. 

In [13]:
s = md.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_md = md.drop('genres', axis=1).join(s)

In [14]:
def build_chart(genre, percentile=0.85):
    df = gen_md[gen_md['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    qualified['wr'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(250)
    
    return qualified

Let us see our method in action by displaying the Top 15 Horror Movies (Horror almost didn't feature at all in our Generic Top Chart despite  being one of the most popular movie genres).

### Top Horror Movies

In [15]:
build_chart('Horror').head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
1213,The Shining,1980,3890,8,19.611589,7.901294
1176,Psycho,1960,2405,8,36.826309,7.843335
1171,Alien,1979,4564,7,23.37742,6.941936
41492,Split,2016,4461,7,28.920839,6.940631
14236,Zombieland,2009,3655,7,11.063029,6.927969
1158,Aliens,1986,3282,7,21.761179,6.920081
21276,The Conjuring,2013,3169,7,14.90169,6.917338
42169,Get Out,2017,2978,7,36.894806,6.912248
1338,Jaws,1975,2628,7,19.726114,6.901088
8147,Shaun of the Dead,2004,2479,7,14.902948,6.895426


The top horror movie according to our metrics is **The Shining**. This movie also happens to be one of my personal favorites! Try this with other genre to see if it works for you as well! (maybe it can give you some idea about what to watch this weekend!) 

## Content Based Recommender

The recommender we built in the previous section suffers some severe limitations. For one, it gives the same recommendation to everyone, regardless of the user's personal taste. If a person who loves romantic movies (and hates action) were to look at our Top 15 Chart, s/he wouldn't probably like most of the movies. If s/he were to go one step further and look at our charts by genre, s/he wouldn't still be getting the best recommendations.

For instance, consider a person who loves *Fight Club*, *Seven* and *The curious case of Benjmin Button*. One inference we can obtain is that the person loves the actor *Brad Pitt* and the director *David Fincher*. Even if s/he were to access the genre chart, s/he wouldn't find these as the top recommendations.

To personalise our recommendations more, we are going to build an engine that computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked. Since we will be using movie metadata (or content) to build this engine, this also known as **Content Based Filtering.**

We will build two Content Based Recommenders based on:
* Movie Overviews and Taglines
* Movie Cast, Crew, Keywords and Genre

Also, as mentioned in the introduction, we will be using a subset of all the movies available to us for the sake of simplicity (But I recommend you try it on the full dataset later on)

In [21]:
links_small = pd.read_csv('links_small.csv')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')
links_small.head(5)


0      862
1     8844
2    15602
3    31357
4    11862
Name: tmdbId, dtype: int64

In [23]:
md = md.drop([19730, 29503, 35587])

In [30]:
smd = md[md['id'].astype('int').isin(links_small)]
smd.shape

(9099, 25)

We have **9099** movies avaiable in our small movies metadata dataset which is 5 times smaller than our original dataset of 45000 movies.

### Movie Description Based Recommender

Let us first try to build a recommender using movie descriptions and taglines. We do not have a quantitative metric to judge our machine's performance so this will have to be done qualitatively.

In [31]:
smd['tagline'] = smd['tagline'].fillna('')
smd['description'] = smd['overview'] + smd['tagline']
smd['description'] = smd['description'].fillna('')

In [32]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])

In [33]:
tfidf_matrix.shape

(9099, 268124)

#### Cosine Similarity

We will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two movies. Mathematically, it is defined as follows:

$cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $

Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score. Therefore, we will use sklearn's **linear_kernel** instead of cosine_similarities since it is much faster.

In [34]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [35]:
cosine_sim[0]

array([1.        , 0.00680476, 0.        , ..., 0.        , 0.00344913,
       0.        ])

We now have a pairwise cosine similarity matrix for all the movies in our dataset. The next step is to write a function that returns the 30 most similar movies based on the cosine similarity score.

In [36]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

In [37]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

We're all set. Let us now try and get the top recommendations for a few movies and see how good the recommendations are.

In [38]:
get_recommendations('Se7en').head(10)

3425     The Magnificent Seven
3085                  The Cell
6678                     21 Up
7102              Seven Pounds
428                 Kalifornia
5391               Mindhunters
2410        The Bone Collector
7611    The Poughkeepsie Tapes
9012                    Solace
8014                 The Raven
Name: title, dtype: object

In [39]:
get_recommendations('The Curious Case of Benjamin Button').head(10)

7365                                           The Box
6838    Sweeney Todd: The Demon Barber of Fleet Street
4097                                      The Believer
4029                                 The Piano Teacher
2771                                    Jacob's Ladder
3335                                             Alfie
7214                                    Shall We Kiss?
6697                                        88 Minutes
3822                     That Obscure Object of Desire
2569                                  Five Easy Pieces
Name: title, dtype: object

In [40]:
get_recommendations('The Dark Knight').head(10)

7931                      The Dark Knight Rises
132                              Batman Forever
1113                             Batman Returns
8227    Batman: The Dark Knight Returns, Part 2
7565                 Batman: Under the Red Hood
524                                      Batman
7901                           Batman: Year One
2579               Batman: Mask of the Phantasm
2696                                        JFK
8165    Batman: The Dark Knight Returns, Part 1
Name: title, dtype: object

We see that for **The Dark Knight**, our system is able to identify it as a Batman film and subsequently recommend other Batman films as its top recommendations. But unfortunately, that is all this system can do at the moment. This is not of much use to most people as it doesn't take into considerations very important features such as cast, crew, director and genre, which determine the rating and the popularity of a movie. Someone who liked **The Dark Knight** probably likes it more because of Nolan and would hate **Batman Forever** and every other substandard movie in the Batman Franchise.

Therefore, we are going to use much more suggestive metadata than **Overview** and **Tagline**. In the next subsection, we will build a more sophisticated recommender that takes **genre**, **keywords**, **cast** and **crew** into consideration.

### Metadata Based Recommender

To build our standard metadata based content recommender, we will need to merge our current dataset with the crew and the keyword datasets. Let us prepare this data as our first step.

We now have our cast, crew, genres and credits, all in one dataframe. Let us wrangle this a little more using the following intuitions:

1. **Crew:** From the crew, we will only pick the director as our feature since the others don't contribute that much to the *feel* of the movie.
2. **Cast:** Choosing Cast is a little more tricky. Lesser known actors and minor roles do not really affect people's opinion of a movie. Therefore, we must only select the major characters and their respective actors. Arbitrarily we will choose the top 3 actors that appear in the credits list. 

In [42]:
smd.head(5)

Unnamed: 0,index,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,year,description
0,0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ..."
1,1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...
2,2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,1995,A family wedding reignites the ancient feud be...
3,3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom..."
4,4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,1995,Just when George Banks has recovered from his ...


In [43]:
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')

In [44]:
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
md['id'] = md['id'].astype('int')

In [45]:
md.shape

(45463, 25)

In [46]:
md = md.merge(credits, on='id')
md = md.merge(keywords, on='id')

In [47]:
smd = md[md['id'].isin(links_small)]
smd.shape

(9219, 28)

In [48]:
smd['cast'] = smd['cast'].apply(literal_eval)
smd['crew'] = smd['crew'].apply(literal_eval)
smd['keywords'] = smd['keywords'].apply(literal_eval)
smd['cast_size'] = smd['cast'].apply(lambda x: len(x))
smd['crew_size'] = smd['crew'].apply(lambda x: len(x))

In [49]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [50]:
smd['director'] = smd['crew'].apply(get_director)

In [51]:
smd['cast'] = smd['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
smd['cast'] = smd['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

In [52]:
smd['keywords'] = smd['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

Our approach to build the recommender is a bit *hacky*! What we plan on doing is creating a metadata dump for every movie which consists of **genres, director, main actors and keywords.** Then we use a **Count Vectorizer** to create our count matrix as we did in the Description Recommender. The remaining steps are similar to what we did earlier: we calculate the cosine similarities and return movies that are most similar.

These are steps we follow in the preparation of our genres and credits data:
1. **Strip Spaces and Convert to Lowercase** from all our features. This way, our engine will not confuse between **Johnny Depp** and **Johnny Galecki.** 
2. **Mention Director 2 times** to give it more weight relative to the entire cast.

In [53]:
smd['cast'] = smd['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [54]:
smd['director'] = smd['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
smd['director'] = smd['director'].apply(lambda x: [x,x])

#### Keywords

We will do a small amount of pre-processing of our keywords before putting them to any use. As a first step, we calculate the frequenct counts of every keyword that appears in the dataset.

In [55]:
s = smd.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'

In [56]:
s = s.value_counts()
s[:5]

independent film        610
woman director          550
murder                  399
duringcreditsstinger    327
based on novel          318
Name: keyword, dtype: int64

Keywords occur in frequencies ranging from 1 to 610. We do not have any use for keywords that occur only once. Therefore, these can be safely removed. Finally, we will convert every word to its stem so that words such as *Dogs* and *Dog* are considered the same.

In [57]:
s = s[s > 1]

In [58]:
stemmer = SnowballStemmer('english')
stemmer.stem('dogs')

'dog'

In [59]:
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words

In [60]:
smd['keywords'] = smd['keywords'].apply(filter_keywords)
smd['keywords'] = smd['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
smd['keywords'] = smd['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [61]:
smd['soup'] = smd['keywords'] + smd['cast'] + smd['director'] + smd['genres']
smd['soup'] = smd['soup'].apply(lambda x: ' '.join(x))

In [62]:
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(smd['soup'])

In [63]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [64]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

We will reuse the get_recommendations function that we had written earlier. Since our cosine similarity scores have changed, we expect it to give us different (and probably better) results. Let us check for **The Dark Knight** again and see what recommendations I get this time around.

In [65]:
get_recommendations('The Dark Knight').head(10)

8031                 The Dark Knight Rises
6218                         Batman Begins
7659            Batman: Under the Red Hood
6623                          The Prestige
1134                        Batman Returns
8927               Kidnapping Mr. Heineken
5943                              Thursday
1260                        Batman & Robin
2085                             Following
9024    Batman v Superman: Dawn of Justice
Name: title, dtype: object

I am much more satisfied with the results I get this time around. The recommendations seem to have recognized other Christopher Nolan movies (due to the high weightage given to director) and put them as top recommendations. I enjoyed watching **The Dark Knight** as well as some of the other ones in the list including **Batman Begins**, **The Prestige** and **The Dark Knight Rises**. 

We can of course experiment on this engine by trying out different weights for our features (directors, actors, genres), limiting the number of keywords that can be used in the soup, weighing genres based on their frequency, only showing movies with the same languages, etc.

In [66]:
get_recommendations('Se7en').head(10)

6719                                 Zodiac
8041        The Girl with the Dragon Tattoo
2390                             Fight Club
4068                             Panic Room
8698                              Gone Girl
1311                               The Game
3648                        The January Man
7186    The Curious Case of Benjamin Button
3391                    Along Came a Spider
596                              True Crime
Name: title, dtype: object

#### Popularity and Ratings

One thing that we notice about our recommendation system is that it recommends movies regardless of ratings and popularity. It is true that **Batman and Robin** has a lot of similar characters as compared to **The Dark Knight** but it was a terrible movie that shouldn't be recommended to anyone.

Therefore, we will add a mechanism to remove bad movies and return movies which are popular and have had a good critical response.

I will take the top 25 movies based on similarity scores and calculate the vote of the 60th percentile movie. Then, using this as the value of $m$, we will calculate the weighted rating of each movie using IMDB's formula like we did in the Simple Recommender section.

In [67]:
def improved_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)
    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(10)
    return qualified

In [68]:
improved_recommendations('The Dark Knight')

Unnamed: 0,title,vote_count,vote_average,year,wr
6623,The Prestige,4510,8,2006,7.758148
8031,The Dark Knight Rises,9263,7,2012,6.921448
6218,Batman Begins,7511,7,2005,6.904127
7659,Batman: Under the Red Hood,459,7,2010,6.147016
2085,Following,363,7,1998,6.044272
1134,Batman Returns,1706,6,1992,5.846862
7561,Harry Brown,351,6,2009,5.582529
8026,Bullet to the Head,490,5,2013,5.115027
9024,Batman v Superman: Dawn of Justice,7189,5,2016,5.013943
1260,Batman & Robin,1447,4,1997,4.287233


In [69]:
improved_recommendations('Se7en')

Unnamed: 0,title,vote_count,vote_average,year,wr
2390,Fight Club,9678,8,1999,7.881753
8698,Gone Girl,6023,7,2014,6.882033
7186,The Curious Case of Benjamin Button,3398,7,2008,6.801223
8041,The Girl with the Dragon Tattoo,2479,7,2011,6.738512
6719,Zodiac,2080,7,2007,6.697011
1311,The Game,1556,7,1997,6.617229
4068,Panic Room,1303,6,2002,5.811333
2430,The Bone Collector,843,6,1999,5.743371
3391,Along Came a Spider,408,6,2001,5.61079
5176,Taking Lives,356,6,2004,5.585171


Unfortunately, **Batman and Robin** does not disappear from our recommendation list. This is probably due to the fact that it is rated a 4, which is only slightly below average on TMDB. But at least we decreased its position in the list. There is nothing much we can do about this right now. Therefore, we will conclude our Content Based Recommender section here and come back to it when we build a hybrid engine.

## Collaborative Filtering

Our content based engine suffers from some severe limitations. It is only capable of suggesting movies which are *close* to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.

Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who s/he is.

Therefore, in this section, we will use a technique called **Collaborative Filtering** to make recommendations to Movie Watchers. Collaborative Filtering is based on the idea that users similar to me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.

I will not be implementing Collaborative Filtering from scratch. Instead, I will use the [**Surprise**](http://surpriselib.com/) library that uses extremely powerful algorithms such as **KNN** or  **Singular Value Decomposition (SVD)** to minimise RMSE (Root Mean Square Error) and give great recommendations. To install Surprise, run the following command in Anaconda Prompt command line (Run the prompt with administrator privilages if you get permission error)

In [70]:
pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 10.3 MB/s eta 0:00:01
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25ldone
[?25h  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp38-cp38-macosx_10_9_x86_64.whl size=767882 sha256=16f1b06c218deb7d8948d75daf105d8e9f5f39101cae4ba831794793896ab7b4
  Stored in directory: /Users/navalaggarwal/Library/Caches/pip/wheels/20/91/57/2965d4cff1b8ac7ed1b6fa25741882af3974b54a31759e10b6
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1
Note: you may need to restart the kernel to use updated packages.


In [71]:
from surprise import Reader, Dataset, SVD, KNNBasic
from surprise.model_selection import cross_validate, train_test_split
reader = Reader()

In [72]:
ratings = pd.read_csv('ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [73]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

In [74]:
#let's start with KNN 
algo = KNNBasic()
# Run 5-fold cross-validation and print results. It may take a while!
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9730  0.9702  0.9654  0.9690  0.9666  0.9689  0.0027  
MAE (testset)     0.7481  0.7452  0.7396  0.7460  0.7441  0.7446  0.0028  
Fit time          0.16    0.12    0.13    0.11    0.11    0.13    0.02    
Test time         1.28    1.30    1.43    1.30    1.14    1.29    0.09    


{'test_rmse': array([0.97301977, 0.97021175, 0.96538225, 0.96903109, 0.96664669]),
 'test_mae': array([0.74807845, 0.74519963, 0.73959559, 0.74597578, 0.74405132]),
 'fit_time': (0.16214609146118164,
  0.11602902412414551,
  0.13273096084594727,
  0.11194896697998047,
  0.1136770248413086),
 'test_time': (1.2793700695037842,
  1.2980890274047852,
  1.4329829216003418,
  1.3015573024749756,
  1.1435377597808838)}

In [75]:
algo = SVD()
# Run 5-fold cross-validation and print results. It may take a while!
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8934  0.8897  0.9017  0.9005  0.9020  0.8975  0.0050  
MAE (testset)     0.6863  0.6850  0.6925  0.6952  0.6938  0.6906  0.0041  
Fit time          3.86    4.08    3.75    4.91    3.87    4.09    0.42    
Test time         0.14    0.20    0.09    0.19    0.09    0.14    0.05    


{'test_rmse': array([0.89339555, 0.8897461 , 0.90165348, 0.90051311, 0.90195575]),
 'test_mae': array([0.68628194, 0.68503057, 0.69252096, 0.69522966, 0.6938468 ]),
 'fit_time': (3.8613510131835938,
  4.0761919021606445,
  3.748966932296753,
  4.907029390335083,
  3.8738369941711426),
 'test_time': (0.1413118839263916,
  0.2046349048614502,
  0.09171915054321289,
  0.19199180603027344,
  0.09275317192077637)}

Look at the **fit time** and the **test time** for two algorithms. Which one do you think is more important in our application? 

For SVD, we get an average **Root Mean Sqaure Error** around 0.69 on test data which is good enough for our case. Let us now train on our dataset and arrive at predictions.

In [76]:
#Now that we know SVD algorithm is what we want, let's train it on the entire dataset 
trainset = data.build_full_trainset()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fbf3d6e0190>

Let us pick user id 1 and check the ratings s/he has given.

In [77]:
ratings[ratings['userId'] == 1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
5,1,1263,2.0,1260759151
6,1,1287,2.0,1260759187
7,1,1293,2.0,1260759148
8,1,1339,3.5,1260759125
9,1,1343,2.0,1260759131


In [78]:
algo.predict(uid=1,iid=31,r_ui=2.5)

Prediction(uid=1, iid=31, r_ui=2.5, est=2.409464403524547, details={'was_impossible': False})

For movie with ID 31, we get an estimated prediction of **2.41** (the true rating was 2.5, not bad ha?!). One startling feature of this recommender system is that it doesn't care what the movie is (or what it contains). It works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have rated the movie.

# Evaluating Recommender Systems

Let’s look into the different ways we can evaluate our Recommendation Systems. There are two main approaches for evaluation:
1. Evaluating Offline
2. Online A/B Testing

Beware, that **NONE OF THE METRICS THAT WE WILL DISCUSS MATTER MORE THAN HOW REAL CUSTOMERS REACT TO RECOMMENDATION YOU PRODUCE IN REAL WORLD!**

In other words, offline metrics such as mae, diversity, hitrate, etc. (that we will discuss below) can all be indicators you can look at while developing Recommendation Systems offline but never declare victory until you measure real impact on real users. Sometimes they don’t work at all! User behavior is the ultimate test of our work. There is a real challenge where often accuracy metrics tell your algorithm is great only to have it do HORRIBLE in online A/B Test. YouTube studies this and calls it **“THE SURROGATE PROBLEM”**. At the end of the day, online A/B tests are the only thing you can use for your Recommendation System.

## Different Offline Metrics
The different offline metrics and other measures that define our Recommendation System are mentioned below. Don’t let the terms scare you; they are actually quite simple. for more information, please read this nice [article](https://medium.com/fnplus/evaluating-recommender-systems-with-python-code-ae0c370c90be) at medium.

1. **Mean Absolute Error (MAE):** It is the average of the differences between the actual value(rating) and the predicted value.
2. **Root Mean Square Error (RMSE):** RMSE is similar to MAE but the only difference is that the absolute value of the residual is squared and the square root of the whole term is taken for comparison.
3. **Hit Rate (HR):** HIT RATE = (HITS IN TEST) / (NUMBER OF USERS). Hit Rate is a better alternative to MAE or RMSE. To measure a Hit Rate, we first generate top N recommendations for all the users in our test data set. If generated top N recommendations contain something that users rated — 1 hit! Greater the Hit Rate better will be our Recommendation System.
4. **Average Reciprocal Hit Rate (ARHR):** It is a variation of Hit Rate. The difference is that we sum up reciprocal of the rank of each hit. It accounts for “where” in the top N lists our hits appear. We get greater credit for successfully recommending item in the top position than in bottom posision. As it takes rankings into account, higher is better. Note that ARHR is more user-focused (in contrast to Hit Rate) metric since users tend to focus more on the beginning of the list.
5. **Coverage:** It is the percentage of possible recommendations (user-item pair) that system can predict. Coverage can be at odds with accuracy. If you enforce a higher quality threshold on recommendations you make, then you might improve your accuracy at the expense of coverage. Find the balance.
6. **Diversity:** It is a measure of how broad or variety of items our Recommender System is putting in front of people. Diversity and Similarity between recommendation pairs are opposite to each other (Diversity = 1 — Similarity). One thing to take care is that we can achieve high diversity by recommending completely random things. Hence it is not always a good thing. Unusually high diversity leads to bad recommendations.
7. **Churn:** How often do recommendations change? If a user rates a new movie does it substantially change their recommendations? If yes, your churn score is high. Showing the same recommendations all the time is a BAD IDEA. But just like Diversity, high churn score is NOT a good thing. You can recommend randomly and still get high churn score. Note that these metrics must be looked at together and we must understand the trade-offs between them.
8. **Responsiveness:** It measures “How quickly does new user behavior influence recommendations”. It may seem similar to Churn score. But the key difference is Responsiveness is measured in time(How quickly changes are made) and Churn score in “How often” i.e number of times the changes are made in a given time interval. High Responsiveness is a good thing but in a world of business, you must decide how responsive your Recommendation System must be.

Ok, now let's see how these metrics can be implemented in Python and how we can use them to evlute our RS. The full code can be found [here](https://www.kaggle.com/l0new0lf/recommendermetrics)



In [79]:
#BUILDING RECOMMENDATION METRICS LIBRARY

#surprise dccumetation : https://surpriselib.com
#you can see the code for metrics at github page in above website

#importing libraries
import itertools
from surprise import accuracy
from collections import defaultdict

def MAE(predictions):
    return accuracy.mae(predictions, verbose=False)
def RMSE(predictions):
    return accuracy.rmse(predictions, verbose=False)

#GetTopN takes in complete list of ratings prediction that come back from some recommender and
#returns a dictionary that maps user ids to their Top N Ratings.
#We are using defaultdict object which is simmilar to normal python dictionary  but has 
#concept of default empty values
def GetTopN(predictions, n=10, minimumRating=4.0):
    topN = defaultdict(list)
    for userID, movieID, actualRating, estimatedRating, _ in predictions:
        if(estimatedRating >= minimumRating):
            topN[int(userID)].append((int(movieID), estimatedRating)) #note parenthesis
    for userID, ratings in topN.items():
        ratings.sort(key=lambda x: x[1], reverse=True)
        topN[int(userID)] = ratings[:n]
    return topN

#To predict HitRate, we need to pass in both our dictionary of Top N Movies for each user ID and 
#the set of test movie ratings that were left out of training dataset.
#We are using Leave One Out Cross Validation to hold back one rating per user and test our ability to 
#recommend that movie in our Top N lists
def HitRate(topNPredicted, leftOutPredictions):
    hits = 0
    total = 0
    #for each left out rating
    for leftOut in leftOutPredictions:
        userID = leftOut[0]
        leftOutMovieID = leftOut[1]
        #is it in predicted top 10 for this user?
        hit = False
        for movieID, predictedRating in topNPredicted[int(userID)]:
            if (int(leftOutMovieID) == int(movieID)):
                hit = True
                break
            if (hit):
                hits += 1
            total += 1
    #compute overall precision
    return (hits/total)

#Cumilative Hit Rate or CHR works exactly the same way as hit rate except now we have rating cutoff value.
#So, we dont count hit unless predited rating is higher than some threshold
def CumulativeHitRate(topNPredicted, leftOutPredictions, ratingCutoff=0):
    hits = 0
    total = 0
    #for each left out rating
    for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
        #only look at ability to recommend things the user actually liked...
        if(actualRating >= ratingCutoff):
            #is it in predicted top 10 of this user?
            hit=False
            for movieID, predictedRating in topNPredicted[int(userID)]:
                if(int(leftOutMovieID) == int(movieID)):
                    hit = True
                    break
            if(hit):
                hits += 1
            total += 1
    #compute overall precision
    return (hits/total)



#Rating Hit Rate (RHR) : Smilar to Hit rate but We keep track of hit rate for each unique rating value
#So instead of keeping one variable to keep track  of hits and total users, we use another dictionary
#to keep track of hits and totals of each rating type. Then, we print them all out
def RatingHitRate(topNPredicted, leftOutPredictions):
    hits = defaultdict(float)
    total = defaultdict(float)

    # For each left-out rating
    for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
        # Is it in the predicted top N for this user?
        hit = False
        for movieID, predictedRating in topNPredicted[int(userID)]:
            if (int(leftOutMovieID) == movieID):
                hit = True
                break
        if (hit) :
            hits[actualRating] += 1

        total[actualRating] += 1

    # Compute overall precision
    for rating in sorted(hits.keys()):
        print (rating, hits[rating] / total[rating])


#ARHR : Simillar to hit rate. Difference uis that we count things up by the reciprocal of ranks of each 
#hits, inorder to get more credit for hits that occured near the top of Top N list
#
def AverageReciprocalHitRank(topNPredicted, leftOutPredictions):
    summation = 0
    total = 0
    # For each left-out rating
    for userID, leftOutMovieID, actualRating, estimatedRating, _ in leftOutPredictions:
        # Is it in the predicted top N for this user?
        hitRank = 0
        rank = 0
        for movieID, predictedRating in topNPredicted[int(userID)]:
            rank = rank + 1
            if (int(leftOutMovieID) == movieID):
                hitRank = rank
                break
        if (hitRank > 0) :
            summation += 1.0 / hitRank

        total += 1

    return summation / total



# COVERAGE : What percentage of users have at least one "good" recommendation (ABOVE SOME THRESHHOLD)
#In real world, you would probably have a catalog of items that is larger than the set of items you
#have recommendations data for and would compute coverage based on that lager dataset intead.
def UserCoverage(topNPredicted, numUsers, ratingThreshold=0):
    hits = 0
    for userID in topNPredicted.keys():
        hit = False
        for movieID, predictedRating in topNPredicted[userID]:
            if (predictedRating >= ratingThreshold):
                hit = True
                break
        if (hit):
            hits += 1

    return hits / numUsers

#DIVERSITY : To measure diversity, we not only need all of the Top N Recommendations from our system,
#we need a matrix of similarity scores between every pair of items in our dataset. 
#Although divesity is easy to explain, coding is little tricky -
#We start by retreiving similarity matrix(Basically a 2x2 Matrix array that contain similarity scores for every posible
#combination of items that we can quickly lookup). Then, we go to Top N Recommendations for each user one user at a time.
#itertools.combinations :
#This call gives us back every combination of item pairs within the Top N List. Wecanthen iterate through each pair
#and look at similarity between each pair o items
#NOTE : SURPRISE maintains INTERNAL IDs for both users and items that are  sequential, and these are different from raw
#user ids and movie ids that are presentin our actual ratings data.
#Similarity mmatrix uses those inner user IDs so we need to convert our raw IDs into inner IDs before looking up 
#similarity scores. We add up all the similarity scores, take the average and subtract it from one to get out
#our diversity metric
def Diversity(topNPredicted, simsAlgo):
    n = 0
    total = 0
    simsMatrix = simsAlgo.compute_similarities()
    for userID in topNPredicted.keys():
        pairs = itertools.combinations(topNPredicted[userID], 2)
        for pair in pairs:
            movie1 = pair[0][0]
            movie2 = pair[1][0]
            innerID1 = simsAlgo.trainset.to_inner_iid(str(movie1))
            innerID2 = simsAlgo.trainset.to_inner_iid(str(movie2))
            similarity = simsMatrix[innerID1][innerID2]
            total += similarity
            n += 1

    S = total / n
    return (1-S)      

            
    #NOTE: ABOVE DATA IS HIGHLY COMPUTATIONAL. USE SAMPLE DATA IN REAL WORLD SCENARIOS


In [80]:
trainset, testset = train_test_split(data, test_size=0.25)
algo = SVD()
predictions  = algo.fit(trainset).test(testset) 



In [81]:
print("MAE is: " , MAE(predictions))
print("RMSE is: " , RMSE(predictions))
top10 = GetTopN(predictions,10)
print("HitRate is: " , HitRate(top10,testset))
print("ARHR is: " , AverageReciprocalHitRank(top10,predictions))
numUsers = len(set([l[0] for l in testset]))
print("Coverage is: " , UserCoverage(top10,numUsers, ratingThreshold=4.5 ))



MAE is:  0.6953873628081256
RMSE is:  0.9026594121865089
HitRate is:  0.0
ARHR is:  0.046567296038317246
Coverage is:  0.33134328358208953


## Hybrid Recommender

In this section, I will try to build a simple hybrid recommender that brings together techniques we have implemented in the content based and collaborative filter based engines. This is how it will work:

* **Input:** User ID and the Title of a Movie
* **Output:** Similar movies sorted on the basis of expected ratings by that particular user.

In [82]:
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

In [85]:
id_map = pd.read_csv('links_small.csv')[['movieId', 'tmdbId']]
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)
id_map.columns = ['movieId', 'id']
id_map = id_map.merge(smd[['title', 'id']], on='id').set_index('title')
#id_map = id_map.set_index('tmdbId')

In [86]:
indices_map = id_map.set_index('id')

In [87]:
def hybrid(userId, title):
    idx = indices[title]
    tmdbId = id_map.loc[title]['id']
    #print(idx)
    movie_id = id_map.loc[title]['movieId']
    
    sim_scores = list(enumerate(cosine_sim[int(idx)]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id']]
    movies['est'] = movies['id'].apply(lambda x: algo.predict(userId, indices_map.loc[x]['movieId']).est)
    movies = movies.sort_values('est', ascending=False)
    return movies.head(10)

In [88]:
hybrid(1, 'Avatar')

Unnamed: 0,title,vote_count,vote_average,year,id,est
1011,The Terminator,4208.0,7.4,1984,218,3.422825
522,Terminator 2: Judgment Day,4274.0,7.7,1991,280,3.369016
974,Aliens,3282.0,7.7,1986,679,3.331167
8401,Star Trek Into Darkness,4479.0,7.4,2013,54138,3.210078
7705,Alice in Wonderland,8.0,5.4,1933,25694,3.094014
922,The Abyss,822.0,7.1,1989,2756,3.063606
8658,X-Men: Days of Future Past,6155.0,7.5,2014,127585,3.063024
2834,Predator,2129.0,7.3,1987,106,3.013339
7265,Dragonball Evolution,475.0,2.9,2009,14164,2.965376
3060,Sinbad and the Eye of the Tiger,39.0,6.3,1977,11940,2.937637


In [89]:
hybrid(500, 'Avatar')

Unnamed: 0,title,vote_count,vote_average,year,id,est
8401,Star Trek Into Darkness,4479.0,7.4,2013,54138,3.652201
522,Terminator 2: Judgment Day,4274.0,7.7,1991,280,3.539362
1011,The Terminator,4208.0,7.4,1984,218,3.149135
922,The Abyss,822.0,7.1,1989,2756,3.078115
8658,X-Men: Days of Future Past,6155.0,7.5,2014,127585,3.050156
7705,Alice in Wonderland,8.0,5.4,1933,25694,3.03264
3060,Sinbad and the Eye of the Tiger,39.0,6.3,1977,11940,3.00095
7265,Dragonball Evolution,475.0,2.9,2009,14164,2.962526
8419,Man of Steel,6462.0,6.5,2013,49521,2.898903
1668,Return from Witch Mountain,38.0,5.6,1978,14822,2.867042


We see that for our hybrid recommender, we get different recommendations for different users although the movie is the same. Hence, our recommendations are more personalized and tailored towards particular users.

## Conclusion

In this notebook, we have built 5 different recommendation engines based on different ideas and algorithms. They are as follows:

1. **Simple Recommender:** This system used overall TMDB Vote Count and Vote Averages to build Top Movies Charts, in general and for a specific genre. The IMDB Weighted Rating System was used to calculate ratings on which the sorting was finally performed.
2. **Content Based Recommender:** We built two content based engines; one that took movie overview and taglines as input and the other which took metadata such as cast, crew, genre and keywords to come up with predictions. We also deviced a simple filter to give greater preference to movies with more votes and higher ratings.
3. **Collaborative Filtering:** We used the powerful Surprise Library to build a collaborative filter based on single value decomposition. The RMSE obtained was less than 1 and the engine gave estimated ratings for a given user and movie.
4. **Hybrid Engine:** We brought together ideas from content and collaborative filterting to build an engine that gave movie suggestions to a particular user based on the estimated ratings that it had internally calculated for that user.


## Apendix

you can check out the below Kaggle page and explore its Kernels to learn more about different idea that can be used to build a recommender system on this data set. This code is the modification of one of its Kernels which you can find [Here](https://www.kaggle.com/rounakbanik/movie-recommender-systems)

https://www.kaggle.com/rounakbanik/the-movies-dataset/kernels
