### About the Movie Recommendation Project

We will attempt to implement a few recommendation algorithms fo Movie Recommendation: 
-  Simple Recommender
-  Content based   


### About the Dataset

The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.

This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.

* **The Full Dataset:** Consists of 26,000,000 ratings and 750,000 tag applications applied to 45,000 movies by 270,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.
* **The Small Dataset:** Comprises of 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users.

We will build our Simple Recommender using movies from the *Full Dataset* whereas all personalised recommender systems will make use of the small dataset

###  I. Setting the Working Enviornment

In [1]:

import pandas as pd ## for data processing
import numpy as np ## for data processing
from ast import literal_eval ## for text preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer ## for Text Data Processing
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity ## for Recommender Data Similarity
import warnings; warnings.simplefilter('ignore')

## II. Simple Recommender

The Simple Recommender offers generalized recommnendations to every user based on movie popularity and (sometimes) genre. The basic idea behind this recommender is that movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience. 

We will sort our movies based on ratings and popularity and display the top movies of our list. As an added step, we can pass in a genre argument to get the top movies of a particular genre. 

In [2]:
md = pd. read_csv('tmdb_5000_movies.csv')
print("the shape of movies data is -->"+str(md.shape))
md.head(2)

the shape of movies data is -->(4803, 20)


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [22]:
#For cleaning the Genre Column
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [23]:
md.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[Action, Adventure, Fantasy, Science Fiction]",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[Adventure, Fantasy, Action]",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


### II.I. Methodology for Simple Recommender System

We will build our overall Top 250 Chart and will define a function to build charts for a particular genre. Let's begin!

In [24]:
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
C

5.6529252550489275

In [25]:
m = vote_counts.quantile(0.95)#95% of the movies in the list
m

3040.8999999999996

In [26]:
#Extracting the Year
md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [27]:
#Extracting the Qualified Movies based on the cutoff 
qualified = md[(md['vote_count'] >= m) & (md['vote_count'].notnull()) & (md['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

(241, 6)

In [28]:
#Below is the structure of the Qualified Movies
qualified.head(2)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres
0,Avatar,2009,11800,7,150.437577,"[Action, Adventure, Fantasy, Science Fiction]"
1,Pirates of the Caribbean: At World's End,2007,4500,6,139.082615,"[Adventure, Fantasy, Action]"


Therefore, to qualify to be considered for the chart, a movie has to have at least **434 votes** on TMDB. We also see that the average rating for a movie on TMDB is **5.244** on a scale of 10. **2274** Movies qualify to be on our chart.

In [29]:
#Defining the Weighted Average for Simple Recommender
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [30]:
qualified['wr'] = qualified.apply(weighted_rating, axis=1)

In [31]:
qualified = qualified.sort_values('wr', ascending=False).head(250)

### Top Movies

In [32]:
qualified.head(10)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,wr
96,Inception,2010,13752,8,167.58371,"[Action, Thriller, Science Fiction, Mystery, A...",7.574986
65,The Dark Knight,2008,12002,8,187.322927,"[Drama, Action, Crime, Thriller]",7.525542
95,Interstellar,2014,10867,8,724.247784,"[Adventure, Drama, Science Fiction]",7.486823
662,Fight Club,1999,9413,8,146.757391,[Drama],7.426909
262,The Lord of the Rings: The Fellowship of the Ring,2001,8705,8,138.049577,"[Adventure, Fantasy, Action]",7.392365
3232,Pulp Fiction,1994,8428,8,121.463076,"[Thriller, Crime]",7.377689
1881,The Shawshank Redemption,1994,8205,8,136.747729,"[Drama, Crime]",7.365349
329,The Lord of the Rings: The Return of the King,2003,8064,8,123.630332,"[Adventure, Fantasy, Action]",7.357291
809,Forrest Gump,1994,7927,8,138.133331,"[Comedy, Drama, Romance]",7.349263
330,The Lord of the Rings: The Two Towers,2002,7487,8,106.914973,"[Adventure, Fantasy, Action]",7.322066


In [34]:
#qualified.drop(['vote_average','wr'],axis = 1,inplace= True)

We see that three Christopher Nolan Films, **Inception**, **The Dark Knight** and **Interstellar** occur at the very top of our chart. The chart also indicates a strong bias of TMDB Users towards particular genres and directors. 



## Content Based Recommender

The recommender we built in the previous section suffers some severe limitations. For one, it gives the same recommendation to everyone, regardless of the user's personal taste. If a person who loves romantic movies (and hates action) were to look at our Top 15 Chart, s/he wouldn't probably like most of the movies. If s/he were to go one step further and look at our charts by genre, s/he wouldn't still be getting the best recommendations.

For instance, consider a person who loves *Dilwale Dulhania Le Jayenge*, *My Name is Khan* and *Kabhi Khushi Kabhi Gham*. One inference we can obtain is that the person loves the actor Shahrukh Khan and the director Karan Johar. Even if s/he were to access the romance chart, s/he wouldn't find these as the top recommendations.

To personalise our recommendations more, We are  going to build an engine that computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked. Since we will be using movie metadata (or content) to build this engine, this also known as **Content Based Filtering.**

I will build two Content Based Recommenders based on:
* Movie Overviews and Taglines
* Movie Cast, Crew, Keywords and Genre

Also, as mentioned in the introduction, I will be using a subset of all the movies available to us due to limiting computing power available to me. 

In [35]:
links_small = pd.read_csv('tmdb_5000_movies.csv')
links_small = links_small[links_small['id'].notnull()]['id'].astype('int')

In [36]:
#md = md.drop([19730, 29503, 35587])

In [37]:
md['id'] = md['id'].astype('int')

In [38]:
#links_small.head(2)

In [39]:
smd = md[md['id'].isin(links_small)]
smd.shape

(4803, 21)

In [40]:
smd.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,year
0,237000000,"[Action, Adventure, Fantasy, Science Fiction]",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,2009
1,300000000,"[Adventure, Fantasy, Action]",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,2007


We have **9099** movies avaiable in our small movies metadata dataset which is 5 times smaller than our original dataset of 45000 movies.

### Movie Description Based Recommender

Let us first try to build a recommender using movie descriptions and taglines. We do not have a quantitative metric to judge our machine's performance so this will have to be done qualitatively.

In [41]:
smd['tagline'] = smd['tagline'].fillna('')
smd['description'] = smd['overview'] + smd['tagline']
smd['description'] = smd['description'].fillna('')

In [42]:
smd.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,year,description
0,237000000,"[Action, Adventure, Fantasy, Science Fiction]",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,2009,"In the 22nd century, a paraplegic Marine is di..."
1,300000000,"[Adventure, Fantasy, Action]",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,2007,"Captain Barbossa, long believed to be dead, ha..."


In [43]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0., stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])

In [44]:
tfidf_matrix

<4803x149317 sparse matrix of type '<class 'numpy.float64'>'
	with 281824 stored elements in Compressed Sparse Row format>

#### Cosine Similarity

I will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two movies. Mathematically, it is defined as follows:

$cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $

Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score. Therefore, we will use sklearn's **linear_kernel** instead of cosine_similarities since it is much faster.

In [45]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [46]:
cosine_sim[0]

array([1.        , 0.00449945, 0.        , ..., 0.00272894, 0.        ,
       0.        ])

We now have a pairwise cosine similarity matrix for all the movies in our dataset. The next step is to write a function that returns the 30 most similar movies based on the cosine similarity score.

In [47]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

In [48]:
indices

title
Avatar                                         0
Pirates of the Caribbean: At World's End       1
Spectre                                        2
The Dark Knight Rises                          3
John Carter                                    4
                                            ... 
El Mariachi                                 4798
Newlyweds                                   4799
Signed, Sealed, Delivered                   4800
Shanghai Calling                            4801
My Date with Drew                           4802
Length: 4803, dtype: int64

In [49]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

We're all set. Let us now try and get the top recommendations for a few movies and see how good the recommendations are.

In [50]:
Recommendations_based_on_a_movie = 'Toy Story 2'
print("Below are the Top 10 Recommendations based on the movie -->"+Recommendations_based_on_a_movie)
get_recommendations(Recommendations_based_on_a_movie).head(10)

Below are the Top 10 Recommendations based on the movie -->Toy Story 2


1541                 Toy Story
42                 Toy Story 3
1191            Small Soldiers
1779    The 40 Year Old Virgin
4387      A LEGO Brickumentary
891            Man on the Moon
787             The Great Raid
3379              Factory Girl
2569               Match Point
2303            The Nutcracker
Name: title, dtype: object

In [51]:
Recommendations_based_on_a_movie = 'Avatar'
print("Below are the Top 10 Recommendations based on the movie -->"+Recommendations_based_on_a_movie)
get_recommendations(Recommendations_based_on_a_movie).head(10)

Below are the Top 10 Recommendations based on the movie -->Avatar


634                                     The Matrix
3604                                     Apollo 18
1341                          The Inhabited Island
529                               Tears of the Sun
369     Lara Croft Tomb Raider: The Cradle of Life
312                                     Green Zone
2130                                  The American
3578                                 House Party 2
775                                      Supernova
2705                                Soul Survivors
Name: title, dtype: object

We see that for **The Dark Knight**, our system is able to identify it as a Batman film and subsequently recommend other Batman films as its top recommendations. 

## Conclusion

In this notebook, I have built 2 different recommendation engines based on different ideas and algorithms. They are as follows:

1. **Simple Recommender:** This system used overall TMDB Vote Count and Vote Averages to build Top Movies Charts, in general and for a specific genre. The IMDB Weighted Rating System was used to calculate ratings on which the sorting was finally performed.
2. **Content Based Recommender:** We built a content based engines; one that took movie overview and taglines as input and the other which took metadata such as cast, crew, genre and keywords to come up with predictions. We also deviced a simple filter to give greater preference to movies with more votes and higher ratings.


# For Webpage

In [52]:
#import pickle

In [53]:
#pickle.dump(cosine_sim,open('sim.pkl','wb'))

In [54]:
#pickle.dump(qualified.head(20),open('qua.pkl','wb'))

In [55]:
#pickle.dump(md,open('database.pkl','wb'))