In this notebook, I will attempt at enforcing a few recommendation algorithms (content material based totally, reputation based and collaborative filtering) and attempt to construct an ensemble of these models to provide you with our final advice machine. With us, we've got  MovieLens datasets.

The Full Dataset: Consists of 26,000,000 scores and 750,000 tag packages carried out to 45,000 movies with the aid of 270,000 users. Includes tag genome data with 12 million relevance rankings throughout 1,one hundred tags.
The Small Dataset: Comprises of a hundred,000 rankings and 1,three hundred tag packages implemented to nine,000 films by means of seven hundred customers.
I will build a Simple Recommender using films from the Full Dataset whereas all customized recommender structures will make use of the small dataset (because of the computing power I possess being very restrained). As a primary step, I will build my easy recommender device.

In [2]:
%matplotlib inline

from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet

import warnings; warnings.simplefilter('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

###Simple Recommender
The Simple Recommender offers generalized recommnendations to every user based totally on film recognition and (once in a while) genre. The simple concept in the back of this recommender is that movies which can be greater famous and more critically acclaimed can have a better chance of being liked by means of the common target market. This model does now not supply customized suggestions primarily based on the person.

The implementation of this model is extremely trivial. All we must do is kind our movies primarily based on scores and recognition and show the pinnacle movies of our list. As an delivered step, we will skip in a genre argument to get the pinnacle movies of a particular genre.

In [5]:
mov = pd. read_csv('movies_metadata.csv')
mov.head(6)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
5,False,,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",,949,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,1995-12-15,187436818.0,170.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886.0


our Top Movies Chart using IMDB's weighted rating formula to construct my chart.

In [6]:
mov['genres'] = mov['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

The following stage is to decide a suitable incentive for m, the base votes required to be recorded in the graph. We will utilize 95th percentile as our cutoff. At the end of the day, for a motion picture to highlight in the graphs, it must have a greater number of votes than in any event 95% of the films in the rundown. 
 our general Top 250 Chart and will characterize a capacity to construct diagrams for a specific sort.

In [7]:
votecounts = mov[mov['vote_count'].notnull()]['vote_count'].astype('int')
voteaverages = mov[mov['vote_average'].notnull()]['vote_average'].astype('int')
C = voteaverages.mean()
C


5.244896612406511

In [8]:
x = votecounts.quantile(0.95)
x

434.0

In [9]:
mov['year'] = pd.to_datetime(mov['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [10]:
qualify = mov[(mov['vote_count'] >= x) & (mov['vote_count'].notnull()) & (mov['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualify['vote_count'] = qualify['vote_count'].astype('int')
qualify['vote_average'] = qualify['vote_average'].astype('int')
qualify.shape

(2274, 6)

a motion picture must have in any event 434 decisions on TMDB. We likewise observe that the normal rating for a film on TMDB is 5.244 on a size of 10. 2274 Movies meet all requirements to be on our graph.

In [11]:
def weightedrating(y):
    v = y['vote_count']
    R = y['vote_average']
    return (v/(v+x) * R) + (x/(x+v) * C)

In [13]:
qualify['wr'] = qualify.apply(weightedrating, axis=1)

In [15]:
qualify = qualify.sort_values('wr', ascending=False)

Top Movies

In [17]:
qualify.head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,wr
15480,Inception,2010,14075,8,29.1081,"[Action, Thriller, Science Fiction, Mystery, A...",7.917588
12481,The Dark Knight,2008,12269,8,123.167,"[Drama, Action, Crime, Thriller]",7.905871
22879,Interstellar,2014,11187,8,32.2135,"[Adventure, Drama, Science Fiction]",7.897107
2843,Fight Club,1999,9678,8,63.8696,[Drama],7.881753
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.0707,"[Adventure, Fantasy, Action]",7.871787
292,Pulp Fiction,1994,8670,8,140.95,"[Thriller, Crime]",7.86866
314,The Shawshank Redemption,1994,8358,8,51.6454,"[Drama, Crime]",7.864
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.3244,"[Adventure, Fantasy, Action]",7.861927
351,Forrest Gump,1994,8147,8,48.3072,"[Comedy, Drama, Romance]",7.860656
5814,The Lord of the Rings: The Two Towers,2002,7641,8,29.4235,"[Adventure, Fantasy, Action]",7.851924


We see that three Christopher Nolan Films, Inception, The Dark Knight and Interstellar happen at the exceptionally top of our outline. The graph additionally demonstrates a solid predisposition of TMDB Users towards specific kinds and chiefs.

Content Based Recommender

To customize our proposals more, I am going to manufacture an engine that figures similitude between motion pictures dependent on specific measurements and recommends films that are most like a specific motion picture that a client enjoyed. Since we will utilize motion picture metadata (or substance) to manufacture this motor, this otherwise called Content Based Filtering. 

I will fabricate two Content Based Recommenders dependent on: 

Motion picture Overviews and Taglines 

Motion picture Cast, Crew, Keywords and Genre 

Likewise, as referenced in the presentation, I will utilize a subset of the considerable number of motion pictures accessible to us because of constraining figuring power accessible to me.

In [21]:
linkssmall = pd.read_csv('links_small.csv')
linkssmall = linkssmall[linkssmall['tmdbId'].notnull()]['tmdbId'].astype('int')

In [22]:
mov = mov.drop([19730, 29503, 35587])

In [23]:
mov['id'] = mov['id'].astype('int')

In [25]:
sm = mov[mov['id'].isin(linkssmall)]
sm.shape

(9099, 25)

We have 9099 movies avaiable in our small movies metadata dataset which is 5 times smaller than our original dataset of 45000 movies.

Movie Description Based Recommender
Let us first try to build a recommender using movie descriptions and taglines. We do not have a quantitative metric to judge our machine's performance so this will have to be done qualitatively.

In [26]:
sm['tagline'] = sm['tagline'].fillna('')
sm['description'] = sm['overview'] + sm['tagline']
sm['description'] = sm['description'].fillna('')

In [27]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidfmatrix = tf.fit_transform(sm['description'])

In [29]:
tfidfmatrix.shape

(9099, 268124)

Implementing cosine similarity

In [30]:
cosinesim = linear_kernel(tfidfmatrix, tfidfmatrix)

In [32]:
sm = sm.reset_index()
titles = sm['title']
indices = pd.Series(sm.index, index=sm['title'])

In [36]:
def getrecommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosinesim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

top recommendations for a few movies and see how good the recommendations are.

In [37]:
getrecommendations('The Godfather').head(10)

973      The Godfather: Part II
8387                 The Family
3509                       Made
4196         Johnny Dangerously
29               Shanghai Triad
5667                       Fury
2412             American Movie
1582    The Godfather: Part III
4221                    8 Women
2159              Summer of Sam
Name: title, dtype: object

In [38]:
getrecommendations('The Dark Knight').head(10)

7931                      The Dark Knight Rises
132                              Batman Forever
1113                             Batman Returns
8227    Batman: The Dark Knight Returns, Part 2
7565                 Batman: Under the Red Hood
524                                      Batman
7901                           Batman: Year One
2579               Batman: Mask of the Phantasm
2696                                        JFK
8165    Batman: The Dark Knight Returns, Part 1
Name: title, dtype: object

We see that for The Dark Knight, our system is able to identify it as a Batman film and subsequently recommend other Batman films as its top recommendations. But unfortunately, that is all this system can do at the moment. This is not of much use to most people as it doesn't take into considerations very important features such as cast, crew, director and genre, which determine the rating and the popularity of a movie. Someone who liked The Dark Knight probably likes it more because of Nolan and would hate Batman Forever and every other substandard movie in the Batman Franchise.