# 1. Import Libraries

In [10]:


import matplotlib.pyplot as plt

import seaborn as sns
import pandas as pd
import numpy as np
import ast
from scipy import stats
from ast import literal_eval


from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity


from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet

from surprise import Reader, Dataset,SVD


# 2. Load Dataset

In [17]:
credits = pd.read_csv(r".\movie_dataset\credits.csv")
keywords = pd.read_csv(r".\movie_dataset\keywords.csv")
links_small = pd.read_csv(r".\movie_dataset\links_small.csv")
md = pd.read_csv(r".\movie_dataset\movies_metadata.csv")
ratings = pd.read_csv(r".\movie_dataset\ratings.csv")


  interactivity=interactivity, compiler=compiler, result=result)


# 3. Understand dataset

### Credits Dataframe

In [22]:
credits.head(1).T

Unnamed: 0,0
cast,"[{'cast_id': 14, 'character': 'Woody (voice)',..."
crew,"[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
id,862


In [23]:
credits.columns

Index(['cast', 'crew', 'id'], dtype='object')

* **cast:** Information about casting. Name of actor, gender and it's character name in movie
* **crew:** Information about crew members. Like who directed the movie, editor of the movie and so on. 
* **id:** It's movie ID given by TMDb

In [24]:
credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   cast    45476 non-null  object
 1   crew    45476 non-null  object
 2   id      45476 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.0+ MB


### Keywords Dataframe

In [26]:
keywords.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [27]:
keywords.columns

Index(['id', 'keywords'], dtype='object')

* **id:** It's movie ID given by TMDb
* **Keywords:** Tags/keywords for the movie. It list of tags/keywords 

In [28]:
keywords.shape

(46419, 2)

In [29]:
keywords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46419 entries, 0 to 46418
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        46419 non-null  int64 
 1   keywords  46419 non-null  object
dtypes: int64(1), object(1)
memory usage: 725.4+ KB


### Link Dataframe

In [32]:
links_small.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [34]:
links_small.columns

Index(['movieId', 'imdbId', 'tmdbId'], dtype='object')

* **movieId:** It's serial number for movie
* **imdbId:** Movie id given on IMDb platform
* **tmdbId**: Movie id given on TMDb platform

In [35]:
links_small.shape

(9125, 3)

In [36]:
links_small.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9125 entries, 0 to 9124
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9125 non-null   int64  
 1   imdbId   9125 non-null   int64  
 2   tmdbId   9112 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 214.0 KB


### Metadata Dataframe

In [38]:
md.iloc[0:3].transpose()

Unnamed: 0,0,1,2
adult,False,False,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect..."
budget,30000000,65000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
homepage,http://toystory.disney.com/toy-story,,
id,862,8844,15602
imdb_id,tt0114709,tt0113497,tt0113228
original_language,en,en,en
original_title,Toy Story,Jumanji,Grumpier Old Men
overview,"Led by Woody, Andy's toys live happily in his ...",When siblings Judy and Peter discover an encha...,A family wedding reignites the ancient feud be...


In [39]:
md.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

**Features**

* **adult:** Indicates if the movie is X-Rated or Adult.
* **belongs_to_collection:** A stringified dictionary that gives information on the movie series the particular film belongs to.
* **budget:** The budget of the movie in dollars.
* **genres:** A stringified list of dictionaries that list out all the genres associated with the movie.
* **homepage:** The Official Homepage of the move.
* **id:** The ID of the movie.
* **imdb_id:** The IMDB ID of the movie.
* **original_language:** The language in which the movie was originally shot in.
* **original_title:** The original title of the movie.
* **overview:** A brief blurb of the movie.
* **popularity:** The Popularity Score assigned by TMDB.
* **poster_path:** The URL of the poster image.
* **production_companies:** A stringified list of production companies involved with the making of the movie.
* **production_countries:** A stringified list of countries where the movie was shot/produced in.
* **release_date:** Theatrical Release Date of the movie.
* **revenue:** The total revenue of the movie in dollars.
* **runtime:** The runtime of the movie in minutes.
* **spoken_languages:** A stringified list of spoken languages in the film.
* **status:** The status of the movie (Released, To Be Released, Announced, etc.)
* **tagline:** The tagline of the movie.
* **title:** The Official Title of the movie.
* **video:** Indicates if there is a video present of the movie with TMDB.
* **vote_average:** The average rating of the movie.
* **vote_count:** The number of votes by users, as counted by TMDB.


In [40]:
md.shape

(45466, 24)

In [41]:
md.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

### Ratings dataframe

In [42]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523
3,1,1221,5.0,1425941546
4,1,1246,5.0,1425941556


In [43]:
ratings.columns

Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')

* **userId:** It is id for User
* **movieId:** It is TMDb movie id.
* **rating:** Rating given for the particular movie by specific user
* **timestamp:** Time stamp when rating has been given by user

In [45]:
ratings.shape

(26024289, 4)

In [46]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26024289 entries, 0 to 26024288
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 794.2 MB


# 4. Data Wrangling

* The Movie database [TMDb](https://www.themoviedb.org/movie/269149-zootopia?language=en)
* Already converted data from json to csv format

# 5. Pre-processing

We will perform pre-processing as and when needed throughout the 

# 6. Build recommendation system

### 6.1. Simple recommendation system

**Approach: **

* The Simple Recommender offers __generalized recommendations__ to every user __based on movie popularity and (sometimes) genre__. 

* The __basic idea__ behind this recommender is that __movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience.__ 

* This model __does not give personalized recommendations__ based on the user.



**What we are actually doing: **

* The implementation of this model is extremely trivial. 
* All we have to do is __sort our movies based on ratings and popularity__ and display the top movies of our list. 
* As an added step, we can __pass in a genre argument to get the top movies of a particular genre.__

I will build our overall Top 250 Chart and will define a function to build charts for a particular genre. Let's begin!

In [48]:
md['genres'] =md['genres'].fillna('[]').apply(literal_eval).apply(lambda x : [i['name'] for i in x] if isinstance(x,list) else [])

* I use the TMDB Ratings to come up with our Top Movies Chart. 
* I will use IMDB's weighted rating formula to construct my chart.
* Mathematically, it is represented as follows:



$\large Weighted\; Rating (WR) = (\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$
```
where,
    v is the number of votes for the movie
    m is the minimum votes required to be listed in the chart
    R is the average rating of the movie
    C is the mean vote across the whole report
```

In [50]:
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')

vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')

C = vote_averages.mean()
C

5.244896612406511

* The next step, we need to determine an appropriate value for `m`, the minimum votes required to be listed in the chart.

* We will use 95th percentile as our cutoff. In other words, for a movie to feture in the charts, it must have more votes than atleast 95% of the movies in the list

In [51]:
m = vote_counts.quantile(0.95)
m

434.0

In [52]:
md['year'] = pd.to_datetime(md['release_date'],errors='coerce').apply(lambda x:str(x).split('-')[0] if x!=np.nan else np.nan)

In [53]:
qualified =md[(md['vote_count'] >= m) &
             (md['vote_count'].notnull()) &
             (md['vote_average'].notnull())][['title','year','vote_count'
                                             ,'vote_average','popularity'
                                             ,'genres']]

qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')

qualified.shape

(2274, 6)

* Therefore, to qualify to be considered for the chart, a movie has to have at least __434 votes__ on TMDB.

* We also see that the __average rating__ for __a movie on TMDB__ is __5.244 on scale of 10__.

* Here, only __2274 movies__ are qualify to be on our chart.

In [54]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m)*R + m/(m+v)*C)

In [55]:
qualified['wr'] = qualified.apply(weighted_rating,axis=1)

In [56]:
qualified = qualified.sort_values('wr',ascending=False).head(250)

### Top Movies

In [57]:
qualified.head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,wr
15480,Inception,2010,14075,8,29.1081,"[Action, Thriller, Science Fiction, Mystery, A...",7.917588
12481,The Dark Knight,2008,12269,8,123.167,"[Drama, Action, Crime, Thriller]",7.905871
22879,Interstellar,2014,11187,8,32.2135,"[Adventure, Drama, Science Fiction]",7.897107
2843,Fight Club,1999,9678,8,63.8696,[Drama],7.881753
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.0707,"[Adventure, Fantasy, Action]",7.871787
292,Pulp Fiction,1994,8670,8,140.95,"[Thriller, Crime]",7.86866
314,The Shawshank Redemption,1994,8358,8,51.6454,"[Drama, Crime]",7.864
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.3244,"[Adventure, Fantasy, Action]",7.861927
351,Forrest Gump,1994,8147,8,48.3072,"[Comedy, Drama, Romance]",7.860656
5814,The Lord of the Rings: The Two Towers,2002,7641,8,29.4235,"[Adventure, Fantasy, Action]",7.851924


* We see that three Christoper Nolan Films, __Inception__, __The Dark Knight__ and __Interstellar__ occur at the very top of our chart.

* The chart also indicates a strong bias of TMDB users towards particular genres and directors.

In [58]:
'''
>>> s
     a   b
one  1.  2.
two  3.  4.

>>> s.stack()
one a    1
    b    2
two a    3
    b    4
'''

s = md.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1,drop=True)
s.name ='genre'

gen_md = md.drop('genres',axis=1).join(s)
gen_md.head(3).transpose()

  


Unnamed: 0,0,0.1,0.2
adult,False,False,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...","{'id': 10194, 'name': 'Toy Story Collection', ...","{'id': 10194, 'name': 'Toy Story Collection', ..."
budget,30000000,30000000,30000000
homepage,http://toystory.disney.com/toy-story,http://toystory.disney.com/toy-story,http://toystory.disney.com/toy-story
id,862,862,862
imdb_id,tt0114709,tt0114709,tt0114709
original_language,en,en,en
original_title,Toy Story,Toy Story,Toy Story
overview,"Led by Woody, Andy's toys live happily in his ...","Led by Woody, Andy's toys live happily in his ...","Led by Woody, Andy's toys live happily in his ..."
popularity,21.9469,21.9469,21.9469


In [59]:
def build_chart(genre, percentile=0.85):
    df = gen_md[gen_md['genre']== genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages =df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C= vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    qualified =df[(df['vote_count'] >= m) &
                 (df['vote_count'].notnull()) &
                 (df['vote_average'].notnull())][['title','year', 'vote_count', 'vote_average', 'popularity']]
    
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    qualified['wr'] = qualified.apply(lambda x : (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count'])* C),axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(250)
    
    return qualified

Let us see our method in action by displaying the __Top 15 Romance Movies__ (Romance almost didn't feature at all in our Generic Top Chart despite being one of the most popular movie genres).


**Top 15 Romantic Movies**

In [60]:
build_chart('Romance').head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
10309,Dilwale Dulhania Le Jayenge,1995,661,9,34.457,8.565285
351,Forrest Gump,1994,8147,8,48.3072,7.971357
876,Vertigo,1958,1162,8,18.2082,7.811667
40251,Your Name.,2016,1030,8,34.461252,7.789489
883,Some Like It Hot,1959,835,8,11.8451,7.745154
1132,Cinema Paradiso,1988,834,8,14.177,7.744878
19901,Paperman,2012,734,8,7.19863,7.713951
37863,Sing Street,2016,669,8,10.672862,7.689483
882,The Apartment,1960,498,8,11.9943,7.599317
38718,The Handmaiden,2016,453,8,16.727405,7.566166


### 6.1. Content based recommendation system

In [63]:
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')

In [64]:
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

In [65]:
md['id'] = md['id'].apply(convert_int)
md[md['id'].isnull()]

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,year
19730,- Written by Ørnås,0.065736,/ff9qCepilowshEtG2GYWwzt2bs4.jpg,"[Carousel Productions, Vision View Entertainme...","[{'iso_3166_1': 'CA', 'name': 'Canada'}, {'iso...",,0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,,,,,,,,,,NaT
29503,Rune Balot goes to a casino connected to the ...,1.931659,/zV8bHuSL6WXoD6FWogP9j4x80bL.jpg,"[Aniplex, GoHands, BROSTA TV, Mardock Scramble...","[{'iso_3166_1': 'US', 'name': 'United States o...",,0,68.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,...,,,,,,,,,,NaT
35587,Avalanche Sharks tells the story of a bikini ...,2.185485,/zaSf5OG7V8X8gqFvly88zDdRm46.jpg,"[Odyssey Media, Pulser Productions, Rogue Stat...","[{'iso_3166_1': 'CA', 'name': 'Canada'}]",,0,82.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,,,,,,,,,,NaT


In [66]:
md =md.drop([19730,29503,35587])

In [67]:
md['id'] = md['id'].astype('int')

In [68]:
smd =md[md['id'].isin(links_small)]
smd.shape

(9099, 25)

We have __9099 movies__ available in our small movies metadata dataset which is 5 times smaller then out orginal dataset of 45000 movies

### Content based recommendation system: Using movie description and taglines

* Let us first try to build a recommender using movie description and taglines

* We do not have a quantitative metric to judge our machine's performance so this will have to be done qualitatively

In [71]:
smd['tagline'] =smd['tagline'].fillna('')
smd['description'] = smd['overview'] + smd['tagline']
smd['description'] =smd['description'].fillna('')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [73]:
smd.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,39752,39771,39784,40004,40058,40224,40503,44821,44826,45265
adult,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",,"{'id': 96871, 'name': 'Father of the Bride Col...",,,,,"{'id': 645, 'name': 'James Bond Collection', '...",...,,"{'id': 286023, 'name': 'Sharknado Collection',...",,,,,,"{'id': 34055, 'name': 'Pokémon Collection', 'p...","{'id': 34055, 'name': 'Pokémon Collection', 'p...",
budget,30000000,65000000,0,16000000,0,60000000,58000000,0,35000000,58000000,...,0,0,8000000,1000000,15050000,15000000,0,16000000,0,0
genres,"[Animation, Comedy, Family]","[Adventure, Fantasy, Family]","[Romance, Comedy]","[Comedy, Drama, Romance]",[Comedy],"[Action, Crime, Drama, Thriller]","[Comedy, Romance]","[Action, Adventure, Drama, Family]","[Action, Adventure, Thriller]","[Adventure, Action, Thriller]",...,"[Drama, Thriller]","[Comedy, Horror, Science Fiction]",[Drama],"[Thriller, Romance]","[Adventure, Drama, History, Romance]","[Action, Adventure, Drama, Horror, Science Fic...","[Documentary, Music]","[Adventure, Fantasy, Animation, Action, Family]","[Adventure, Fantasy, Animation, Science Fictio...","[Comedy, Drama]"
homepage,http://toystory.disney.com/toy-story,,,,,,,,,http://www.mgm.com/view/movie/757/Goldeneye/,...,,http://www.syfy.com/sharknado4,,,,,http://www.thebeatlesliveproject.com/,http://movies.warnerbros.com/pk3/,http://www.pokemon.com/us/movies/movie-pokemon...,
id,862,8844,15602,31357,11862,949,11860,45325,9091,710,...,314420,390989,159550,392572,402672,315011,391698,10991,12600,265189
imdb_id,tt0114709,tt0113497,tt0113228,tt0114885,tt0113041,tt0113277,tt0114319,tt0112302,tt0114576,tt0113189,...,tt3732950,tt4831420,tt0255313,tt5165344,tt3859980,tt4262980,tt2531318,tt0235679,tt0287635,tt2121382
original_language,en,en,en,en,en,en,en,en,en,en,...,en,en,en,hi,hi,ja,en,ja,ja,sv
original_title,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II,Heat,Sabrina,Tom and Huck,Sudden Death,GoldenEye,...,Body,Sharknado 4: The 4th Awakens,The Last Brickmaker in America,रुस्तम,Mohenjo Daro,シン・ゴジラ,The Beatles: Eight Days a Week - The Touring Y...,Pokémon 3: The Movie,劇場版ポケットモンスター セレビィ 時を越えた遭遇（であい）,Turist
overview,"Led by Woody, Andy's toys live happily in his ...",When siblings Judy and Peter discover an encha...,A family wedding reignites the ancient feud be...,"Cheated on, mistreated and stepped on, the wom...",Just when George Banks has recovered from his ...,"Obsessive master thief, Neil McCauley leads a ...",An ugly duckling having undergone a remarkable...,"A mischievous young boy, Tom Sawyer, witnesses...",International action superstar Jean Claude Van...,James Bond must unmask the mysterious head of ...,...,A night out turns deadly when three girls brea...,The new installment of the Sharknado franchise...,A man must cope with the loss of his wife and ...,"Rustom Pavri, an honourable officer of the Ind...","Village lad Sarman is drawn to big, bad Mohenj...",From the mind behind Evangelion comes a hit la...,"The band stormed Europe in 1963, and, in 1964,...",When Molly Hale's sadness of her father's disa...,"All your favorite Pokémon characters are back,...","While holidaying in the French Alps, a Swedish..."


In [74]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1,2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])

In [88]:
tfidf_matrix.shape

(9099, 268124)

In [78]:
tfidf_matrix.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

* Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score. 

* Therefore, we will use sklearn's linear_kernel instead of cosine_similarities since it is much faster.

In [79]:
cosine_sim = cosine_similarity(tfidf_matrix,tfidf_matrix)

In [80]:
cosine_sim[0]

array([1.        , 0.00680476, 0.        , ..., 0.        , 0.00344913,
       0.        ])

* We now have a pairwise cosine similarity matrix for all the movies in our dataset
* The next step is to write a function that returns the 30 most similar movies based on the cosine similarity score.

In [81]:
smd = smd.reset_index()
titles =smd['title']
indices =pd.Series(smd.index,index=smd['title'])

In [83]:
indices

title
Toy Story                                                0
Jumanji                                                  1
Grumpier Old Men                                         2
Waiting to Exhale                                        3
Father of the Bride Part II                              4
                                                      ... 
Shin Godzilla                                         9094
The Beatles: Eight Days a Week - The Touring Years    9095
Pokémon: Spell of the Unknown                         9096
Pokémon 4Ever: Celebi - Voice of the Forest           9097
Force Majeure                                         9098
Length: 9099, dtype: int64

In [85]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores =list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores,key=lambda x:x[1], reverse=True)
    sim_scores =sim_scores[1:31]
    movie_indices =[i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

In [86]:
get_recommendations('The Godfather').head(10)

973      The Godfather: Part II
8387                 The Family
3509                       Made
4196         Johnny Dangerously
29               Shanghai Triad
5667                       Fury
2412             American Movie
1582    The Godfather: Part III
4221                    8 Women
2159              Summer of Sam
Name: title, dtype: object

In [87]:
get_recommendations('The Dark Knight').head(10)

7931                      The Dark Knight Rises
132                              Batman Forever
1113                             Batman Returns
8227    Batman: The Dark Knight Returns, Part 2
7565                 Batman: Under the Red Hood
524                                      Batman
7901                           Batman: Year One
2579               Batman: Mask of the Phantasm
2696                                        JFK
8165    Batman: The Dark Knight Returns, Part 1
Name: title, dtype: object

* We see that for The __Dark Knight__, our system is able to identify it as a __Batman film and subsequently recommend other Batman films__ as its top recommendations.

* But unfortunately, that is all this system can do at the moment. 

* This is not of much use to most people as it doesn't take into considerations very important features such as cast, crew, director and genre, which determine the rating and the popularity of a movie. 

* Someone who liked The Dark Knight probably likes it more because of Nolan and would hate Batman Forever and every other substandard movie in the Batman Franchise.

* Therefore, we are going to use much more suggestive metadata than Overview and Tagline. 
* In the next subsection, we will build a more sophisticated recommender that takes __genre, keywords, cast and crew__ into consideration.

### Content based RS : Using movie description, taglines, keywords, cast, director and genres

* To build our standard metadata based content recommender, we will need to __merge our current dataset with the 
  crew and the keyword datasets__. 
* Let us prepare this data as our first step.