# This Project aims to Recommend the Top Movies Based on User's History

In this script we will use Tf-Idf vectorization in order to compute recommendations for movies a user has previously seen.

First lets import our libraries and our dataframe

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("tmdb_5000_movies.csv")
df.head()
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


We notice that our genres and the keywords are Json objects containing the id and the name of a movie genre.

In [2]:
import json
x=df.iloc[0]
j= json.loads(x['genres'])
j

[{'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 14, 'name': 'Fantasy'},
 {'id': 878, 'name': 'Science Fiction'}]

Let's code a class that will take a row of the movie dataframe and turn the genre and keyword columns into a string containing all the name values seperated by a space.

In [3]:
def genres_and_keywords_to_string(row):
    genres = json.loads(row['genres'])
    genres=' '.join(''.join(j['name'].split()) for j in genres)
    keywords = json.loads(row['keywords'])
    keywords=' '.join(''.join(j['name'].split()) for j in keywords)
    return "%s %s"% (genres,keywords)
    

We apply this class onto all rows in the movie dataframe.

In [4]:
df['string']=df.apply (genres_and_keywords_to_string, axis =1)
df['string']

0       Action Adventure Fantasy ScienceFiction cultur...
1       Adventure Fantasy Action ocean drugabuse exoti...
2       Action Adventure Crime spy basedonnovel secret...
3       Action Crime Drama Thriller dccomics crimefigh...
4       Action Adventure ScienceFiction basedonnovel m...
                              ...                        
4798    Action Crime Thriller unitedstates–mexicobarri...
4799                                      Comedy Romance 
4800    Comedy Drama Romance TVMovie date loveatfirsts...
4801                                                     
4802      Documentary obsession camcorder crush dreamgirl
Name: string, Length: 4803, dtype: object

Let's initialize our vectorizer, in this project I used Tf-Idf vectorizer in order to place emphasis on more important keywords, we also store our Tf-Idf matrix in X which will contain the score of each word (columns) in every piece of text (rows) . We limit the words in the matrix we choose 2000 as the maximum features.

In [5]:
tfidf = TfidfVectorizer(max_features = 2000)
X=tfidf.fit_transform(df['string'])

In [6]:
X

<4803x2000 sparse matrix of type '<class 'numpy.float64'>'
	with 37285 stored elements in Compressed Sparse Row format>

As we are working to identify the similarity between movies using the Cosine Similarity function. In X the movies are identified by their index, so we need to have sort of a key-value relationship between the title and the index.

In [7]:
movie2idx = pd.Series(df.index, index=df['title'])

In [8]:
idx = movie2idx['Scream 3']

In [9]:
query = X[idx]

In [10]:
query.toarray()


array([[0., 0., 0., ..., 0., 0., 0.]])

In [23]:
scores = cosine_similarity(query, X)
(-scores).argsort()[1:6]

array([], shape=(0, 4803), dtype=int64)

Let's code the class that recommends the movies. It will take the title of a previously watched movie, ex : 'scream3' will identify its index, we then calculate the cosine similarity between scream3 and all the other movies, and output the titles of the top 5 movies closest to it after it has searched the dataframe for their index. Note that the if statement checks if the index is a pandas series in case there is multiple indices of the same title in which case we just grab the first one.

In [12]:
def recommend(title):
    idx=movie2idx[title]
    if type(idx)==pd.Series:
        idx=idx.iloc[0]
    query=X[idx]
    scores=cosine_similarity(query,X)
    scores=scores.flatten()
    recommend_idx=(-scores).argsort()[1:6]
    return df['title'].iloc[recommend_idx]


In [13]:
print(recommend('Furious 7'))



706              Days of Thunder
935          Herbie Fully Loaded
1186       The Final Destination
2050    The Transporter Refueled
500             2 Fast 2 Furious
Name: title, dtype: object


In [25]:
print(recommend('Mortal Kombat'))

1611              Mortal Kombat: Annihilation
1670                       DOA: Dead or Alive
3856              In the Name of the King III
1001    Street Fighter: The Legend of Chun-Li
2237                        Alone in the Dark
Name: title, dtype: object


In [26]:
print(recommend('Runaway Bride'))

4115                    House of D
2325    My Big Fat Greek Wedding 2
4604         It Happened One Night
3313                  An Education
2689            Our Family Wedding
Name: title, dtype: object


In [28]:
print(recommend('Harry Potter and the Goblet of Fire'))

276      Harry Potter and the Chamber of Secrets
113    Harry Potter and the Order of the Phoenix
191     Harry Potter and the Prisoner of Azkaban
8         Harry Potter and the Half-Blood Prince
197     Harry Potter and the Philosopher's Stone
Name: title, dtype: object
