S.a.M: Movie Recommender Feature
Hello! In this notebook we will be implementing the movie recommendation system of our AI: S.a.M. S.a.M. allows the user to input their favorite movies and get curated recommendations based off of it. This is achieved through the use of content-based filtering, collaborative filtering and a hybrid of both filtering techinques. 

Let's start!

Step 1: Loading the data
First we have to import the necessary libraries and resources.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity
from ast import literal_eval
from datetime import datetime

We will now read the CSV files containing our dataset and put it in a pandas dataframe. In dataframe format, we will be able to show important information such as the format of the dataset and its shape. Df1 contains movie_id, cast, and crew information. Df2 contains budget, genre, homepage, id, keywords, and other movie features. We will eventually combine the datasets. 

In [2]:
df1 = pd.read_csv(r"C:\Users\kylek\Downloads\CS 450 Project\data\tmdb_5000_credits.csv")
df1.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


This is our first dataset. Now let's take a look at our second. 

In [3]:
df2 = pd.read_csv(r"C:\Users\kylek\Downloads\CS 450 Project\data\tmdb_5000_movies.csv")
df2.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


Now lets join the two datasets together and take a look at our new dataset. 

In [4]:
df1.columns = ['id','tittle','cast','crew']
movie_df = df2.merge(df1,on='id')
movie_df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,tittle,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


Let's take a look at the shape our dataset. This will tell us how many movies and features are in our dataset. 

In [5]:
movie_df.shape

(4803, 23)

We are returned a tupple, (4803, 23). This means that in our dataset, we have 4803 movies, each movie containing 23 features. 

We will begin our recommendation system with content based filtering. The content of the movie (overview, cast, crew, keyword, and tagline) are used to compute a similarity score with other movies. Then the movies with the highest similarity score are recommended. We'll start by computing the pairwise similarity scores for all movies based on their plot description. The plot description is given in the overview feature of our dataset. 

Let's take a look.  

In [6]:
movie_df['overview'].head(5)

0    In the 22nd century, a paraplegic Marine is di...
1    Captain Barbossa, long believed to be dead, ha...
2    A cryptic message from Bond’s past sends him o...
3    Following the death of District Attorney Harve...
4    John Carter is a war-weary, former military ca...
Name: overview, dtype: object

Now we must perform data preprocessing in order to transform our raw data into something usable. We want to perform text processing in order to accurately analyze the plot description for each movie. This is achieved by computing the Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each overview. Term frequency is the relative frequency of a word in a document. It's given as (term instance / total instance). Inverse Document Frequency is the relative count of documents containing the term and is given as log(number of documents / documents with term). The importance of each word to the document is equal to TF * IDF. This gives you a matrix where each column represents a word in the overview vocabulary and each row represents a movie. We perform these actions in order to reduce the importance of words that occur frequently in plot overviews. 

In [7]:
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
movie_df['overview'] = movie_df['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(movie_df['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(4803, 20978)

There's 20,978 words used to describe the 4803 movies in our dataset. 

Using this matrix, we can now compute a similarity score. There are multiple methods for finding a similary score such as eculidean,  pearson and cosine similarty score. No method is better than another, they all have advantages for certain sceanrios. In this case, we will be using the cosine similarty score to calculate a numeric quanity that represent the similarity between two movies.

Similarity = cos(θ) = (A * B) / ||A|| ||B|| 
(dot product)

Since we have used the TF-IDF vectorizer, calculating the dot product will direclty give us the cosine similarity score. Therefore, we will use sklearn's linear_kernel(). 

In [8]:
#Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

Now we will define a function that takes in a movie title as an input and outputs a list of the 10 most similar movies. We need a way to identify the index of a movie in our metadata DataFrame given its title. This can be achieved by reverse mapping the movies titles and DataFrame indices. 

In [9]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(movie_df.index, index=movie_df['title']).drop_duplicates()

Basic Steps of Recommendation System
1) Get the index of the movie based off title
2) Compute cosine similarity scores for the particular movie with all movies. Then convert it 
into a list of tuples where the first element is its position and the second is the similarity score
3) Sort the list of tuples based on the similarity scores
4) Get the top 10 elements of this list. Ignore the first element since it's refering to itself
5) Return the titles corresponding to the indices of the top elements

In [23]:
#Function that takes in movie title as input and outputs most similar movies
def get_content_recommendations(title, cosine_sim=cosine_sim):
    #Get the index of the movie that matches the title
    idx = indices[title]

    #Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    #Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    #Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    #Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    #Create a DataFrame with the top 10 most similar movies and their similarity scores
    result_df = pd.DataFrame(columns=['Title', 'Similarity Score'])
    result_df['Title'] = movie_df['title'].iloc[movie_indices]
    result_df['Similarity Score'] = [i[1] for i in sim_scores]

    #Return the top 10 most similar movies
    return result_df



Let's test out our content based recommendation by inputing our favorite movie. 

In [24]:
get_content_recommendations('Batman & Robin')

Unnamed: 0,Title,Similarity Score
1359,Batman,0.178817
299,Batman Forever,0.164285
428,Batman Returns,0.138518
212,The Day After Tomorrow,0.137278
514,Ice Age: The Meltdown,0.134456
3,The Dark Knight Rises,0.130455
4768,The Exploding Girl,0.120091
3854,"Batman: The Dark Knight Returns, Part 2",0.109152
65,The Dark Knight,0.106896
9,Batman v Superman: Dawn of Justice,0.105312


Our recommendation system does a good job at recommending movies that use similar keywords to describe the plot. In this example, the user inputs Batman & Robin and is recommened additional batman movies and movies with similar plot. 

Now lets create our collaborative filtering recommendation system. We want to create a dictionary mapping movieIDs to their corresponding movie titles. 

In [12]:
# Created a (movieId: title) dictionary for all movieId's for replacing them with their names
movieIdDict = movie_df.drop_duplicates('title')[['id', 'title']].set_index('id').to_dict()['title']

# First 5 elements of this dictionary
list(movieIdDict.items())[:5]

[(19995, 'Avatar'),
 (285, "Pirates of the Caribbean: At World's End"),
 (206647, 'Spectre'),
 (49026, 'The Dark Knight Rises'),
 (49529, 'John Carter')]

Now lets create a pivot table (dataRecommendation) from the movies DataFrame. 

In [13]:
#Creating a pivot table that has indexes as user ratings, and columns as each movie title
dataRecommendation = movie_df.pivot_table(index='id', columns='title', values='vote_average').fillna(0)

#Replacing dataRecommendation columns with the movie titles
#dataRecommendation.columns = dataRecommendation.columns.map(movieIdDict)

#Output pivot table with user ratings for each movie. Show a sample of 5 individual users ratings' of 5 movies
dataRecommendation.head(5).iloc[:, [0,1,2,3,4]]


title,#Horror,(500) Days of Summer,10 Cloverfield Lane,10 Days in a Madhouse,10 Things I Hate About You
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,0.0,0.0,0.0,0.0,0.0
11,0.0,0.0,0.0,0.0,0.0
12,0.0,0.0,0.0,0.0,0.0
13,0.0,0.0,0.0,0.0,0.0
14,0.0,0.0,0.0,0.0,0.0


Now let's use k-nearest neighbors (KNN) for content-based collaborative filtering to gernerate movie recommendations. 

In [14]:
knn = NearestNeighbors(n_neighbors=11, metric='cosine', algorithm='brute')
knn.fit(dataRecommendation.values.T)

The feature vectors can represent various attributes of a movie, such as user ratings, genres and other relevant features. Cosine similarity is chosen as a metric because it effectively captures the similarity between two movies based on their feature vectors.

In [15]:
# Here is our movie recommendations for Toy Story
recommendationResult = list(knn.kneighbors([dataRecommendation['Toy Story'].values], 8))

recommendationResult 
#The first array gives the cosine angles. 
#The second array gives the movieId corresponding to the cosine angles. 
#We'll need to convert it to a more readable form.

[array([[0., 1., 1., 1., 1., 1., 1., 1.]]),
 array([[4448, 3201, 3200, 3198, 3202, 3203, 3197, 3199]], dtype=int64)]

This step generates movie recommendations for Toy Story. The first array represents the cosine similarities between the target movie and the recommended movie. The second array represents the movieID corresponding to the cosine similarity.

In [16]:
recommendations = pd.DataFrame(np.vstack((recommendationResult[1], recommendationResult[0])),
                 index=['movieId', 'Cosine_Similarity (degree)']).T
recommendations = recommendations.drop([0]).reset_index(drop=True)
# In this step, I created a dataframe that stores the movieId and cosine similarity in degrees
recommendations

Unnamed: 0,movieId,Cosine_Similarity (degree)
0,3201.0,1.0
1,3200.0,1.0
2,3198.0,1.0
3,3202.0,1.0
4,3203.0,1.0
5,3197.0,1.0
6,3199.0,1.0


Now let's use this infomration to create a function to generate movie recommendations. Our function uses a k-Nearest-Neighbors model to find movie recommendations based on collaboraitve filtering. 

First we find the nearest neighbors. We extract the row from dataRecommendation that corresponds to the movie title provided. dataRecommendation is a DataFrame where columns represent movie titles and rows represent user rating and other forms of features. Then we reshape our data into a 2D array with one row and a column for each feature. Then we flatten the 2D array into a 1D array and pick out the titles of the movies that are the nearest neighbors to the provided movie.

In [17]:
#Use a k-Nearest-Neighbors model to find movie recommendations based on collaboraitve filter
def get_collaborative_recommendations(title):
    distances, indices = knn.kneighbors(dataRecommendation[title].values.reshape(1, -1), n_neighbors=11)
    titles = dataRecommendation.columns[indices.flatten()][1:]
    return pd.Series(titles, name='title')

Now I'll explain what some of the specific lines of code are used for. (Will edit this later)

"dataRecommendation[title].values" - extracts the row from the dataRecommendation

".reshape(1, -1)" - Reshapes the data into a 2D array with one row and a column for each feature

"knn.kneighbors()" - Method of knn object that finds the nearest neighrbors of the movie 

"indices.flatten()" - flattens the 2D array of indices into a 1D array 

"dataRecommendation.columns[indices.flatten()]" - Selects the movie titles from dataRecommendation that corresponds to these indices, which picks out the titles of the movies that are the nearest neighbors to the provided movie. 

"[1:]" - Slices the array to exclude the first element because the first element is always itself (The closet neighrbor is the same movie)

In [18]:
get_collaborative_recommendations('Broken Arrow')

0                           Spotlight
1                              Splice
2                            Spy Kids
3                            Spy Hard
4                            Spy Game
5                                Spun
6                     Spring Breakers
7             Spy Kids 3-D: Game Over
8                        Split Second
9    Spirit: Stallion of the Cimarron
Name: title, dtype: object

Now we have 2 separate techniques of filtering data - content based and collaborative based. We've noticed that the quality of our recommendatinos could improve so we created a hybrid recommendation system. This system takes the strengrths of both content similary and user preference patterns which leads to more accurate recommendations. By combining these methods, we can reduce the limitations of each method. It also caters to both aspects of what makes a movie appealing, similar content and similar user preferences. 

In [19]:
def hybrid_recommendation(title, content_weight=0.5, collab_weight=0.5):
    #Use previoulsy described methods to get content and collaborative recommendations. 
    content_recs = get_content_recommendations(title)
    collab_recs = get_collaborative_recommendations(title)

    #Initialize dictionaries to store the weighted scores for movies recommended by both methods.
    #Weights are applied to emphasize or de-emphasize the influence of each recommendation type.
    content_scores = {movie: content_weight for movie in content_recs}
    collab_scores = {movie: collab_weight for movie in collab_recs}

    #Create dictionary to combine scores from both recs
    combined_scores = {}
    
    #Update combined scores with weighted scores from content based recommendatinos
    for movie, score in content_scores.items():
        if movie in combined_scores:
            combined_scores[movie] += score
        else:
            combined_scores[movie] = score

    #Update combined scores with weighted scores from collaborative recommendations 
    for movie, score in collab_scores.items():
        if movie in combined_scores:
            combined_scores[movie] += score
        else:
            combined_scores[movie] = score

    #Sort movies based on the combined scores in descending order 
    sorted_movies = sorted(combined_scores, key=combined_scores.get, reverse=True)
    
    #Select the top 10 movies from the sorted list 
    top_movies = sorted_movies[:10]  # Get top 10 movies
    
    #Return these top 10 movies as a panda Series 
    return pd.Series(top_movies, name='title')

Now lets test out all of our recommendation systems. 

In [20]:
print("Content-based Recommendations for 'The Godfather':")
print(get_content_recommendations('The Godfather'))

Content-based Recommendations for 'The Godfather':
2731     The Godfather: Part II
1873                 Blood Ties
867     The Godfather: Part III
3727                 Easy Money
3623                       Made
3125                     Eulogy
3896                   Sinister
4506            The Maid's Room
3783                        Joe
2244      The Cold Light of Day
Name: title, dtype: object


In [21]:
print("\nCollaborative Recommendations for 'The Godfather':")
print(get_collaborative_recommendations('The Godfather'))


Collaborative Recommendations for 'The Godfather':
0                                   Splash
1                                 Spy Hard
2                                 Spy Game
3                                     Spun
4                          Spring Breakers
5                                Spotlight
6                             Split Second
7    Spy Kids 2: The Island of Lost Dreams
8                                   Splice
9                             Spider-Man 3
Name: title, dtype: object


In [22]:
print("\nHybrid Recommendations for 'The Godfather':")
print(hybrid_recommendation('The Godfather'))


Hybrid Recommendations for 'The Godfather':
0     The Godfather: Part II
1                 Blood Ties
2    The Godfather: Part III
3                 Easy Money
4                       Made
5                     Eulogy
6                   Sinister
7            The Maid's Room
8                        Joe
9      The Cold Light of Day
Name: title, dtype: object
