## Movie Recommendation System
* This script implements a hybrid movie recommendation system combining content-based filtering 
and collaborative filtering. 
* The system uses cosine similarity to find the closest matching movie to the user's search followed by collaborative filtering to recommend movies that similar users also liked and rated highly.

### Data Files Used:
* movies.csv: Contains details like movie titles and genres.
* ratings.csv: Contains user ratings for different movies.

In [23]:
# importing necessary libraries
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import warnings 
warnings.filterwarnings('ignore')

In [2]:
movies = pd.read_csv('movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
movieId    62423 non-null int64
title      62423 non-null object
genres     62423 non-null object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB


In [4]:
# Checking for any null values
movies.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

In [5]:
# Find all duplicate rows based on the 'title' and 'genres' columns
duplicate_rows = movies[movies.duplicated(subset=['title','genres'], keep=False)].sort_values(by='title', ascending = False)
# keep = False would give us all the duplicate values. Keep = 'first' would retain the initial value and remove the restprint(duplicate_rows.shape)
duplicate_rows

Unnamed: 0,movieId,title,genres
61521,206117,The Lonely Island Presents: The Unauthorized B...,Comedy
60497,203449,The Lonely Island Presents: The Unauthorized B...,Comedy
42155,163246,Seven Years Bad Luck (1921),Comedy
13555,70155,Seven Years Bad Luck (1921),Comedy
36472,150310,Macbeth (2015),Drama
30563,136564,Macbeth (2015),Drama
50572,181329,Lucky (2017),Drama
49220,178401,Lucky (2017),Drama
10924,46865,Little Man (2006),Comedy
46429,172427,Little Man (2006),Comedy


In [6]:
# Remove duplicate movie entries based on the combination of title and genres
movies = movies.drop_duplicates(subset = ['title', 'genres'])
movies.shape

(62409, 3)

In [7]:
# Clean movie titles by removing special characters to standardize titles for better matching
def clean_title(title):
    title = re.sub("[^a-zA-Z0-9 ]", "", title)
    return title
movies["clean_title"] = movies["title"].apply(clean_title)
# Apply clean_title function to every line in the title column
movies.to_csv('cleaned_movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995


In [8]:
# Creating a TF-IDF vectorizer to convert the 'title' column into feature vectors
# Considering both unigrams and bigrams
vectorizer = TfidfVectorizer(ngram_range=(1,2))
tfidf = vectorizer.fit_transform(movies.title)


In [9]:
def find(title):
    title = clean_title(title) # To remove non-alpha numeric characters and convert to lower
    
    query_vec = vectorizer.transform([title]) # Vector for the movie title being searched
    
    # Calculate cosine similarity between the searched title and all other titles in the dataset
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    # cosine similarity is a matrix which helps us determine how different movies are similar to 
    # each other based on comparing their vectors. Here, we compare the searched movie vector with all
    # other vectors in the dataset to find most similar vectors basaed on cosine similarity score.
    # .flatten() converts 2d array into a 1d array for ease of calculations (1d array since we are
    # compariny one title(x-axis) with all other titles(y-axis), each combination having one score)
   
    # Get the indices of the top 5 most similar titles (highest similarity scores)
    indices = np.argpartition(similarity, -5)[-5:]
    
    # Retrieve the corresponding movie entries from the 'movies' DataFrame, sorted by similarity
    results = movies.iloc[indices].iloc[::-1]
    return results

In [10]:
find('Avengers')
# This returns movies with titles similar to the searched term. This in itself is not the 
# recommendation system being built as two movies could have a very similar name but could be of 
# completely different genres.

Unnamed: 0,movieId,title,genres,clean_title
34536,145676,3 Avengers (1964),(no genres listed),3 Avengers 1964
17067,89745,"Avengers, The (2012)",Action|Adventure|Sci-Fi|IMAX,Avengers The 2012
2063,2153,"Avengers, The (1998)",Action|Adventure,Avengers The 1998
40636,159920,Shaolin Avengers (1994),Action,Shaolin Avengers 1994
45394,170297,Ultimate Avengers 2 (2006),Action|Animation|Sci-Fi,Ultimate Avengers 2 2006


In [11]:
# loading the ratings.csv file
ratings = pd.read_csv('ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [12]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
userId       int64
movieId      int64
rating       float64
timestamp    int64
dtypes: float64(1), int64(3)
memory usage: 762.9 MB


In [13]:
id=78499
similar_users = ratings[(ratings["movieId"] == id) & (ratings["rating"] == 5)]["userId"].unique()
# Users who've watched the movie we like and rated it

In [14]:
# Movie Ids of other movies rated highly by similar_users (>4)
recommended_ids = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]
print(recommended_ids.head())
# Finding the percentage of similar users who've highly rated each movies in the recommended list
rec_id = (recommended_ids.value_counts()/len(similar_users))*100
# To keep movies rated by more than 20% of similar usrs
rec_id = rec_id[rec_id > .20]
rec_id.head()

12896        1
12900     2028
12904     5618
12912    59315
12914    67087
Name: movieId, dtype: int64


78499    100.000000
1         61.712668
68954     55.201699
58559     53.609342
79132     49.823071
Name: movieId, dtype: float64

In [15]:
# There might be users who didn't rate the searched movie but highly rated most of the other recommended movies
# To find these users,
broader_users= ratings[(ratings["movieId"].isin(rec_id.index)) & (ratings["rating"] > 4)]
broader_users.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
8,1,1237,5.0,1147868839
16,1,2351,4.5,1147877957


In [24]:
# Calculate the percentage of **recommended movies** (movies suggested to the user based on 
# similarity to their liked movie) that are liked by users in the broader audience 
broader_users_rec = (broader_users['movieId'].value_counts())/len(broader_users['userId'].unique())*100
broader_users_rec.head()

318     32.230462
296     26.810695
2571    22.983179
356     22.157429
593     21.276171
Name: movieId, dtype: float64

In [17]:
# Combine the recommended movie IDs and their corresponding percentages from the broader audience 
# The columns represent the percentage of likes from similar users ("similar") and the broader audience ("all").
# If a movie is rated highly by similar users and not rated very much by a broader audience, we can
# conclude that the movie is not a overly popular movie watched by everyone and our recommendation is
# actually unique.
rec_percentages = pd.concat([rec_id, broader_users_rec], axis=1)
rec_percentages.columns = ["similar", "all"]
rec_percentages.head()

Unnamed: 0,similar,all
1,61.712668,11.746986
2,6.546355,1.663351
3,0.460014,0.725338
5,0.9908,0.603721
6,6.050955,4.597135


In [18]:
# Calculate a recommendation score by comparing the percentage of similar users who liked a movie 
# to the percentage of all users who liked it (similar/all). Higher score indicates better recommendation.
rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]
rec_percentages = rec_percentages.sort_values("score", ascending=False)
movies = pd.read_csv('cleaned_movies.csv')
rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")

Unnamed: 0.1,similar,all,score,Unnamed: 0,movieId,title,genres,clean_title
22633,0.283086,0.006237,45.389667,22634,115879,Toy Story Toons: Small Fry (2011),Adventure|Animation|Children|Comedy|Fantasy,Toy Story Toons Small Fry 2011
24061,0.2477,0.005613,44.128843,24062,120468,Toy Story Toons: Partysaurus Rex (2012),Animation|Children|Comedy,Toy Story Toons Partysaurus Rex 2012
22632,0.389243,0.009355,41.607195,22633,115875,Toy Story Toons: Hawaiian Vacation (2011),Adventure|Animation|Children|Comedy|Fantasy,Toy Story Toons Hawaiian Vacation 2011
20497,0.920028,0.026818,34.306144,20497,106022,Toy Story of Terror (2013),Animation|Children|Comedy,Toy Story of Terror 2013
24063,0.778485,0.023076,33.735564,24064,120474,Toy Story That Time Forgot (2014),Animation|Children,Toy Story That Time Forgot 2014
23188,0.353857,0.011226,31.520602,23189,117368,The Madagascar Penguins in a Christmas Caper (...,Animation|Comedy,The Madagascar Penguins in a Christmas Caper 2005
14813,100.0,3.308615,30.224128,14813,78499,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX,Toy Story 3 2010
28164,0.2477,0.008732,28.368542,28166,131080,Cinderella III: A Twist in Time (2007),Animation|Children|Fantasy|Musical|Romance,Cinderella III A Twist in Time 2007
12797,0.212314,0.007484,28.368542,12797,63540,Beverly Hills Chihuahua (2008),Adventure|Children|Comedy,Beverly Hills Chihuahua 2008
10189,0.2477,0.008732,28.368542,10189,36397,Valiant (2005),Adventure|Animation|Children|Comedy|Fantasy|War,Valiant 2005


In [19]:
# Function to combine all the above steps which takes a movie id and makes the top 10 recommendation
def find_similar_movies(movie_id):
    similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()
    similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]
    similar_user_recs = similar_user_recs.value_counts() / len(similar_users)

    similar_user_recs = similar_user_recs[similar_user_recs > .20]
    all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]
    all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())
    rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
    rec_percentages.columns = ["similar", "all"]
    
    rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]
    rec_percentages = rec_percentages.sort_values("score", ascending=False)
    return rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")[["score", "title", "genres"]]

In [22]:
import ipywidgets as widgets
from IPython.display import display

# Text input widget for users to enter the movie title they want recommendations for
movie_name_input = widgets.Text(
    value='Interstellar 2014',
    description='Movie Title:',
    disabled=False
)
recommendation_list = widgets.Output() # Output widget to display the recommendations

# function to handle changes in the text input
def on_type(data):
    recommendation_list.clear_output() # Clear previous output to refresh recommendations
    with recommendation_list: # Redirect output to the recommendation list widget
        
        title = data["new"] # Get the new title input by the user
        if len(title) > 5:
            results = find(title) # Call the 'find' function to search for the movie
            movie_id = results.iloc[0]["movieId"] # Extract the movie ID of the first result
            display(find_similar_movies(movie_id)) # Display similar movies based on the movie ID

# Observer to trigger the on_type function whenever the user types in the input field
movie_name_input.observe(on_type, names='value')

# Display the movie title input widget and the output widget for recommendations
display(movie_name_input, recommendation_list)
# Please enter the full movie name followed by the year of release. 
# It might take around 10 seconds to load the recommendation.

Text(value='Interstellar 2014', description='Movie Title:')

Output()

Note: This project uses jupyter widgets which isn't supported in github. Please refer to the output screenshot separately attached. For better interactivity, DOWNLOAD THE CODE AND RUN IT IN A JUPYTER NOTEBOOK.

The recommendation algorithm takes into account:
* Content-based filtering using cosine similarity to match the input movie.
* Collaborative filtering based on similar users who highly rated the input movie.
* A scoring system that evaluates the relevance of the recommended movies by comparing popularity among similar users versus the entire user base.
