## Recommender System
A **recommender system** is a type of machine learning model designed to suggest items, products, or content to users based on various criteria. In this project, we implement a content-based recommender system that recommends movies by analyzing their features such as overviews, genres, directors, actors, and other relevant attributes.

In [1]:
# Importing required libraries

import numpy as np
import pandas as pd
import ast
from nltk.stem.porter import PorterStemmer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle

In [2]:
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

In [3]:
# Reading the CSV file
df = pd.read_csv('../data/complete_movie_dataset.csv')

In [4]:
# Displaying the top 5 rows
df.head()

Unnamed: 0,title,movie_ID,overview,original_language,release_date,popularity,vote_count,vote_average,budget,genres,homepage,production_companies,production_countries,revenue,runtime,director,top_actors,keywords
0,Venom: The Last Dance,912649,Eddie and Venom are on the run. Hunted by both...,en,2024-10-22,3047.508,789,6.489,120000000,"['Science Fiction', 'Action', 'Adventure']",https://venom.movie,"['Columbia Pictures', 'Pascal Pictures', 'Matt...",['United States of America'],394000000,109,Kelly Marcel,"Tom Hardy, Chiwetel Ejiofor, Juno Temple","hero, superhero, anti hero, villain, alien lif..."
1,Terrifier 3,1034541,Five years after surviving Art the Clown's Hal...,en,2024-10-09,1929.351,1031,6.909,2000000,"['Horror', 'Thriller', 'Mystery']",https://terrifier3.com/,"['Cineverse', 'Bloody Disgusting', 'Dark Age C...",['United States of America'],78573405,125,Damien Leone,"Lauren LaVera, David Howard Thornton, Samantha...","monster, post-traumatic stress disorder (ptsd)..."
2,The Wild Robot,1184918,"After a shipwreck, an intelligent robot called...",en,2024-09-12,1808.363,2938,8.471,78000000,"['Animation', 'Science Fiction', 'Family']",https://www.thewildrobotmovie.com,['DreamWorks Animation'],['United States of America'],308583746,102,Chris Sanders,"Lupita Nyong'o, Pedro Pascal, Kit Connor","robot, based on children's book, aftercreditss..."
3,Apocalypse Z: The Beginning of the End,1118031,When a kind of rabies that transforms people i...,es,2024-10-04,1638.618,498,6.784,0,"['Drama', 'Action', 'Horror']",https://nostromopictures.com/en/movies/coming-...,['Nostromo Pictures'],['Spain'],0,119,Carles Torrens,"Francisco Ortiz, José María Yázpik, Berta Vázquez","based on novel or book, cat, human animal rela..."
4,Gladiator II,558449,Years after witnessing the death of the revere...,en,2024-11-13,1742.5,450,6.791,310000000,"['Action', 'Adventure', 'Drama']",https://www.gladiator.movie,"['Paramount Pictures', 'Red Wagon Entertainmen...",['United States of America'],87000000,148,Ridley Scott,"Paul Mescal, Denzel Washington, Pedro Pascal","epic, gladiator, roman empire, ancient rome, s..."


Columns to retain
- genres - to classify movies 
- id - to get the poster for streamlit
- title
- overview
- director
- top_actors

In [5]:
# Select required columns for the recommender system
df_req = df[['movie_ID','title','overview','genres','director','top_actors','keywords']]

In [6]:
# Displaying the top 5 rows of 'df_req'
df_req.head()

Unnamed: 0,movie_ID,title,overview,genres,director,top_actors,keywords
0,912649,Venom: The Last Dance,Eddie and Venom are on the run. Hunted by both...,"['Science Fiction', 'Action', 'Adventure']",Kelly Marcel,"Tom Hardy, Chiwetel Ejiofor, Juno Temple","hero, superhero, anti hero, villain, alien lif..."
1,1034541,Terrifier 3,Five years after surviving Art the Clown's Hal...,"['Horror', 'Thriller', 'Mystery']",Damien Leone,"Lauren LaVera, David Howard Thornton, Samantha...","monster, post-traumatic stress disorder (ptsd)..."
2,1184918,The Wild Robot,"After a shipwreck, an intelligent robot called...","['Animation', 'Science Fiction', 'Family']",Chris Sanders,"Lupita Nyong'o, Pedro Pascal, Kit Connor","robot, based on children's book, aftercreditss..."
3,1118031,Apocalypse Z: The Beginning of the End,When a kind of rabies that transforms people i...,"['Drama', 'Action', 'Horror']",Carles Torrens,"Francisco Ortiz, José María Yázpik, Berta Vázquez","based on novel or book, cat, human animal rela..."
4,558449,Gladiator II,Years after witnessing the death of the revere...,"['Action', 'Adventure', 'Drama']",Ridley Scott,"Paul Mescal, Denzel Washington, Pedro Pascal","epic, gladiator, roman empire, ancient rome, s..."


transform the above to a data_frame where there are only three columns
- movie_ID
- title
- tags

## Data Pre-Processing

In [7]:
# Checking for null values
df_req.isnull().sum()

movie_ID        0
title           0
overview       78
genres          0
director       64
top_actors     75
keywords      919
dtype: int64

In [8]:
# Dropiign the null values as they are small in number compared to the dataset
df_req = df_req.fillna('')

In [9]:
# Verifying the shape of 'df_req' after dropping the null rows
df_req.shape

(9130, 7)

In [10]:
df_req.isnull().sum()

movie_ID      0
title         0
overview      0
genres        0
director      0
top_actors    0
keywords      0
dtype: int64

In [11]:
# Checking for duplicates
df_req.duplicated().sum()

0

There are no duplicates in the 'df_req' dataset

##### Convert the string value of 'genres' column to a python list using ast.literal_eval

In [12]:
df_req.iloc[0].genres

"['Science Fiction', 'Action', 'Adventure']"

In [13]:
df_req['genres'] = df_req['genres'].apply(ast.literal_eval)

In [14]:
df_req.iloc[0].genres

['Science Fiction', 'Action', 'Adventure']

##### Converting string 'top_actors' column to a list

In [15]:
df_req.iloc[0].top_actors

'Tom Hardy, Chiwetel Ejiofor, Juno Temple'

In [16]:
df_req['top_actors'] = df_req['top_actors'].apply(lambda x: x.split(', '))

In [17]:
df_req.iloc[0].top_actors

['Tom Hardy', 'Chiwetel Ejiofor', 'Juno Temple']

#### Converting overview and keywords also to a list

In [18]:
df_req['overview'] = df_req['overview'].apply(lambda x:x.split())
df_req['keywords'] = df_req['keywords'].apply(lambda x:x.split())

In [19]:
df_req.head()

Unnamed: 0,movie_ID,title,overview,genres,director,top_actors,keywords
0,912649,Venom: The Last Dance,"[Eddie, and, Venom, are, on, the, run., Hunted...","[Science Fiction, Action, Adventure]",Kelly Marcel,"[Tom Hardy, Chiwetel Ejiofor, Juno Temple]","[hero,, superhero,, anti, hero,, villain,, ali..."
1,1034541,Terrifier 3,"[Five, years, after, surviving, Art, the, Clow...","[Horror, Thriller, Mystery]",Damien Leone,"[Lauren LaVera, David Howard Thornton, Samanth...","[monster,, post-traumatic, stress, disorder, (..."
2,1184918,The Wild Robot,"[After, a, shipwreck,, an, intelligent, robot,...","[Animation, Science Fiction, Family]",Chris Sanders,"[Lupita Nyong'o, Pedro Pascal, Kit Connor]","[robot,, based, on, children's, book,, aftercr..."
3,1118031,Apocalypse Z: The Beginning of the End,"[When, a, kind, of, rabies, that, transforms, ...","[Drama, Action, Horror]",Carles Torrens,"[Francisco Ortiz, José María Yázpik, Berta Váz...","[based, on, novel, or, book,, cat,, human, ani..."
4,558449,Gladiator II,"[Years, after, witnessing, the, death, of, the...","[Action, Adventure, Drama]",Ridley Scott,"[Paul Mescal, Denzel Washington, Pedro Pascal]","[epic,, gladiator,, roman, empire,, ancient, r..."


### Transforming the columns by removing the space

In [20]:
df_req['genres'] = df_req['genres'].apply(lambda x:[i.replace(' ','') for i in x])

df_req['overview'] = df_req['overview'].apply(lambda x:[i.replace(' ','') for i in x])

df_req['director'] = df_req['director'].apply(lambda x: x.replace(" ", "") if isinstance(x, str) else x)

df_req['top_actors'] = df_req['top_actors'].apply(lambda x:[i.replace(' ','') for i in x])

df_req['top_actors'] = df_req['top_actors'].apply(lambda x:[i.replace(' ','') for i in x])

In [21]:
df_req['director'] = df_req['director'].apply(lambda x: [x] if isinstance(x, str) else x)

In [22]:
df_req.head()

Unnamed: 0,movie_ID,title,overview,genres,director,top_actors,keywords
0,912649,Venom: The Last Dance,"[Eddie, and, Venom, are, on, the, run., Hunted...","[ScienceFiction, Action, Adventure]",[KellyMarcel],"[TomHardy, ChiwetelEjiofor, JunoTemple]","[hero,, superhero,, anti, hero,, villain,, ali..."
1,1034541,Terrifier 3,"[Five, years, after, surviving, Art, the, Clow...","[Horror, Thriller, Mystery]",[DamienLeone],"[LaurenLaVera, DavidHowardThornton, SamanthaSc...","[monster,, post-traumatic, stress, disorder, (..."
2,1184918,The Wild Robot,"[After, a, shipwreck,, an, intelligent, robot,...","[Animation, ScienceFiction, Family]",[ChrisSanders],"[LupitaNyong'o, PedroPascal, KitConnor]","[robot,, based, on, children's, book,, aftercr..."
3,1118031,Apocalypse Z: The Beginning of the End,"[When, a, kind, of, rabies, that, transforms, ...","[Drama, Action, Horror]",[CarlesTorrens],"[FranciscoOrtiz, JoséMaríaYázpik, BertaVázquez]","[based, on, novel, or, book,, cat,, human, ani..."
4,558449,Gladiator II,"[Years, after, witnessing, the, death, of, the...","[Action, Adventure, Drama]",[RidleyScott],"[PaulMescal, DenzelWashington, PedroPascal]","[epic,, gladiator,, roman, empire,, ancient, r..."


In [23]:
# Concatinating the columns

df_req['tags'] = df_req['overview'] + df_req['genres'] + df_req['director'] + df_req['top_actors'] + df_req['keywords']

In [24]:
# Displaying tags
df_req.head()

Unnamed: 0,movie_ID,title,overview,genres,director,top_actors,keywords,tags
0,912649,Venom: The Last Dance,"[Eddie, and, Venom, are, on, the, run., Hunted...","[ScienceFiction, Action, Adventure]",[KellyMarcel],"[TomHardy, ChiwetelEjiofor, JunoTemple]","[hero,, superhero,, anti, hero,, villain,, ali...","[Eddie, and, Venom, are, on, the, run., Hunted..."
1,1034541,Terrifier 3,"[Five, years, after, surviving, Art, the, Clow...","[Horror, Thriller, Mystery]",[DamienLeone],"[LaurenLaVera, DavidHowardThornton, SamanthaSc...","[monster,, post-traumatic, stress, disorder, (...","[Five, years, after, surviving, Art, the, Clow..."
2,1184918,The Wild Robot,"[After, a, shipwreck,, an, intelligent, robot,...","[Animation, ScienceFiction, Family]",[ChrisSanders],"[LupitaNyong'o, PedroPascal, KitConnor]","[robot,, based, on, children's, book,, aftercr...","[After, a, shipwreck,, an, intelligent, robot,..."
3,1118031,Apocalypse Z: The Beginning of the End,"[When, a, kind, of, rabies, that, transforms, ...","[Drama, Action, Horror]",[CarlesTorrens],"[FranciscoOrtiz, JoséMaríaYázpik, BertaVázquez]","[based, on, novel, or, book,, cat,, human, ani...","[When, a, kind, of, rabies, that, transforms, ..."
4,558449,Gladiator II,"[Years, after, witnessing, the, death, of, the...","[Action, Adventure, Drama]",[RidleyScott],"[PaulMescal, DenzelWashington, PedroPascal]","[epic,, gladiator,, roman, empire,, ancient, r...","[Years, after, witnessing, the, death, of, the..."


In [25]:
# DataFrame for recommender system with movie_ID, title and tags

new_df = df_req[['movie_ID','title','tags']]

In [26]:
new_df.head()

Unnamed: 0,movie_ID,title,tags
0,912649,Venom: The Last Dance,"[Eddie, and, Venom, are, on, the, run., Hunted..."
1,1034541,Terrifier 3,"[Five, years, after, surviving, Art, the, Clow..."
2,1184918,The Wild Robot,"[After, a, shipwreck,, an, intelligent, robot,..."
3,1118031,Apocalypse Z: The Beginning of the End,"[When, a, kind, of, rabies, that, transforms, ..."
4,558449,Gladiator II,"[Years, after, witnessing, the, death, of, the..."


In [27]:
new_df.loc[:,'tags'] = new_df['tags'].apply(lambda x : ' '.join(x))

In [28]:
new_df['tags'][0]

"Eddie and Venom are on the run. Hunted by both of their worlds and with the net closing in, the duo are forced into a devastating decision that will bring the curtains down on Venom and Eddie's last dance. ScienceFiction Action Adventure KellyMarcel TomHardy ChiwetelEjiofor JunoTemple hero, superhero, anti hero, villain, alien life-form, based on comic, sequel, aftercreditsstinger, woman director, absurd"

In [29]:
# converting to lowercase
new_df.loc[:,'tags'] = new_df['tags'].apply(lambda x : x.lower())

In [30]:
new_df.head()

Unnamed: 0,movie_ID,title,tags
0,912649,Venom: The Last Dance,eddie and venom are on the run. hunted by both...
1,1034541,Terrifier 3,five years after surviving art the clown's hal...
2,1184918,The Wild Robot,"after a shipwreck, an intelligent robot called..."
3,1118031,Apocalypse Z: The Beginning of the End,when a kind of rabies that transforms people i...
4,558449,Gladiator II,years after witnessing the death of the revere...


## Stemming

To address the issue of multiple words sharing the same root, Stemming is applied to reduce variations like "activity" and "activities" to their root form, such as "activ."

This technique eliminates inconsistencies in word forms, ensuring the recommendation system treats them as identical words. By applying stemming to the 'tags' column, the system enhances its ability to recognize and compare movies with similar themes or topics, leading to more precise and relevant movie recommendations.

In [31]:
ps = PorterStemmer()

In [32]:
# Function to apply stemming to a given text

def stem(text):
    y = []
    
    for i in text.split(): # splits the text into individual words, applies the stemmer to each word 
        y.append(ps.stem(i))
        
    return " ".join(y) # joins the stemmed words back into a single string

In [33]:
new_df.loc[:,'tags'] = new_df['tags'].apply(stem)

## Vectorization

In this project, **vectorization** is used to represent movie metadata (tags) as numerical vectors. This process transforms text data—like genres, overviews, directors, and top actors—into a format that can be processed mathematically. 

### Bag of words

One common technique for vectorization is the ***Bag of Words*** approach, which creates a matrix where rows represent movies, columns represent unique words, and the values indicate the frequency of those words in the tags.

In [34]:
# Convert text to vectors with top 5000 words, excluding English stop words
cv = CountVectorizer(max_features=6000, stop_words='english')

In [35]:
# Generate feature vectors from the 'tags' column
vectors = cv.fit_transform(new_df['tags']).toarray()

In [36]:
vectors[0]

array([0, 0, 0, ..., 0, 0, 0])

Most values in vectors[0] will be 0 because the Bag of Words representation creates a large sparse matrix where each column represents a unique word from the corpus. For any given movie, only a small subset of words from the total vocabulary will appear in its tags, leading to many zeros in its vector.

In [37]:
cv.get_feature_names_out()

array(['000', '10', '100', ..., 'zone', 'zoo', 'zooeydeschanel'],
      dtype=object)

## Cosine Similarity

Cosine Similarity is then applied to measure the similarity between two movies based on their vector representations. It calculates the cosine of the angle between two vectors in a high-dimensional space. A cosine similarity score close to 1 indicates that the movies are highly similar, while a score closer to 0 means they are less similar.

Compute the cosine similarity matrix for the vectors
This will return a square matrix where each element represents the similarity between two movies based on their vectorized 'tags'.

The shape of the matrix will be (number_of_movies, number_of_movies).

In [38]:
# Compute the cosine similarity matrix for the movie vectors
similarity = cosine_similarity(vectors)

In [39]:
similarity.shape

(9130, 9130)

### Main function - Bag of Words

In [40]:
def recommend_bow(movie):
    
    # Get the index of the movie in the dataset based on its title
    movie_index = new_df[new_df['title'] == movie].index[0]
    
    # Retrieve the similarity scores of the selected movie
    distances = similarity[movie_index]
    
    # Sort the movies based on similarity score in descending order and get top 10 expect the first movie itself
    movies_list = sorted(list(enumerate(distances)),reverse=True,key = lambda x: x[1])[1:11]
    
    # Print the titles of the top 10 most similar movies
    return [new_df.iloc[i[0]].title for i in movies_list]

In [41]:
recommend_bow('Harry Potter and the Prisoner of Azkaban')

["Harry Potter and the Philosopher's Stone",
 'Harry Potter and the Goblet of Fire',
 'Harry Potter and the Deathly Hallows: Part 2',
 'Harry Potter and the Chamber of Secrets',
 'Harry Potter and the Order of the Phoenix',
 'Fantastic Beasts: The Crimes of Grindelwald',
 'The Irregular at Magic High School: The Girl Who Summons the Stars',
 'Harry Potter and the Half-Blood Prince',
 'Upside-Down Magic',
 'Fantastic Beasts and Where to Find Them']

### TF-IDF

In [42]:
# Initialize the TfidfVectorizer
tfidf = TfidfVectorizer(max_features=6000, stop_words='english')

# Fit and transform the 'tags' column using TF-IDF
tfidf_vectors = tfidf.fit_transform(new_df['tags']).toarray()

# Check the shape of the resulting TF-IDF matrix (number of rows and features)
print(tfidf_vectors.shape)

# Now you can perform your similarity calculations (e.g., cosine similarity)
from sklearn.metrics.pairwise import cosine_similarity

# Compute the cosine similarity between the movies
similarity_tfidf = cosine_similarity(tfidf_vectors)

# Display the similarity matrix shape
print(similarity.shape)

(9130, 6000)
(9130, 9130)


### Main Function - TF-IDF

In [43]:
def recommend_tfidf(movie):
    
    # Get the index of the movie in the dataset based on its title
    movie_index = new_df[new_df['title'] == movie].index[0]
    
    # Retrieve the similarity scores of the selected movie
    distances = similarity_tfidf[movie_index]
    
    # Sort the movies based on similarity score in descending order and get top 10 expect the first movie itself
    movies_list = sorted(list(enumerate(distances)),reverse=True,key = lambda x: x[1])[1:11]
    
    # Print the titles of the top 10 most similar movies
    return [new_df.iloc[i[0]].title for i in movies_list]

In [44]:
recommend_tfidf('Harry Potter and the Prisoner of Azkaban')

["Harry Potter and the Philosopher's Stone",
 'Harry Potter and the Goblet of Fire',
 'Harry Potter and the Deathly Hallows: Part 2',
 'Harry Potter and the Order of the Phoenix',
 'Harry Potter and the Chamber of Secrets',
 'Harry Potter and the Half-Blood Prince',
 'Fantastic Beasts: The Crimes of Grindelwald',
 'Fantastic Beasts and Where to Find Them',
 'Fantastic Beasts: The Secrets of Dumbledore',
 'Upside-Down Magic']

## Comparision

In [45]:
recommend_bow('Harry Potter and the Goblet of Fire')

["Harry Potter and the Philosopher's Stone",
 'Harry Potter and the Prisoner of Azkaban',
 'Harry Potter and the Deathly Hallows: Part 2',
 'Harry Potter and the Order of the Phoenix',
 'Harry Potter and the Half-Blood Prince',
 'Harry Potter and the Chamber of Secrets',
 'Fantastic Beasts: The Crimes of Grindelwald',
 'Eragon',
 'The NeverEnding Story',
 'Fantastic Beasts: The Secrets of Dumbledore']

In [46]:
recommend_tfidf('Harry Potter and the Goblet of Fire')

["Harry Potter and the Philosopher's Stone",
 'Harry Potter and the Prisoner of Azkaban',
 'Harry Potter and the Order of the Phoenix',
 'Harry Potter and the Deathly Hallows: Part 2',
 'Harry Potter and the Half-Blood Prince',
 'Harry Potter and the Chamber of Secrets',
 'Fantastic Beasts: The Crimes of Grindelwald',
 'Fantastic Beasts: The Secrets of Dumbledore',
 'Harry Potter and the Deathly Hallows: Part 1',
 'Black Clover: Sword of the Wizard King']

### Calculating Precision

In [47]:
def calculate_precision_at_k(system_recommendations, curated_recommendations, k=10):
    # Count matches between system and curated lists
    matches = sum(1 for movie in system_recommendations[:k] if movie in curated_recommendations)
    
    # Calculate precision
    precision = matches / k
    return precision

In [48]:
curated_recommendations = [
    "Harry Potter and the Half-Blood Prince",
    "Harry Potter and the Prisoner of Azkaban",
    "Harry Potter and the Order of the Phoenix",
    "Harry Potter and the Chamber of Secrets",
    "Harry Potter and the Deathly Hallows: Part 1",
    "Harry Potter and the Deathly Hallows: Part 2",
    "Fantastic Beasts and Where to Find Them",
    "Fantastic Beasts: The Crimes of Grindelwald",
    "Harry Potter and the Philosopher's Stone",
    "Fantastic Beasts: The Secrets of Dumbledore"
]

# Bag of Words recommendations
bow_recommendations = recommend_bow("Harry Potter and the Goblet of Fire")

# TF-IDF recommendations
tfidf_recommendations = recommend_tfidf("Harry Potter and the Goblet of Fire")

# Calculate Precision@10 for Bag of Words
precision_bow = calculate_precision_at_k(bow_recommendations, curated_recommendations)

# Calculate Precision@10 for TF-IDF
precision_tfidf = calculate_precision_at_k(tfidf_recommendations, curated_recommendations)

# Print results
print(f"Precision@10 for Bag of Words: {precision_bow * 100:.2f}%")
print(f"Precision@10 for TF-IDF: {precision_tfidf * 100:.2f}%")

Precision@10 for Bag of Words: 80.00%
Precision@10 for TF-IDF: 90.00%


### Exporting Data Using Pickle

In [54]:
pickle.dump(new_df,open('movies.pkl','wb'))

In [49]:
pickle.dump(similarity_tfidf,open('similarity.pkl','wb'))