# Content-Based Movie Recommendation System

The goal of this notebook is to build a **Content-Based Movie Recommendation System** that provides personalized movie suggestions based on what users have watched and their interests.
By analyzing the content features of movies such as: genres, cast, crew, plot keywords, the system will recommend movies that share similar attributes.

The system will use 3 datasets that have been pre-cleaned and stored in the cleaned folder:
- movie_metadata.csv: Contains detailed information such as title, genres, ratings, of each movie.
- movie_credits.csv: Includes cast and crew details for each movie.
- movie_keywords.csv: Set of keywords related to the theme or plot of the movie.

Using these datasets, the system will utilize tex vectorization and cosine similarity to calculate similarty scores between movies. 

### Loading and Preparing the Data

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Show full content in each cell
#pd.set_option('display.max_colwidth', None)
#pd.set_option('display.max_rows', None)  # Show all rows
#pd.set_option('display.max_columns', None)  # Show all columns
# Reset the display options to their default values
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
pd.reset_option('display.max_colwidth')

In [2]:
movies_metadata = pd.read_csv("cleaned/movies_metadata.csv")
movies_keywords = pd.read_csv("cleaned/movie_keywords.csv")
movies_credits = pd.read_csv("cleaned/movie_credits.csv")

In [3]:
movies_metadata

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,status,tagline,title,video,vote_average,vote_count,adult_category,budget_category,popularity_category,runtime_category
0,False,Toy Story Collection,30000000.0,"Animation, Comedy, Family",http://toystory.disney.com/toy-story,862.0,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,Released,,Toy Story,False,7.7,5415.0,General Audiences,HighBudget,BlockbusterPopularity,UltraShortDuration
1,False,,65000000.0,"Adventure, Fantasy, Family",,8844.0,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,General Audiences,BlockbusterBudget,BlockbusterPopularity,HighDuration
2,False,Grumpy Old Men Collection,,"Romance, Comedy",,15602.0,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,General Audiences,,BlockbusterPopularity,HighDuration
3,False,,16000000.0,"Comedy, Drama, Romance",,31357.0,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,General Audiences,HighBudget,HighPopularity,UltraHighDuration
4,False,Father of the Bride Collection,,Comedy,,11862.0,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,General Audiences,,BlockbusterPopularity,HighDuration
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45428,False,,,"Drama, Family",http://www.imdb.com/title/tt6209470/,439050.0,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,...,Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0,General Audiences,,UltraLowPopularity,ShortDuration
45429,False,,,Drama,,111109.0,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,...,Released,,Century of Birthing,False,9.0,3.0,General Audiences,,UltraLowPopularity,UltraHighDuration
45430,False,,,"Action, Drama, Thriller",,67758.0,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",...,Released,A deadly game of wits.,Betrayal,False,3.8,6.0,General Audiences,,MediumPopularity,ShortDuration
45431,False,,,,,227506.0,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",...,Released,,Satan Triumphant,False,0.0,0.0,General Audiences,,UltraLowPopularity,ShortDuration


In [4]:
movies_keywords

Unnamed: 0,id,keywords
0,862,"jealousy, toy, boy, friendship, friends, rival..."
1,8844,"board game, disappearance, based on children's..."
2,15602,"fishing, best friend, duringcreditsstinger, ol..."
3,31357,"based on novel, interracial relationship, sing..."
4,11862,"baby, midlife crisis, confidence, aging, daugh..."
...,...,...
45427,439050,tragic love
45428,111109,"artist, play, pinoy"
45429,67758,
45430,227506,


In [5]:
movies_credits

Unnamed: 0,cast,crew,id
0,"Tom Hanks, Tim Allen, Don Rickles, Jim Varney,...","John Lasseter, Joss Whedon, Andrew Stanton, Jo...",862
1,"Robin Williams, Jonathan Hyde, Kirsten Dunst, ...","Larry J. Franco, Jonathan Hensleigh, James Hor...",8844
2,"Walter Matthau, Jack Lemmon, Ann-Margret, Soph...","Howard Deutch, Mark Steven Johnson, Mark Steve...",15602
3,"Whitney Houston, Angela Bassett, Loretta Devin...","Forest Whitaker, Ronald Bass, Ronald Bass, Ezr...",31357
4,"Steve Martin, Diane Keaton, Martin Short, Kimb...","Alan Silvestri, Elliot Davis, Nancy Meyers, Na...",11862
...,...,...,...
45427,"Leila Hatami, Kourosh Tahami, Elham Korda","Hamid Nematollah, Hamid Nematollah, Farshad Mo...",439050
45428,"Angel Aquino, Perry Dizon, Hazel Orencio, Joel...","Lav Diaz, Lav Diaz, Dante Perez, Lav Diaz, Lav...",111109
45429,"Erika Eleniak, Adam Baldwin, Julie du Page, Ja...","Mark L. Lester, C. Courtney Joyner, Jeffrey Go...",67758
45430,"Iwan Mosschuchin, Nathalie Lissenko, Pavel Pav...","Yakov Protazanov, Joseph N. Ermolieff",227506


In [6]:
# Add columns that are most relevant for the recommendation systems
columns = ['id', 'original_title', 'title', 'tagline', 'overview', 'belongs_to_collection', 'genres', 'adult_category', 'original_language', 'spoken_languages', 'runtime_category', 'production_companies', 'production_countries', 'budget_category', 'popularity_category']
features = movies_metadata[columns]

In [7]:
# Merge the main movies_metadata datframe with keywords and casts. Left join to ensure all movies are available.
movies_df = pd.merge(pd.merge(movies_metadata[columns], movies_keywords, how='left', on='id' ), movies_credits, how='left', on='id')

In [8]:
# Fill na to replace any missing values with empty string as it can cause issues during string concatnation.
movies_df['combined'] = movies_df[columns[1:]].fillna('').astype(str).agg(' '.join, axis=1)
movies_df

Unnamed: 0,id,original_title,title,tagline,overview,belongs_to_collection,genres,adult_category,original_language,spoken_languages,runtime_category,production_companies,production_countries,budget_category,popularity_category,keywords,cast,crew,combined
0,862.0,Toy Story,Toy Story,,"Led by Woody, Andy's toys live happily in his ...",Toy Story Collection,"Animation, Comedy, Family",General Audiences,en,English,UltraShortDuration,Pixar Animation Studios,United States of America,HighBudget,BlockbusterPopularity,"jealousy, toy, boy, friendship, friends, rival...","Tom Hanks, Tim Allen, Don Rickles, Jim Varney,...","John Lasseter, Joss Whedon, Andrew Stanton, Jo...","Toy Story Toy Story Led by Woody, Andy's toys..."
1,8844.0,Jumanji,Jumanji,Roll the dice and unleash the excitement!,When siblings Judy and Peter discover an encha...,,"Adventure, Fantasy, Family",General Audiences,en,"English, Français",HighDuration,"TriStar Pictures, Teitler Film, Interscope Com...",United States of America,BlockbusterBudget,BlockbusterPopularity,"board game, disappearance, based on children's...","Robin Williams, Jonathan Hyde, Kirsten Dunst, ...","Larry J. Franco, Jonathan Hensleigh, James Hor...",Jumanji Jumanji Roll the dice and unleash the ...
2,15602.0,Grumpier Old Men,Grumpier Old Men,Still Yelling. Still Fighting. Still Ready for...,A family wedding reignites the ancient feud be...,Grumpy Old Men Collection,"Romance, Comedy",General Audiences,en,English,HighDuration,"Warner Bros., Lancaster Gate",United States of America,,BlockbusterPopularity,"fishing, best friend, duringcreditsstinger, ol...","Walter Matthau, Jack Lemmon, Ann-Margret, Soph...","Howard Deutch, Mark Steven Johnson, Mark Steve...",Grumpier Old Men Grumpier Old Men Still Yellin...
3,31357.0,Waiting to Exhale,Waiting to Exhale,Friends are the people who let you be yourself...,"Cheated on, mistreated and stepped on, the wom...",,"Comedy, Drama, Romance",General Audiences,en,English,UltraHighDuration,Twentieth Century Fox Film Corporation,United States of America,HighBudget,HighPopularity,"based on novel, interracial relationship, sing...","Whitney Houston, Angela Bassett, Loretta Devin...","Forest Whitaker, Ronald Bass, Ronald Bass, Ezr...",Waiting to Exhale Waiting to Exhale Friends ar...
4,11862.0,Father of the Bride Part II,Father of the Bride Part II,Just When His World Is Back To Normal... He's ...,Just when George Banks has recovered from his ...,Father of the Bride Collection,Comedy,General Audiences,en,English,HighDuration,"Sandollar Productions, Touchstone Pictures",United States of America,,BlockbusterPopularity,"baby, midlife crisis, confidence, aging, daugh...","Steve Martin, Diane Keaton, Martin Short, Kimb...","Alan Silvestri, Elliot Davis, Nancy Meyers, Na...",Father of the Bride Part II Father of the Brid...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45428,439050.0,رگ خواب,Subdue,Rising and falling between a man and woman,Rising and falling between a man and woman.,,"Drama, Family",General Audiences,fa,فارسی,ShortDuration,,Iran,,UltraLowPopularity,tragic love,"Leila Hatami, Kourosh Tahami, Elham Korda","Hamid Nematollah, Hamid Nematollah, Farshad Mo...",رگ خواب Subdue Rising and falling between a ma...
45429,111109.0,Siglo ng Pagluluwal,Century of Birthing,,An artist struggles to finish his work while a...,,Drama,General Audiences,tl,,UltraHighDuration,Sine Olivia,Philippines,,UltraLowPopularity,"artist, play, pinoy","Angel Aquino, Perry Dizon, Hazel Orencio, Joel...","Lav Diaz, Lav Diaz, Dante Perez, Lav Diaz, Lav...",Siglo ng Pagluluwal Century of Birthing An ar...
45430,67758.0,Betrayal,Betrayal,A deadly game of wits.,"When one of her hits goes wrong, a professiona...",,"Action, Drama, Thriller",General Audiences,en,English,ShortDuration,American World Pictures,United States of America,,MediumPopularity,,"Erika Eleniak, Adam Baldwin, Julie du Page, Ja...","Mark L. Lester, C. Courtney Joyner, Jeffrey Go...",Betrayal Betrayal A deadly game of wits. When ...
45431,227506.0,Satana likuyushchiy,Satan Triumphant,,"In a small town live two brothers, one a minis...",,,General Audiences,en,,ShortDuration,Yermoliev,Russia,,UltraLowPopularity,,"Iwan Mosschuchin, Nathalie Lissenko, Pavel Pav...","Yakov Protazanov, Joseph N. Ermolieff",Satana likuyushchiy Satan Triumphant In a sma...


In [9]:
# Apply TF-IDF Vectorization for text to numerical feature transformation
# Option 1: TFIDF Vectorizer
# Assigns weights to words based on frequency. Common words get lower weights.
# max_df ignores words appearing in more than 90% of the documents. English stop words are used to remove commonly used words without any significnace.
tfidf = TfidfVectorizer(max_df=0.9, stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies_df['combined'])
tfidf_matrix.shape


'''# Option 2: Counts the frequency of words in the text without weighing.
vectorizer = CountVectorizer()
vectorized_matrix = vectorizer.fit_transform(movies_df['combined'])
vectorized_matrix.shape'''

"# Option 2: Counts the frequency of words in the text without weighing.\nvectorizer = CountVectorizer()\nvectorized_matrix = vectorizer.fit_transform(movies_df['combined'])\nvectorized_matrix.shape"

In [10]:
# Compute cosine similarity
# Option 1
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
# Option 2
# cosine_sim = cosine_similarity(vectorized_matrix, vectorized_matrix)
cosine_sim

array([[1.        , 0.01913877, 0.01486669, ..., 0.00755015, 0.00504998,
        0.00573448],
       [0.01913877, 1.        , 0.04071265, ..., 0.06907149, 0.01641236,
        0.00929048],
       [0.01486669, 0.04071265, 1.        , ..., 0.00843701, 0.01004329,
        0.02761358],
       ...,
       [0.00755015, 0.06907149, 0.00843701, ..., 1.        , 0.00454169,
        0.0056754 ],
       [0.00504998, 0.01641236, 0.01004329, ..., 0.00454169, 1.        ,
        0.0031309 ],
       [0.00573448, 0.00929048, 0.02761358, ..., 0.0056754 , 0.0031309 ,
        1.        ]])

In [11]:
# Cosine similarity score matrix
# Scores ranging from 0 to 1, where 1 being the movie is exactly same, and 0 means no similarity.
# Movie ids are added to for easy lookup.
similarity_df = pd.DataFrame(cosine_sim, columns=movies_df['id'], index=movies_df['id']).reset_index()
similarity_df

id,id.1,862.0,8844.0,15602.0,31357.0,11862.0,949.0,11860.0,45325.0,9091.0,...,84419.0,390959.0,289923.0,222848.0,30840.0,439050.0,111109.0,67758.0,227506.0,461257.0
0,862.0,1.000000,0.019139,0.014867,0.012351,0.014120,0.007110,0.009000,0.008515,0.005452,...,0.008681,0.004507,0.018899,0.003158,0.031112,0.004989,0.000000,0.007550,0.005050,0.005734
1,8844.0,0.019139,1.000000,0.040713,0.007309,0.018797,0.045300,0.022838,0.015295,0.089024,...,0.009799,0.001815,0.013434,0.009871,0.029079,0.005505,0.026907,0.069071,0.016412,0.009290
2,15602.0,0.014867,0.040713,1.000000,0.010030,0.032551,0.024028,0.013442,0.009515,0.008254,...,0.006318,0.001838,0.005374,0.003529,0.015638,0.005575,0.000000,0.008437,0.010043,0.027614
3,31357.0,0.012351,0.007309,0.010030,1.000000,0.008058,0.013818,0.014304,0.017594,0.004176,...,0.006656,0.001936,0.008514,0.015227,0.072191,0.014423,0.017853,0.011153,0.006520,0.022422
4,11862.0,0.014120,0.018797,0.032551,0.008058,1.000000,0.009362,0.074206,0.007399,0.050070,...,0.011691,0.002166,0.010071,0.007506,0.014361,0.000000,0.000000,0.023614,0.008070,0.003703
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45428,439050.0,0.004989,0.005505,0.005575,0.014423,0.000000,0.015148,0.000000,0.011304,0.000000,...,0.005868,0.005547,0.000000,0.003479,0.002345,1.000000,0.010339,0.012033,0.006774,0.005664
45429,111109.0,0.000000,0.026907,0.000000,0.017853,0.000000,0.006830,0.004749,0.001646,0.000000,...,0.005483,0.005183,0.000000,0.000000,0.014465,0.010339,1.000000,0.003473,0.003071,0.019356
45430,67758.0,0.007550,0.069071,0.008437,0.011153,0.023614,0.041812,0.014806,0.033377,0.107624,...,0.024813,0.003320,0.009706,0.035654,0.020286,0.012033,0.003473,1.000000,0.004542,0.005675
45431,227506.0,0.005050,0.016412,0.010043,0.006520,0.008070,0.000552,0.006945,0.004151,0.000423,...,0.003244,0.003066,0.000574,0.001900,0.000568,0.006774,0.003071,0.004542,1.000000,0.003131


# Movies Recommendator

In [34]:
# Ask the user what they have watched.
# user_input = input("Enter all the movies you have watched. Use comma to seperate.")
# user_input = [title.strip() for title in user_input.split(',')]
user_input = ['The Avengers', 'Black Widow', 'Black Widow', 'The Shawshank Redemption']

In [35]:
# Convert to movie id.
watched_movies_id = movies_df.loc[movies_df['title'].isin(user_input),'id'].tolist()
watched_movies_id

[278.0, 9320.0, 19345.0, 44052.0, 24428.0, 255577.0]

In [36]:
# Create an empty dataframe to store all the similar movies and readjust the weights
rank_similarity = pd.DataFrame(columns=['id','weights'])

# Iterate through each watched movie id to find similar movies.
for movie_id in watched_movies_id:
    # Select movies that have similarity scores above the average for the current movie.
    temp_df = pd.DataFrame(similarity_df.loc[similarity_df[movie_id] >= similarity_df[movie_id].mean(), ['id', movie_id]])
    # Rename similarity scores to weights
    temp_df = temp_df.rename(columns = {movie_id:'weights'})
    # Exclue the movie user has watched. System will not recommend what user has already watched.
    temp_df = temp_df.loc[temp_df['id'] != movie_id,:]
    # Save similar movies to a rank_similarity dataframe
    rank_similarity = pd.concat([rank_similarity, temp_df])

# Check how many movies are repeated. Higher repetation means higher interest.
rank_similarity['id'].value_counts()

# If a particular movie id is repeated, then they should be very relevant. Add the weights to rank higher.
rank_similarity['weights'] = rank_similarity.groupby(['id'])['weights'].transform('sum')

# All same movie ids will have equal weights. Drop the duplicates and show top 10 highest similarity.
top10similar = rank_similarity.drop_duplicates().sort_values(by='weights', ascending=False).head(10)
top10similar

  rank_similarity = pd.concat([rank_similarity, temp_df])


Unnamed: 0,id,weights
42979,14611.0,0.679593
25399,118134.0,0.6424
30718,14613.0,0.63391
26540,99861.0,0.607356
8354,43689.0,0.595584
10822,14609.0,0.575919
23037,100402.0,0.571668
5016,56133.0,0.483224
23048,257346.0,0.469892
36905,30345.0,0.461167


In [37]:
# Show Result
print(f"Based on movies you have watched: {user_input}, these are your recommendations:")
pd.merge(top10similar, movies_df[['id', 'title', 'overview']], on='id', how='left')[['title', 'overview']]

Based on movies you have watched: ['The Avengers', 'Black Widow', 'Black Widow', 'The Shawshank Redemption'], these are your recommendations:


Unnamed: 0,title,overview
0,Ultimate Avengers 2,Mysterious Wakanda lies in the darkest heart o...
1,The Widow From Chicago,A woman infiltrates a criminal mob to avenge h...
2,Next Avengers: Heroes of Tomorrow,The children of the Avengers hone their powers...
3,Avengers: Age of Ultron,When Tony Stark tries to jumpstart a dormant p...
4,The Merry Widow,A prince from a small kingdom courts a wealthy...
5,Ultimate Avengers,When a nuclear missile was fired at Washington...
6,Captain America: The Winter Soldier,After the cataclysmic events in New York with ...
7,Black Like Me,Black Like Me is the true account of John Grif...
8,Avengers Confidential: Black Widow & Punisher,When the Punisher takes out a black-market wea...
9,The Black Castle,A Man investigates the disappearance of two of...
