# Paria Rezayan
# Movie Recommendation System using Content-based Filtering

This is a project to build a movie recommendation system using content-based filtering. The dataset used for this project has been taken from https://www.kaggle.com/datasets/thedevastator/imdb-movie-data-from-2006-2016, which consists of 1000 movies and their respective features such as title, genre, description, director, actors, year, runtime, rating, votes, revenue, and metascore.

# Libraries 

The following libraries have been imported for this project:

1) pandas: for data manipulation and analysis
2) matplotlib.pyplot and seaborn: for data visualization
3) difflib: for comparing and working with sequences of strings
4) sklearn.feature_extraction.text: for converting text into numerical vectors
5) sklearn.metrics.pairwise: for computing pairwise similarity scores between samples
6) IPython: for displaying outputs in a more user-friendly manner

In [1]:
# importing the necessary libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import difflib 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from IPython import *

# Data Exploration

In [2]:
# loading the data 
data = pd.read_csv('IMDB-Movie-Data.csv')

In [3]:
# getting to know the dataset
data.shape

(1000, 13)

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   index               1000 non-null   int64  
 1   Rank                1000 non-null   int64  
 2   Title               1000 non-null   object 
 3   Genre               1000 non-null   object 
 4   Description         1000 non-null   object 
 5   Director            1000 non-null   object 
 6   Actors              1000 non-null   object 
 7   Year                1000 non-null   int64  
 8   Runtime (Minutes)   1000 non-null   int64  
 9   Rating              1000 non-null   float64
 10  Votes               1000 non-null   int64  
 11  Revenue (Millions)  872 non-null    float64
 12  Metascore           936 non-null    float64
dtypes: float64(3), int64(5), object(5)
memory usage: 101.7+ KB


In [5]:
data.head()

Unnamed: 0,index,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


In [6]:
data.describe()

Unnamed: 0,index,Rank,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,872.0,936.0
mean,499.5,500.5,2012.783,113.172,6.7232,169808.3,82.956376,58.985043
std,288.819436,288.819436,3.205962,18.810908,0.945429,188762.6,103.25354,17.194757
min,0.0,1.0,2006.0,66.0,1.9,61.0,0.0,11.0
25%,249.75,250.75,2010.0,100.0,6.2,36309.0,13.27,47.0
50%,499.5,500.5,2014.0,111.0,6.8,110799.0,47.985,59.5
75%,749.25,750.25,2016.0,123.0,7.4,239909.8,113.715,72.0
max,999.0,1000.0,2016.0,191.0,9.0,1791916.0,936.63,100.0


In [7]:
# checking for non-numeric data points
data.dtypes

index                   int64
Rank                    int64
Title                  object
Genre                  object
Description            object
Director               object
Actors                 object
Year                    int64
Runtime (Minutes)       int64
Rating                float64
Votes                   int64
Revenue (Millions)    float64
Metascore             float64
dtype: object

In [8]:
# selecting relevent features for our content-based recommendation system 
selected_features = ['Title', 'Genre', 'Description']

In [9]:
# cheking for total count of nan values
print(data[selected_features].isnull().sum().sum())

0


In [10]:
# combining the selected features 
combined_features = data[['Title', 'Genre', 'Description']].apply(lambda x: ' '.join(x), axis=1)
print(combined_features)

0      Guardians of the Galaxy Action,Adventure,Sci-F...
1      Prometheus Adventure,Mystery,Sci-Fi Following ...
2      Split Horror,Thriller Three girls are kidnappe...
3      Sing Animation,Comedy,Family In a city of huma...
4      Suicide Squad Action,Adventure,Fantasy A secre...
                             ...                        
995    Secret in Their Eyes Crime,Drama,Mystery A tig...
996    Hostel: Part II Horror Three American college ...
997    Step Up 2: The Streets Drama,Music,Romance Rom...
998    Search Party Adventure,Comedy A pair of friend...
999    Nine Lives Comedy,Family,Fantasy A stuffy busi...
Length: 1000, dtype: object


In [11]:
# vectorizing the text data 
vectorizer = TfidfVectorizer()
feature_vectors = vectorizer.fit_transform(combined_features)
print(feature_vectors)

  (0, 6024)	0.25115936025447294
  (0, 1304)	0.2415450902181415
  (0, 5577)	0.2415450902181415
  (0, 2350)	0.12486716808237268
  (0, 6227)	0.23368965962139007
  (0, 2122)	0.3108886612320426
  (0, 5403)	0.2020661696930252
  (0, 5772)	0.19143005927690324
  (0, 6376)	0.2095782870537396
  (0, 5769)	0.12539023032030575
  (0, 2271)	0.1987502583552243
  (0, 380)	0.14933909156708955
  (0, 1390)	0.2714097407069263
  (0, 3005)	0.2714097407069263
  (0, 2537)	0.1739603586438204
  (0, 2177)	0.1341238793599193
  (0, 4967)	0.1341238793599193
  (0, 190)	0.10067380579795737
  (0, 159)	0.09400873685865625
  (0, 2392)	0.2934189605989598
  (0, 5687)	0.10700718085895795
  (0, 3990)	0.1956292089531824
  (0, 2547)	0.2934189605989598
  (1, 266)	0.2384514074750334
  (1, 3941)	0.18869405779984558
  :	:
  (998, 5691)	0.14090659259920557
  (998, 2636)	0.1406085590583855
  (998, 1197)	0.11555056954061664
  (998, 6343)	0.12110646544163163
  (998, 4020)	0.1306552787920351
  (998, 5769)	0.1478835970287896
  (998, 190)

In [12]:
# cosine similarity 
similarity_score = cosine_similarity(feature_vectors)

In [13]:
print('Cosine Similarity of the feature vectors', similarity_score)
print('The shape of the Cosine Similarity', similarity_score.shape)

Cosine Similarity of the feature vectors [[1.         0.08996481 0.04292115 ... 0.043895   0.05229416 0.02160264]
 [0.08996481 1.         0.06504706 ... 0.01175418 0.04185964 0.04015897]
 [0.04292115 0.06504706 1.         ... 0.01092367 0.0265568  0.00733756]
 ...
 [0.043895   0.01175418 0.01092367 ... 1.         0.01460152 0.01447088]
 [0.05229416 0.04185964 0.0265568  ... 0.01460152 1.         0.02304049]
 [0.02160264 0.04015897 0.00733756 ... 0.01447088 0.02304049 1.        ]]
The shape of the Cosine Similarity (1000, 1000)


* As demonstrated, the dataset has 13 columns and 1000 rows. There are no missing values in the "Title", "Genre", and "Description" columns that were selected for building the content-based recommendation system. I merged the selected features and then transformed them into vectors using TfidfVectorizer to calculate the cosine similarity score between every set of movies.

# User Input

This section pertains to the way users can input their favorite movie title and the system will provide recommendations based on similarities with other movies in the dataset.

In [14]:
# getting input data from the user 
user_input = input('Please type in the title of your favorite movie:')
print(user_input)

Please type in the title of your favorite movie:Fight Club
Fight Club


# Movie Recommendation System Implementation:

Subsequently, I should find the best match from the dataset for the input by using cosine similarity to suggest similar movies based on the selected movie.

In [17]:
# creating a list of all the movie titles 
all_titles = data['Title'].tolist()
print(all_titles)

['Guardians of the Galaxy', 'Prometheus', 'Split', 'Sing', 'Suicide Squad', 'The Great Wall', 'La La Land', 'Mindhorn', 'The Lost City of Z', 'Passengers', 'Fantastic Beasts and Where to Find Them', 'Hidden Figures', 'Rogue One', 'Moana', 'Colossal', 'The Secret Life of Pets', 'Hacksaw Ridge', 'Jason Bourne', 'Lion', 'Arrival', 'Gold', 'Manchester by the Sea', 'Hounds of Love', 'Trolls', 'Independence Day: Resurgence', 'Paris pieds nus', 'Bahubali: The Beginning', 'Dead Awake', 'Bad Moms', "Assassin's Creed", 'Why Him?', 'Nocturnal Animals', 'X-Men: Apocalypse', 'Deadpool', 'Resident Evil: The Final Chapter', 'Captain America: Civil War', 'Interstellar', 'Doctor Strange', 'The Magnificent Seven', '5- 25- 77', 'Sausage Party', 'Moonlight', "Don't Fuck in the Woods", 'The Founder', 'Lowriders', 'Pirates of the Caribbean: On Stranger Tides', 'Miss Sloane', 'Fallen', 'Star Trek Beyond', 'The Last Face', 'Star Wars: Episode VII - The Force Awakens', 'Underworld: Blood Wars', "Mother's Day",

* Here, I have employed the difflib library to find close matches for the user input in the movie dataset. The best match is then found by taking the first element of the list of close matches:

In [18]:
# finding the close matches for the user input 
close_matches = difflib.get_close_matches(user_input, all_titles)
print(close_matches)

['The Hangover', 'The Last Face', 'The Bad Batch']


In [19]:
# finding the best match for the user input
best_match = close_matches[0]
print(best_match)

The Hangover


In [20]:
# getting the IMDB rating for the best match
rating_best_match = data[data.Title == best_match]['Rating'].values[0]
print(rating_best_match)

7.8


In [21]:
# getting the index of the best match
index_best_match = data[data.Title == best_match]['index'].values[0]
print(index_best_match)

255


In [22]:
# creating a list of tuples containing the index and similarity score for the best match
similarities = list(enumerate(similarity_score[index_best_match]))

In [23]:
print(similarities)

[(0, 0.04425881960054167), (1, 0.045450151828962955), (2, 0.08446564413028597), (3, 0.08194805120555405), (4, 0.05607300489462608), (5, 0.035388978757614056), (6, 0.016667287111676726), (7, 0.03692562781379106), (8, 0.08504756770301145), (9, 0.013961252304717023), (10, 0.08150580173438571), (11, 0.039740516684689906), (12, 0.05268964225275697), (13, 0.03434388426778809), (14, 0.06780418119334362), (15, 0.03758445619142756), (16, 0.0354633587324231), (17, 0.021379869455621654), (18, 0.06851743445955923), (19, 0.062352150872596164), (20, 0.057008068444836545), (21, 0.03144222153065617), (22, 0.04643473113657672), (23, 0.03133392249904536), (24, 0.012455101559875791), (25, 0.01654497660223469), (26, 0.02847375427146433), (27, 0.025494048100835744), (28, 0.07336227556630945), (29, 0.035550360901519806), (30, 0.030451744923868977), (31, 0.0), (32, 0.03256688465082344), (33, 0.027724194003225806), (34, 0.0656698688008444), (35, 0.017936340530728626), (36, 0.02557468495614954), (37, 0.0271926

In [24]:
len(similarities)

1000

* After computing the cosine similarity between each pair of movies, the similarity scores then should be sorted in a descending order:

In [25]:
# sorting movies based on their cosine similarity scores 
converted_similarities_dict = dict(similarities)
sorted_movies = sorted(converted_similarities_dict.items(), key = lambda x:x[1], reverse = True)
print(sorted_movies)

[(255, 1.0000000000000002), (452, 0.15541715393332886), (607, 0.15184169586308494), (696, 0.1477352594170288), (994, 0.14340266465069954), (115, 0.1404729214527269), (582, 0.1377210920077751), (738, 0.1307993897551599), (122, 0.13046084292569948), (873, 0.12720744734848322), (790, 0.12322924563873416), (784, 0.11989731614826521), (795, 0.11980103812607917), (529, 0.1175920589745993), (220, 0.11491037114751275), (765, 0.11246179334607574), (430, 0.11106530680797469), (921, 0.1104004374803714), (550, 0.11038800765109906), (975, 0.11036694020805288), (635, 0.1083739337872599), (537, 0.10758525894315286), (953, 0.1074022524114319), (835, 0.10612165962030823), (773, 0.10525625631098665), (226, 0.10316211706525617), (729, 0.1019093576046905), (998, 0.10185390249624846), (749, 0.09937367425211421), (399, 0.09761326494084237), (50, 0.09642848867328736), (629, 0.09606825509420357), (409, 0.09443470951321327), (411, 0.09420844198953382), (708, 0.09376395550196327), (826, 0.09298773949885401), (3

* And finally, top 10 and then 30 similar movies are presented to the user:

In [26]:
print('Here are the top 10 suggested movies for you: \n')
i = 1
for movie in sorted_movies:
    index = movie[0]
    title_from_index = data[data.index==index]['Title'].values[0]
    if (i<11):
        print (i, '.' ,title_from_index)
        i += 1

Here are the top 10 suggested movies for you: 

1 . The Hangover
2 . Pandorum
3 . Horrible Bosses
4 . 10 Years
5 . Project X
6 . Office Christmas Party
7 . Sex Tape
8 . Knight of Cups
9 . Mike and Dave Need Wedding Dates
10 . One Day


In [27]:
print('Here are the top 30 suggested movies for you: \n')

i = 1
for movie in sorted_movies:
    index = movie[0]
    title_from_index = data[data.index == index]['Title'].values
    if len(title_from_index) > 0:  # add this check to prevent "index out of bounds" error
        title_from_index = title_from_index[0]
        if i < 31:
            print(i, '.', title_from_index)
            i += 1
        else:
            break


Here are the top 30 suggested movies for you: 

1 . The Hangover
2 . Pandorum
3 . Horrible Bosses
4 . 10 Years
5 . Project X
6 . Office Christmas Party
7 . Sex Tape
8 . Knight of Cups
9 . Mike and Dave Need Wedding Dates
10 . One Day
11 . Sisters
12 . Before We Go
13 . No Strings Attached
14 . The Bourne Legacy
15 . Hardcore Henry
16 . PK
17 . 3 Idiots
18 . Scouts Guide to the Zombie Apocalypse
19 . The Break-Up
20 . My Big Fat Greek Wedding 2
21 . Knocked Up
22 . The Do-Over
23 . The Kings of Summer
24 . The Loft
25 . Lady in the Water
26 . The Lobster
27 . The Guest
28 . Search Party
29 . Percy Jackson: Sea of Monsters
30 . Magic Mike


# Conclusion

This project was all about an implementation of a content-based recommendation system using a dataset from IMDB. The system provided movie recommendations based on similar genres, descriptions, and titles to the user's input.