<h1 style="text-align: center;">Movie Recommendation System Using Content-Based Filtering</h1>
<h3 style="text-align: center;">Elia Samuel</h3>
<h3 style="text-align: center;">Muhammad Fa'iz Ismail</h3>

---

## **Background**

This notebook implements a content-based movie recommendation system using the TMDB movies dataset. The system analyzes movie features like plot overview and genres to find similar movies based on content. It uses natural language processing techniques and cosine similarity to measure movie similarities and provide personalized recommendations.

## **Section 1. Data Loading and Initial Exploration**

In [1]:
import pandas as pd
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

This loads our movie dataset containing information like movie ID, title, genre, language, overview, and ratings.

In [2]:
data = pd.read_csv('top10K-TMDB-movies.csv')
data.head()

Unnamed: 0,id,title,genre,original_language,overview,popularity,release_date,vote_average,vote_count
0,278,The Shawshank Redemption,"Drama,Crime",en,Framed in the 1940s for the double murder of h...,94.075,1994-09-23,8.7,21862
1,19404,Dilwale Dulhania Le Jayenge,"Comedy,Drama,Romance",hi,"Raj is a rich, carefree, happy-go-lucky second...",25.408,1995-10-19,8.7,3731
2,238,The Godfather,"Drama,Crime",en,"Spanning the years 1945 to 1955, a chronicle o...",90.585,1972-03-14,8.7,16280
3,424,Schindler's List,"Drama,History,War",en,The true story of how businessman Oskar Schind...,44.761,1993-12-15,8.6,12959
4,240,The Godfather: Part II,"Drama,Crime",en,In the continuing saga of the Corleone crime f...,57.749,1974-12-20,8.6,9811


## **Section 2. Data Analysis and Statistics**

In [3]:
data.describe()

Unnamed: 0,id,popularity,vote_average,vote_count
count,10000.0,10000.0,10000.0,10000.0
mean,161243.505,34.697267,6.62115,1547.3094
std,211422.046043,211.684175,0.766231,2648.295789
min,5.0,0.6,4.6,200.0
25%,10127.75,9.15475,6.1,315.0
50%,30002.5,13.6375,6.6,583.5
75%,310133.5,25.65125,7.2,1460.0
max,934761.0,10436.917,8.7,31917.0


In [4]:
data.describe(include='object')

Unnamed: 0,title,genre,original_language,overview,release_date
count,10000,9997,10000,9987,10000
unique,9661,2123,43,9985,6113
top,Beauty and the Beast,Comedy,en,"""Loro"", in two parts, is a period movie that c...",2017-10-20
freq,4,744,7810,2,9


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 10000 non-null  int64  
 1   title              10000 non-null  object 
 2   genre              9997 non-null   object 
 3   original_language  10000 non-null  object 
 4   overview           9987 non-null   object 
 5   popularity         10000 non-null  float64
 6   release_date       10000 non-null  object 
 7   vote_average       10000 non-null  float64
 8   vote_count         10000 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 703.3+ KB


1. Dataset Size and Completeness:
- The dataset contains exactly 10,000 movies
- There are very few missing values: only 3 missing genre entries and 13 missing overviews
- Most fields have complete data (10,000 entries)

2. Movie Titles and Uniqueness:
- There are 9,661 unique titles out of 10,000 entries
- "Beauty and the Beast" appears most frequently with 4 occurrences
- This suggests some remakes or different versions of the same movie exist in the database

3. Genre Distribution:
- There are 2,123 unique genre combinations
- "Comedy" is the most common genre, appearing 744 times
- This indicates many movies have multiple genres combined (e.g., "Comedy,Drama")

4. Language Distribution:
- The dataset covers 43 different original languages
- English ("en") is heavily dominant with 7,810 movies
- This shows a strong bias towards English-language content

5. Ratings and Popularity:
- Vote averages range from 4.6 to 8.7, with a mean of 6.62
- The median vote average is 6.6, suggesting a fairly normal distribution
- Vote counts vary widely from 200 to 31,917, with a mean of 1,547
- The high standard deviation in vote counts (2,648) indicates some movies are much more widely rated than others

6. Release Dates:
- Contains movies up to 2017-10-20
- Has 6,113 unique release dates
- Most common release date appears 9 times

7. Content Descriptions:
- Almost all movies (9,987 out of 10,000) have overview descriptions
- The most common overview appears only twice, suggesting unique descriptions for most movies

## **Section 3. Data Cleaning**

Check and handle any missing values in the dataset:

In [6]:
data.isnull().sum()

id                    0
title                 0
genre                 3
original_language     0
overview             13
popularity            0
release_date          0
vote_average          0
vote_count            0
dtype: int64

In [7]:
data = data.dropna()

In [8]:
data.isnull().sum()

id                   0
title                0
genre                0
original_language    0
overview             0
popularity           0
release_date         0
vote_average         0
vote_count           0
dtype: int64

Check for duplicate entries:

In [9]:
# Data Cleaning - Identify Data Duplicated
print('Total Duplicated Rows: ', data.duplicated().sum())

Total Duplicated Rows:  0


In [10]:
data_copy = data.copy()

# Delete 'id' column from dataset
data_copy = data_copy.drop(columns=['id'])

# Data Cleaning - Identify Data Duplicated
print('Total Duplicated Rows in data_copy: ', data_copy.duplicated().sum())

Total Duplicated Rows in data_copy:  0


## **Section 4. Feature Engineering**

In [11]:
data.columns

Index(['id', 'title', 'genre', 'original_language', 'overview', 'popularity',
       'release_date', 'vote_average', 'vote_count'],
      dtype='object')

In [12]:
data = data[['id', 'title', 'overview', 'genre']]

In [13]:
data

Unnamed: 0,id,title,overview,genre
0,278,The Shawshank Redemption,Framed in the 1940s for the double murder of h...,"Drama,Crime"
1,19404,Dilwale Dulhania Le Jayenge,"Raj is a rich, carefree, happy-go-lucky second...","Comedy,Drama,Romance"
2,238,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...","Drama,Crime"
3,424,Schindler's List,The true story of how businessman Oskar Schind...,"Drama,History,War"
4,240,The Godfather: Part II,In the continuing saga of the Corleone crime f...,"Drama,Crime"
...,...,...,...,...
9995,10196,The Last Airbender,"The story follows the adventures of Aang, a yo...","Action,Adventure,Fantasy"
9996,331446,Sharknado 3: Oh Hell No!,The sharks take bite out of the East Coast whe...,"Action,TV Movie,Science Fiction,Comedy,Adventure"
9997,13995,Captain America,"During World War II, a brave, patriotic Americ...","Action,Science Fiction,War"
9998,2312,In the Name of the King: A Dungeon Siege Tale,A man named Farmer sets out to rescue his kidn...,"Adventure,Fantasy,Action,Drama"


This step combines movie overviews and genres into a single text feature that we'll use for finding similar movies.

In [14]:
data['attributes'] = data['overview'] + ' ' + data['genre']

In [15]:
data = data.drop(columns=['overview', 'genre'])

In [16]:
data

Unnamed: 0,id,title,attributes
0,278,The Shawshank Redemption,Framed in the 1940s for the double murder of h...
1,19404,Dilwale Dulhania Le Jayenge,"Raj is a rich, carefree, happy-go-lucky second..."
2,238,The Godfather,"Spanning the years 1945 to 1955, a chronicle o..."
3,424,Schindler's List,The true story of how businessman Oskar Schind...
4,240,The Godfather: Part II,In the continuing saga of the Corleone crime f...
...,...,...,...
9995,10196,The Last Airbender,"The story follows the adventures of Aang, a yo..."
9996,331446,Sharknado 3: Oh Hell No!,The sharks take bite out of the East Coast whe...
9997,13995,Captain America,"During World War II, a brave, patriotic Americ..."
9998,2312,In the Name of the King: A Dungeon Siege Tale,A man named Farmer sets out to rescue his kidn...


## **Section 5. Text Processing and Similarity Calculation**

Process the text data and calculate similarity scores

This section:

- Converts text data into numerical vectors using CountVectorizer
- Removes common English stop words
- Calculates similarity scores between all movies

In [17]:
cv = CountVectorizer(max_features=9985, stop_words='english')

In [18]:
vector = cv.fit_transform(data['attributes'].values.astype('U')).toarray()

In [19]:
vector.shape

(9985, 9985)

In [25]:
vector

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [20]:
similarity = cosine_similarity(vector)

In [21]:
similarity

array([[1.        , 0.05634362, 0.13041013, ..., 0.07559289, 0.11065667,
        0.06900656],
       [0.05634362, 1.        , 0.07715167, ..., 0.        , 0.03636965,
        0.        ],
       [0.13041013, 0.07715167, 1.        , ..., 0.02300219, 0.0673435 ,
        0.09449112],
       ...,
       [0.07559289, 0.        , 0.02300219, ..., 1.        , 0.03253   ,
        0.03042903],
       [0.11065667, 0.03636965, 0.0673435 , ..., 0.03253   , 1.        ,
        0.04454354],
       [0.06900656, 0.        , 0.09449112, ..., 0.03042903, 0.04454354,
        1.        ]])

In [22]:
distance = sorted(list(enumerate(similarity[2])), reverse=True, key=lambda vector: vector[1])
for i in distance [0:5]:
    print(data.iloc[i[0]]['title'])

The Godfather
The Godfather: Part II
Blood Ties
Joker
Bomb City


## **Section 6.  Recommendation Function**

Create a function to recommend similar movies:

In [23]:
def recommend(movie):
    index = data[data['title'] == movie].index[0]
    distance = sorted(list(enumerate(similarity[index])), reverse=True, key=lambda vector: vector[1])
    for i in distance [0:5]:
        print(data.iloc[i[0]]['title'])

This function:

- Takes a movie title as input
- Finds its index in our dataset
- Returns the top 5 most similar movies based on content similarity

## **Section 7. Save Model Files**

In [24]:
pickle.dump(data, open('movies.pkl', 'wb'))
pickle.dump(similarity, open('similarity.pkl', 'wb'))

These files can be used later to make recommendations without reprocessing the data.