# **Movies Recommender System**

A recommender system is an intelligent system designed to help users discover items they are likely to enjoy from a large pool of available options. In the context of movies, users often struggle to decide what to watch due to the enormous variety of content available on streaming platforms. A movie recommender system addresses this challenge by suggesting relevant movies based on user preferences, past behavior, or similarities between movies.

By providing personalized and meaningful recommendations, these systems simplify the movie selection process and significantly enhance the overall user experience. As a result, recommender systems have become a core component of modern streaming platforms such as Netflix, Amazon Prime, and IMDb.

In this project, multiple recommendation approaches are implemented to explore different recommendation strategies. These include:

- IMDb’s Weighted Average Rating to rank movies fairly based on ratings and vote counts.
- IMDb’s Weighted Rating across different genres to identify top movies within specific genres.
- Content-based movie recommendation system that suggests movies based on similarity in movie features.
- Collaborative filtering-based recommendation system that leverages user–movie interactions.
- Hybrid recommendation system that combines both content-based and collaborative filtering approaches for improved recommendations.

All models are developed using [The Movies Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?select=ratings_small.csv) available on Kaggle, which contains detailed information about movies, ratings, and user interactions. This project provides a comprehensive understanding of how different recommendation techniques work and how they can be applied to build scalable, real-world movie recommendation systems.

In [1]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ast
import nltk
import warnings; warnings.simplefilter('ignore')

# Download Dataset

In [147]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("rounakbanik/the-movies-dataset")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/rounakbanik/the-movies-dataset?dataset_version_number=7...


100%|██████████| 228M/228M [00:01<00:00, 165MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/rounakbanik/the-movies-dataset/versions/7


# Dataset First View

**Movies Dataset**

In [3]:
mdf = pd.read_csv('/content/movies_metadata.csv')
mdf.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0


In [4]:
# Number of rows and columns in the movies dataset.
print("The Rows in Movies Dataset = {}".format(mdf.shape[0]))
print("The Columns in Movies Dataset = {}".format(mdf.shape[1]))

The Rows in Movies Dataset = 45466
The Columns in Movies Dataset = 24


**Credits Dataset**

In [5]:
cdf = pd.read_csv('/content/credits.csv')
cdf.head(2)

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844


In [6]:
# Number of rows and columns in the credits dataset.
print("The Rows in Credits Dataset = {}".format(cdf.shape[0]))
print("The Columns in Credits Dataset = {}".format(cdf.shape[1]))

The Rows in Credits Dataset = 45476
The Columns in Credits Dataset = 3


**Keywords Dataset**

In [7]:
kdf = pd.read_csv('/content/keywords.csv')
kdf.head(2)

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."


In [8]:
# Number of rows and columns in the keywords dataset.
print("The Rows in Keywords Dataset = {}".format(kdf.shape[0]))
print("The Columns in Keywords Dataset = {}".format(kdf.shape[1]))

The Rows in Keywords Dataset = 46419
The Columns in Keywords Dataset = 2


**Links Dataset**

In [9]:
ldf = pd.read_csv('/content/links.csv')
ldf.head(5)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [10]:
# Number of rows and columns in the links dataset.
print("The Rows in links Dataset = {}".format(ldf.shape[0]))
print("The Columns in links Dataset = {}".format(ldf.shape[1]))

The Rows in links Dataset = 45843
The Columns in links Dataset = 3


# Dataset Description

This dataset is taken from the Full MovieLens Dataset and contains details of about 45,000 movies released up to July 2017. It includes different types of information about movies, which makes it useful for building and testing movie recommendation systems.

The dataset is divided into the following files:

`movies_metadata.csv`: - This is the main file of the dataset. It contains basic and detailed information about each movie, such as:
- Movie title and description
- Release date and language
- Budget and revenue
- Genres and runtime
- Ratings and popularity
- Production companies and countries

`keywords.csv`: - This file contains important keywords related to the movie story. These keywords help in understanding the movie’s theme and are useful for content-based recommendations.

`credits.csv`: - This file includes information about the cast and crew of each movie, such as:
- Actors and their roles
- Directors and other crew members

`links.csv`: - This file connects movies to external databases. This allows the dataset to be linked with other movie data sources for additional information. It contains:
- MovieLens IDs
- TMDB and IMDb IDs

# Dataset Column Description

**Movies Dataset Columns** (movies_metadata.csv) - The movies dataset contains the following columns:

- `adult` – Indicates whether the movie is for adults
- `belongs_to_collection` – Information about the movie collection.
- `budget` – Budget of the movie
- `genres` – Genres of the movie
- `homepage` – Official movie website
- `id` – TMDB movie ID
- `imdb_id` – IMDb movie ID
- `original_language` – Original language of the movie
- `original_title` – Original title of the movie
- `overview` – Short description of the movie
- `popularity` – Popularity scor
- `poster_path` – Path to the movie poster image
- `production_companies` – Companies involved in production
- `production_countries` – Countries where the movie was produced
- `release_date` – Movie release date
- `revenue` – Revenue generated by the movie
- `runtime` – Duration of the movie (in minutes)
- `spoken_languages` – Languages spoken in the movie
- `status` – Current status of the movie (e.g., Released)
- `tagline` – Movie tagline
- `title` – Movie title
- `video` – Indicates whether a video is available
- `vote_average` – Average user rating
- `vote_count` – Total number of user votes

**Credits Dataset Columns** (credits.csv) - The credits dataset contains the following columns:

- `cast` – List of actors and their roles
- `crew` – List of crew members (director, writer, producer, etc.)
- `id` – Movie ID (used to link with other datasets)

**Keywords Dataset Columns** (keywords.csv) - The keywords dataset contains the following columns:

- `id` – Movie ID (used to link with other datasets)
- `keywords` – List of important keywords related to the movie’s story and themes

**Links Dataset Columns** (links.csv) - The links dataset contains the following columns:

- `movieId` – MovieLens movie ID
- `imdbId` – IMDb movie ID
- `tmdbId` – TMDB movie ID

# Data Preprocessing

**Data Preprocessing on the Movies Dataset**

In [11]:
# Selection of Important Features from the Movies Dataset

movies_df = mdf[['id', 'title', 'release_date', 'genres', 'overview', 'tagline', 'popularity', 'vote_average', 'vote_count']]
movies_df.head(2)

Unnamed: 0,id,title,release_date,genres,overview,tagline,popularity,vote_average,vote_count
0,862,Toy Story,1995-10-30,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","Led by Woody, Andy's toys live happily in his ...",,21.946943,7.7,5415.0
1,8844,Jumanji,1995-12-15,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",When siblings Judy and Peter discover an encha...,Roll the dice and unleash the excitement!,17.015539,6.9,2413.0


In [12]:
# Checking missing values in dataset
movies_df.isnull().sum()

Unnamed: 0,0
id,0
title,6
release_date,87
genres,0
overview,954
tagline,25054
popularity,5
vote_average,6
vote_count,6


In [13]:
x = movies_df[movies_df['overview'].isnull()]
x.head(5)

Unnamed: 0,id,title,release_date,genres,overview,tagline,popularity,vote_average,vote_count
32,78802,Wings of Courage,1996-09-18,"[{'id': 10749, 'name': 'Romance'}, {'id': 12, ...",,,0.745542,6.8,4.0
300,161495,Roommates,1995-03-01,"[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name...",,,3.395867,6.4,7.0
634,287305,Peanuts – Die Bank zahlt alles,1996-03-21,"[{'id': 35, 'name': 'Comedy'}]",,,0.066123,4.0,1.0
635,339428,Happy Weekend,1996-03-14,"[{'id': 35, 'name': 'Comedy'}]",,,0.002229,0.0,0.0
641,10801,The Superwife,1996-03-06,"[{'id': 35, 'name': 'Comedy'}]",,,0.821299,5.3,7.0


In [14]:
x['vote_count'].max()

299.0

In [15]:
# There are 954 rows with missing values in the 'overview' column.
# The maximum vote_count among these movies is 299, indicating low audience engagement.
# Since the 'overview' column is important for a content-based filtering recommendation system,
# these 954 movies are removed from the dataset.

movies_df.dropna(subset=['overview'], inplace=True)

In [16]:
# The missing values in the 'tagline' column are replaced with an empty string ('') instead of 'NaN'.
movies_df['tagline']=movies_df['tagline'].fillna('')

In [17]:
# Rows with missing movie title are removed from the Dataset.
movies_df.dropna(subset=['title'], inplace=True)

In [18]:
# Extracting 'year' from release_date column
movies_df['release_date'] = pd.to_datetime(movies_df['release_date']).dt.year
movies_df.rename(columns={'release_date':'release_year'}, inplace=True)

In [19]:
# The missing values in the 'release_year' column are replaced with an (0) instead of 'NaT'.
movies_df['release_year'].fillna(0, inplace=True)

In [20]:
movies_df.isnull().sum()

Unnamed: 0,0
id,0
title,0
release_year,0
genres,0
overview,0
tagline,0
popularity,0
vote_average,0
vote_count,0


In [21]:
# Check for duplicated values in the dataset.
movies_df.duplicated().sum()

np.int64(13)

In [22]:
# Remove duplicate values from the dataset.
movies_df.drop_duplicates(inplace=True)

In [23]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 44493 entries, 0 to 45465
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            44493 non-null  object 
 1   title         44493 non-null  object 
 2   release_year  44493 non-null  float64
 3   genres        44493 non-null  object 
 4   overview      44493 non-null  object 
 5   tagline       44493 non-null  object 
 6   popularity    44493 non-null  object 
 7   vote_average  44493 non-null  float64
 8   vote_count    44493 non-null  float64
dtypes: float64(3), object(6)
memory usage: 3.4+ MB


In [24]:
# Change the data type of the 'id' and 'release_year' column to 'int64'
movies_df[['id', 'release_year']] = movies_df[['id', 'release_year']].astype('int64')

In [25]:
# Change the data type of the 'popularity' column
movies_df['popularity'] = movies_df['popularity'].astype('float64')

In [26]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 44493 entries, 0 to 45465
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            44493 non-null  int64  
 1   title         44493 non-null  object 
 2   release_year  44493 non-null  int64  
 3   genres        44493 non-null  object 
 4   overview      44493 non-null  object 
 5   tagline       44493 non-null  object 
 6   popularity    44493 non-null  float64
 7   vote_average  44493 non-null  float64
 8   vote_count    44493 non-null  float64
dtypes: float64(3), int64(2), object(4)
memory usage: 3.4+ MB


In [27]:
movies_df.head(2)

Unnamed: 0,id,title,release_year,genres,overview,tagline,popularity,vote_average,vote_count
0,862,Toy Story,1995,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","Led by Woody, Andy's toys live happily in his ...",,21.946943,7.7,5415.0
1,8844,Jumanji,1995,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",When siblings Judy and Peter discover an encha...,Roll the dice and unleash the excitement!,17.015539,6.9,2413.0


**Data Preprocessing on Credits Dataset**

In [28]:
credits = cdf.copy()
credits.head(2)

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844


In [29]:
# Checking missing values in dataset
credits.isnull().sum()

Unnamed: 0,0
cast,0
crew,0
id,0


In [30]:
# Check for duplicated values in the dataset.
credits.duplicated().sum()

np.int64(37)

In [31]:
# Remove duplicate values from the dataset.
credits.drop_duplicates(inplace=True)

In [32]:
credits.info()

<class 'pandas.core.frame.DataFrame'>
Index: 45439 entries, 0 to 45475
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   cast    45439 non-null  object
 1   crew    45439 non-null  object
 2   id      45439 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.4+ MB


In [33]:
credits.head(2)

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844


**Data Preprocessing on keywords Dataset**

In [34]:
keywords = kdf.copy()
keywords.head(2)

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."


In [35]:
# Checking missing values in dataset
keywords.isnull().sum()

Unnamed: 0,0
id,0
keywords,0


In [36]:
# Check for duplicated values in the dataset.
keywords.duplicated().sum()

np.int64(987)

In [37]:
# Remove duplicate values from the dataset.
keywords.drop_duplicates(inplace=True)

In [38]:
keywords.info()

<class 'pandas.core.frame.DataFrame'>
Index: 45432 entries, 0 to 46418
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        45432 non-null  int64 
 1   keywords  45432 non-null  object
dtypes: int64(1), object(1)
memory usage: 1.0+ MB


**Data Preprocessing on link Dataset**

In [39]:
link = ldf.copy()
link.head(2)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0


In [40]:
# Checking missing values in dataset
link.isnull().sum()

Unnamed: 0,0
movieId,0
imdbId,0
tmdbId,219


In [41]:
# Remove rows from the dataset where tmdbId is missing.
link.dropna(inplace=True)

In [42]:
link.isnull().sum()

Unnamed: 0,0
movieId,0
imdbId,0
tmdbId,0


In [43]:
# Check for duplicated values in the dataset.
link.duplicated().sum()

np.int64(0)

In [44]:
link.info()

<class 'pandas.core.frame.DataFrame'>
Index: 45624 entries, 0 to 45842
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  45624 non-null  int64  
 1   imdbId   45624 non-null  int64  
 2   tmdbId   45624 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 1.4 MB


In [45]:
# The tmdbId column in the links dataset is the same as the id column in the movies_df dataset,
# so the tmdbId column is renamed to id in the links dataset.

link['tmdbId'] = link['tmdbId'].astype('int64')
link.rename(columns={'tmdbId': 'id'}, inplace=True)

# Merging the Dataset

In [46]:
# Preprocessed movies dataset and their features
movies = movies_df.merge(credits, how = 'inner', on = 'id').merge(keywords, how = 'inner', on = 'id').merge(link, how='inner', on = 'id')


In [47]:
movies = movies[['id', 'imdbId', 'movieId', 'title', 'release_year', 'genres', 'keywords', 'overview', 'tagline', 'cast', 'crew', 'popularity', 'vote_average', 'vote_count']]
movies.head(3)

Unnamed: 0,id,imdbId,movieId,title,release_year,genres,keywords,overview,tagline,cast,crew,popularity,vote_average,vote_count
0,862,114709,1,Toy Story,1995,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","Led by Woody, Andy's toys live happily in his ...",,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",21.946943,7.7,5415.0
1,8844,113497,2,Jumanji,1995,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10090, 'name': 'board game'}, {'id': 1...",When siblings Judy and Peter discover an encha...,Roll the dice and unleash the excitement!,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",17.015539,6.9,2413.0
2,15602,113228,3,Grumpier Old Men,1995,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392...",A family wedding reignites the ancient feud be...,Still Yelling. Still Fighting. Still Ready for...,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",11.7129,6.5,92.0


In [48]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44565 entries, 0 to 44564
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            44565 non-null  int64  
 1   imdbId        44565 non-null  int64  
 2   movieId       44565 non-null  int64  
 3   title         44565 non-null  object 
 4   release_year  44565 non-null  int64  
 5   genres        44565 non-null  object 
 6   keywords      44565 non-null  object 
 7   overview      44565 non-null  object 
 8   tagline       44565 non-null  object 
 9   cast          44565 non-null  object 
 10  crew          44565 non-null  object 
 11  popularity    44565 non-null  float64
 12  vote_average  44565 non-null  float64
 13  vote_count    44565 non-null  float64
dtypes: float64(3), int64(4), object(7)
memory usage: 4.8+ MB


In [49]:
# Checking missing values in dataset
movies.isnull().sum()

Unnamed: 0,0
id,0
imdbId,0
movieId,0
title,0
release_year,0
genres,0
keywords,0
overview,0
tagline,0
cast,0


In [50]:
# Check for duplicated values in the dataset.
movies.duplicated().sum()

np.int64(12)

In [51]:
# Remove duplicate values from the dataset.
movies.drop_duplicates(inplace=True)

In [52]:
# Found 48 duplicate rows based on the 'imdbId' column
movies['imdbId'].duplicated().sum()

np.int64(48)

In [53]:
# Found 78 duplicate rows based on the 'id' column
movies['id'].duplicated().sum()

np.int64(78)

In [54]:
# Found 48 duplicate rows based on the 'movieId' column
movies['movieId'].duplicated().sum()

np.int64(48)

In [55]:
# Based on the id, imdbId, and movieId columns, there are 48 duplicate rows in the dataset.
movies[['id', 'imdbId', 'movieId']].duplicated().sum()

np.int64(48)

In [56]:
# Although the id, imdbId, and movieId columns are expected to be unique, duplicate entries were found.
# Since the same movie can have different values across these identifiers, the exact cause of duplication is unclear.
# Therefore, I am going to remove duplicate rows based on the id column.

movies.drop_duplicates(subset=['id'], inplace=True)

In [57]:
# This is the final movies dataset that will be used in the recommendation system.
movies = movies.reset_index(drop=True)
movies.head(3)

Unnamed: 0,id,imdbId,movieId,title,release_year,genres,keywords,overview,tagline,cast,crew,popularity,vote_average,vote_count
0,862,114709,1,Toy Story,1995,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","Led by Woody, Andy's toys live happily in his ...",,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",21.946943,7.7,5415.0
1,8844,113497,2,Jumanji,1995,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10090, 'name': 'board game'}, {'id': 1...",When siblings Judy and Peter discover an encha...,Roll the dice and unleash the excitement!,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",17.015539,6.9,2413.0
2,15602,113228,3,Grumpier Old Men,1995,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392...",A family wedding reignites the ancient feud be...,Still Yelling. Still Fighting. Still Ready for...,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",11.7129,6.5,92.0


In [58]:
print("The Rows in Preprocessed Movies Dataset = {}".format(movies.shape[0]))
print("The Columns in Preprocessed Movies Dataset = {}".format(movies.shape[1]))

The Rows in Preprocessed Movies Dataset = 44475
The Columns in Preprocessed Movies Dataset = 14


# The following preprocessing steps are applied to selected columns in the movies dataset:

- `Genres` column – Extract genre names and store them as a list.
- `Keywords` column – Extract relevant keywords and store them as a list.
- `Overview` and `tagline` columns – Remove punctuation and convert all text to lowercase to ensure consistency in text processing.
- `Cast` column – Lesser-known actors and minor roles have limited influence on audience perception. Therefore, only major characters and their actors are considered by selecting the top three cast members listed in the credits.
- `Crew` column – Extract the director’s name from the crew information.

**Genres**

In [59]:
# The genres column contains a list of dictionaries where each dictionary has 'id' and 'name' as genre attributes.
movies['genres'][0]

"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"

In [60]:
# The 'genres_name' function extracts genre names for each row.
def genres_name(obj):
  l = []
  for i in ast.literal_eval(obj):
    l.append(i['name'])
  return l

In [61]:
movies['genres'] = movies['genres'].apply(genres_name)

# Remove whitespace from each genre name in the list, if it exists
movies['genres'] = movies['genres'].apply(lambda x: [i.replace(" ", "") for i in x])

In [62]:
movies['genres'][0]

['Animation', 'Comedy', 'Family']

**keywords**

In [63]:
# The keywords column contains a list of dictionaries where each dictionary has 'id' and 'name' as keyword attributes.
movies['keywords'][0]

"[{'id': 931, 'name': 'jealousy'}, {'id': 4290, 'name': 'toy'}, {'id': 5202, 'name': 'boy'}, {'id': 6054, 'name': 'friendship'}, {'id': 9713, 'name': 'friends'}, {'id': 9823, 'name': 'rivalry'}, {'id': 165503, 'name': 'boy next door'}, {'id': 170722, 'name': 'new toy'}, {'id': 187065, 'name': 'toy comes to life'}]"

In [64]:
# The 'keyword' function extracts keywords for each row.
def keyword(obj):
  l = []
  for i in ast.literal_eval(obj):
    l.append(i['name'])
  return l

In [65]:
movies['keywords'] = movies['keywords'].apply(keyword)

# Remove whitespace from each keyword name in the list, if it exists
movies['keywords'] = movies['keywords'].apply(lambda x: [i.replace(" ", '') for i in x])

In [66]:
movies['keywords'][0]

['jealousy',
 'toy',
 'boy',
 'friendship',
 'friends',
 'rivalry',
 'boynextdoor',
 'newtoy',
 'toycomestolife']

**Overview and Tagline**

In [67]:
# Remove punctuation and special characters from a given text
import re

def remove_punctuation(text):
  return re.sub(r'[^\w\s]', '', text)

In [68]:
movies['overview'][0]

"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences."

In [69]:
movies['overview'] = movies['overview'].apply(remove_punctuation)

In [70]:
movies['overview'][0]

'Led by Woody Andys toys live happily in his room until Andys birthday brings Buzz Lightyear onto the scene Afraid of losing his place in Andys heart Woody plots against Buzz But when circumstances separate Buzz and Woody from their owner the duo eventually learns to put aside their differences'

In [71]:
# Convert the 'overview' column to string and split each overview into a list of words
movies['overview'] = movies['overview'].astype(str).apply(lambda x: x.split())

In [72]:
movies['tagline'][7]

'The Original Bad Boys.'

In [73]:
movies['tagline'] = movies['tagline'].apply(remove_punctuation)

In [74]:
# Convert the 'tagline' column to string and split each tagline into a list of words
movies['tagline'] = movies['tagline'].astype(str).apply(lambda x: x.split())

In [75]:
movies['tagline'][7]

['The', 'Original', 'Bad', 'Boys']

**Cast**

In [76]:
# Display the cast information for the first movie
movies['cast'][0]

"[{'cast_id': 14, 'character': 'Woody (voice)', 'credit_id': '52fe4284c3a36847f8024f95', 'gender': 2, 'id': 31, 'name': 'Tom Hanks', 'order': 0, 'profile_path': '/pQFoyx7rp09CJTAb932F2g8Nlho.jpg'}, {'cast_id': 15, 'character': 'Buzz Lightyear (voice)', 'credit_id': '52fe4284c3a36847f8024f99', 'gender': 2, 'id': 12898, 'name': 'Tim Allen', 'order': 1, 'profile_path': '/uX2xVf6pMmPepxnvFWyBtjexzgY.jpg'}, {'cast_id': 16, 'character': 'Mr. Potato Head (voice)', 'credit_id': '52fe4284c3a36847f8024f9d', 'gender': 2, 'id': 7167, 'name': 'Don Rickles', 'order': 2, 'profile_path': '/h5BcaDMPRVLHLDzbQavec4xfSdt.jpg'}, {'cast_id': 17, 'character': 'Slinky Dog (voice)', 'credit_id': '52fe4284c3a36847f8024fa1', 'gender': 2, 'id': 12899, 'name': 'Jim Varney', 'order': 3, 'profile_path': '/eIo2jVVXYgjDtaHoF19Ll9vtW7h.jpg'}, {'cast_id': 18, 'character': 'Rex (voice)', 'credit_id': '52fe4284c3a36847f8024fa5', 'gender': 2, 'id': 12900, 'name': 'Wallace Shawn', 'order': 4, 'profile_path': '/oGE6JqPP2xH4t

In [77]:
# Extract the names of the top three cast members from the cast column

def cast_name(obj):
  l = []
  counter = 0
  for i in ast.literal_eval(obj):
    if counter != 3:
      l.append(i['name'])
      counter += 1
    else:
      break
  return l

In [78]:
movies['cast'] = movies['cast'].apply(cast_name)

In [79]:
# Remove whitespace from each cast member's name in the 'cast' column
movies['cast'] = movies['cast'].apply(lambda x: [i.replace(" ", '') for i in x])

**Crew**

In [80]:
# Display the crew information for the first movie
movies['crew'][0]

'[{\'credit_id\': \'52fe4284c3a36847f8024f49\', \'department\': \'Directing\', \'gender\': 2, \'id\': 7879, \'job\': \'Director\', \'name\': \'John Lasseter\', \'profile_path\': \'/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f4f\', \'department\': \'Writing\', \'gender\': 2, \'id\': 12891, \'job\': \'Screenplay\', \'name\': \'Joss Whedon\', \'profile_path\': \'/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f55\', \'department\': \'Writing\', \'gender\': 2, \'id\': 7, \'job\': \'Screenplay\', \'name\': \'Andrew Stanton\', \'profile_path\': \'/pvQWsu0qc8JFQhMVJkTHuexUAa1.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f5b\', \'department\': \'Writing\', \'gender\': 2, \'id\': 12892, \'job\': \'Screenplay\', \'name\': \'Joel Cohen\', \'profile_path\': \'/dAubAiZcvKFbboWlj7oXOkZnTSu.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f61\', \'department\': \'Writing\', \'gender\': 0, \'id\': 12893, \'job\': \'Screenplay\', \'name\': \'A

In [81]:
# Extract the director's name from the crew metadata
def fetch_directior(obj):
  l = []
  for i in ast.literal_eval(obj):
    if i['job'] == 'Director':
      l.append(i['name'])
      break
  return l

In [82]:
movies['crew'] = movies['crew'].apply(fetch_directior)

In [83]:
# Rename the 'crew' column to 'director'
movies.rename(columns={'crew': 'director'}, inplace=True)

In [84]:
# Remove whitespace from the director name in the 'director' column
movies['director'] = movies['director'].apply(lambda x: [i.replace(" ", '') for i in x])

# **Movies Recommendation Based on IMDB's Weighted Average Score.**

In this method, movies are recommended based on how popular they are and how well they are rated by viewers. The idea is simple, if the movies that many people watch and rate highly are more likely to be liked by most users. However, since everyone’s taste is different, this approach is not very personalized.


To make the recommendations fairer, IMDB’s weighted rating is used instead of a simple average rating. This method considers both the movie’s rating and the number of votes, so a movie with a few high ratings does not rank above a well-rated movie that has been reviewed by many people.

**Weighted Rating (WR)** = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$

where,
* *v* is the number of votes for the movie
* *m* is the minimum votes required to be listed in the chart
* *R* is the average rating of the movie
* *C* is the mean vote across the whole report

In [85]:
# C is the mean vote across the whole report
c = movies['vote_average'].mean()
c

np.float64(5.639145587408657)

In [86]:
# m is the minimum votes required to be listed in the chart.
m = movies['vote_count'].quantile(0.95)
m

np.float64(447.0)

In [87]:
# Selecting movies with a vote count greater than the minimum votes required to be listed in the chart.

movies_rec1 = movies.copy().loc[movies['vote_count'] >= m]
movies_rec1.shape

(2225, 14)

In [88]:
# Function to calculate the weighted rating
def weighted_rating(obj, m=m, c=c):
  V = obj['vote_count']
  R = obj['vote_average']
  return (V/(V+m) * R) + (m / (m+V) * c)

In [89]:
# Add a score column to the 'movies_rec1' dataset based on the weighted rating.
movies_rec1['score'] = movies_rec1.apply(weighted_rating, axis=1)

In [90]:
# Sort the dataset by the score column in descending order.
movies_rec1 = movies_rec1.sort_values('score', ascending=False)

In [91]:
# Recommend the top 10 movies based on the score.
score_based_recommendation = movies_rec1[['title', 'release_year', 'popularity', 'score']].head(10)
score_based_recommendation

Unnamed: 0,title,release_year,popularity,score
312,The Shawshank Redemption,1994,51.645403,8.354764
823,The Godfather,1972,41.109264,8.30238
12443,The Dark Knight,2008,123.167259,8.206464
2824,Fight Club,1999,63.869599,8.182528
291,Pulp Fiction,1994,140.950236,8.16954
349,Forrest Gump,1994,48.307194,8.066802
520,Schindler's List,1993,41.725123,8.05642
23461,Whiplash,2014,64.29999,8.05339
5459,Spirited Away,2001,41.048867,8.0306
15407,Inception,2010,29.108149,8.024253


We initially built a movie recommender system based on weighted ratings. However, the main limitation of this approach is that it does not consider individual taste. If a user prefers specific genres, this system may still recommend only overall top-rated movies, even if they belong to different or unwanted genres. To overcome this issue, we shift to a genre-based recommendation approach, where movies are suggested based on the genres a user likes, making the recommendations more relevant and personalized.

# **Movie Recommendation Based on IMDB’s Weighted Rating Across Different Genres**

While overall top-rated movies are useful, users often prefer recommendations within a specific genre such as action, comedy, drama, or romance. This approach extends IMDb’s weighted rating method by applying it separately across different genres, allowing the system to identify the best movies within each category.

In this approach, movies are recommended using IMDB’s weighted rating while also considering different genres. Instead of showing only the overall top-rated movies, the system ranks movies within each genre based on their weighted scores. This helps users discover highly rated movies in the genres they prefer. By combining popularity, ratings, and genre information, this method provides more balanced and relevant recommendations compared to a general popularity-based system.

In [92]:
movies.head(3)

Unnamed: 0,id,imdbId,movieId,title,release_year,genres,keywords,overview,tagline,cast,director,popularity,vote_average,vote_count
0,862,114709,1,Toy Story,1995,"[Animation, Comedy, Family]","[jealousy, toy, boy, friendship, friends, riva...","[Led, by, Woody, Andys, toys, live, happily, i...",[],"[TomHanks, TimAllen, DonRickles]",[JohnLasseter],21.946943,7.7,5415.0
1,8844,113497,2,Jumanji,1995,"[Adventure, Fantasy, Family]","[boardgame, disappearance, basedonchildren'sbo...","[When, siblings, Judy, and, Peter, discover, a...","[Roll, the, dice, and, unleash, the, excitement]","[RobinWilliams, JonathanHyde, KirstenDunst]",[JoeJohnston],17.015539,6.9,2413.0
2,15602,113228,3,Grumpier Old Men,1995,"[Romance, Comedy]","[fishing, bestfriend, duringcreditsstinger, ol...","[A, family, wedding, reignites, the, ancient, ...","[Still, Yelling, Still, Fighting, Still, Ready...","[WalterMatthau, JackLemmon, Ann-Margret]",[HowardDeutch],11.7129,6.5,92.0


In [93]:
# Split the list of genres into separate rows
ind_gen = movies.copy().apply(lambda x: pd.Series(x['genres']), axis = 1).stack().reset_index(level=1, drop=True)

# Rename the stacked column to 'genre'
ind_gen.name = 'genre'

# Remove the original 'genres' column and join the expanded genre column
movies_rec2 = movies.copy().drop('genres', axis=1).join(ind_gen)

In [94]:
# This transforms the dataset from a list-based genre column into a row-wise format,
# making it easier to analyze and filter movies by individual genres.

movies_rec2.head(3)

Unnamed: 0,id,imdbId,movieId,title,release_year,keywords,overview,tagline,cast,director,popularity,vote_average,vote_count,genre
0,862,114709,1,Toy Story,1995,"[jealousy, toy, boy, friendship, friends, riva...","[Led, by, Woody, Andys, toys, live, happily, i...",[],"[TomHanks, TimAllen, DonRickles]",[JohnLasseter],21.946943,7.7,5415.0,Animation
0,862,114709,1,Toy Story,1995,"[jealousy, toy, boy, friendship, friends, riva...","[Led, by, Woody, Andys, toys, live, happily, i...",[],"[TomHanks, TimAllen, DonRickles]",[JohnLasseter],21.946943,7.7,5415.0,Comedy
0,862,114709,1,Toy Story,1995,"[jealousy, toy, boy, friendship, friends, riva...","[Led, by, Woody, Andys, toys, live, happily, i...",[],"[TomHanks, TimAllen, DonRickles]",[JohnLasseter],21.946943,7.7,5415.0,Family


In [95]:
def genres_based_recommendation(genre, top=10):
  """
  Generate movie recommendations within a specific genre using IMDb's
  weighted rating formula.

  This function filters movies based on the selected genre and ranks them
  by a weighted score that considers both the average rating and the number
  of votes. This ensures that movies with a sufficient number of votes are
  prioritized, resulting in more reliable recommendations.

  Parameters:
  ----------
  genre : str
      The genre for which movie recommendations are to be generated
      (e.g., 'Action', 'Comedy', 'Drama').

  top : int, optional (default=10)
      The number of top movies to return.

  Returns:
  -------
  pandas.DataFrame
      A DataFrame containing the top-ranked movies of the selected genre
      with the following columns:
      ['id', 'title', 'release_year', 'popularity', 'score']
  """

  # Filter movies belonging to the selected genre
  df = movies_rec2[movies_rec2['genre'] == genre]

  # Calculate the mean of average ratings for the selected genre
  c = df['vote_average'].mean()

  # Determine the minimum number of votes required (90th percentile)
  m = df['vote_count'].quantile(0.90)

  # Select movies that have a vote count greater than or equal to the threshold
  df2 = df.copy().loc[df['vote_count'] >= m]

  # Compute the IMDb weighted rating score
  df2['score'] = (df2['vote_count']/(df2['vote_count'] + m) * df2['vote_average']) + (m / (df2['vote_count'] + m) * c)

  # Sort movies based on the weighted score in descending order
  df2 = df2.sort_values('score', ascending=False)

  # Retrieve recommended movies with selected metadata
  recommend_movies = df2[['id', 'title', 'release_year', 'popularity', 'score']].head(top)

  return recommend_movies

In [96]:
genres_based_recommendation('Action').head(10)

Unnamed: 0,id,title,release_year,popularity,score
12443,155,The Dark Knight,2008,123.167259,8.194812
15407,27205,Inception,2010,29.108149,8.01466
1140,1891,The Empire Strikes Back,1980,19.470959,8.00088
6975,122,The Lord of the Rings: The Return of the King,2003,29.324358,7.957432
255,11,Star Wars,1977,42.149697,7.929053
4841,120,The Lord of the Rings: The Fellowship of the Ring,2001,32.070725,7.872831
5792,121,The Lord of the Rings: The Two Towers,2002,29.423537,7.853281
23540,118340,Guardians of the Galaxy,2014,53.291601,7.791128
2439,603,The Matrix,1999,33.366332,7.780497
13561,16869,Inglourious Basterds,2009,16.89564,7.738704


Movie recommendation based on IMDB’s weighted rating across different genres has several advantages. It provides fair and reliable rankings by considering both the movie’s rating and the number of votes, which helps avoid bias toward movies with very few reviews. By recommending movies within specific genres, it allows users to easily find highly rated films that match their interests. This approach is also simple to implement and easy to understand, making it suitable for basic recommendation systems.

However, this method also has some limitations. It is not fully personalized, as it does not take individual user behavior or watch history into account. Popular movies may still dominate the recommendations within a genre, reducing diversity. Additionally, lesser-known or niche movies with fewer votes may not be recommended, even if they are high quality.

# **Content Based Movie Recommendation System**

A content-based movie recommendation system suggests movies based on what a user likes. It looks at movie details such as genre, cast, director, and story, and then recommends movies that are similar to those the user has already watched or enjoyed. This type of system does not depend on other users’ ratings, so the recommendations are personalized. However, it may keep suggesting similar kinds of movies and offer less variety.

The `movies` dataset contains 44,475 movies. After creating a tags column by combining the `genres`, `keywords`,`overview`, `tagline`, `cast`, and `director` columns, computing cosine similarity for all movies becomes computationally expensive. Our system is not capable of handling this large-scale similarity computation efficiently.

To address this limitation, the `links_small` dataset is used instead. This dataset contains approximately 9,000 movies, which are a subset of the original `movies` dataset. Using this smaller dataset makes it feasible to compute cosine similarity and build the recommendation system efficiently.

Therefore, the links_small dataset is selected for implementing the recommendation system.

In [97]:
# links_small dataset first view
links_small = pd.read_csv('/content/links_small.csv')
links_small.head(2)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0


In [98]:
# Checking missing values in dataset
links_small.isnull().sum()

Unnamed: 0,0
movieId,0
imdbId,0
tmdbId,13


In [99]:
# Remove rows with missing values from the links_small dataset
links_small.dropna(inplace=True)

In [100]:
# Inspect the dimensionality of the links_small dataset
links_small.shape

(9112, 3)

In [101]:
# Check the number of duplicate rows in the links_small dataset
links_small.duplicated().sum()

np.int64(0)

In [102]:
links_small.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9112 entries, 0 to 9124
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9112 non-null   int64  
 1   imdbId   9112 non-null   int64  
 2   tmdbId   9112 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 284.8 KB


In [103]:
# Convert the 'tmdbId' column to int64
links_small['tmdbId'] = links_small['tmdbId'].astype('int64')

# Rename 'tmdbId' column to 'id'
links_small.rename(columns={'tmdbId': 'id'}, inplace=True)

In [104]:
links_small.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9112 entries, 0 to 9124
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   movieId  9112 non-null   int64
 1   imdbId   9112 non-null   int64
 2   id       9112 non-null   int64
dtypes: int64(3)
memory usage: 284.8 KB


In [105]:
# Create a subset of the movies dataset using IDs from the links_small dataset
movies_rec3 = movies[movies['id'].isin(links_small['id'])]

In [106]:
# Reset the index of the filtered movies dataset
movies_rec3.reset_index(inplace = True, drop = True)

In [107]:
movies_rec3.head(3)

Unnamed: 0,id,imdbId,movieId,title,release_year,genres,keywords,overview,tagline,cast,director,popularity,vote_average,vote_count
0,862,114709,1,Toy Story,1995,"[Animation, Comedy, Family]","[jealousy, toy, boy, friendship, friends, riva...","[Led, by, Woody, Andys, toys, live, happily, i...",[],"[TomHanks, TimAllen, DonRickles]",[JohnLasseter],21.946943,7.7,5415.0
1,8844,113497,2,Jumanji,1995,"[Adventure, Fantasy, Family]","[boardgame, disappearance, basedonchildren'sbo...","[When, siblings, Judy, and, Peter, discover, a...","[Roll, the, dice, and, unleash, the, excitement]","[RobinWilliams, JonathanHyde, KirstenDunst]",[JoeJohnston],17.015539,6.9,2413.0
2,15602,113228,3,Grumpier Old Men,1995,"[Romance, Comedy]","[fishing, bestfriend, duringcreditsstinger, ol...","[A, family, wedding, reignites, the, ancient, ...","[Still, Yelling, Still, Fighting, Still, Ready...","[WalterMatthau, JackLemmon, Ann-Margret]",[HowardDeutch],11.7129,6.5,92.0


In [108]:
movies_rec3.shape

(9070, 14)

In [109]:
# Create a 'tags' column by combining overview, tagline, cast, director, keywords, and genres

movies_rec3['tags'] = movies_rec3['overview'] + movies_rec3['tagline'] + movies_rec3['cast'] + movies_rec3['director'] + movies_rec3['keywords'] + movies_rec3['genres']

In [110]:
# converting datatype of tags into string

movies_rec3['tags'] = movies_rec3['tags'].apply(lambda x: " ".join(x))

In [111]:
movies_rec3['tags'][0]

'Led by Woody Andys toys live happily in his room until Andys birthday brings Buzz Lightyear onto the scene Afraid of losing his place in Andys heart Woody plots against Buzz But when circumstances separate Buzz and Woody from their owner the duo eventually learns to put aside their differences TomHanks TimAllen DonRickles JohnLasseter jealousy toy boy friendship friends rivalry boynextdoor newtoy toycomestolife Animation Comedy Family'

In [112]:
# Lowering the case of the string
movies_rec3['tags'] = movies_rec3['tags'].apply(lambda x:x.lower())

In [113]:
movies_rec3['tags'][0]

'led by woody andys toys live happily in his room until andys birthday brings buzz lightyear onto the scene afraid of losing his place in andys heart woody plots against buzz but when circumstances separate buzz and woody from their owner the duo eventually learns to put aside their differences tomhanks timallen donrickles johnlasseter jealousy toy boy friendship friends rivalry boynextdoor newtoy toycomestolife animation comedy family'

**Text Preprocessing on Tags Column**

In [114]:
from nltk import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [115]:
wl = WordNetLemmatizer()

In [116]:
# This function Lemmatize each word in the text and return the processed string.
def lemmatize(text):
  y = []
  for i in text.split():
    y.append(wl.lemmatize(i))
  return " ".join(y)

In [117]:
movies_rec3['tags'] = movies_rec3['tags'].apply(lemmatize)

In [118]:
movies_rec3['tags'][0]

'led by woody andys toy live happily in his room until andys birthday brings buzz lightyear onto the scene afraid of losing his place in andys heart woody plot against buzz but when circumstance separate buzz and woody from their owner the duo eventually learns to put aside their difference tomhanks timallen donrickles johnlasseter jealousy toy boy friendship friend rivalry boynextdoor newtoy toycomestolife animation comedy family'

**Text Vectorization on tags column**

In [119]:
# Initialize the TF-IDF vectorizer with English stop words

from sklearn.feature_extraction.text import TfidfVectorizer
tfid = TfidfVectorizer(stop_words='english')

In [120]:
# and convert the 'tags' text column into a TF-IDF feature matrix
vector = tfid.fit_transform(movies_rec3['tags'])

In [121]:
vector.shape

(9070, 52766)

**Cosine Similarity between vector**

In [122]:
from sklearn.metrics.pairwise import cosine_similarity

In [123]:
# Calculate the cosine similarity matrix based on the TF-IDF vectors
similarity = cosine_similarity(vector)

In [124]:
similarity[0]

array([1.        , 0.01470924, 0.00421439, ..., 0.        , 0.        ,
       0.00942179])

In [125]:
def content_based_recommendation(movie, top=10):

  """
  Generate movie recommendations using content-based filtering.

  This function identifies movies similar to a given movie by computing
  cosine similarity on TF-IDF feature vectors derived from movie metadata.
  It returns the top N most similar movies along with basic metadata.

  Parameters:
  ----------
  movie : str
      Title of the movie for which recommendations are generated.
  top : int, optional
      Number of similar movies to recommend (default is 10).

  Returns:
  -------
  pandas.DataFrame
      A DataFrame containing the recommended movies with the following columns:
      ['id', 'title', 'release_year', 'popularity']
  """

  # Find the index of the given movie title in the movies dataset
  movie_index = movies_rec3[movies_rec3['title'] == movie].index[0]

  # Retrieve similarity scores for the selected movie
  distances = similarity[movie_index]

  # Sort movies by similarity score in descending order
  # and select the top N most similar movies (excluding the movie itself)
  movies_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x:x[1])[1:top]

  # Extract the indices of the recommended movies
  indices = [item[0] for item in movies_list]

  # Retrieve recommended movies with selected metadata
  recommend_movies = movies_rec3.iloc[indices][['id', 'title', 'release_year', 'popularity']]
  return recommend_movies

In [126]:
content_based_recommendation('The Dark Knight')

Unnamed: 0,id,title,release_year,popularity
7920,49026,The Dark Knight Rises,2012,20.58258
1110,364,Batman Returns,1992,15.001681
6136,272,Batman Begins,2005,28.505341
7555,40662,Batman: Under the Red Hood,2010,7.039325
8215,142061,"Batman: The Dark Knight Returns, Part 2",2013,12.576611
132,414,Batman Forever,1995,13.321354
523,268,Batman,1989,19.10673
2573,14919,Batman: Mask of the Phantasm,1993,7.29114
8897,209112,Batman v Superman: Dawn of Justice,2016,31.435879


While this approach works well for providing relevant and personalized suggestions, it also has some limitations. One major drawback is that the system often recommends movies that are very similar to each other, which can make the recommendations feel repetitive and limit the discovery of new or different types of movies. Since the system depends heavily on movie metadata, any missing or poorly written information can reduce the quality of recommendations. Additionally, content-based systems do not learn from other users’ preferences, so they cannot take advantage of popular trends or collective opinions. They also fail to consider a user’s changing mood or context, such as wanting something different from their usual choices. As a result, while content-based recommendation systems are useful and simple to implement, they may not always provide diverse or highly engaging recommendations on their own.

# **Recommend Movies Based on collaborative filtering**

Collaborative filtering is a widely used recommendation technique that suggests movies by analyzing the preferences and behavior of multiple users. Instead of focusing on movie features, this approach identifies patterns in user–movie interactions, such as ratings or viewing history, to find similarities between users or between movies. The underlying idea is that users with similar tastes in the past are likely to enjoy similar movies in the future. By leveraging collective user feedback, collaborative filtering can provide more personalized and dynamic recommendations compared to content-based methods, making it a popular choice for modern recommendation systems.


For collaborative filtering–based movie recommendation, we require a dataset that contains user–movie interactions, where different users rate different movies. For this purpose, we use the `ratings_small` dataset.

The `ratings_small` dataset contains approximately 100,000 ratings provided by around 700 users for about 9,000 movies. This dataset is suitable for building and evaluating collaborative filtering recommendation models.

In [127]:
rdf = pd.read_csv('/content/ratings_small.csv')

In [128]:
rdf.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [129]:
ratings_small = rdf[['userId', 'movieId', 'rating']]
ratings_small.head()

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


In [130]:
# Checking missing values in dataset
ratings_small.isnull().sum()

Unnamed: 0,0
userId,0
movieId,0
rating,0


In [131]:
# Check for duplicated values in the dataset.
ratings_small.duplicated().sum()

np.int64(0)

In [132]:
ratings_small.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   userId   100004 non-null  int64  
 1   movieId  100004 non-null  int64  
 2   rating   100004 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 2.3 MB


In [133]:
# Extract the movie IDs from the ratings_small dataset
y = ratings_small.groupby('movieId').count()['rating'].index

In [134]:
# Create a subset of the movies dataset using movie IDs from the ratings_small dataset
movies_rec4 = movies[movies['movieId'].isin(y)]
movies_rec4.head()

Unnamed: 0,id,imdbId,movieId,title,release_year,genres,keywords,overview,tagline,cast,director,popularity,vote_average,vote_count
0,862,114709,1,Toy Story,1995,"[Animation, Comedy, Family]","[jealousy, toy, boy, friendship, friends, riva...","[Led, by, Woody, Andys, toys, live, happily, i...",[],"[TomHanks, TimAllen, DonRickles]",[JohnLasseter],21.946943,7.7,5415.0
1,8844,113497,2,Jumanji,1995,"[Adventure, Fantasy, Family]","[boardgame, disappearance, basedonchildren'sbo...","[When, siblings, Judy, and, Peter, discover, a...","[Roll, the, dice, and, unleash, the, excitement]","[RobinWilliams, JonathanHyde, KirstenDunst]",[JoeJohnston],17.015539,6.9,2413.0
2,15602,113228,3,Grumpier Old Men,1995,"[Romance, Comedy]","[fishing, bestfriend, duringcreditsstinger, ol...","[A, family, wedding, reignites, the, ancient, ...","[Still, Yelling, Still, Fighting, Still, Ready...","[WalterMatthau, JackLemmon, Ann-Margret]",[HowardDeutch],11.7129,6.5,92.0
3,31357,114885,4,Waiting to Exhale,1995,"[Comedy, Drama, Romance]","[basedonnovel, interracialrelationship, single...","[Cheated, on, mistreated, and, stepped, on, th...","[Friends, are, the, people, who, let, you, be,...","[WhitneyHouston, AngelaBassett, LorettaDevine]",[ForestWhitaker],3.859495,6.1,34.0
4,11862,113041,5,Father of the Bride Part II,1995,[Comedy],"[baby, midlifecrisis, confidence, aging, daugh...","[Just, when, George, Banks, has, recovered, fr...","[Just, When, His, World, Is, Back, To, Normal,...","[SteveMartin, DianeKeaton, MartinShort]",[CharlesShyer],8.387519,5.7,173.0


In [135]:
movies_rec4.shape

(9010, 14)

In [136]:
# Create a movie-user rating matrix using a pivot table
# Fill missing ratings with 0 to prepare the matrix for similarity computation
pt = ratings_small.pivot_table(index='movieId', columns='userId', values='rating')
pt.fillna(0, inplace=True)

In [137]:
pt

userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,4.0,0.0,...,0.0,4.0,3.5,0.0,0.0,0.0,0.0,0.0,4.0,5.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161944,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
162376,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
162542,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
162672,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [138]:
# Compute cosine similarity between movies based on user ratings
similarity_score = cosine_similarity(pt)

In [139]:
similarity_score.shape

(9066, 9066)

In [140]:
def colaborative_filtering_recommendation(movie, top=10):
    """
    Generate movie recommendations using collaborative filtering.

    This function identifies movies similar to a given movie based on
    user rating patterns using item–item cosine similarity. It returns
    the top N recommended movies along with basic metadata.

    Parameters:
    ----------
    movie : str
        Title of the movie for which recommendations are generated.
    top : int, optional
        Number of similar movies to recommend (default is 10).

    Returns:
    -------
    pandas.DataFrame
        A DataFrame containing the recommended movies with the following columns:
        ['id', 'title', 'release_year', 'popularity']
    """

    # Find the movieId corresponding to the given movie title
    movieId = movies_rec4[movies_rec4['title'] == movie]['movieId'].values[0]

    # Get the index of the movieId in the pivot table
    index = np.where(pt.index == movieId)[0][0]

    # Sort movies by similarity score in descending order
    # and select the top N most similar movies (excluding the movie itself)
    similar_items = sorted(list(enumerate(similarity_score[index])), key=lambda x: x[1], reverse=True)[1:top]

    # Extract indices of similar movies
    indices = [item[0] for item in similar_items]

    # Map indices back to movieIds using the pivot table
    movieId_list = [pt.index[i] for i in indices]

    # Retrieve recommended movies with selected metadata
    recommend_movies = movies_rec4[movies_rec4['movieId'].isin(movieId_list)][['id', 'title', 'release_year', 'popularity']]
    return recommend_movies

In [141]:
colaborative_filtering_recommendation('The Dark Knight')

Unnamed: 0,id,title,release_year,popularity
6975,122,The Lord of the Rings: The Return of the King,2003,29.324358
10089,272,Batman Begins,2005,28.505341
11281,1422,The Departed,2006,18.515448
11626,1271,300,2006,18.108408
12550,1726,Iron Man,2008,22.073099
12666,10681,WALL·E,2008,16.088366
14492,19995,Avatar,2009,185.070892
15407,27205,Inception,2010,29.108149
18136,49026,The Dark Knight Rises,2012,20.58258


Collaborative filtering relies heavily on user ratings and interaction data to generate recommendations, which leads to several limitations. One major challenge is the cold-start problem, where the system struggles to recommend movies to new users or suggest newly added movies because there is little or no interaction data available. The quality of recommendations also depends on the availability of sufficient user data; if users rate very few movies, the system may fail to identify meaningful patterns. Additionally, collaborative filtering can face scalability issues as the number of users and movies grows, increasing computational complexity. It may also produce biased recommendations by favoring popular movies while overlooking niche or less-rated content. Despite being highly effective for personalization, collaborative filtering requires large, well-structured datasets to perform reliably.

#**Hybrid Movie Recommendation System (Content-Based + Collaborative Filtering)**

The Hybrid Recommendation System combines Content-Based Filtering and Collaborative Filtering to generate more accurate and personalized movie recommendations. Each approach has its own strengths and limitations, and the hybrid model leverages both to improve recommendation quality.

The hybrid system combines similarity scores from both approaches using a weighted average method. A parameter α (alpha) controls the contribution of each method:

- α determines the weight of the content-based similarity
- (1 − α) determines the weight of the collaborative filtering similarity

This allows the system to balance movie content features with user preferences and reduces issues such as cold start and data sparsity.
Overall, the hybrid recommendation system provides more reliable and relevant movie suggestions than using either content-based or collaborative filtering alone.

In [142]:
movies_rec3.shape

(9070, 15)

In [143]:
movies_rec4.shape

(9010, 14)

In [144]:
# Count the number of movies that appear in both the content-based and collaborative filtering datasets
movies_rec3['movieId'].isin(movies_rec4['movieId']).sum()

np.int64(9010)

In [145]:
def hybrid_recommendation(movie, top=10, alpha=0.5):

  """
  Generate movie recommendations using a hybrid approach that combines
  content-based filtering and collaborative filtering.

  This function computes similarity scores from both content-based
  features and user rating patterns, then combines them using a
  weighted average controlled by the parameter `alpha`.

  Parameters:
  ----------
  movie : str
      Title of the movie for which recommendations are generated.
  top : int, optional
      Number of similar movies to recommend (default is 10).
  alpha : float, optional
      Weight assigned to content-based similarity (range: 0–1).
      The collaborative filtering weight is calculated as (1 - alpha).

  Returns:
  -------
  pandas.DataFrame
      A DataFrame containing the recommended movies with the following columns:
      ['id', 'title', 'release_year', 'popularity']
  """

  # Content-Based Similarity
  movie_index = movies_rec3[movies_rec3['title'] == movie].index[0]
  movie_similarity = similarity[movie_index]
  content_scores = dict(zip(movies_rec3['movieId'], movie_similarity))

  # Collaborative Filtering Similarity
  movieId = movies_rec4[movies_rec4['title'] == movie]['movieId'].values[0]
  movie_idx = np.where(pt.index == movieId)[0][0]
  movie_sim = similarity_score[movie_idx]
  collaborative_scores = dict(zip(pt.index, movie_sim))

  # Hybrid score calculation
  hybrid_scores = {}

  # Add weighted content-based similarity scores
  for mid in content_scores:
      hybrid_scores[mid] = alpha * content_scores[mid]

  # Add weighted collaborative filtering similarity scores
  for mid in collaborative_scores:
      hybrid_scores[mid] = hybrid_scores.get(mid, 0) + (1 - alpha) * collaborative_scores[mid]

  # Sort movies by hybrid score in descending order
  # The result is in the form of (movieId, hybrid_score)
  hybrid_scores = sorted(hybrid_scores.items(), key=lambda x:x[1], reverse=True)

  # Extract movieIds of the top N similar movies (excluding the input movie)
  movie_ids = [mid for mid, score in hybrid_scores if mid != movieId][:top]

  # Recommended movies with selected metadata
  recommend_movies = movies_rec4[movies_rec4['movieId'].isin(movie_ids)][['id', 'title', 'release_year', 'popularity']]

  return recommend_movies


In [146]:
hybrid_recommendation("The Dark Knight")

Unnamed: 0,id,title,release_year,popularity
6975,122,The Lord of the Rings: The Return of the King,2003,29.324358
10089,272,Batman Begins,2005,28.505341
11281,1422,The Departed,2006,18.515448
11319,1124,The Prestige,2006,16.94556
12550,1726,Iron Man,2008,22.073099
12666,10681,WALL·E,2008,16.088366
13561,16869,Inglourious Basterds,2009,16.89564
14492,19995,Avatar,2009,185.070892
15407,27205,Inception,2010,29.108149
18136,49026,The Dark Knight Rises,2012,20.58258


# **Conclusion**

In this project, multiple movie recommendation techniques were explored and implemented to understand how different recommendation strategies work in real-world applications. The project began with popularity-based recommendations using IMDb’s weighted average rating, which helped identify top-rated and widely appreciated movies. This approach was further extended to provide genre-wise recommendations, offering more targeted suggestions within specific movie categories.

To move beyond general recommendations, a content-based recommendation system was developed using movie metadata and text-based features. By applying TF-IDF vectorization and cosine similarity, the system was able to recommend movies based on their content similarity, making it suitable for personalized recommendations without relying on user interaction data. Collaborative filtering was also implemented to leverage user ratings and interaction patterns, enabling more dynamic and personalized movie suggestions.

Finally, a hybrid recommendation approach was introduced to combine the strengths of both content-based and collaborative filtering methods. This helped reduce the limitations of individual approaches and improved the overall recommendation quality. Overall, this project demonstrates how different recommendation techniques can be designed, compared, and integrated to build effective and scalable movie recommendation systems commonly used in modern streaming platforms.