# **MOVIELENS RECOMMENDER SYSTEM**

## **1. Business Understanding**
------------------------------------

### 1.1: Problem Statement

In the modern streaming era, users are often overwhelmed by vast content libraries, leading to "choice paralysis" and platform abandonment. When users cannot quickly find content that aligns with their tastes, engagement drops and churn rates consequently increase.

### 1.2: Objective

The goal of this project is to build a Hybrid Recommendation System that leverages historical user data to provide personalized 'Top 5' movie recommendations. 
<br>
By automating content discovery, the system aims to:

  * Increase User Retention: Keep users on the platform longer by surfacing relevant content.

  * Enhance Experience: Reduce the cognitive load of searching for movies.

  * Optimize Engagement: Use predictive modeling (Collaborative Filtering) to identify high-interest items a user hasn't yet discovered.

### 1.3: Success Criteria

Success Criteria
The model will be considered successful if it achieves:

  - A low RMSE (Root Mean Squared Error), indicating high accuracy in predicting user ratings.

  - High Precision@k, ensuring that the top recommendations are truly relevant to the user.

## **2. Data Understanding**
------------------------------


### 2.1: Importing the necessary Libraries

In [2]:
import pandas as pd
import numpy as np

### 2.2: Data Loading and Inspection

The dataset provides a rich snapshot of user-item interactions hence allowing for the development of both collaborative and content-based models.

**2.2.1: Dataset Composition**

This dataset comprises four interconnected files that map movie attributes to user behaviors and external metadata.

1. movies.csv - The primary catalog containing movieId, title (with release year) and a pipe-separated list of genres (e.g., Action|Sci-Fi).

In [3]:
movies = pd.read_csv('Datasets/Movielens_data/movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


2. ratings.csv: A collection of user-movie interactions featuring userId, movieId and a rating on a 5-star scale with 0.5-star increments.

In [4]:
ratings = pd.read_csv('Datasets/Movielens_data/ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


3. tags.csv: User-generated metadata providing short, descriptive phrases (e.g., "cult classic") for specific films. It includes userId, movieId and the tag content.

In [5]:
tags = pd.read_csv('Datasets/Movielens_data/tags.csv')
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


4. links.csv: A relational bridge containing movieId, imdbId and tmdbId to facilitate integration with external platforms like IMDb and The Movie Database.

In [6]:
links = pd.read_csv('Datasets/Movielens_data/links.csv')
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


**2.2.2: Merging the Datasets**

By utilizing the `movieId` column as the primary key we joined the metadata from movies.csv and links.csv with the user-generated observations in ratings.csv and tags.csv.

In [7]:
df1 = pd.merge(movies, links, on='movieId', how='inner')
df1.head()

Unnamed: 0,movieId,title,genres,imdbId,tmdbId
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,113497,8844.0
2,3,Grumpier Old Men (1995),Comedy|Romance,113228,15602.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,114885,31357.0
4,5,Father of the Bride Part II (1995),Comedy,113041,11862.0


In [8]:
df2 = pd.merge(df1, ratings, on='movieId', how='left')
df2.head()

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,964982700.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,847435000.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,7.0,4.5,1106636000.0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,15.0,2.5,1510578000.0
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,17.0,4.5,1305696000.0


In [9]:
final_df = pd.merge(df2, tags, on=['userId', 'movieId'], 
                    how='left', suffixes=('_rating', '_tag'))
final_df.head()


Unnamed: 0,movieId,title,genres,imdbId,tmdbId,userId,rating,timestamp_rating,tag,timestamp_tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,964982700.0,,
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,847435000.0,,
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,7.0,4.5,1106636000.0,,
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,15.0,2.5,1510578000.0,,
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,17.0,4.5,1305696000.0,,


Now to save the data for subsequent analysis

In [10]:
final_df.to_csv('Datasets/merged_movie_data.csv', index=False)

## **3. Data Preparation**
----------------------------------------------------------

In [11]:
data = pd.read_csv('Datasets/merged_movie_data.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102695 entries, 0 to 102694
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   movieId           102695 non-null  int64  
 1   title             102695 non-null  object 
 2   genres            102695 non-null  object 
 3   imdbId            102695 non-null  int64  
 4   tmdbId            102682 non-null  float64
 5   userId            102677 non-null  float64
 6   rating            102677 non-null  float64
 7   timestamp_rating  102677 non-null  float64
 8   tag               3476 non-null    object 
 9   timestamp_tag     3476 non-null    float64
dtypes: float64(5), int64(2), object(3)
memory usage: 7.8+ MB


a. Checking for duplicates

In [12]:
data.duplicated().sum()

0

b. Checking any missing values

In [13]:
data.isnull().sum()

movieId                 0
title                   0
genres                  0
imdbId                  0
tmdbId                 13
userId                 18
rating                 18
timestamp_rating       18
tag                 99219
timestamp_tag       99219
dtype: int64

In [14]:
# 1. Fill missing tags with a placeholder 
data['tag'] = data['tag'].fillna('no_tag')
data['timestamp_tag'] = data['timestamp_tag'].fillna(0)  # 0 will indicate no tag

# 2. Drop rows where there is no rating or userId 
data.dropna(subset=['userId', 'rating'], inplace=True)

# 3. Drop the tmdbId missing values. They are only 13
data.dropna(subset=['tmdbId'], inplace=True)

data.isnull().sum()

movieId             0
title               0
genres              0
imdbId              0
tmdbId              0
userId              0
rating              0
timestamp_rating    0
tag                 0
timestamp_tag       0
dtype: int64

c. Standardizing the title column and extracting the year for a time series analysis

In [16]:
# Remove leading/trailing spaces and make titles title-case
data['title'] = data['title'].str.strip().str.title()
data

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,userId,rating,timestamp_rating,tag,timestamp_tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,9.649827e+08,no_tag,0.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,8.474350e+08,no_tag,0.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,7.0,4.5,1.106636e+09,no_tag,0.0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,15.0,2.5,1.510578e+09,no_tag,0.0
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,17.0,4.5,1.305696e+09,no_tag,0.0
...,...,...,...,...,...,...,...,...,...,...
102690,193581,Black Butler: Book Of The Atlantic (2017),Action|Animation|Comedy|Fantasy,5476944,432131.0,184.0,4.0,1.537109e+09,no_tag,0.0
102691,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,5914996,445030.0,184.0,3.5,1.537110e+09,no_tag,0.0
102692,193585,Flint (2017),Drama,6397426,479308.0,184.0,3.5,1.537110e+09,no_tag,0.0
102693,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,8391976,483455.0,184.0,3.5,1.537110e+09,no_tag,0.0


2.2 Statistical Challenges & Constraints
To build a "best-in-class" system, we must account for the following data characteristics:

Matrix Sparsity: With 943 users and 1,682 movies, the interaction matrix has a density of only 6.3%. Most users have only rated a fraction of the library. Our model (SVD) must be robust enough to "fill in the gaps" via latent factor estimation.

The Long Tail Distribution: A small percentage of movies (popular blockbusters) receive the vast majority of ratings. Conversely, a large "tail" of movies has very few ratings, which can lead to biased recommendations if not handled during the data preprocessing stage.

Rating Bias: Users typically provide ratings for movies they have already chosen to watch, creating a "selection bias." We will examine the mean rating distribution to determine if the baseline is skewed toward higher scores (e.g., a mean of 3.5 instead of 2.5).

### 2.3: Data Preparation

The analysis utilizes the MovieLens 100k dataset, a benchmark dataset in recommendation system research. It contains 100,000 ratings from 943 users across 1,682 movies.

Data Sources
The project relies on two primary files:

ratings.csv: Contains the core interaction data including userId, movieId, rating (on a scale of 1–5), and timestamp.

movies.csv: Contains metadata including movieId, title, and genres (pipe-separated list).

Key Data Characteristics
Rating Distribution: We will analyze the frequency of ratings to identify if there is a positive bias (e.g., users tending to rate movies they liked rather than every movie they watched).

Sparsity: With 943 users and 1,682 movies, the "User-Item Matrix" has a maximum of 1,586,126 potential interactions. Since we only have 100,000 actual ratings, the matrix is approximately 93.7% sparse.

Popularity Bias: A small subset of movies (blockbusters) often accounts for a large percentage of total ratings, while many niche films have very few ratings.