# Movie Recommendation System Using Collaborative and Content-Based Filtering

## 1. Business Understanding
### 1.1 Project Overview
Personalized recommendation systems are a core component of modern digital platforms, particularly in media and entertainment services where users are exposed to large volumes of content. Effective personalization helps users discover relevant content efficiently while improving engagement and satisfaction.
This project focuses on developing a movie recommendation system using historical user rating data from the [MovieLens](https://grouplens.org/datasets/movielens/)
 (ml-latest-small) dataset. The system applies collaborative filtering as the primary recommendation mechanism, supplemented by content-based filtering using movie genre information to address cold-start scenarios and improving engagement, satisfaction, and content discovery efficiency.

### 1.2 Business Problem Statement

A movie streaming platform seeks to improve user engagement and retention by recommending movies that align with individual user preferences. Without effective personalization, users may struggle to discover relevant content, leading to reduced satisfaction and platform usage. The challenge is to leverage historical user ratings to generate accurate and personalized movie recommendations.

### 1.3 Project Objectives
1. Build a hybrid recommendation system using collaborative filtering and content-based techniques based on explicit user rating and movie metadata.
2. Generate personalized top-5 movie recommendations for individual users.
3. Evaluate model performance using appropriate validation techniques and error-based metrics.

### 1.4 Research Questions
1. How effectively can collaborative filtering learn user preferences from historical movie ratings?
2. How accurately can the model predict user ratings for unseen movies?
3. How does a hybrid recommendation approach compare to pure collaborative filtering in terms of recommendation quality?

## 2. Data Understanding
### 2.1 Dataset Overview
The MovieLens (ml-latest-small) dataset is a publicly available movie recommendation dataset developed by the GroupLens Research Lab at the University of Minnesota. It contains explicit user ratings and user-generated tags collected from the MovieLens platform between 1996 and 2018.    
The dataset includes:
100,836 ratings    
3,683 tag applications    
9,742 movies    
610 anonymized users    
Ratings on a 0.5–5.0 star scale          
Each user in the dataset has rated at least 20 movies, making it suitable for collaborative filtering techniques.

### 2.2 Data Files
The dataset is provided in four comma-separated value (CSV) files:     
ratings.csv – user ratings for movies, including timestamps     
movies.csv – movie titles and genre information     
tags.csv – user-generated tags applied to movies    
links.csv – external identifiers linking movies to IMDb and TMDB

## 3. Environment Setup & Reproducibility

In [1]:
# Liabraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Surprise library for collaborative filtering
from surprise import Dataset, Reader
from surprise import KNNBasic, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

# Evaluation metrics 
from sklearn.metrics import mean_absolute_error

# Utility libraries
import warnings
import random

# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

# Suppress unnecessary warnings for cleaner output
warnings.filterwarnings("ignore")

# Confirm Surprise version
import surprise
print("Surprise version:", surprise.__version__)

Surprise version: 1.1.1


## 4. Data Loading

In [4]:

ratings_df = pd.read_csv("../data/ratings.csv")
movies_df = pd.read_csv("../data/movies.csv")
tags_df = pd.read_csv("../data/tags.csv")
links_df = pd.read_csv("../data/links.csv")

# Basic validation checks
print("Ratings shape:", ratings_df.shape)
print("Movies shape:", movies_df.shape)
print("Tags shape:", tags_df.shape)
print("Links shape:", links_df.shape)

# Inspect first few rows to confirm successful loading
ratings_df.head(), movies_df.head()

Ratings shape: (100836, 4)
Movies shape: (9742, 3)
Tags shape: (3683, 4)
Links shape: (9742, 3)


(   userId  movieId  rating  timestamp
 0       1        1     4.0  964982703
 1       1        3     4.0  964981247
 2       1        6     4.0  964982224
 3       1       47     5.0  964983815
 4       1       50     5.0  964982931,
    movieId                               title  \
 0        1                    Toy Story (1995)   
 1        2                      Jumanji (1995)   
 2        3             Grumpier Old Men (1995)   
 3        4            Waiting to Exhale (1995)   
 4        5  Father of the Bride Part II (1995)   
 
                                         genres  
 0  Adventure|Animation|Children|Comedy|Fantasy  
 1                   Adventure|Children|Fantasy  
 2                               Comedy|Romance  
 3                         Comedy|Drama|Romance  
 4                                       Comedy  )