### Data Preparation - Cleaning & Feature Engineering

Key step: Split multi-genre movies into separate columns or list for content-based filtering / hybrid approach.

In [13]:
import pandas as pd

# Load merged Dataset
merged_df = pd.read_csv('./Datasets/merged_movie_data.csv')


### Why merge?
- Combines user ratings with movie metadata (title, genres) and tags → one clean table.
- Enables both:
  - **Collaborative Filtering** (userId + movieId + rating)
  - **Content-Based Filtering** (genres + tags)
- Saves the merged file so everyone uses the same data later.

### Data Preparation – Using the group's merged file

The team merged the raw files (movies, ratings, tags, links) into one table so we can:
- See ratings together with movie titles/genres (for collaborative + content filtering)
- Handle tags (sparse, but useful for content similarity)
- Avoid repeating merge code in every notebook

## 1. Business Understanding

### 1.1 Problem Statement
In today’s streaming world (Netflix, Disney+, etc.), users face **choice overload**.  
There are thousands of movies most people get stuck scrolling and eventually leave the platform (churn).  

**Real stakeholder example**:  
A streaming company like "MovieStream" wants to keep users watching longer.  
They need a system that quickly shows each person movies they’re very likely to enjoy.

### 1.2 Project Objective
We are building a **hybrid recommender system** that gives users **personalized Top-5 movie recommendations**.  

**Hybrid = two approaches combined**:
- **Collaborative Filtering** → learns from what similar users liked (based on ratings)
- **Content-Based Filtering** → looks at movie features (genres, tags) to find similar movies

**Goal**:  
- Reduce decision fatigue  
- Increase time spent on platform  
- Lower churn rate  
- Improve user satisfaction

### 1.3 Success Criteria (how we’ll know it worked)
- **Low RMSE** on predicted ratings → model predicts user preferences accurately  
- **High Precision@5** → most of the Top-5 movies we recommend are actually ones the user likes  
- Bonus: users get relevant suggestions even for movies with few ratings (hybrid helps with cold-start)

## 1. Business Understanding

### 1.1 Problem Statement
In today’s streaming world (Netflix, Disney+, etc.), users face **choice overload**.  
There are thousands of movies most people get stuck scrolling and eventually leave the platform (churn).  

**Real stakeholder example**:  
A streaming company like "MovieStream" wants to keep users watching longer.  
They need a system that quickly shows each person movies they’re very likely to enjoy.

### 1.2 Project Objective
We are building a **hybrid recommender system** that gives users **personalized Top-5 movie recommendations**.  

**Hybrid = two approaches combined**:
- **Collaborative Filtering** → learns from what similar users liked (based on ratings)
- **Content-Based Filtering** → looks at movie features (genres, tags) to find similar movies

**Goal**:  
- Reduce decision fatigue  
- Increase time spent on platform  
- Lower churn rate  
- Improve user satisfaction

### 1.3 Success Criteria (how we’ll know it worked)
- **Low RMSE** on predicted ratings → model predicts user preferences accurately  
- **High Precision@5** → most of the Top-5 movies we recommend are actually ones the user likes  
- Bonus: users get relevant suggestions even for movies with few ratings (hybrid helps with cold-start)

## 2. Data Understanding

Now that we know **why** we’re doing this, let’s look at **what data** we actually have.

The team used the **MovieLens small dataset** (~100,000 ratings).  
It has 4 main files — I’ll explain each one and why we need them.

In [14]:
import pandas as pd

movies = pd.read_csv('Datasets/Movielens_data/movies.csv')
print("Movies shape:", movies.shape)
movies.head()

Movies shape: (9742, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


**movies.csv**  
- movieId: unique movie identifier  
- title: movie name + year  
- genres: pipe-separated (e.g. Adventure|Animation|Children)  

→ This gives us the **content features** (genres) we need for content-based filtering.

In [15]:
ratings = pd.read_csv('Datasets/Movielens_data/ratings.csv')
print("Ratings shape:", ratings.shape)
print("Unique users:", ratings['userId'].nunique())
print("Unique movies rated:", ratings['movieId'].nunique())
ratings.head()

Ratings shape: (100836, 4)
Unique users: 610
Unique movies rated: 9724


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


**ratings.csv**  
- userId + movieId + rating (0.5 to 5.0) + timestamp  

→ This is the heart of **collaborative filtering**  who rated what and how much.

In [16]:
tags = pd.read_csv('Datasets/Movielens_data/tags.csv')
print("Tags shape:", tags.shape)
tags.head(3)

links = pd.read_csv('Datasets/Movielens_data/links.csv')
links.head(3)

Tags shape: (3683, 4)


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0


**tags.csv** → user-generated keywords (very sparse)  
**links.csv** → connects to IMDb/TMDb (optional extra metadata)

Most important: ratings + movies + tags → enough for hybrid system.

### 2.2.2 Why & How We Merge (this is the key step)

We can’t do recommendations with separate tables.  
We need **one table** that has:
- User ratings
- Movie titles & genres
- Tags (where available)

**Team’s merge logic (same as index.ipynb)**:
1. movies + links → df1 (added external IDs)
2. df1 + ratings → df2 (now every rating has movie info)
3. df2 + tags → final_df (added tags, most are missing → we fill with 'no_tag')
4. Saved as `merged_movie_data.csv` so everyone uses the same clean data

This merge lets us do both collaborative (ratings) and content-based (genres + tags) in one place.

In [17]:
# Load the final merged file the team created
merged_df = pd.read_csv('Datasets/merged_movie_data.csv')

print("Merged data shape:", merged_df.shape)
print("\nColumns:", merged_df.columns.tolist())
merged_df.head(5)

Merged data shape: (102695, 10)

Columns: ['movieId', 'title', 'genres', 'imdbId', 'tmdbId', 'userId', 'rating', 'timestamp_rating', 'tag', 'timestamp_tag']


Unnamed: 0,movieId,title,genres,imdbId,tmdbId,userId,rating,timestamp_rating,tag,timestamp_tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,964982700.0,,
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,847435000.0,,
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,7.0,4.5,1106636000.0,,
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,15.0,2.5,1510578000.0,,
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,17.0,4.5,1305696000.0,,


### Quick Summary of the Merged Data

We now have one big table (~100,000 rows) that combines:
- User ratings (userId, movieId, rating)
- Movie details (title, genres)
- Tags (mostly missing → filled with 'no_tag')

This single file is what the team saved so everyone can start modeling from the same place — no need to re-merge every time.

Next steps in our project:
- Split genres into dummy columns (for content-based filtering)
