# Movie Recommender System

This notebook builds a movie recommendation system using the MovieLens dataset and collaborative filtering.


# Business Understanding

The goal of this project is to build a **Movie Recommendation System** using the MovieLens dataset.  

### Problem Statement
Movie streaming platforms need effective ways to recommend movies that users are likely to enjoy.  
This project applies **collaborative filtering** to recommend movies based on past user ratings.

### Objectives
- Provide personalized movie recommendations.  
- Explore user-item interactions from the MovieLens dataset.  
- Build a baseline recommender using item-based collaborative filtering.  


# Data Understanding

The MovieLens dataset contains multiple CSV files:
- `movies.csv`: movieId, title, genres  
- `ratings.csv`: userId, movieId, rating, timestamp  
- `tags.csv`: userId, movieId, tag, timestamp  
- `links.csv`: movieId, IMDb ID, TMDb ID  

We will primarily use `ratings.csv` and `movies.csv` to build our recommender.


In [5]:
import pandas as pd

# Load data
movies = pd.read_csv("../data/movies.csv")
ratings = pd.read_csv("../data/ratings.csv")
tags = pd.read_csv("../data/tags.csv")
links = pd.read_csv("../data/links.csv")

# Inspect shapes
print("Movies:", movies.shape)
print("Ratings:", ratings.shape)
print("Tags:", tags.shape)
print("Links:", links.shape)

# Preview datasets
display(movies.head())
display(ratings.head())


Movies: (9742, 3)
Ratings: (100836, 4)
Tags: (3683, 4)
Links: (9742, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


## 3. Data Preparation  

In this section, we clean and transform the raw MovieLens datasets to make them ready for analysis and modeling.  
Key tasks include:  
1. Handling missing values.  
2. Removing duplicates.  
3. Converting data types.  
4. Splitting and processing the `genres` column.  
5. Merging datasets (`movies.csv`, `ratings.csv`, `tags.csv`, `links.csv`) into a unified structure.  

This ensures the data is consistent, reliable, and suitable for building our recommender system.


In [6]:
# Check missing values
print("Missing values per dataset:")
print(movies.isnull().sum())
print(ratings.isnull().sum())


# Check duplicates in each dataset
print("Movies duplicates:", movies.duplicated().sum())
print("Ratings duplicates:", ratings.duplicated().sum())


Missing values per dataset:
movieId    0
title      0
genres     0
dtype: int64
userId       0
movieId      0
rating       0
timestamp    0
dtype: int64
Movies duplicates: 0
Ratings duplicates: 0


## Step 2: Check & Convert Data Types (Movies & Ratings)

For our recommender system, we will mainly use **movies** and **ratings** datasets.  
It is important to ensure that the columns in these two datasets have the correct data types.  

- **Movies dataset**:  
  - `movieId` → should be integer (unique identifier).  
  - `title` → string.  
  - `genres` → string (pipe-separated).  

- **Ratings dataset**:  
  - `userId` → should be integer (unique identifier for each user).  
  - `movieId` → integer (foreign key matching movies).  
  - `rating` → float (numerical rating).  
  - `timestamp` → needs conversion from UNIX time to `datetime` for interpretability.  

Converting timestamps is especially important because it allows us to analyze trends over time and not just by raw numbers.  


In [7]:
# 1. Check data types for movies and ratings
print("Movies dtypes:\n", movies.dtypes, "\n")
print("Ratings dtypes:\n", ratings.dtypes, "\n")

# 2. Convert ratings timestamp from UNIX seconds to datetime
ratings['timestamp'] = pd.to_datetime(ratings['timestamp'], unit='s')

# 3. Confirm conversion
print("Ratings dtypes after conversion:\n", ratings.dtypes, "\n")


Movies dtypes:
 movieId     int64
title      object
genres     object
dtype: object 

Ratings dtypes:
 userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object 

Ratings dtypes after conversion:
 userId                int64
movieId               int64
rating              float64
timestamp    datetime64[ns]
dtype: object 



## Step 4: Splitting and Processing the `genres` Column

The `genres` column in the **movies dataset** contains multiple genres separated by the `|` symbol.  
To make this column easier to work with, we will:  

1. Split genres into individual lists.  
2. Create a one-hot encoded structure where each genre becomes its own column with binary values (0/1).  

This transformation is useful for:  
- Filtering movies by genre.  
- Building content-based recommendation models in the future.  


In [8]:
# Step 4: Split genres into lists
movies['genres'] = movies['genres'].str.split('|')

# Create one-hot encoded dataframe for genres
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
genres_encoded = pd.DataFrame(
    mlb.fit_transform(movies['genres']),
    columns=mlb.classes_,
    index=movies.index
)

# Merge back into movies dataframe
movies = movies.join(genres_encoded)

# Preview processed movies dataset
movies.head(5)


Unnamed: 0,movieId,title,genres,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",0,0,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]",0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),"[Comedy, Romance]",0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II (1995),[Comedy],0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
