# 🎬 Movie Recommendation System with Hybrid Collaborative Filtering

This notebook demonstrates how to build a **hybrid recommendation system** that combines the power of **collaborative filtering** with rich **content-based features** such as:

- 🏷️ **Tags**
- 🎬 **Genres**
- 🎥 **Director & Cast**
- 📝 **User Reviews**
- 🕓 **Temporal Features** (timestamps encoded with Cartesian coordinates)

We use **PyTorch** to implement a neural network that learns to predict user ratings by embedding both user-movie interactions and high-level textual/movie metadata using **SentenceTransformers**.

Key components of the notebook include:
- 🧹 Data preprocessing and enrichment
- 🧠 Embedding textual features with transformer-based models
- 🔁 Train/validation/test split
- 🧱 Custom PyTorch dataset and model
- 📈 Training and evaluation
- 🎯 Movie recommendation generation

By the end, you'll have a flexible and scalable framework for delivering personalized movie recommendations using both **collaborative signals** and **semantic content**.

Data is from: https://grouplens.org/datasets/movielens/

In [1]:
%reset -f

## Libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torch.optim as optim


## Import data

### Metadata

In [3]:
metadata = pd.read_json("data/raw/metadata_updated.json", lines = True)

In [4]:
metadata.head(3)

Unnamed: 0,title,directedBy,starring,avgRating,imdbId,item_id
0,Toy Story (1995),John Lasseter,"Tim Allen, Tom Hanks, Don Rickles, Jim Varney,...",3.89146,114709,1
1,Jumanji (1995),Joe Johnston,"Jonathan Hyde, Bradley Pierce, Robin Williams,...",3.26605,113497,2
2,Grumpier Old Men (1995),Howard Deutch,"Jack Lemmon, Walter Matthau, Ann-Margret , Sop...",3.17146,113228,3


In [5]:
metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84661 entries, 0 to 84660
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   title       84661 non-null  object 
 1   directedBy  84661 non-null  object 
 2   starring    84661 non-null  object 
 3   avgRating   84661 non-null  float64
 4   imdbId      84661 non-null  int64  
 5   item_id     84661 non-null  int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 3.9+ MB


In [6]:
#metadata.to_csv("data/metadata.csv")

### Reviews
Reviews file is a huge file 3.9G. Since this project is for educational purposes it is better only to take a smaple of it.

In [7]:
reviews = pd.read_json("data/raw/reviews.json", lines = True)

We group by "movieId" so we are sure that the sampling is **stratified** meaning we having ratings for all movies.

In [8]:
reviews_sample = reviews.groupby("item_id", group_keys = False).sample(frac = 0.1, random_state = 42)

In [9]:
reviews_sample.head(3)

Unnamed: 0,item_id,txt
1083889,1,The pioneer of animation movies.; This was the...
2044248,1,The masterpiece that started it all; I remembe...
1248469,1,A magnificent milestone in animation!; Toy Sto...


In [10]:
reviews_sample.info()

<class 'pandas.core.frame.DataFrame'>
Index: 260048 entries, 1083889 to 412345
Data columns (total 2 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   item_id  260048 non-null  int64 
 1   txt      260048 non-null  object
dtypes: int64(1), object(1)
memory usage: 6.0+ MB


In [11]:
reviews_sample.dropna(inplace = True)
reviews_sample.reset_index(drop = True, inplace = True)

In [12]:
reviews_sample.to_csv("data/processed/reviews_sample.csv")

### Ratings

Again ratings file is a huge file 1.9G. Since this project is for educational purposes it is better only to take a smaple of it

In [13]:
ratings = pd.read_csv("data/raw/rating.csv", parse_dates=["timestamp"])

In [14]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000263 entries, 0 to 20000262
Data columns (total 4 columns):
 #   Column     Dtype         
---  ------     -----         
 0   userId     int64         
 1   movieId    int64         
 2   rating     float64       
 3   timestamp  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 610.4 MB


In [15]:
ratings_sample = ratings.groupby("movieId", group_keys = False).sample(frac = 0.1, random_state = 42)

In [16]:
#this is our main dataframe (raitng is the target actually)
ratings_sample.head(3)

Unnamed: 0,userId,movieId,rating,timestamp
4834169,33218,1,5.0,1997-04-03 22:17:32
1716733,11586,1,3.0,2009-01-09 21:44:52
7170935,49444,1,4.0,2003-02-03 22:37:10


In [17]:
ratings_sample.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1998483 entries, 4834169 to 8504617
Data columns (total 4 columns):
 #   Column     Dtype         
---  ------     -----         
 0   userId     int64         
 1   movieId    int64         
 2   rating     float64       
 3   timestamp  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 76.2 MB


In [18]:
ratings.dropna(inplace = True)
ratings.reset_index(drop = True, inplace = True)

In [19]:
ratings_sample.to_csv("data/processed/ratings_sample.csv")

### Movies

In [20]:
movies = pd.read_csv("data/raw/movie.csv")

In [21]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  27278 non-null  int64 
 1   title    27278 non-null  object
 2   genres   27278 non-null  object
dtypes: int64(1), object(2)
memory usage: 639.5+ KB


In [22]:
movies.head(3)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [23]:
movies = movies.sample(frac = 0.05, random_state = 42)

In [24]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1364 entries, 12922 to 15555
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  1364 non-null   int64 
 1   title    1364 non-null   object
 2   genres   1364 non-null   object
dtypes: int64(1), object(2)
memory usage: 42.6+ KB


### Tags

These are the tags given (suggested) by users.

In [25]:
tags = pd.read_csv("data/raw/tag.csv")

In [26]:
tags.head(3)

Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,Mark Waters,2009-04-24 18:19:40
1,65,208,dark hero,2013-05-10 01:41:18
2,65,353,dark hero,2013-05-10 01:41:19


In [27]:
tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 465564 entries, 0 to 465563
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   userId     465564 non-null  int64 
 1   movieId    465564 non-null  int64 
 2   tag        465548 non-null  object
 3   timestamp  465564 non-null  object
dtypes: int64(2), object(2)
memory usage: 14.2+ MB


There are missing values in tags provided by users. So we drop them.

In [28]:
tags.dropna(inplace = True)
tags.reset_index(drop = True, inplace = True)

## 📦 Data Processing

To build a hybrid recommendation system, we combine collaborative and content-based features from multiple data sources. Here's what we extract and prepare from each dataset:

- **Tags**  
  → Extract user-generated ***tags*** and embed them for semantic representation.

- **Movies**  
  → Extract ***genres*** and embed them to capture thematic information.

- **Metadata**  
  → Use:
  - ***directedBy*** and ***starring*** fields for embedding key contributors
  - ***item_id*** to join with other sources

- **Reviews** (`reviews_sample`)  
  → Extract ***item_id*** for joining and ***txt*** reviews to embed textual sentiment and themes.

- **Ratings**  
  → Use:
  - ***userId*** and ***movieId*** for collaborative filtering
  - ***timestamp*** converted to **Cartesian features** (sin/cos) to capture temporal patterns

Additionally, we **aggregate multi-valued fields** (e.g. tags, genres, reviews) into lists per movie to ensure a clean, one-row-per-movie format before embedding.

### 💡 Why Aggregate Tags and Genres as Lists per Movie?

When enriching movies with metadata like `genres`, `tags`, `reviews`, or `people`, we often face many-to-one relationships. A single movie might have multiple tags or genres, and representing each as a separate row would lead to a **combinatorial explosion** in the dataset.

Consider:
- A movie with **5 tags** and **3 genres**
- If we created a row for each unique tag-genre combination, we'd have **15 rows per movie** (5 × 3)
- For millions of movies, this quickly becomes inefficient and introduces **redundant data**

✅ **Instead, we group metadata into lists:**
- `["Action", "Sci-Fi", "Thriller"]` for genres
- `["space", "alien", "explosions"]` for tags

This has several benefits:
- 🧠 **One row per movie** → easier to align with rating-based collaborative filtering
- 💾 **Memory efficient** → avoids row duplication
- 🧮 **Simple embedding** → encode each list as a single embedding (e.g., average of tag vectors)
- ⚡ **Faster training** → since we don’t repeat scalar or embedding features

This structure is particularly well-suited for **hybrid recommender systems**, where we combine collaborative embeddings (user/movie IDs) with content-based embeddings (tags, genres, etc.).


### 🕒 Why We Convert Timestamps to Sin/Cos Features (Cartesian Time Embedding)?

Timestamps are naturally **cyclical** — patterns repeat daily, weekly, or annually. For example:
- People tend to watch movies more on weekends or holidays
- Viewing behavior might shift during summer breaks or winter seasons

However, raw timestamps or even numeric encodings like `"hour = 23"` or `"month = 12"` don't capture this **cyclicity** well in models — especially in neural networks.

🔁 **The Problem:**
The model sees `"hour = 0"` and `"hour = 23"` as far apart, even though they’re right next to each other on the clock.

✅ **The Solution: Project time into a circle using sin/cos**
By converting time into its **sinusoidal components**, we preserve its cyclical nature in a **smooth, continuous** way:

This maps time onto a **unit circle**, where:
- Midnight (0 seconds) → sin = 0, cos = 1
- Noon → sin = 0, cos = -1
- Midnight again → back to cos = 1

✨ **Benefits:**
- 🧭 Captures **temporal cycles** (day/night, seasons, holidays)
- 🧠 Makes time features more interpretable for neural nets
- 📉 Improves generalization on time-related patterns

This technique is widely used in time series modeling, recommender systems, and even transformer-based models when representing positional data.

### #️⃣ TAGs

We get 'tags' given by users and merge them with movies 

In [29]:
tags['tag'] = tags['tag'].apply(lambda x : x.strip().lower())

In [30]:
tags.head(3)

Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,mark waters,2009-04-24 18:19:40
1,65,208,dark hero,2013-05-10 01:41:18
2,65,353,dark hero,2013-05-10 01:41:19


In [31]:
# Utility function to remove duplicates while preserving order
def unique_preserve_order(tag_list):
    seen = set()
    return [tag for tag in tag_list if not (tag in seen or seen.add(tag))]

In [32]:
# Aggregate tag_1 by movieId → get list of tags (no duplicates)
df_tags = (
    tags.sort_values(by=['movieId'])  # Optional: sort to keep consistency
    .groupby('movieId')['tag']
    .apply(lambda tags: unique_preserve_order(list(tags)))
    .reset_index()
)

In [33]:
df_tags.head(3)

Unnamed: 0,movieId,tag
0,1,"[pixar, children, clever, disney, family, funn..."
1,2,"[robin williams, scary, board game, saturn awa..."
2,3,"[walter matthau, comedy, howard deutch, jack l..."


In [34]:
df_tags.to_csv("data/processed/df_tags.csv")

### 🎬 Movies

In [35]:
movies['genres'] = movies['genres'].apply(lambda x : x.lower())

In [36]:
def clean_genre_list(genres_str):
    if pd.isna(genres_str):
        return []
    genre_list = genres_str.split('|')
    genre_list = [g.strip() for g in genre_list if g.strip()]
    genre_list = sorted(set(genre_list))  # remove duplicates, sort alphabetically
    return genre_list

In [37]:
movies['genres'] = movies['genres'].apply(clean_genre_list)

In [38]:
movies.head(3)

Unnamed: 0,movieId,title,genres
12922,61116,Black Caesar (1973),"[crime, drama]"
14085,70697,G-Force (2009),"[action, adventure, children, fantasy]"
23517,111931,Raze (2013),"[action, horror]"


In [39]:
movies.to_csv("data/processed/df_movies.csv")

### 🧾 Metadata

In [40]:
metadata['directedBy'] = metadata['directedBy'].apply(lambda x : x.strip().lower())
metadata['starring'] = metadata['starring'].apply(lambda x : [x.strip().lower()])

In [41]:
metadata.head(3)

Unnamed: 0,title,directedBy,starring,avgRating,imdbId,item_id
0,Toy Story (1995),john lasseter,"[tim allen, tom hanks, don rickles, jim varney...",3.89146,114709,1
1,Jumanji (1995),joe johnston,"[jonathan hyde, bradley pierce, robin williams...",3.26605,113497,2
2,Grumpier Old Men (1995),howard deutch,"[jack lemmon, walter matthau, ann-margret , so...",3.17146,113228,3


In [42]:
metadata.to_csv("data/processed/df_metadata.csv")

### 📝 reviews_sample

In [43]:
reviews_sample.head()

Unnamed: 0,item_id,txt
0,1,The pioneer of animation movies.; This was the...
1,1,The masterpiece that started it all; I remembe...
2,1,A magnificent milestone in animation!; Toy Sto...
3,1,Good and groundbreaking; Good and groundbreaki...
4,1,Great Movie; All kids should watch this movie....


In [44]:
reviews_sample['review'] = reviews_sample['txt'].apply(lambda x : x.strip().lower().replace(";",""))

In [45]:
reviews_sample.head(3)

Unnamed: 0,item_id,txt,review
0,1,The pioneer of animation movies.; This was the...,the pioneer of animation movies. this was the ...
1,1,The masterpiece that started it all; I remembe...,the masterpiece that started it all i remember...
2,1,A magnificent milestone in animation!; Toy Sto...,a magnificent milestone in animation! toy stor...


In [46]:
# Utility function to remove duplicates while preserving order
def unique_preserve_order(review_list):
    seen = set()
    return [review for review in review_list if not (review in seen or seen.add(review))]

In [47]:
# Aggregate tag_1 by movieId → get list of tags (no duplicates)
df_reviews = (
    reviews_sample.sort_values(by=['item_id'])  # Optional: sort to keep consistency
    .groupby('item_id')['review']
    .apply(lambda reviews: unique_preserve_order(list(reviews)))
    .reset_index()
)

In [48]:
df_reviews.head(3)

Unnamed: 0,item_id,review
0,1,[the pioneer of animation movies. this was the...
1,2,"[pretty good, except... jumanji is an enjoyabl..."
2,3,[as good as the original some people see this ...


In [49]:
df_reviews.to_csv("data/processed/df_reviews_sample.csv")

### ⭐ Ratings

In [50]:
ratings_sample.head(3)

Unnamed: 0,userId,movieId,rating,timestamp
4834169,33218,1,5.0,1997-04-03 22:17:32
1716733,11586,1,3.0,2009-01-09 21:44:52
7170935,49444,1,4.0,2003-02-03 22:37:10


In [51]:
timestamp_s = ratings_sample['timestamp'].map(pd.Timestamp.timestamp)

In [52]:
day = 24 * 60 * 60
year = 365.2425 * day  # Accounting for leap years

ratings_sample.loc[:, 'Day sin'] = np.sin(timestamp_s * (2 * np.pi / day))
ratings_sample.loc[:, 'Day cos'] = np.cos(timestamp_s * (2 * np.pi / day))
ratings_sample.loc[:, 'Year sin'] = np.sin(timestamp_s * (2 * np.pi / year))
ratings_sample.loc[:, 'Year cos'] = np.cos(timestamp_s * (2 * np.pi / year))

In [53]:
ratings_sample.head(3)

Unnamed: 0,userId,movieId,rating,timestamp,Day sin,Day cos,Year sin,Year cos
4834169,33218,1,5.0,1997-04-03 22:17:32,-0.432348,0.901707,0.999366,-0.035615
1716733,11586,1,3.0,2009-01-09 21:44:52,-0.556054,0.831146,0.161828,0.986819
7170935,49444,1,4.0,2003-02-03 22:37:10,-0.353611,0.935393,0.55125,0.83434


In [54]:
ratings_sample.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1998483 entries, 4834169 to 8504617
Data columns (total 8 columns):
 #   Column     Dtype         
---  ------     -----         
 0   userId     int64         
 1   movieId    int64         
 2   rating     float64       
 3   timestamp  datetime64[ns]
 4   Day sin    float64       
 5   Day cos    float64       
 6   Year sin   float64       
 7   Year cos   float64       
dtypes: datetime64[ns](1), float64(5), int64(2)
memory usage: 137.2 MB


In [55]:
ratings_sample.to_csv("data/processed/df_ratings_sample.csv")

### Merging the data

In [56]:
# Merge tags with movies → enrich movies with tag data
df_movies_tags = pd.merge(movies, df_tags, on="movieId")

# Merge with metadata → enrich with director, starring, item_id
df_movies_tags_meta = pd.merge(df_movies_tags, metadata, left_on="movieId", right_on="item_id")

# Merge with reviews → attach textual reviews
df_movies_tags_meta_reviews = pd.merge(df_movies_tags_meta, df_reviews, on="item_id")

# Final merge with ratings → include user interaction data
df_final = pd.merge(df_movies_tags_meta_reviews, ratings_sample, on="movieId")

In [57]:
df_final = df_final[['movieId', 'userId', 'genres', 'tag',  'directedBy', 'starring','review', 'rating', 'Day sin', 'Day cos', 'Year sin', 'Year cos']]

In [59]:
df_final.head(1)

Unnamed: 0,movieId,userId,genres,tag,directedBy,starring,review,rating,Day sin,Day cos,Year sin,Year cos
0,61116,130459,"[crime, drama]",[blaxploitation],larry cohen,"[fred williamson, gloria hendry, art lund, d'u...",[great black movie of the 70s!! this was a gre...,2.5,0.12793,0.991783,0.968119,0.250489


In [61]:
df_final.dropna(inplace = True)

## Encoding and Embedding data

### 🔢 Encode Labels

To prepare our data for collaborative filtering, we need to convert categorical identifiers like **`userId`** and **`movieId`** into integer indices that can be passed into embedding layers.

We use `LabelEncoder` from `sklearn` for this purpose:

- **`userId`** → becomes a user index
- **`movieId`** → becomes a movie index

These integer labels are later used by `nn.Embedding` layers in the model.

> ⚠️ **Note:**  
We can safely apply this encoding **before splitting** the data into train/val/test sets. This is because:
- The IDs are **categorical**, not **learnable features**
- The model doesn't "learn" from the IDs themselves — it just uses them as **indices** to look up learned embeddings
- There's no data leakage, since we're not leaking any information about the rating values or user behavior — just mapping IDs to indices

This step ensures a consistent indexing scheme across the dataset and simplifies model training.

In [62]:
from sklearn.preprocessing import LabelEncoder

user_encoder = LabelEncoder()
movie_encoder = LabelEncoder()

# Apply to the WHOLE dataset BEFORE any split
df_final['user'] = user_encoder.fit_transform(df_final['userId'])
df_final['movie'] = movie_encoder.fit_transform(df_final['movieId'])

In [63]:
df_final.shape

(101541, 14)

df_final is very large, we get only 5% of it.

In [64]:
df_final = df_final.groupby('movie', group_keys = False).sample(frac = 0.5, random_state = 42)

In [65]:
df_final.shape

(50731, 14)

### 🧠 Create Semantic Embeddings

To enrich our model with **content-based information**, we embed high-level textual features using a **pre-trained SentenceTransformer** model.

We generate embeddings for the following fields:
- 🎬 **Genres**
- #️⃣ **Tags**
- 🎥 **DirectedBy**
- 👥 **Starring (Cast)**

These embeddings are created using models like `all-MiniLM-L6-v2` from [SentenceTransformers](https://www.sbert.net/), which are trained to capture semantic similarity between phrases.

---

🧊 **Why It's Safe to Embed *Before* Splitting**

> 💡 Unlike neural network layers that are trained during model fitting, **SentenceTransformer models are frozen** — they don't update weights based on your dataset.

- ✅ **No data leakage** occurs since we're not learning from the data itself.
- ✅ Embeddings are generated purely based on **pre-trained language knowledge**, not on labels (like ratings).

---

🔁 **Attention over Tokens (Not Order)**

Traditional tokenizers or label encoders treat `"sci-fi thriller"` and `"thriller sci-fi"` as different inputs.

But thanks to the **attention mechanism** inside SentenceTransformers:
- Both phrases result in **similar embeddings**
- The model captures **semantic meaning**, not just word order

This makes these embeddings ideal for use in a hybrid recommender system, where nuanced text understanding boosts recommendation quality.

In [66]:
from sentence_transformers import SentenceTransformer
model_emb = SentenceTransformer('all-MiniLM-L6-v2')

In [67]:
df_final['tags_emb'] = df_final['tag'].map(
    lambda lst: model_emb.encode(" ".join(sorted([str(tag) for tag in lst if isinstance(tag, str)]))) if isinstance(lst, list) else model_emb.encode("")
)

df_final['genres_emb'] = df_final['genres'].map(
    lambda lst: model_emb.encode(" ".join(sorted([str(g) for g in lst if isinstance(g, str)]))) if isinstance(lst, list) else model_emb.encode("")
)

df_final['director_emb'] = df_final['directedBy'].map(
    lambda lst: model_emb.encode(" ".join(sorted([str(g) for g in lst if isinstance(g, str)]))) if isinstance(lst, list) else model_emb.encode("")
)

df_final['starring_emb'] = df_final['starring'].map(
    lambda lst: model_emb.encode(" ".join(sorted([str(g) for g in lst if isinstance(g, str)]))) if isinstance(lst, list) else model_emb.encode("")
)

df_final['review_emb'] = df_final['review'].map(
    lambda lst: model_emb.encode(" ".join(sorted([str(g) for g in lst if isinstance(g, str)]))) if isinstance(lst, list) else model_emb.encode("")
)

### 💾 Load Preprocessed Dataset with Embedded Features

If you've previously saved your fully processed dataset (including embedded features like genres, tags, and reviews), you can simply reload it from CSV without rerunning the entire preprocessing pipeline.

Since embeddings are stored as stringified lists (e.g. `"[0.12, -0.34, ...]"`), we apply **parsing logic** to automatically convert them back into numerical lists upon loading.

This ensures:
- 🧠 Embedded columns like `tags_emb`, `genres_emb`, `director_emb`, `starring_emb`, and `review_emb` are restored to their proper format
- 🏎️ Fast loading for training/testing without recomputing embeddings
- ✅ Consistent data structure for downstream PyTorch `Dataset` and model input

This step is especially useful when working with large datasets or when embedding generation is time-consuming.

In [778]:
embedding_columns = [
    'tags_emb', 'genres_emb', 'director_emb', 'starring_emb', 'review_emb'
]

# Define converters for each embedding column
converters = {col: ast.literal_eval for col in embedding_columns}

# Read CSV and auto-parse list-like strings
df_final = pd.read_csv("data/df_final.csv", index_col = 0)

In [798]:
embedding_columns = ['tags_emb', 'genres_emb', 'director_emb', 'starring_emb', 'review_emb']

for col in embedding_columns:
    df_final[col] = df_final[col].apply(
        lambda x: np.fromstring(x.strip("[]"), sep=" ", dtype=np.float32) if isinstance(x, str) else x
    )


In [68]:
df_final.head(1)

Unnamed: 0,movieId,userId,genres,tag,directedBy,starring,review,rating,Day sin,Day cos,Year sin,Year cos,user,movie,tags_emb,genres_emb,director_emb,starring_emb,review_emb
98570,98,122607,"[action, thriller]","[want to own, jude law, directorial debut]",paul w.s. anderson,"[sadie frost, jude law, sean pertwee, sean bea...",[ram-raiding joy riding thrill seeking fun jud...,4.0,-0.290563,0.956856,0.037591,-0.999293,46789,0,"[-0.019601567, -0.031717356, -0.021985551, -0....","[-0.0436603, -0.017041653, -0.0807151, 0.05877...","[-0.11883843, 0.04829872, -0.0025480906, -0.01...","[-0.037837047, -0.13291277, 0.059829786, -0.02...","[-0.072004735, -0.08433034, 0.035813972, -0.11..."


In [69]:
df_final.reset_index(drop = True, inplace = True)
df_final.to_csv("data/processed/df_final.csv")

In [70]:
df = df_final[['rating','Day sin', 'Day cos', 'Year sin', 'Year cos', 'user', 'movie',
       'tags_emb', 'genres_emb', 'director_emb', 'starring_emb', 'review_emb']]

## Data Split -> Dataset -> DataLoader

### 🔀 Split Data: Train / Validation / Test

To evaluate the performance of our recommendation model fairly and avoid overfitting, we split the dataset into three parts:
1. Training Set
2. Validation Set 
3. Test Set

- The split is done **after encoding** the IDs and (optionally) embedding the content features.
- We ensure the split is **random but reproducible** by using a fixed `random_state`.

In [71]:
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader

In [72]:
def data_split(df_final):
    """
    split data into train, val and test datasets
    """
    # Split into training+validation and test sets first
    train_val_data, test_data = train_test_split(df_final, test_size=0.2, random_state=42)

    # Then split train+validation into training and validation sets (e.g., 80/20 of the remaining)
    train_data, val_data = train_test_split(train_val_data, test_size=0.25, random_state=42)  

    return train_data, val_data, test_data

In [73]:
train_data, val_data, test_data = data_split(df)

### 📦 Create Datasets and DataLoaders

We wrap the processed data into custom PyTorch `Dataset` objects and use `DataLoader` to efficiently batch, shuffle, and feed data into the model during training and evaluation.

In [74]:
from torch.utils.data import Dataset, DataLoader

class MovieLensDataset(Dataset):
    # Initialize the data objects
    def __init__(self, data):
        self.users = torch.tensor(data['user'].values, dtype=torch.long)    # integers
        self.movies = torch.tensor(data['movie'].values, dtype=torch.long)   # integers
        self.ratings = torch.tensor(data['rating'].values, dtype=torch.float32)
        self.Daysin = torch.tensor(data['Day sin'].values, dtype=torch.float32)
        self.Daycos = torch.tensor(data['Day cos'].values, dtype=torch.float32)
        self.Yearsin = torch.tensor(data['Year sin'].values, dtype=torch.float32)
        self.Yearcos = torch.tensor(data['Year cos'].values, dtype=torch.float32)
        self.tags_emb = torch.tensor(np.stack(data['tags_emb'].values), dtype=torch.float32)
        self.genres_emb = torch.tensor(np.stack(data['genres_emb'].values), dtype=torch.float32)
        self.director_emb = torch.tensor(np.stack(data['director_emb'].values), dtype=torch.float32)
        self.starring_emb = torch.tensor(np.stack(data['starring_emb'].values), dtype=torch.float32)
        self.review_emb = torch.tensor(np.stack(data['review_emb'].values), dtype=torch.float32)


    # Return the total number of samples
    def __len__(self):
        return len(self.ratings)

    # Get a single sample for a given index
    def __getitem__(self, idx):
        return (
            self.users[idx],
            self.movies[idx],
            self.ratings[idx],
            self.Daysin[idx],
            self.Daycos[idx],
            self.Yearsin[idx],
            self.Yearcos[idx],
            self.tags_emb[idx],
            self.genres_emb[idx],
            self.director_emb[idx],
            self.starring_emb[idx],
            self.review_emb[idx],)

In [75]:
def make_dataloader(df):
    dataset = MovieLensDataset(df)
    data_loader = DataLoader(dataset, batch_size=64, shuffle=True)
    return data_loader

In [76]:
train_loader = make_dataloader(train_data)
val_loader = make_dataloader(val_data)
test_loader = make_dataloader(test_data)

In [77]:
for batch in train_loader:
    users, movies, ratings, daysin, daycos, yearsin, yearcos, tags_emb, genres_emb, director_emb, starring_emb, review_emb = batch

    print(review_emb.shape)
    break


torch.Size([64, 384])


## 🧠 Model

We define a hybrid recommendation model that combines **collaborative filtering** (user and movie embeddings) with **content-based features** (e.g. tags, genres, reviews). These inputs are concatenated and passed through fully connected layers to predict user ratings.

In [78]:
class CFHybridModel(nn.Module):
    def __init__(self, n_users, n_movies, emb_dim, content_dim):
        super().__init__()
        self.user_embedding = nn.Embedding(n_users, emb_dim)
        self.movie_embedding = nn.Embedding(n_movies, emb_dim)

        # MLP for combining everything
        self.fc = nn.Sequential(
            nn.Linear(emb_dim * 2 + content_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 1)
        )

    def forward(self, users, movies, daysin, daycos, yearsin, yearcos,
                tags_emb, genres_emb, director_emb, starring_emb, review_emb):
        
        # Collaborative embeddings
        user_vec = self.user_embedding(users)
        movie_vec = self.movie_embedding(movies)

        # Concatenate all content features
        content = torch.cat([
            daysin.unsqueeze(1),
            daycos.unsqueeze(1),
            yearsin.unsqueeze(1),
            yearcos.unsqueeze(1),
            tags_emb,
            genres_emb,
            director_emb,
            starring_emb,
            review_emb
        ], dim=1)

        # Final input vector
        x = torch.cat([user_vec, movie_vec, content], dim=1)

        # Predict
        return self.fc(x).squeeze()

### 📐 Define Embedding Dimensions

We initialize the **collaborative vector space** for users and movies:

- `n_users` and `n_movies` are set based on the max encoded indices (not just unique counts).
- `emb_dim` defines the dimensionality of the collaborative embeddings.
- `content_dim` represents the total size of all additional content-based features (e.g. timestamp, tags, genres, etc.) to be concatenated with user/movie vectors.

This forms the complete input vector for the hybrid model.


In [79]:
#create vecotr space
n_users = df_final['user'].max() + 1  # ✅ NOT nunique()
n_movies = df_final['movie'].max() + 1
emb_dim = 5 #dimension of the vector space
content_dim = 4 + 384 * 5 

In [80]:
#initiate the model
model = CFHybridModel(
    n_users=n_users,
    n_movies=n_movies,
    emb_dim=emb_dim,
    content_dim=content_dim
)

### ⚙️ Loss Function & Optimizer

We use **Mean Squared Error (MSE)** as the loss function to measure the difference between predicted and actual ratings.  
The model is optimized using the **Adam optimizer** with a learning rate of `0.01` for efficient and adaptive gradient updates.

In [81]:
loss_fn = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

## Train & Eval

In [82]:
n_epochs = 20

device = "cuda" if torch.cuda.is_available() else (
    "mps" if torch.backends.mps.is_available() else "cpu"
)
model.to(device)

for epoch in range(n_epochs):
    model.train()
    train_loss = 0

    for batch in train_loader:
        (
            users, movies, ratings,
            daysin, daycos, yearsin, yearcos,
            tags_emb, genres_emb, director_emb, starring_emb, review_emb
        ) = [x.to(device) for x in batch]

        optimizer.zero_grad()

        # Forward pass
        predictions = model(
            users, movies, daysin, daycos, yearsin, yearcos,
            tags_emb, genres_emb, director_emb, starring_emb, review_emb
        )

        loss = loss_fn(predictions, ratings)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()

    # Validation loop
    model.eval()
    val_loss = 0

    with torch.no_grad():
        for batch in val_loader:
            (
                users, movies, ratings,
                daysin, daycos, yearsin, yearcos,
                tags_emb, genres_emb, director_emb, starring_emb, review_emb
            ) = [x.to(device) for x in batch]

            predictions = model(
                users, movies, daysin, daycos, yearsin, yearcos,
                tags_emb, genres_emb, director_emb, starring_emb, review_emb
            )

            loss = loss_fn(predictions, ratings)
            val_loss += loss.item()

    print(
        f"Epoch {epoch+1}/{n_epochs} | "
        f"Train Loss: {train_loss / len(train_loader):.4f} | "
        f"Val Loss: {val_loss / len(val_loader):.4f}"
    )


Epoch 1/20 | Train Loss: 1.0745 | Val Loss: 1.0619
Epoch 2/20 | Train Loss: 0.9814 | Val Loss: 0.9528
Epoch 3/20 | Train Loss: 0.9022 | Val Loss: 0.9951
Epoch 4/20 | Train Loss: 0.7086 | Val Loss: 1.1896
Epoch 5/20 | Train Loss: 0.5173 | Val Loss: 1.2069
Epoch 6/20 | Train Loss: 0.3855 | Val Loss: 1.2126
Epoch 7/20 | Train Loss: 0.3091 | Val Loss: 1.3106
Epoch 8/20 | Train Loss: 0.2701 | Val Loss: 1.3364
Epoch 9/20 | Train Loss: 0.2499 | Val Loss: 1.3043
Epoch 10/20 | Train Loss: 0.2352 | Val Loss: 1.3274
Epoch 11/20 | Train Loss: 0.2239 | Val Loss: 1.3221
Epoch 12/20 | Train Loss: 0.2184 | Val Loss: 1.2824
Epoch 13/20 | Train Loss: 0.2144 | Val Loss: 1.2918
Epoch 14/20 | Train Loss: 0.2110 | Val Loss: 1.2890
Epoch 15/20 | Train Loss: 0.2053 | Val Loss: 1.3077
Epoch 16/20 | Train Loss: 0.2038 | Val Loss: 1.2613
Epoch 17/20 | Train Loss: 0.1992 | Val Loss: 1.2740
Epoch 18/20 | Train Loss: 0.1964 | Val Loss: 1.2713
Epoch 19/20 | Train Loss: 0.1954 | Val Loss: 1.3361
Epoch 20/20 | Train L

In [83]:
# Test
model.eval()

with torch.no_grad():
    test_loss = 0
    for batch in test_loader:
        (
            users, movies, ratings,
            daysin, daycos, yearsin, yearcos,
            tags_emb, genres_emb, director_emb, starring_emb, review_emb
        ) = [x.to(device) for x in batch]

        predictions = model(
            users, movies, daysin, daycos, yearsin, yearcos,
            tags_emb, genres_emb, director_emb, starring_emb, review_emb
        )

        loss = loss_fn(predictions, ratings)
        test_loss += loss.item()

    print(f"✅ Test Loss: {test_loss / len(test_loader):.4f}")


✅ Test Loss: 1.2805


## 🎯 Model In Application 

We define a function to **recommend top-N movies** for a given user based on the model's predicted ratings.  

- It filters out movies the user has already rated.
- It gathers relevant content features for each candidate movie.
- It runs the model to predict ratings for all remaining movies.
- It returns the **highest-rated recommendations** for that user.

This enables personalized recommendations using both collaborative and content-based signals.

In [86]:
def recommend_movies(model, user_id, df, all_movie_ids, rated_movie_ids=set(), top_n=10, device="cpu"):
    """
    Return a list of top_n recommended movie IDs (encoded) for a given user.

    Args:
        model (nn.Module): Trained hybrid model.
        user_id (int): Encoded user ID.
        df (pd.DataFrame): Dataset with all features and movie encodings.
        all_movie_ids (list): All encoded movie IDs.
        rated_movie_ids (set): Movies already rated by this user.
        top_n (int): Number of recommendations.
        device (str): "cpu", "cuda", or "mps".

    Returns:
        List[int]: Top-N recommended movie IDs.
    """
    model.eval()

    # Filter candidate movies
    candidate_ids = [m for m in all_movie_ids if m not in rated_movie_ids]
    if not candidate_ids:
        return []

    # Repeat user_id to match movie count
    user_tensor = torch.tensor([user_id] * len(candidate_ids), dtype=torch.long, device=device)
    movie_tensor = torch.tensor(candidate_ids, dtype=torch.long, device=device)

    # Ensure only one row per movie
    df_unique_movies = df.drop_duplicates(subset="movie", keep="first")
    
    # Now reindex safely
    features = df_unique_movies.set_index("movie").reindex(candidate_ids).dropna()
    candidate_ids = features.index.tolist()

    # Scalar features
    daysin  = torch.tensor(features["Day sin"].values, dtype=torch.float32, device=device)
    daycos  = torch.tensor(features["Day cos"].values, dtype=torch.float32, device=device)
    yearsin = torch.tensor(features["Year sin"].values, dtype=torch.float32, device=device)
    yearcos = torch.tensor(features["Year cos"].values, dtype=torch.float32, device=device)

    # Embedded features
    tags_emb     = torch.tensor(np.stack(features["tags_emb"].values), dtype=torch.float32, device=device)
    genres_emb   = torch.tensor(np.stack(features["genres_emb"].values), dtype=torch.float32, device=device)
    director_emb = torch.tensor(np.stack(features["director_emb"].values), dtype=torch.float32, device=device)
    starring_emb = torch.tensor(np.stack(features["starring_emb"].values), dtype=torch.float32, device=device)
    review_emb   = torch.tensor(np.stack(features["review_emb"].values), dtype=torch.float32, device=device)

    # Run model
    with torch.no_grad():
        preds = model(
            user_tensor, movie_tensor,
            daysin, daycos, yearsin, yearcos,
            tags_emb, genres_emb, director_emb, starring_emb, review_emb
        )

    # Get top-N movie indices
    top_indices = preds.cpu().numpy().argsort()[::-1][:top_n]

    # Return top-N movie IDs (encoded)
    return [candidate_ids[i] for i in top_indices]


Now we can test and see e.g. what movies does it sugges to userId = 10

In [87]:
user_id = 10  # Encoded user ID
rated_movies = set(df_final[df_final['userId'] == user_id]['movie'].values)
all_movies = df_final['movie'].unique()

recommended_movies = recommend_movies(
    model=model,
    user_id=user_id,
    df=df,
    all_movie_ids=all_movies,
    rated_movie_ids=rated_movies,
    top_n=5,
    device=device
)

print("🎬 Recommended movie IDs:", recommended_movies)


🎬 Recommended movie IDs: [179, 444, 174, 112, 228]


In [88]:
movie_ids = np.unique(df_final[df_final.movie.isin(recommended_movies)]['movieId'])

In [90]:
#list of suggested movies
movies = pd.read_csv("data/raw/movie.csv")
movies[movies.movieId.isin(movie_ids)]

Unnamed: 0,movieId,title,genres
2819,2905,Sanjuro (Tsubaki Sanjûrô) (1962),Action|Adventure|Drama
4203,4298,Rififi (Du rififi chez les hommes) (1955),Crime|Film-Noir|Thriller
4296,4391,"Vertical Ray of the Sun, The (Mua he chieu tha...",Drama
5610,5709,"Amateur, The (1981)",Crime|Thriller
10577,40491,"Match Factory Girl, The (Tulitikkutehtaan tytt...",Comedy|Drama


It seems that userId = 10 likes movies of genres Drama, Thriller & Crime and abit Comedy. 

Let's check if we he actually liked these types of movies:

In [98]:
movies[movies.movieId.isin(ratings_sample[ratings_sample.userId == 10]['movieId'])]

Unnamed: 0,movieId,title,genres
895,912,Casablanca (1942),Drama|Romance
1113,1136,Monty Python and the Holy Grail (1975),Adventure|Comedy|Fantasy
1195,1221,"Godfather: Part II, The (1974)",Crime|Drama
1974,2058,"Negotiator, The (1998)",Action|Crime|Drama|Mystery|Thriller
2299,2384,Babe: Pig in the City (1998),Adventure|Children|Drama


We can see that users has rated high rating to movies of these genres before. So our model made good suggestions