# Train Hybrid Movie Recommender System

This notebook trains:
- A content-based recommendation model using TF-IDF + Nearest Neighbors
- A collaborative filtering model using SVD (Singular Value Decomposition)

## Step 1: Load and Clean Dataset

Clean and complete data ensures both recommendation models operate without error or bias.

### What it Does
Loads the CSV dataset and removes rows with missing values for genre, rating, or votes. This ensures input consistency for the TF-IDF and SVD algorithms.

### Variables
- `movies_df`: DataFrame containing the cleaned dataset.

In [4]:
import pandas as pd
import joblib
import time
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from surprise import Dataset, Reader, SVD

start_time = time.time()
print("[INFO] Loading cleaned dataset...")
movies_df = pd.read_csv("cleaned_imdb_movies.csv")
movies_df = movies_df.dropna(subset=["genres", "averageRating", "numVotes"])
movies_df = movies_df[movies_df["genres"].str.strip().astype(bool)]
print("[INFO] Movies loaded:", len(movies_df))

[INFO] Loading cleaned dataset...
[INFO] Movies loaded: 100000


## Step 2: TF-IDF Genre Vectorization

To numerically represent genres for similarity computation, we apply TF-IDF — a widely-used algorithm in NLP that emphasizes distinct terms.

### What it Does
Transforms genre text using TF-IDF into a matrix where rows represent movies and columns represent genre terms. Each cell reflects term importance.

### Variables
- `tfidf_vectorizer`: Fitted TF-IDF model.
- `tfidf_matrix`: TF-IDF score matrix for genres.

In [7]:
print("[INFO] Starting TF-IDF vectorization...")
tfidf_vectorizer = TfidfVectorizer(stop_words="english")
tfidf_matrix = tfidf_vectorizer.fit_transform(movies_df["genres"])
print(f"[DONE] TF-IDF shape: {tfidf_matrix.shape}")

[INFO] Starting TF-IDF vectorization...
[DONE] TF-IDF shape: (100000, 28)


## Step 3: Train Nearest Neighbors Model

To find genre-similar movies, we use k-Nearest Neighbors (k-NN) with cosine similarity on the TF-IDF matrix.
- Cosine similarity is a metric used to measure how similar two vectors are — based on the angle between them.

### What it Does
Trains NearestNeighbors with cosine distance to find the top 50 similar movies for each item.

### Variables
- `nn_model`: NearestNeighbors model.
- `distances`, `indices`: Results of the nearest neighbor lookup.

In [12]:
print("[INFO] Fitting NearestNeighbors model...")
nn_model = NearestNeighbors(n_neighbors=50, metric="cosine", algorithm="brute")
nn_model.fit(tfidf_matrix)

print("[INFO] Computing all neighbors...")
distances, indices = nn_model.kneighbors(tfidf_matrix)

[INFO] Fitting NearestNeighbors model...
[INFO] Computing all neighbors...


## Step 4: Save Content-Based Models

Saving models avoids retraining and allows integration into APIs or apps for real-time use.

### What it Does
Saves the trained TF-IDF vectorizer and nearest neighbors indices to disk.

### Variables
- Saved files: `tfidf_vectorizer.pkl`, `nearest_neighbors_indices.pkl`

In [18]:
os.makedirs("models", exist_ok=True)
joblib.dump(tfidf_vectorizer, "models/tfidf_vectorizer.pkl")
print("[SAVED] TF-IDF vectorizer saved.")
joblib.dump(indices, "models/nearest_neighbors_indices.pkl")
print("[SAVED] Nearest Neighbors indices saved.")

[SAVED] TF-IDF vectorizer saved.
[SAVED] Nearest Neighbors indices saved.


## Step 5: Simulate Users for SVD

SVD is a collaborative filtering algorithm requiring user-item interactions. We simulate 1000 users using modulo logic.

### What it Does
Adds `user_id` and reformats the DataFrame into Surprise's required dataset structure.

### Variables
- `reader`, `data`, `trainset`: Inputs for SVD training.

In [23]:
print("[INFO] Preparing data for SVD...")
movies_df["user_id"] = movies_df.index % 1000
reader = Reader(rating_scale=(0, 10))
data = Dataset.load_from_df(movies_df[["user_id", "primaryTitle", "averageRating"]], reader)
trainset = data.build_full_trainset()

[INFO] Preparing data for SVD...


## Step 6: Train and Save SVD Model

SVD factorizes user-movie interactions into latent dimensions to predict future ratings.

### What it Does
Trains the SVD model with 50 latent features and saves it for prediction.

### Variables
- `svd_model`: Trained model.
- File saved: `svd_model.pkl`

In [26]:
print("[INFO] Training SVD model...")
svd_model = SVD(n_factors=50, n_epochs=10)
svd_model.fit(trainset)
joblib.dump(svd_model, "models/svd_model.pkl")
print("[SAVED] SVD model saved.")

[INFO] Training SVD model...
[SAVED] SVD model saved.


In [28]:
elapsed = time.time() - start_time
print(f"All models trained and saved in {elapsed:.2f} seconds")

All models trained and saved in 249.83 seconds
