# Movie Recommendation System — Combined Notebook

This single notebook includes EDA, user/item collaborative filtering, matrix factorization (SVD), evaluation, and instructions to save and deploy a demo. Follow cells in order.

## 0. Setup & Notes

- Place MovieLens 100k dataset under `data/ml-100k/` (required files: `u.data`, `u.item`).
- Install requirements: `pip install -r requirements.txt` (file included in project zip).

In [None]:

# 0.1 Imports
import os
import math
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import defaultdict
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from surprise import Dataset, Reader, SVD, accuracy
from surprise.model_selection import train_test_split as svd_train_test_split

# Display settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 120)


## 1. Load Data

Load `u.data` and `u.item`.

In [None]:

DATA_DIR = 'data/ml-100k'  # adjust if needed
if not os.path.exists(DATA_DIR):
    print("Warning: data/ml-100k not found. Please download MovieLens 100k and place it under data/ml-100k/")
cols = ['user_id','movie_id','rating','timestamp']
ratings = pd.read_csv(os.path.join(DATA_DIR,'u.data'), sep='\t', names=cols, encoding='latin-1')
movie_cols = ['movie_id','title','release_date','video_release_date','IMDb_URL',
              'unknown','Action','Adventure','Animation','Children','Comedy','Crime','Documentary',
              'Drama','Fantasy','Film-Noir','Horror','Musical','Mystery','Romance','Sci-Fi',
              'Thriller','War','Western']
movies = pd.read_csv(os.path.join(DATA_DIR,'u.item'), sep='|', names=movie_cols, encoding='latin-1', header=None)
print('Ratings rows:', len(ratings))
ratings.head()


## 2. Exploratory Data Analysis (EDA)

Simple statistics and plots: rating distribution, users, movies, sparsity, popular movies.

In [None]:

# Basic stats
num_ratings = len(ratings)
num_users = ratings['user_id'].nunique()
num_movies = ratings['movie_id'].nunique()
sparsity = 1.0 - num_ratings / (num_users * num_movies)
print(f"Ratings: {num_ratings}, Users: {num_users}, Movies: {num_movies}, Sparsity: {sparsity:.4f}")

# Rating distribution
plt.figure(figsize=(6,4))
ratings['rating'].hist(bins=5)
plt.title('Rating distribution')
plt.xlabel('Rating')
plt.show()

# Top 10 most-rated movies
pop = ratings.groupby('movie_id').size().reset_index(name='count').sort_values('count', ascending=False)
top10 = pop.head(10).merge(movies[['movie_id','title']], on='movie_id')
top10[['title','count']]


## 3. Prepare training and test sets

We split the raw ratings dataframe so evaluation is done on unseen user-item pairs.

In [None]:

train_df, test_df = train_test_split(ratings, test_size=0.2, random_state=42)
print('Train size:', len(train_df), 'Test size:', len(test_df))
# Pivot (for algorithms that require full matrix)
train_matrix = train_df.pivot(index='user_id', columns='movie_id', values='rating').fillna(0)
train_matrix.shape


## 4. User-based Collaborative Filtering

Compute cosine similarity between users and predict using k nearest neighbors.

In [None]:

# 4.1 Compute user-user cosine similarity
user_ids = train_matrix.index.tolist()
user_matrix = train_matrix.values  # rows: users, cols: movies
user_sim = cosine_similarity(user_matrix)
print('User similarity matrix shape:', user_sim.shape)

# 4.2 Helper functions
user_id_to_index = {uid: i for i, uid in enumerate(user_ids)}
index_to_user_id = {i: uid for uid, i in user_id_to_index.items()}

def predict_user_based(user_id, movie_id, k=20):
    # fallback to global mean
    global_mean = train_df['rating'].mean()
    if user_id not in user_id_to_index:
        return global_mean
    if movie_id not in train_matrix.columns:
        return global_mean
    uidx = user_id_to_index[user_id]
    movie_col_idx = list(train_matrix.columns).index(movie_id)
    sims = user_sim[uidx]
    other_ratings = user_matrix[:, movie_col_idx]
    pairs = [(sims[i], other_ratings[i]) for i in range(len(sims)) if other_ratings[i] > 0 and i != uidx]
    if not pairs:
        return global_mean
    pairs.sort(key=lambda x: x[0], reverse=True)
    topk = pairs[:k]
    num = sum(sim * rating for sim, rating in topk)
    den = sum(abs(sim) for sim, _ in topk)
    if den == 0:
        return global_mean
    return num / den


### 4.3 Evaluate user-based CF on sample of test set (for speed)

In [None]:

def evaluate_preds(preds, trues):
    preds = np.array(preds)
    trues = np.array(trues)
    rmse = np.sqrt(np.mean((preds - trues)**2))
    mask = trues != 0
    mape = np.mean(np.abs((preds[mask] - trues[mask]) / trues[mask])) * 100
    return rmse, mape

sample_test = test_df.sample(n=2000, random_state=42) if len(test_df) > 2000 else test_df
preds, trues = [], []
for _, row in sample_test.iterrows():
    p = predict_user_based(int(row['user_id']), int(row['movie_id']), k=30)
    preds.append(p); trues.append(row['rating'])
rmse_u, mape_u = evaluate_preds(preds, trues)
print('User-based CF RMSE:', rmse_u, 'MAPE:', mape_u)


## 5. Item-based Collaborative Filtering

Build item vectors and use cosine similarity between items.

In [None]:

# Item matrix (movies x users)
item_matrix = train_matrix.T
item_ids = item_matrix.index.tolist()
item_matrix_vals = item_matrix.values
item_sim = cosine_similarity(item_matrix_vals)
item_id_to_index = {iid: i for i, iid in enumerate(item_ids)}

def predict_item_based(user_id, movie_id, k=20):
    global_mean = train_df['rating'].mean()
    if user_id not in train_matrix.index or movie_id not in item_id_to_index:
        return global_mean
    user_ratings = train_matrix.loc[user_id]
    item_idx = item_id_to_index[movie_id]
    sims = item_sim[item_idx]
    rated_items = [(sims[item_id_to_index[mid]], user_ratings[mid]) for mid in train_matrix.columns if user_ratings[mid] > 0 and mid != movie_id]
    if not rated_items:
        return global_mean
    rated_items.sort(key=lambda x: x[0], reverse=True)
    topk = rated_items[:k]
    num = sum(sim * rating for sim, rating in topk)
    den = sum(abs(sim) for sim, _ in topk)
    if den == 0:
        return global_mean
    return num / den

# Evaluate item-based
preds, trues = [], []
for _, row in sample_test.iterrows():
    p = predict_item_based(int(row['user_id']), int(row['movie_id']), k=30)
    preds.append(p); trues.append(row['rating'])
rmse_i, mape_i = evaluate_preds(preds, trues)
print('Item-based CF RMSE:', rmse_i, 'MAPE:', mape_i)


## 6. Matrix Factorization — SVD (Surprise)

Use the Surprise library's SVD algorithm for a robust matrix factorization baseline.

In [None]:

# Prepare Surprise dataset
reader = Reader(line_format='user item rating timestamp', sep='\t')
data = Dataset.load_from_file(os.path.join(DATA_DIR,'u.data'), reader=reader)
trainset, testset = svd_train_test_split(data, test_size=0.2, random_state=42)
algo = SVD(n_factors=50, n_epochs=20, lr_all=0.005, reg_all=0.02, random_state=42)
print('Training SVD (this may take a moment)...')
algo.fit(trainset)
predictions = algo.test(testset)
rmse_svd = accuracy.rmse(predictions, verbose=True)
# compute MAPE
y_true = np.array([pred.r_ui for pred in predictions])
y_pred = np.array([pred.est for pred in predictions])
mape_svd = np.mean(np.abs((y_pred - y_true) / y_true)) * 100
print('SVD MAPE:', mape_svd)


## 7. Top-N Recommendations (SVD)

Get top-N for a target user using the trained SVD model.

In [None]:

def get_unseen_items(train_df, user_id):
    seen = set(train_df[train_df.user_id == user_id].movie_id)
    all_items = set(train_df.movie_id.unique())
    return list(all_items - seen)

def recommend_svd(algo, train_df, user_id, movies_df, topn=10):
    unseen = get_unseen_items(train_df, user_id)
    preds = []
    for iid in unseen:
        try:
            est = algo.predict(user_id, iid).est
        except Exception:
            est = 0
        preds.append((iid, est))
    preds.sort(key=lambda x: x[1], reverse=True)
    res = []
    for mid, score in preds[:topn]:
        title = movies_df[movies_df.movie_id == mid]['title'].values[0]
        res.append((mid, title, score))
    return res

# Example recommendations for user 1
print('Top 10 recommendations for user 1:')
recos = recommend_svd(algo, train_df, 1, movies, topn=10)
for mid, title, score in recos:
    print(f"{title} (movie_id={mid}) — predicted {score:.2f}")


## 8. Ranking Metrics: Precision@K & Recall@K

Evaluate top-K recommendation quality by treating ratings >= 4 as relevant.

In [None]:

def precision_recall_at_k(predictions, k=10, threshold=4.0):
    user_preds = defaultdict(list)
    for uid, iid, true_r, est in predictions:
        user_preds[uid].append((iid, est, true_r))
    precisions = []
    recalls = []
    for uid, items in user_preds.items():
        items.sort(key=lambda x: x[1], reverse=True)
        topk = items[:k]
        n_rel = sum(1 for _, _, true_r in items if true_r >= threshold)
        n_rel_k = sum(1 for _, _, true_r in topk if true_r >= threshold)
        precisions.append(n_rel_k / k if k > 0 else 0)
        recalls.append(n_rel_k / n_rel if n_rel > 0 else 0)
    return np.mean(precisions), np.mean(recalls)

# Build a simple predictions list from Surprise predictions
surprise_preds = [(pred.uid, pred.iid, pred.r_ui, pred.est) for pred in predictions]
prec, rec = precision_recall_at_k(surprise_preds, k=10, threshold=4.0)
print('Precision@10:', prec, 'Recall@10:', rec)


## 9. Visualizing item latent space (PCA)

Project item factors to 2D to visualize clusters.

In [None]:

# Extract item factors from Surprise SVD (algo.qi is item factor matrix indexed by inner ids)
# Need to map raw item ids to inner ids in trainset used by algo (trainset in Surprise)
try:
    inner_id_map = algo.trainset._raw2inner_id_items
    item_raw_ids = list(inner_id_map.keys())
    item_latents = []
    raw_ids = []
    for raw_id, inner in inner_id_map.items():
        item_latents.append(algo.qi[inner])
        raw_ids.append(int(raw_id))
    item_latents = np.array(item_latents)
    pca = PCA(n_components=2)
    coords = pca.fit_transform(item_latents)
    plt.figure(figsize=(8,6))
    plt.scatter(coords[:,0], coords[:,1], alpha=0.6, s=20)
    plt.title('Item latent space (PCA projection)')
    plt.xlabel('PC1'); plt.ylabel('PC2')
    plt.show()
except Exception as e:
    print('Could not extract item latents:', e)
    print('This may occur if Surprise used a different internal trainset than expected.')


## 10. Save SVD model

Save the trained SVD to `models/svd_model.pkl` so the Streamlit app can load it.

In [None]:

os.makedirs('models', exist_ok=True)
with open('models/svd_model.pkl', 'wb') as f:
    pickle.dump(algo, f)
print('Saved SVD model to models/svd_model.pkl')


## 11. Streamlit demo (instructions)

The Streamlit demo `app/streamlit_app.py` in the repo loads `models/svd_model.pkl` and serves Top-N recommendations. Run:

```bash
streamlit run app/streamlit_app.py
```

If you wish, you can containerize the app with Docker or deploy on Streamlit Community Cloud.

## 12. Next steps & improvements

- Hyperparameter tuning (Surprise GridSearchCV)
- Use implicit feedback and ALS for implicit signals
- Hybrid recommender combining content (genres) and collaborative signals
- Improve cold-start handling (content-based fallback)

---

That's the combined notebook. Run cells sequentially. If anything errors due to missing data or packages, follow error messages to install packages or place the dataset.