<a href="https://colab.research.google.com/github/NIC-DE/MOOVIE-RECOMMENDER/blob/main/Movie_Recommender_CV_Clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Recommender (Collaborative Filtering) — MovieLens 100K

This notebook downloads **MovieLens 100K** from Kaggle (optional), trains a **Surprise SVD** model, evaluates it (RMSE), generates **top‑N recommendations** for a user, and **exports** the trained model for later use.

> Tip: In Colab, run cells top‑to‑bottom after a runtime restart.


In [1]:
# --- 1) Install dependencies (run once per runtime) ---
# scikit-surprise is compiled against NumPy 1.x on Colab, so we pin NumPy < 2.
!pip -q install "numpy<2" pandas scikit-surprise kaggle


## Kaggle API setup (only if you want auto-download)

If you already have the dataset folder in `/content/data/ml-100k/`, you can skip this step.

1) On Kaggle: **Account → Create New Token**  
2) Upload the downloaded `kaggle.json` into Colab (left sidebar → Files → Upload).

In [2]:
# --- 2) Configure Kaggle credentials (expects kaggle.json uploaded to /content) ---
from pathlib import Path
import os

# Some browsers rename it to "kaggle (1).json" etc. We'll auto-detect.
candidates = list(Path("/content").glob("kaggle*.json"))
assert candidates, (
    "I can't see kaggle.json in /content. Upload it in the Files panel first."
)

src = candidates[0]
kaggle_dir = Path("/root/.kaggle")
kaggle_dir.mkdir(parents=True, exist_ok=True)

dst = kaggle_dir / "kaggle.json"
dst.write_bytes(src.read_bytes())
os.chmod(dst, 0o600)

print("✅ Kaggle token installed at:", dst)


✅ Kaggle token installed at: /root/.kaggle/kaggle.json


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# --- 3) Download MovieLens 100K from Kaggle (creates data/ml-100k/) ---
# If you already have it, this will be quick.
from pathlib import Path

DATA_DIR = Path("/content/data")
DATA_DIR.mkdir(parents=True, exist_ok=True)

# Kaggle dataset id (as you used)
DATASET = "prajitdatta/movielens-100k-dataset"

!kaggle datasets download -d {DATASET} -p {DATA_DIR} --unzip

# Verify expected folder
ml_path = DATA_DIR / "ml-100k"
assert ml_path.exists(), f"Expected folder not found: {ml_path}"
print("✅ Dataset folder:", ml_path)


Dataset URL: https://www.kaggle.com/datasets/prajitdatta/movielens-100k-dataset
License(s): CC0-1.0
Downloading movielens-100k-dataset.zip to /content/data
  0% 0.00/4.77M [00:00<?, ?B/s]
100% 4.77M/4.77M [00:00<00:00, 1.11GB/s]
✅ Dataset folder: /content/data/ml-100k


## Load data

MovieLens 100K main files:
- `u.data` : userId, itemId (movieId), rating, timestamp (tab-separated)
- `u.item` : movie metadata (title, genres, etc.)

In [5]:
# --- 4) Load ratings + movie titles ---
import pandas as pd
from pathlib import Path

ml_path = Path("/content/data/ml-100k")

# Ratings
ratings = pd.read_csv(
    ml_path / "u.data",
    sep="\t",
    names=["user_id", "item_id", "rating", "timestamp"],
)

# Movies (latin-1 encoding in MovieLens 100K)
movies = pd.read_csv(
    ml_path / "u.item",
    sep="|",
    encoding="latin-1",
    header=None,
    usecols=[0, 1],
    names=["item_id", "title"],
)

print("ratings:", ratings.shape)
display(ratings.head())

print("movies:", movies.shape)
display(movies.head())


ratings: (100000, 4)


Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


movies: (1682, 2)


Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [6]:
# --- 5) Quick sanity checks ---
print("Unique users:", ratings["user_id"].nunique())
print("Unique items:", ratings["item_id"].nunique())
print("Rating scale:", ratings["rating"].min(), "→", ratings["rating"].max())


Unique users: 943
Unique items: 1682
Rating scale: 1 → 5


## Train a recommender (Surprise SVD)

We’ll train SVD on a train split and report RMSE on a test split.

In [7]:
# --- 6) Train / test split + model training ---
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

# Surprise expects a Reader with rating scale
reader = Reader(rating_scale=(1, 5))

# Surprise Dataset from a (user, item, rating) dataframe
data = Dataset.load_from_df(ratings[["user_id", "item_id", "rating"]], reader)

# Split
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Model: SVD (matrix factorization)
model = SVD(
    n_factors=100,     # latent dimensions
    n_epochs=20,       # training epochs
    lr_all=0.005,      # learning rate
    reg_all=0.02,      # regularization
    random_state=42,
)

model.fit(trainset)

# Evaluate
predictions = model.test(testset)
rmse = accuracy.rmse(predictions, verbose=True)
print("✅ RMSE:", rmse)


RMSE: 0.9352
✅ RMSE: 0.935171451026933


## Top‑N recommendations for a user

Logic:
1) Find all movies.
2) Remove movies the user already rated.
3) Predict rating for each unseen movie.
4) Return top‑N by predicted rating (with titles).

In [8]:
# --- 7) Recommendation helper ---
import numpy as np

def recommend_movies(model, ratings_df, movies_df, user_id: int, n: int = 10):
    """Return top-N movie recommendations for a given user_id."""
    all_items = movies_df["item_id"].unique()

    # Movies already rated by the user
    rated_items = ratings_df.loc[ratings_df["user_id"] == user_id, "item_id"].unique()

    # Candidate movies: not yet rated
    candidates = np.setdiff1d(all_items, rated_items)

    # Predict scores for each candidate
    preds = []
    for item_id in candidates:
        est = model.predict(user_id, int(item_id)).est  # predicted rating
        preds.append((int(item_id), est))

    # Sort by predicted rating
    preds.sort(key=lambda x: x[1], reverse=True)
    top = preds[:n]

    # Attach titles
    top_df = pd.DataFrame(top, columns=["item_id", "pred_rating"])
    top_df = top_df.merge(movies_df, on="item_id", how="left")

    return top_df[["item_id", "title", "pred_rating"]]


In [9]:
# --- 8) Demo recommendations ---
USER_ID = 1
topn = recommend_movies(model, ratings, movies, user_id=USER_ID, n=10)

print(f"Top recommendations for user {USER_ID}:")
display(topn)


Top recommendations for user 1:


Unnamed: 0,item_id,title,pred_rating
0,474,Dr. Strangelove or: How I Learned to Stop Worr...,4.93723
1,357,One Flew Over the Cuckoo's Nest (1975),4.936512
2,603,Rear Window (1954),4.925143
3,318,Schindler's List (1993),4.690363
4,513,"Third Man, The (1949)",4.67677
5,427,To Kill a Mockingbird (1962),4.669902
6,483,Casablanca (1942),4.639861
7,657,"Manchurian Candidate, The (1962)",4.553324
8,408,"Close Shave, A (1995)",4.52726
9,302,L.A. Confidential (1997),4.520707


## Export the model

This saves a `pickle` file in `/content/` so you can:
- download it from Colab
- or copy it to Google Drive

In [10]:
# --- 9) Export trained model (pickle) ---
import pickle
from pathlib import Path

OUT_PATH = Path("/content/movie_recommender_svd.pkl")

with open(OUT_PATH, "wb") as f:
    pickle.dump(model, f)

print("✅ Saved model to:", OUT_PATH)


✅ Saved model to: /content/movie_recommender_svd.pkl


In [None]:
# (Optional) Copy export to Google Drive
# from google.colab import drive
# drive.mount('/content/drive')
# !cp /content/movie_recommender_svd.pkl /content/drive/MyDrive/
