# Movie Recommender Pipeline
**MALDASC Project - Alexander Sallmann**

## Setup: paths and helper function

Set the paths (can differ depending on the project layout) and define helper to run the scripts.

In [2]:
from pathlib import Path
import sys
import subprocess

# Project root (current working directory)
ROOT = Path.cwd()
SCRIPTS = ROOT / "scripts"

ML_DIR = ROOT / "data" / "ml-32m"
TMDB_IDS = ROOT / "data" / "tmdb_ids_to_scrape.txt"
TMDB_DB = ROOT / "data" / "tmdb.sqlite"
INTERACTIONS = ROOT / "data" / "dataset" / "interactions.csv"
MOVIE_FEATURES = ROOT / "feature_store" / "movie_features.joblib"
MODEL_PATH = ROOT / "models" / "ridge_rating_model.joblib"

def run(cmd):
    """Run a command, show stdout + stderr, and raise a clear error if it fails."""
    print("$", " ".join(str(c) for c in cmd))
    result = subprocess.run(cmd, capture_output=True, text=True)

    if result.stdout:
        print("STDOUT:\n", result.stdout)
    if result.stderr:
        print("STDERR:\n", result.stderr)

    if result.returncode != 0:
        raise RuntimeError(f"Command failed with exit code {result.returncode}")


## Step 1 – Select movies to scrape (`make_tmdb_id_list.py`)

This script reads **MovieLens `ratings.csv` and `links.csv`** and:

- counts how many ratings each `movieId` has,
- keeps the **TOP N** most-rated movies,
- writes their `tmdbId` (one per line) to a text file (`tmdb_ids_to_scrape.txt`).

**Inputs**:
- `--ml-dir`: directory containing `ratings.csv` and `links.csv`.
- `--top-n`: how many movies to keep, sorted by number of ratings.
- `--out`: output file whose lines are TMDB movie IDs.

In [3]:
TOP_N = 10000

TMDB_IDS.parent.mkdir(parents=True, exist_ok=True)
run([
    sys.executable,
    str(SCRIPTS / "make_tmdb_id_list.py"),
    "--ml-dir", str(ML_DIR),
    "--top-n", str(TOP_N),
    "--out", str(TMDB_IDS),
])

# Runtime <10 seconds


$ c:\HSLU\MALDASC\MovieRatingPredictor\venv\Scripts\python.exe c:\HSLU\MALDASC\MovieRatingPredictor\scripts\make_tmdb_id_list.py --ml-dir c:\HSLU\MALDASC\MovieRatingPredictor\data\ml-32m --top-n 10000 --out c:\HSLU\MALDASC\MovieRatingPredictor\data\tmdb_ids_to_scrape.txt


## Step 2 – Scrape movies from TMDB (`tmdb_scraper.py`)

This script calls the **TMDB API** for each ID in `tmdb_ids_to_scrape.txt` and stores
the results in `tmdb.sqlite`.

It maintains the following tables:
- `movies` (title, runtime, popularity, etc.)
- `genres` and `movie_genres`

**Inputs**:
- `--ids-file`: text file with one TMDB movie ID per line.
- `--db`: path to the SQLite database to create/update.

**Environment**:
- `TMDB_API_KEY` must be set (e.g. in a `.env` file or OS env vars).

In [None]:
TMDB_DB.parent.mkdir(parents=True, exist_ok=True)
run([
    sys.executable,
    str(SCRIPTS / "tmdb_scraper.py"),
    "--ids-file", str(TMDB_IDS),
    "--db", str(TMDB_DB),
])

# Runtime was <141 minutes

$ c:\HSLU\MALDASC\MovieRatingPredictor\venv\Scripts\python.exe c:\HSLU\MALDASC\MovieRatingPredictor\scripts\tmdb_scraper.py --ids-file c:\HSLU\MALDASC\MovieRatingPredictor\data\tmdb_ids_to_scrape.txt --db c:\HSLU\MALDASC\MovieRatingPredictor\data\tmdb.sqlite


## Step 3 – Build movie features (`build_movie_features.py`)

This script converts the raw movie metadata in `tmdb.sqlite` into a set of features
suitable for ML.

**Text features** (TF–IDF):
- genres (e.g. `Action`, `Drama`, `Comedy`)

**Numeric features**:
- `runtime`
- `vote_avg` (TMDB average rating)
- `log1p_vote_count` (log of number of votes)
- `popularity`
- `inv_recency` (newer movies have higher values)

**Inputs**:
- `--db`: path to `tmdb.sqlite`.
- `--max-features`: max number of genre tokens to keep (TF–IDF vocab size).
- `--min-df`: minimum document frequency for TF–IDF terms.
- `--out`: output `.joblib` file containing movies, features and metadata.


In [None]:
MAX_FEATURES = 50
MIN_DF = 1

MOVIE_FEATURES.parent.mkdir(parents=True, exist_ok=True)
run([
    sys.executable,
    str(SCRIPTS / "build_movie_features.py"),
    "--db", str(TMDB_DB),
    "--max-features", str(MAX_FEATURES),
    "--min-df", str(MIN_DF),
    "--out", str(MOVIE_FEATURES),
])

# Runtime <10 seconds


$ c:\HSLU\MALDASC\MovieRatingPredictor\venv\Scripts\python.exe c:\HSLU\MALDASC\MovieRatingPredictor\scripts\build_movie_features.py --db c:\HSLU\MALDASC\MovieRatingPredictor\data\tmdb.sqlite --max-features 50 --min-df 1 --out c:\HSLU\MALDASC\MovieRatingPredictor\feature_store\movie_features.joblib


## Step 4 – Build interactions (`movielens_to_interactions.py`)

This script links MovieLens ratings to the movies in `tmdb.sqlite`.
It produces a table of **interactions** with columns:

- `userId`
- `movie_rowid` (primary key in `tmdb.sqlite`)
- `tmdb_id`
- `rating` (1–5)
- `timestamp`
- `title`

**Inputs**:
- `--ml-dir`: MovieLens directory (with `ratings.csv`, `links.csv`).
- `--sqlite`: path to `tmdb.sqlite`.
- `--min-user-ratings`: drop users with fewer ratings than this.
- `--min-movie-ratings`: drop movies with fewer ratings than this.
- `--out`: output `.csv` or `.parquet` file.

You can use the `min-*` parameters to filter out very sparse users or movies.

In [None]:
MIN_USER_RATINGS = 5
MIN_MOVIE_RATINGS = 5

INTERACTIONS.parent.mkdir(parents=True, exist_ok=True)
run([
    sys.executable,
    str(SCRIPTS / "movielens_to_interactions.py"),
    "--ml-dir", str(ML_DIR),
    "--sqlite", str(TMDB_DB),
    "--min-user-ratings", str(MIN_USER_RATINGS),
    "--min-movie-ratings", str(MIN_MOVIE_RATINGS),
    "--out", str(INTERACTIONS),
])

# Runtime <2 minutes


$ c:\HSLU\MALDASC\MovieRatingPredictor\venv\Scripts\python.exe c:\HSLU\MALDASC\MovieRatingPredictor\scripts\movielens_to_interactions.py --ml-dir c:\HSLU\MALDASC\MovieRatingPredictor\data\ml-32m --sqlite c:\HSLU\MALDASC\MovieRatingPredictor\data\tmdb.sqlite --min-user-ratings 5 --min-movie-ratings 5 --out c:\HSLU\MALDASC\MovieRatingPredictor\data\dataset\interactions.csv


## Step 5 – Train rating prediction model (`train_model.py`)

This script trains a **Ridge regression** model that predicts a user's rating (1–5)
from the following features:

- movie TF–IDF features (genres),
- movie numeric features (runtime, vote_avg, popularity, etc.),
- the user's mean rating (per-user bias feature).

**Inputs**:
- `--interactions`: interactions file from Step 4.
- `--features`: movie features from Step 3.
- `--model-out`: where to save the trained model.
- `--test-size`: fraction of data used as test set.
- `--alpha`: Ridge regularization strength.
- `--max-samples`: max number of interactions to use (for speed).
  - set to `-1` to use **all** interactions.

At the end, the script prints MAE and RMSE on a held-out test set.

In [17]:
TEST_SIZE = 0.2
ALPHA = 1.0
MAX_SAMPLES = 200_000

MODEL_PATH.parent.mkdir(parents=True, exist_ok=True)
run([
    sys.executable,
    str(SCRIPTS / "train_model.py"),
    "--interactions", str(INTERACTIONS),
    "--features", str(MOVIE_FEATURES),
    "--model-out", str(MODEL_PATH),
    "--test-size", str(TEST_SIZE),
    "--alpha", str(ALPHA),
    "--max-samples", str(MAX_SAMPLES),
])

# Runtime for 200k samples was <20 seconds

$ c:\HSLU\MALDASC\MovieRatingPredictor\venv\Scripts\python.exe c:\HSLU\MALDASC\MovieRatingPredictor\scripts\train_model.py --interactions c:\HSLU\MALDASC\MovieRatingPredictor\data\dataset\interactions.csv --features c:\HSLU\MALDASC\MovieRatingPredictor\feature_store\movie_features.joblib --model-out c:\HSLU\MALDASC\MovieRatingPredictor\models\ridge_rating_model.joblib --test-size 0.2 --alpha 1.0 --max-samples 200000
STDOUT:
 Loading interactions from c:\HSLU\MALDASC\MovieRatingPredictor\data\dataset\interactions.csv ...
Total interactions: 30,883,379
Subsampling to 200,000 interactions (from 30,883,379) ...
Using 200,000 interactions.
Loading movie features from c:\HSLU\MALDASC\MovieRatingPredictor\feature_store\movie_features.joblib ...
Building design matrix...
Feature matrix shape: (200000, 27), target shape: (200000,)
Training samples: 160,000, test samples: 40,000
Training Ridge regression (alpha=1.0) ...

Evaluation on held-out test set:
  MAE : 0.4735
  RMSE: 0.6622

âœ“ Saved t

## Step 6 – Inspect model coefficients (`explain_model.py`) *(optional)*

The script inspect which features push ratings **up** or **down**.
The script:

- reconstructs the full feature name list,
- prints the **top positive** and **top negative** coefficients.


In [None]:
run([
    sys.executable,
    str(SCRIPTS / "explain_model.py"),
    "--model", str(MODEL_PATH),
    "--features", str(MOVIE_FEATURES),
])

$ c:\HSLU\MALDASC\MovieRatingPredictor\venv\Scripts\python.exe c:\HSLU\MALDASC\MovieRatingPredictor\scripts\explain_model.py --model c:\HSLU\MALDASC\MovieRatingPredictor\models\ridge_rating_model.joblib --features c:\HSLU\MALDASC\MovieRatingPredictor\feature_store\movie_features.joblib
STDOUT:
 
=== Top Positive Features (push rating higher) ===
+0.9306   user_mean_rating
+0.3271   num:vote_avg
+0.0447   genre:thriller
+0.0340   genre:mystery
+0.0219   genre:history
+0.0213   genre:action
+0.0211   genre:drama
+0.0186   genre:animation
+0.0170   genre:comedy
+0.0137   genre:war
+0.0134   genre:crime
+0.0116   genre:western
+0.0093   num:inv_recency
+0.0074   genre:romance
+0.0064   genre:science

=== Top Negative Features (push rating lower) ===
-0.0952   genre:family
-0.0540   num:log1p_vote_count
-0.0518   genre:horror
-0.0478   genre:music
-0.0099   genre:documentary
-0.0039   genre:fantasy
-0.0026   num:popularity
-0.0023   genre:movie
-0.0023   genre:tv
-0.0007   num:runtime

Done

# Step 7 - Recommend movie for new user

Executing without the "run" command (because we need user input for this step)

In [7]:
import sys
from pathlib import Path
import importlib
import scripts.recommend_for_new_user as r4u

importlib.reload(r4u)

sys.argv = [
    "recommend_for_new_user.py",
    "--model", str(MODEL_PATH),
    "--features", str(MOVIE_FEATURES),
    "--num-rate", "20",
    "--num-recs", "10",
    "--min-ratings", "20",
]

r4u.main()

Loading model and movie features...
Loaded 9928 movies.


=== Step 1: Please rate some movies ===
Enter a rating from 1.0 to 5.0.
Press ENTER to skip a movie, or type 'q' to finish early.

Invalid input. Please enter a number (e.g. 3.5), ENTER to skip, or 'q' to quit.
Invalid input. Please enter a number (e.g. 3.5), ENTER to skip, or 'q' to quit.
Please enter a number between 1.0 and 5.0.
Please enter a number between 1.0 and 5.0.

Thanks! You rated 20 movies.

Estimated user_mean_rating: 4.185
Predicting ratings for all movies...

=== Top Recommendations for You ===
 1. Gigantic (A Tale of Two Johns)  (predicted rating: 4.78)
 2. World of Tomorrow Episode Two: The Burden of Other People's Thoughts  (predicted rating: 4.72)
 3. Twin Peaks  (predicted rating: 4.71)
 4. David Attenborough: A Life on Our Planet  (predicted rating: 4.71)
 5. Duck Amuck  (predicted rating: 4.70)
 6. Margaret Cho: I'm the One That I Want  (predicted rating: 4.69)
 7. Dave Chappelle: Killin' Them Softly  (pre

0