# Movie Recommender Pipeline
**MALDASC Project - Alexander Sallmann**

## Setup: paths and helper function

Set the paths (can differ depending on the project layout) and define helper to run the scripts.

In [2]:
from pathlib import Path
import sys
import subprocess

# Project root (current working directory)
ROOT = Path.cwd()
SCRIPTS = ROOT / "scripts"

ML_DIR = ROOT / "data" / "ml-latest"
TMDB_IDS = ROOT / "data" / "tmdb_ids_to_scrape.txt"
TMDB_DB = ROOT / "data" / "tmdb.sqlite"
INTERACTIONS = ROOT / "data" / "dataset" / "interactions.csv"
MOVIE_FEATURES = ROOT / "feature_store" / "movie_features.joblib"
MODEL_PATH = ROOT / "models" / "ridge_rating_model.joblib"

def run(cmd):
    """Run a command, show stdout + stderr, and raise a clear error if it fails."""
    print("$", " ".join(str(c) for c in cmd))
    result = subprocess.run(cmd, capture_output=True, text=True)

    if result.stdout:
        print("STDOUT:\n", result.stdout)
    if result.stderr:
        print("STDERR:\n", result.stderr)

    if result.returncode != 0:
        raise RuntimeError(f"Command failed with exit code {result.returncode}")


## Step 1 – Select movies to scrape (`make_tmdb_id_list.py`)

This script reads **MovieLens `ratings.csv` and `links.csv`** and:

- counts how many ratings each `movieId` has,
- keeps the **TOP N** most-rated movies,
- writes their `tmdbId` (one per line) to a text file (`tmdb_ids_to_scrape.txt`).

**Inputs**:
- `--ml-dir`: directory containing `ratings.csv` and `links.csv`.
- `--top-n`: how many movies to keep, sorted by number of ratings.
- `--out`: output file whose lines are TMDB movie IDs.

In [4]:
TOP_N = 10000

TMDB_IDS.parent.mkdir(parents=True, exist_ok=True)
run([
    sys.executable,
    str(SCRIPTS / "make_tmdb_id_list.py"),
    "--ml-dir", str(ML_DIR),
    "--top-n", str(TOP_N),
    "--out", str(TMDB_IDS),
])

# Runtime <10 seconds


$ c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\venv\Scripts\python.exe c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\scripts\make_tmdb_id_list.py --ml-dir c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\data\ml-latest --top-n 10000 --out c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\data\tmdb_ids_to_scrape.txt
STDOUT:
 âœ“ wrote 10000 TMDB IDs to c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\data\tmdb_ids_to_scrape.txt



## Step 2 – Scrape movies from TMDB (`tmdb_scraper.py`)

This script calls the **TMDB API** for each ID in `tmdb_ids_to_scrape.txt` and stores
the results in `tmdb.sqlite`.

It maintains the following tables:
- `movies` (title, runtime, popularity, etc.)
- `genres` and `movie_genres`

**Inputs**:
- `--ids-file`: text file with one TMDB movie ID per line.
- `--db`: path to the SQLite database to create/update.

**Environment**:
- `TMDB_API_KEY` must be set (e.g. in a `.env` file or OS env vars).

In [None]:
TMDB_DB.parent.mkdir(parents=True, exist_ok=True)
run([
    sys.executable,
    str(SCRIPTS / "tmdb_scraper.py"),
    "--ids-file", str(TMDB_IDS),
    "--db", str(TMDB_DB),
])

# Runtime was <141 minutes

$ c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\venv\Scripts\python.exe c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\scripts\tmdb_scraper.py --ids-file c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\data\tmdb_ids_to_scrape.txt --db c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\data\tmdb.sqlite


## Step 3 – Build movie features (`build_movie_features.py`)

This script converts the raw movie metadata in `tmdb.sqlite` into a set of features
suitable for ML.

**Text features** (TF–IDF):
- genres (e.g. `Action`, `Drama`, `Comedy`)

**Numeric features**:
- `runtime`
- `vote_avg` (TMDB average rating)
- `log1p_vote_count` (log of number of votes)
- `popularity`
- `inv_recency` (newer movies have higher values)

**Inputs**:
- `--db`: path to `tmdb.sqlite`.
- `--max-features`: max number of genre tokens to keep (TF–IDF vocab size).
- `--min-df`: minimum document frequency for TF–IDF terms.
- `--out`: output `.joblib` file containing movies, features and metadata.


In [2]:
MAX_FEATURES = 50
MIN_DF = 1

MOVIE_FEATURES.parent.mkdir(parents=True, exist_ok=True)
run([
    sys.executable,
    str(SCRIPTS / "build_movie_features.py"),
    "--db", str(TMDB_DB),
    "--max-features", str(MAX_FEATURES),
    "--min-df", str(MIN_DF),
    "--out", str(MOVIE_FEATURES),
])

# Runtime <10 seconds


$ c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\venv\Scripts\python.exe c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\scripts\build_movie_features.py --db c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\data\tmdb.sqlite --max-features 50 --min-df 1 --out c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\feature_store\movie_features.joblib
STDOUT:
 âœ“ saved c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\feature_store\movie_features.joblib (9928 movies x 21 genre-features, 5 numeric features)



## Step 4 – Build interactions (`movielens_to_interactions.py`)

This script links MovieLens ratings to the movies in `tmdb.sqlite`.
It produces a table of **interactions** with columns:

- `userId`
- `movie_rowid` (primary key in `tmdb.sqlite`)
- `tmdb_id`
- `rating` (1–5)
- `timestamp`
- `title`

**Inputs**:
- `--ml-dir`: MovieLens directory (with `ratings.csv`, `links.csv`).
- `--sqlite`: path to `tmdb.sqlite`.
- `--min-user-ratings`: drop users with fewer ratings than this.
- `--min-movie-ratings`: drop movies with fewer ratings than this.
- `--out`: output `.csv` or `.parquet` file.

You can use the `min-*` parameters to filter out very sparse users or movies.

In [3]:
MIN_USER_RATINGS = 5
MIN_MOVIE_RATINGS = 5

INTERACTIONS.parent.mkdir(parents=True, exist_ok=True)
run([
    sys.executable,
    str(SCRIPTS / "movielens_to_interactions.py"),
    "--ml-dir", str(ML_DIR),
    "--sqlite", str(TMDB_DB),
    "--min-user-ratings", str(MIN_USER_RATINGS),
    "--min-movie-ratings", str(MIN_MOVIE_RATINGS),
    "--out", str(INTERACTIONS),
])

# Runtime <2 minutes


$ c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\venv\Scripts\python.exe c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\scripts\movielens_to_interactions.py --ml-dir c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\data\ml-latest --sqlite c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\data\tmdb.sqlite --min-user-ratings 5 --min-movie-ratings 5 --out c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\data\dataset\interactions.csv
STDOUT:
 âœ“ wrote 32,646,972 interactions for 306,763 users and 9,928 movies â†’ c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\data\dataset\interactions.csv



## Step 5 – Train rating prediction model (`train_model.py`)

This script trains a **Ridge regression** model that predicts a user's rating (1–5)
from the following features:

- movie TF–IDF features (genres),
- movie numeric features (runtime, vote_avg, popularity, etc.),
- the user's mean rating (per-user bias feature).

**Inputs**:
- `--interactions`: interactions file from Step 4.
- `--features`: movie features from Step 3.
- `--model-out`: where to save the trained model.
- `--test-size`: fraction of data used as test set.
- `--alpha`: Ridge regularization strength.
- `--max-samples`: max number of interactions to use (for speed).
  - set to `-1` to use **all** interactions.

At the end, the script prints MAE and RMSE on a held-out test set.

In [3]:
TEST_SIZE = 0.2
ALPHA = 1.0
MAX_SAMPLES = 200_000

MODEL_PATH.parent.mkdir(parents=True, exist_ok=True)
run([
    sys.executable,
    str(SCRIPTS / "train_model.py"),
    "--interactions", str(INTERACTIONS),
    "--features", str(MOVIE_FEATURES),
    "--model-out", str(MODEL_PATH),
    "--test-size", str(TEST_SIZE),
    "--alpha", str(ALPHA),
    "--max-samples", str(MAX_SAMPLES),
])

# Runtime for 200k samples was <20 seconds

$ c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\venv\Scripts\python.exe c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\scripts\train_model.py --interactions c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\data\dataset\interactions.csv --features c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\feature_store\movie_features.joblib --model-out c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\models\ridge_rating_model.joblib --test-size 0.2 --alpha 1.0 --max-samples 200000
STDOUT:
 Loading interactions from c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\data\dataset\interactions.csv ...
Total interactions: 32,646,972
Subsampling to 200,000 interactions (from 32,646,972) ...
Using 200,000 interactions.
Loading movie features from c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\feature_store\movie_features.joblib ...


## Step 6 – Inspect model coefficients (`explain_model.py`) *(optional)*

The script inspect which features push ratings **up** or **down**.
The script:

- reconstructs the full feature name list,
- prints the **top positive** and **top negative** coefficients.


In [4]:
run([
    sys.executable,
    str(SCRIPTS / "explain_model.py"),
    "--model", str(MODEL_PATH),
    "--features", str(MOVIE_FEATURES),
])

$ c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\venv\Scripts\python.exe c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\scripts\explain_model.py --model c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\models\ridge_rating_model.joblib --features c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\feature_store\movie_features.joblib
STDOUT:
 
=== Top Positive Features (push rating higher) ===
+0.9571   user_mean_rating
+0.3004   num:vote_avg
+0.0593   user_movie_sim
+0.0375   genre:history
+0.0285   genre:western
+0.0250   genre:mystery
+0.0208   genre:animation
+0.0203   genre:documentary
+0.0145   genre:thriller
+0.0043   genre:comedy

=== Top Negative Features (push rating lower) ===
-0.0862   genre:family
-0.0629   genre:horror
-0.0557   genre:music
-0.0390   num:log1p_vote_count
-0.0370   num:inv_recency
-0.0289   genre:drama
-0.0281   genre:crime
-0.0235   genre:adventure
-0.0212   gen

# Step 7 - Recommend movie for new user

Executing without the "run" command (because we need user input for this step)

In [14]:
import sys
import importlib
import scripts.recommend_for_new_user as r4u

importlib.reload(r4u)

sys.argv = [
    "recommend_for_new_user.py",
    "--model", str(MODEL_PATH),
    "--features", str(MOVIE_FEATURES),
    "--num-rate", "15",
    "--num-recs", "10",
    "--min-ratings", "20",
    "--profiles-dir", str(ROOT / "user_profiles"),
]

r4u.main()

Loading model and movie features...
Loaded 9928 movies.

Existing profiles:
  1. alex
  2. x
  n. Create new profile



Using existing profile 'alex' with 67 ratings.

Current profile 'alex' has 67 ratings/preferences.


=== Step 1: Please rate some movies ===
For each movie, respond with:
  - 1–5   : rating if you've seen it
  - i     : interested (but not seen)  → treated as ~4.0
  - n     : not interested             → treated as ~2.0
  - ENTER : skip
  - q     : quit rating


Title: Joker
Overview: During the 1980s, a failed stand-up comedian is driven insane and turns to a life of crime and chaos in Gotham City while becoming an infamous psychopathic crime figure.
TMDB page: https://www.themoviedb.org/movie/475557


Title: The Shawshank Redemption
Overview: Imprisoned in the 1940s for the double murder of his wife and her lover, upstanding banker Andy Dufresne begins a new life at the Shawshank prison, where he puts his accounting skills to work for an amoral warden. During his long stretch in prison, Dufresne comes to be admired by the other inmates -- including an older prisoner named Red -- for 

0