# Movie Recommender Pipeline
**MALDASC Project - Alexander Sallmann**

## Setup: paths and helper function

Set the paths (can differ depending on the project layout) and define helper to run the scripts.

In [1]:
from pathlib import Path
import sys
import subprocess

# Project root (current working directory)
ROOT = Path.cwd()
SCRIPTS = ROOT / "scripts"

ML_DIR = ROOT / "data" / "ml-latest"
TMDB_IDS = ROOT / "data" / "tmdb_ids_to_scrape.txt"
TMDB_DB = ROOT / "data" / "tmdb.sqlite"
INTERACTIONS = ROOT / "data" / "dataset" / "interactions.csv"
MOVIE_FEATURES = ROOT / "feature_store" / "movie_features.joblib"
MODEL_PATH = ROOT / "models" / "ridge_rating_model.joblib"

def run(cmd):
    """Run a command, show stdout + stderr, and raise a clear error if it fails."""
    print("$", " ".join(str(c) for c in cmd))
    result = subprocess.run(cmd, capture_output=True, text=True)

    if result.stdout:
        print("STDOUT:\n", result.stdout)
    if result.stderr:
        print("STDERR:\n", result.stderr)

    if result.returncode != 0:
        raise RuntimeError(f"Command failed with exit code {result.returncode}")


## Step 1 – Build movie features (`build_movie_features.py`)

This script converts the raw movie metadata in `tmdb.sqlite` into a set of features
suitable for ML.

**Text features** (TF–IDF):
- genres (e.g. `Action`, `Drama`, `Comedy`)

**Numeric features**:
- `runtime`
- `vote_avg` (TMDB average rating)
- `log1p_vote_count` (log of number of votes)
- `popularity`
- `inv_recency` (newer movies have higher values)

**Inputs**:
- `--db`: path to `tmdb.sqlite`.
- `--max-features`: max number of genre tokens to keep (TF–IDF vocab size).
- `--min-df`: minimum document frequency for TF–IDF terms.
- `--out`: output `.joblib` file containing movies, features and metadata.


In [23]:
MAX_FEATURES = 2000
MIN_DF = 20

MOVIE_FEATURES.parent.mkdir(parents=True, exist_ok=True)
run([
    sys.executable,
    str(SCRIPTS / "build_movie_features.py"),
    "--ml-dir", str(ML_DIR),
    "--max-features", str(MAX_FEATURES),
    "--min-df", str(MIN_DF),
    "--out", str(MOVIE_FEATURES),
])

# Runtime <10 seconds

$ c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\venv\Scripts\python.exe c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\scripts\build_movie_features.py --ml-dir c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\data\ml-latest --max-features 2000 --min-df 20 --out c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\feature_store\movie_features.joblib
STDOUT:
 Loading MovieLens from c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\data\ml-latest ...
Loaded 86,537 movies, 33,832,162 ratings, 2,328,315 tags.
Building numeric features...
Building text corpus (genres + tags) and TF-IDF features...
âœ“ saved c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\feature_store\movie_features.joblib (86537 movies x 2000 text-features, 3 numeric features)



## Step 2 – Build interactions (`movielens_to_interactions.py`)

This script links MovieLens ratings to the movies in `tmdb.sqlite`.
It produces a table of **interactions** with columns:

- `userId`
- `movie_rowid` (primary key in `tmdb.sqlite`)
- `tmdb_id`
- `rating` (1–5)
- `timestamp`
- `title`

**Inputs**:
- `--ml-dir`: MovieLens directory (with `ratings.csv`, `links.csv`).
- `--sqlite`: path to `tmdb.sqlite`.
- `--min-user-ratings`: drop users with fewer ratings than this.
- `--min-movie-ratings`: drop movies with fewer ratings than this.
- `--out`: output `.csv` or `.parquet` file.

You can use the `min-*` parameters to filter out very sparse users or movies.

In [24]:
MIN_USER_RATINGS = 5
MIN_MOVIE_RATINGS = 5

INTERACTIONS.parent.mkdir(parents=True, exist_ok=True)
run([
    sys.executable,
    str(SCRIPTS / "movielens_to_interactions.py"),
    "--ml-dir", str(ML_DIR),
    "--min-user-ratings", str(MIN_USER_RATINGS),
    "--min-movie-ratings", str(MIN_MOVIE_RATINGS),
    "--out", str(INTERACTIONS),
])

# Runtime <2 minutes

$ c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\venv\Scripts\python.exe c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\scripts\movielens_to_interactions.py --ml-dir c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\data\ml-latest --min-user-ratings 5 --min-movie-ratings 5 --out c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\data\dataset\interactions.csv
STDOUT:
 âœ“ wrote 33,703,215 interactions for 307,412 users and 43,855 movies â†’ c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\data\dataset\interactions.csv



## Step 3 – Train rating prediction model (`train_model.py`)

This script trains a **Ridge regression** model that predicts a user's rating (1–5)
from the following features:

- movie TF–IDF features (genres),
- movie numeric features (runtime, vote_avg, popularity, etc.),
- the user's mean rating (per-user bias feature).

**Inputs**:
- `--interactions`: interactions file from Step 4.
- `--features`: movie features from Step 3.
- `--model-out`: where to save the trained model.
- `--test-size`: fraction of data used as test set.
- `--alpha`: Ridge regularization strength.
- `--max-samples`: max number of interactions to use.
  - set to `-1` to use **all** interactions.

At the end, the script prints MAE and RMSE on a held-out test set.

In [2]:
TEST_SIZE = 0.2
ALPHA = 1.0
MAX_SAMPLES = 200_000

MODEL_PATH.parent.mkdir(parents=True, exist_ok=True)
run([
    sys.executable,
    str(SCRIPTS / "train_model.py"),
    "--interactions", str(INTERACTIONS),
    "--features", str(MOVIE_FEATURES),
    "--model-out", str(MODEL_PATH),
    "--test-size", str(TEST_SIZE),
    "--alpha", str(ALPHA),
    "--max-samples", str(MAX_SAMPLES),
])

# Runtime for 200k samples was <20 seconds

$ c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\venv\Scripts\python.exe c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\scripts\train_model.py --interactions c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\data\dataset\interactions.csv --features c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\feature_store\movie_features.joblib --model-out c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\models\ridge_rating_model.joblib --test-size 0.2 --alpha 1.0 --max-samples 200000
STDOUT:
 Loading interactions from c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\data\dataset\interactions.csv ...
Total interactions: 33,703,215
Subsampling to 200,000 interactions (from 33,703,215) ...
Using 200,000 interactions.
Loading movie features from c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\feature_store\movie_features.joblib ...


## Step 4 – Inspect model coefficients (`explain_model.py`)

The script inspect which features push ratings **up** or **down**.
The script:

- reconstructs the full feature name list,
- prints the **top positive** and **top negative** coefficients.


In [19]:
run([
    sys.executable,
    str(SCRIPTS / "explain_model.py"),
    "--model", str(MODEL_PATH),
    "--features", str(MOVIE_FEATURES),
])

$ c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\venv\Scripts\python.exe c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\scripts\explain_model.py --model c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\models\ridge_rating_model.joblib --features c:\HSLU\MALDASC (Machine Learning and Data Science)\MovieRatingPredictor\feature_store\movie_features.joblib
STDOUT:
 
=== Top Positive Features (push rating higher) ===
+0.9619   user_mean_rating
+0.6884   genre:differences
+0.6401   genre:shirt
+0.6218   genre:laser
+0.6046   genre:bag
+0.6022   genre:locked
+0.5657   genre:narration
+0.5394   genre:slap
+0.5392   genre:striking
+0.5383   genre:magnolia
+0.5189   genre:spacecraft
+0.5017   genre:massacre
+0.4801   genre:awful
+0.4669   genre:riding
+0.4621   num:global_mean_rating

=== Top Negative Features (push rating lower) ===
-0.6499   genre:06
-0.6068   genre:guitar
-0.6034   genre:production
-0.6027   genre:sho

# Step 5 - Recommend movie for new user

Executing without the "run" command (because we need user input for this step)

In [16]:
from pathlib import Path
import importlib

import ipywidgets as widgets
from IPython.display import display, HTML

import scripts.recommend_for_new_user as r4u

importlib.reload(r4u)


def run_notebook_ui(
    model_path: Path,
    features_path: Path,
    profiles_dir: Path,
    profile_name: str | None = None,
    num_candidates: int = 30,
    min_ratings: int = 20,
    num_recs: int = 10,
):
    """
    Jupyter UI:
      - load/select profile
      - Phase 1: rate diverse popular movies
      - Phase 2: rate personalized recommendations (with predicted rating)
      - then compute final recommendations and save profile
    """
    # 1) Load model + features
    print("Loading model and movie features...")
    model, movies, X_tfidf, num = r4u.load_model_and_features(model_path, features_path)
    print(f"Loaded {len(movies)} movies.\n")

    # 2) Choose / load profile (still CLI-style)
    profile_name, user_ratings, profile_path = r4u.choose_profile(
        profiles_dir=profiles_dir,
        profile_name_arg=profile_name,
    )
    print(f"Current profile '{profile_name}' has {len(user_ratings)} ratings/preferences.\n")

    # 3) Candidate pool for phase 1
    rating_pool = r4u.select_candidate_movies(
        movies,
        X_tfidf,
        per_genre=5,
        top_genres=8,
        extra_random=20,
        min_year=1990,
    )

    if user_ratings:
        already = set(user_ratings.keys())
        rating_pool = rating_pool[~rating_pool["movie_rowid"].isin(already)]

    rating_pool = rating_pool.head(num_candidates).reset_index(drop=True)

    if rating_pool.empty:
        print("No candidate movies available to rate.")
        return

    # placeholders for phase 2
    rec_pool = None  # will be set after phase 1

    # 4) Widgets
    output = widgets.Output()

    # star buttons
    btn_1 = widgets.Button(description="1 ⭐", tooltip="1 - hated it")
    btn_2 = widgets.Button(description="2 ⭐", tooltip="2 - not good")
    btn_3 = widgets.Button(description="3 ⭐", tooltip="3 - okay")
    btn_4 = widgets.Button(description="4 ⭐", tooltip="4 - liked it")
    btn_5 = widgets.Button(description="5 ⭐", tooltip="5 - loved it")

    # interest buttons
    btn_i = widgets.Button(description="Interested (~4⭐)", button_style="info")
    btn_n = widgets.Button(description="Not interested (~2⭐)", button_style="warning")

    # slider for decimal ratings
    rating_slider = widgets.FloatSlider(
        value=4.0,
        min=1.0,
        max=5.0,
        step=0.1,
        description="Rating",
        readout_format=".1f",
        continuous_update=False,
    )
    btn_slider_apply = widgets.Button(description="Use slider rating")

    # control buttons
    btn_skip = widgets.Button(description="Skip")
    btn_done = widgets.Button(description="Next phase / Finish", button_style="success")

    buttons_row_1 = widgets.HBox([btn_1, btn_2, btn_3, btn_4, btn_5])
    buttons_row_2 = widgets.HBox([btn_i, btn_n, btn_skip, btn_done])
    slider_row = widgets.HBox([rating_slider, btn_slider_apply])

    ui_box = widgets.VBox([output, slider_row, buttons_row_1, buttons_row_2])

    # 5) state
    state = {
        "phase": "rating",     # "rating" -> "recs" -> "done"
        "rating_i": 0,
        "recs_i": 0,
    }

    def get_tmdb_details(row):
        tmdb_id = row.get("tmdb_id")
        if "tmdb_id" not in row or tmdb_id != tmdb_id:
            return None
        try:
            return r4u.fetch_tmdb_details(int(tmdb_id))
        except Exception:
            return None

    def current_pool_and_index():
        if state["phase"] == "rating":
            return rating_pool, state["rating_i"]
        elif state["phase"] == "recs":
            return rec_pool, state["recs_i"]
        else:
            return None, None

    def render_current_movie():
        pool, i = current_pool_and_index()
        with output:
            output.clear_output(wait=True)

            if pool is None:
                print("No movies to show.")
                return

            if i is None or i >= len(pool):
                if state["phase"] == "rating":
                    print("No more movies to rate in Phase 1. You can click 'Next phase / Finish'.")
                elif state["phase"] == "recs":
                    print("No more recommended movies to rate. Click 'Next phase / Finish' to see final recommendations.")
                return

            row = pool.iloc[i]
            movie_id = int(row["movie_rowid"])
            base_title = str(row.get("title", "Unknown title"))

            details = get_tmdb_details(row)
            display_title = base_title
            overview = ""
            year = None
            poster_url = None

            if details is not None:
                if details.get("title"):
                    display_title = details["title"]
                overview = details.get("overview") or ""
                rd = details.get("release_date") or ""
                if len(rd) >= 4:
                    year = rd[:4]
                poster_path = details.get("poster_path")
                if poster_path:
                    poster_url = f"https://image.tmdb.org/t/p/w342{poster_path}"

            # Header
            phase_label = "Phase 1: rate popular movies" if state["phase"] == "rating" else "Phase 2: refine on recommendations"
            print(f"{phase_label}")
            print(f"[{i+1}/{len(pool)}]  movie_rowid={movie_id}")
            print(f"Title: {display_title}")
            if year:
                print(f"Year : {year}")

            # If in recs phase, show predicted rating
            if state["phase"] == "recs" and "pred_rating" in row:
                print(f"Predicted rating: {row['pred_rating']:.2f}")

            # Poster + overview
            html_parts = []
            if poster_url:
                html_parts.append(
                    f'<img src="{poster_url}" '
                    f'style="max-height:300px;margin-right:15px;border-radius:8px;">'
                )

            if overview:
                ov = overview.strip()
                if len(ov) > 600:
                    ov = ov[:600] + "..."
                html_parts.append(
                    f"<div style='max-width:500px;'><b>Overview</b><br>{ov}</div>"
                )

            if html_parts:
                html = (
                    "<div style='display:flex;flex-direction:row;"
                    "align-items:flex-start;gap:15px;margin-top:10px;'>"
                    + "".join(html_parts)
                    + "</div>"
                )
                display(HTML(html))

            tmdb_id = row.get("tmdb_id")
            if tmdb_id == tmdb_id:
                display(HTML(
                    f"<p>TMDB: "
                    f"<a href='https://www.themoviedb.org/movie/{int(tmdb_id)}' target='_blank'>"
                    f"Open in TMDB</a></p>"
                ))

            print("\nUse the star buttons, interest buttons, or the slider to rate. Or 'Skip' to move on.")

    def add_rating_value(rating: float):
        pool, i = current_pool_and_index()
        if pool is None or i is None or i >= len(pool):
            return

        row = pool.iloc[i]
        movie_id = int(row["movie_rowid"])
        user_ratings[movie_id] = float(rating)

        # advance index in the current phase
        if state["phase"] == "rating":
            state["rating_i"] += 1
        elif state["phase"] == "recs":
            state["recs_i"] += 1

        render_current_movie()

    def skip_movie(_):
        if state["phase"] == "rating":
            state["rating_i"] += 1
        elif state["phase"] == "recs":
            state["recs_i"] += 1
        render_current_movie()

    def apply_slider(_):
        add_rating_value(rating_slider.value)

    def go_next_phase_or_finish(_):
        nonlocal rec_pool
        with output:
            output.clear_output(wait=True)
            total_ratings = len(user_ratings)
            print(f"Profile '{profile_name}' currently has {total_ratings} ratings/preferences.\n")

        # If we're still in phase 1, move to phase 2 (recs)
        if state["phase"] == "rating":
            if total_ratings < min_ratings:
                with output:
                    print(
                        f"You currently have only {total_ratings} ratings/preferences.\n"
                        f"We recommend at least {min_ratings} to get good recommendations,\n"
                        "but we'll compute some initial recommendations anyway.\n"
                    )

            # compute initial recommendations and set up rec_pool
            top_recs = r4u.recommend_for_user(
                model=model,
                movies=movies,
                X_tfidf=X_tfidf,
                num=num,
                user_ratings=user_ratings,
                num_recs=num_recs * 3,   # show more in refinement phase
                min_year=1990,
            )
            rec_pool = top_recs.reset_index(drop=True)

            state["phase"] = "recs"
            state["recs_i"] = 0
            render_current_movie()
            return

        # If we're in phase 2, finish: compute final recommendations
        if state["phase"] == "recs":
            with output:
                output.clear_output(wait=True)
                total_ratings = len(user_ratings)
                print(f"Profile '{profile_name}' now has {total_ratings} total ratings/preferences.\n")
                print("Computing final recommendations based on all your feedback...\n")

                top_final = r4u.recommend_for_user(
                    model=model,
                    movies=movies,
                    X_tfidf=X_tfidf,
                    num=num,
                    user_ratings=user_ratings,
                    num_recs=num_recs,
                    min_year=1990,
                )

                print("\n=== Final Top Recommendations for You ===")
                for i, row in top_final.iterrows():
                    title = row.get("title", "Unknown title")
                    pred = row["pred_rating"]
                    line = f"{i+1:2d}. {title}  (predicted rating: {pred:.2f})"
                    tmdb_id = row.get("tmdb_id")
                    if tmdb_id == tmdb_id:
                        line += f"  [TMDB id: {int(tmdb_id)}]"
                    print(line)

                r4u.save_user_ratings(profile_path, user_ratings)
                print(f"\nSaved {len(user_ratings)} ratings/preferences to profile '{profile_name}'.\n")

            state["phase"] = "done"

    # Button callbacks
    btn_1.on_click(lambda b: add_rating_value(1.0))
    btn_2.on_click(lambda b: add_rating_value(2.0))
    btn_3.on_click(lambda b: add_rating_value(3.0))
    btn_4.on_click(lambda b: add_rating_value(4.0))
    btn_5.on_click(lambda b: add_rating_value(5.0))
    btn_i.on_click(lambda b: add_rating_value(4.0))  # interested
    btn_n.on_click(lambda b: add_rating_value(2.0))  # not interested
    btn_skip.on_click(skip_movie)
    btn_slider_apply.on_click(apply_slider)
    btn_done.on_click(go_next_phase_or_finish)

    # Display UI and show first movie of phase 1
    display(ui_box)
    render_current_movie()

In [17]:
NUM_CANDIDATES = 30
MIN_RATINGS = 20
NUM_RECS = 10

run_notebook_ui(
    model_path=MODEL_PATH,
    features_path=MOVIE_FEATURES,
    profiles_dir=ROOT / "user_profiles",
    num_candidates=NUM_CANDIDATES,
    min_ratings=MIN_RATINGS,
    num_recs=NUM_RECS,
)

Loading model and movie features...
Loaded 86537 movies.

Existing profiles:
  1. alex
  n. Create new profile

Using existing profile 'alex' with 47 ratings.

Current profile 'alex' has 47 ratings/preferences.



VBox(children=(Output(), HBox(children=(FloatSlider(value=4.0, continuous_update=False, description='Rating', …