A production-style movie recommendation and ranking system that combines offline ML training with a low-latency Go service and a lightweight Next.js demo UI.
This is a personal portfolio project designed to mirror real-world MLE systems: data pipelines, model training, and online ranking at inference time.
- User mode: enter a MovieLens
user_idand get ranked recommendations based on that user’s historical ratings. - Movie mode: search by title, pick a movie, and get similar recommendations.
- Uses MovieLens for ratings data and TMDB for metadata/posters.
- Ingestion + validation of MovieLens CSVs
- TMDB enrichment for metadata (genres, popularity, posters, etc.)
- Feature tables for users and movies
- Model training with LightGBM (offline)
- Online service returns ranked results + lightweight explanations
- Frontend UI calls
/searchand/rankand renders cards
Note: the Go service currently uses a lightweight heuristic score from the feature tables. The LightGBM model is trained offline and saved to disk; wiring model inference into Go is a future step.
MovieLens Ratings + TMDB Metadata
|
v
Offline ML Pipeline (Python)
- Feature engineering
- Model training (LightGBM)
- Offline evaluation
- Model export
|
v
Online Ranking Service (Go)
- Candidate generation
- Feature fetching
- Ranking + response
|
v
Frontend Demo (Next.js)
- Search title or enter user_id
- Call /search and /rank
- Render movie cards
- ML pipeline: Python, LightGBM
- Online service: Go
- Frontend: Next.js + Tailwind
- Data: MovieLens + TMDB API
Prereqs: Python 3, Go 1.21+, Node 18+.
Create + activate virtual env (macOS/zsh):
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtMakefile shortcuts:
make helpSet your TMDB key (optional but recommended):
export TMDB_API_KEY=YOUR_KEY_HEREpython ml/scripts/ingest_movielens.py --raw-dir ml/data/raw --out-dir ml/data/processed
python ml/scripts/enrich_tmdb.py --processed-dir ml/data/processed --out ml/data/processed/tmdb_enriched.csv
python ml/scripts/build_features.py --processed-dir ml/data/processed --tmdb-csv ml/data/processed/tmdb_enriched.csv --out-dir ml/data/processed/features
python ml/scripts/build_training_dataset.py --processed-dir ml/data/processed --features-dir ml/data/processed/features --out-dir ml/data/processed/training
python ml/scripts/train_lightgbm.py --training-dir ml/data/processed/training --out-dir ml/models
python ml/scripts/export_service_data.py --features-dir ml/data/processed/features --out-dir service/datauvicorn model_service.app:app --host 0.0.0.0 --port 8090cd service
MODEL_API_BASE=http://localhost:8090 go run ./cmd/serverIf you skip the model service, run:
cd service
go run ./cmd/servercd frontend
npm run devPOST /rank
User-based request:
{
"user_id": 123,
"k": 25
}Movie-based request:
{
"movie_id": 550,
"k": 25
}Response:
{
"user_id": 123,
"results": [
{
"movie_id": 550,
"score": 0.91,
"title": "Fight Club",
"poster_url": "https://image.tmdb.org/t/p/w342/....jpg",
"reasons": ["genre_match:thriller", "high_popularity"]
}
],
"latency_ms": 47
}GET /search?q=matrix&limit=10
Response:
[
{ "movie_id": 2571, "title": "Matrix, The (1999)" },
{ "movie_id": 2572, "title": "Matrix Reloaded, The (2003)" }
]GET /movie/{movie_id}
Response:
{
"movie_id": 550,
"title": "Fight Club",
"release_year": 1999,
"genres": ["Drama", "Thriller"],
"tmdb_vote_avg": 8.4,
"tmdb_popularity": 62.1,
"poster_url": "https://image.tmdb.org/t/p/w342/....jpg",
"overview": "..."
}- MovieLens (25M or 32M): ratings, tags, timestamps
- TMDB API: genres, popularity, vote averages, release year, runtime, posters
Demo complete and working locally. Model inference integration in Go is optional future work; the current service ranks using a heuristic over feature tables.
Offline quality (NDCG@10, model vs baseline):
python ml/scripts/evaluate_model.py \
--training-dir ml/data/processed/training \
--model-dir ml/models \
--ndcg-k 10Online-style quality (model vs heuristic):
python ml/scripts/compare_heuristic_vs_model.py \
--training-dir ml/data/processed/training \
--model-dir ml/models \
--ndcg-k 10Scale (dataset and feature table sizes):
python ml/scripts/report_dataset_stats.py \
--processed-dir ml/data/processed \
--features-dir ml/data/processed/featuresLatency (p50/p95/p99):
python service/scripts/benchmark_latency.py --base-url http://localhost:8080 --requests 200 --k 25
