# **Netflix Show Clustering with K-Means**  

**Hierarchy**: Genre ➜ Rating + Duration ➜ Duration

**Data source**: TMDB “discover” endpoint (watch-provider id = 8 → Netflix)  
**Tech Stack**: Python • Pandas • Scikit-learn • Seaborn • Matplotlib  


**Install Libraries**

In [None]:
!pip install requests tqdm python-dotenv numpy pandas scikit-learn seaborn matplotlib

**1 Setup & imports**

In [39]:
import os, requests, math, json
from pathlib import Path
from tqdm.notebook import tqdm

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

from sklearn.preprocessing import MultiLabelBinarizer, OneHotEncoder, StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

np.random.seed(42)
plt.rcParams["figure.dpi"] = 110     # sharper plots


# ---------- DATA ACQUISITION ----------

**2 Pull UK Netflix movies via TMDB API**

In [None]:
TMDB_KEY = os.getenv("TMDB_API_KEY")
BASE_URL = "https://api.themoviedb.org/3"
PROVIDER  = 8          # Netflix
REGION    = "GB"       # United Kingdom
PAGES     = 285        # 285 pages of Netflix UK titles

In [43]:
def tmdb(endpoint: str, **params):
    """Tiny TMDB GET wrapper"""
    params["api_key"] = TMDB_KEY
    r = requests.get(f"{BASE_URL}{endpoint}", params=params, timeout=30)
    r.raise_for_status()
    return r.json()

Discover movie IDs available on UK Netflix

In [44]:
ids = []
for page in tqdm(range(1, PAGES + 1), desc="Discover pages"):
    data = tmdb(
        "/discover/movie",
        watch_region=REGION,
        with_watch_providers=PROVIDER,
        page=page,
        sort_by="popularity.desc",
    )
    ids.extend([m["id"] for m in data["results"]])
    if page >= data["total_pages"]:
        break

print(f"Collected {len(ids):,} titles")

Discover pages:   0%|          | 0/500 [00:00<?, ?it/s]

Collected 5,655 titles


Fetch title-level details (runtime, genres, rating certificate)

In [None]:
def extract_certification(release_dates, country="UK"):
    for block in release_dates.get("results", []):
        if block["iso_3166_1"] == country:
            for entry in block["release_dates"]:
                if entry["certification"]:
                    return entry["certification"]
    return None

records = []
for mid in tqdm(ids, desc="Movie details"):
    d = tmdb(f"/movie/{mid}", append_to_response="release_dates")
    records.append(
        {
            "title":        d["title"],
            "genres":       [g["name"] for g in d["genres"]],
            "rating":       extract_certification(d["release_dates"]),
            "duration_min": d["runtime"] or np.nan,
        }
    )

Movie details:   0%|          | 0/5655 [00:00<?, ?it/s]

In [None]:
df = pd.DataFrame(records).dropna(subset=["genres", "duration_min"])
print(f"Netflix UK catalogue pulled: {len(df):,} titles")

In [None]:
df

Unnamed: 0,title,genres,rating,duration_min
0,The Old Guard 2,"[Action, Fantasy]",R,107.0
1,KPop Demon Hunters,"[Animation, Fantasy, Action, Comedy, Music]",PG,96.0
2,Squid Game: Fireplace,[],G,60.0
3,STRAW,"[Thriller, Drama, Crime]",R,105.0
4,Squid Game in Conversation,[Documentary],R,33.0
...,...,...,...,...
795,The Stronghold,"[Thriller, Action, Crime]",NR,105.0
796,Bhool Bhulaiyaa 3,"[Horror, Comedy]",NR,158.0
797,The 8th Night,"[Mystery, Thriller, Horror]",NR,115.0
798,Maestro,"[Drama, Romance, Music]",R,129.0


# ---------- FEATURE ENGINEERING ----------