# **Netflix Show Clustering with K-Means**  
Group similar shows by genre, rating, and duration.

**Tech Stack**: Python • Pandas • Scikit-learn • Seaborn • Matplotlib  
Dataset: Netflix Titles – snapshot of films & TV shows available on Netflix.

**Install Libraries**

In [None]:
!pip install requests tqdm python-dotenv numpy pandas scikit-learn seaborn matplotlib

**1 Setup & imports**

In [2]:
import os, requests, time, math, json
from pathlib import Path
from tqdm.notebook import tqdm

import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer, OneHotEncoder, StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA

import seaborn as sns
import matplotlib.pyplot as plt

from dotenv import load_dotenv
load_dotenv()

np.random.seed(42)

# ---------- DATA ACQUISITION ----------

**2A TMDB option**

Helper to call TMDB

In [3]:
TMDB_KEY = os.getenv("TMDB_API_KEY") or "PASTE_KEY_HERE"
BASE = "https://api.themoviedb.org/3"

def tmdb_get(endpoint:str, **params):
    """Thin wrapper around TMDB GET requests."""
    params["api_key"] = TMDB_KEY
    url = f"{BASE}{endpoint}"
    r = requests.get(url, params=params, timeout=30)
    r.raise_for_status()
    return r.json()

Discover all Netflix movies in a region

In [4]:
REGION       = "GB"      
MAX_PAGES    = 40       
provider_id  = 8         

records = []

for page in tqdm(range(1, MAX_PAGES+1), desc="Fetching discover pages"):
    data = tmdb_get(
        "/discover/movie",
        with_watch_providers=provider_id,
        watch_region=REGION,
        sort_by="popularity.desc",
        page=page
    )
    records.extend(data["results"])
    if page >= data["total_pages"]:
        break

print(f"Collected {len(records):,} titles")

Fetching discover pages:   0%|          | 0/40 [00:00<?, ?it/s]

Collected 800 titles


Enrich with runtime & US parental rating

In [5]:
def extract_certification(release_dates, country="US"):
    for block in release_dates.get("results", []):
        if block["iso_3166_1"] == country:
            for entry in block["release_dates"]:
                if entry["certification"]:
                    return entry["certification"]
    return None

details = []
for rec in tqdm(records, desc="Enriching"):
    mid = rec["id"]
    d = tmdb_get(f"/movie/{mid}", append_to_response="release_dates")
    details.append(
        {
            "title":       d["title"],
            "genres":      [g["name"] for g in d["genres"]],
            "rating":      extract_certification(d["release_dates"]) or "NR",
            "duration_min": d["runtime"] or np.nan,
        }
    )

df = pd.DataFrame(details)


Enriching:   0%|          | 0/800 [00:00<?, ?it/s]

In [6]:
df

Unnamed: 0,title,genres,rating,duration_min
0,The Old Guard 2,"[Action, Fantasy]",R,107.0
1,KPop Demon Hunters,"[Animation, Fantasy, Action, Comedy, Music]",PG,96.0
2,Squid Game: Fireplace,[],G,60.0
3,STRAW,"[Thriller, Drama, Crime]",R,105.0
4,Squid Game in Conversation,[Documentary],R,33.0
...,...,...,...,...
795,The Stronghold,"[Thriller, Action, Crime]",NR,105.0
796,Bhool Bhulaiyaa 3,"[Horror, Comedy]",NR,158.0
797,The 8th Night,"[Mystery, Thriller, Horror]",NR,115.0
798,Maestro,"[Drama, Romance, Music]",R,129.0
