Semantic Cluster Labeling a required step that sits before the classification-with-categories. 
Unsupervised learning - 
This step will (a) describe each cluster with top keywords and top genres, 
(b) give each cluster a short human-readable label, and 
(c) save tidy outputs for your slides and for the next step.

Loads your clustered data (data/shows_with_umap_kmeans.parquet) and the review info (data/shows_for_review.csv)
Merges them to get name, genres, ai_summary, u1/u2, cluster
Builds TF-IDF over all summaries and computes per-cluster top unigrams/bigrams
Extracts top genres per cluster (from TVMaze)
Creates a cluster_label (e.g., "crime, investigation, police")

Saves:
data/shows_with_cluster_labels.parquet (same rows, plus cluster_label)
data/cluster_profiles.csv (one row per cluster, with keywords, genres, examples)
reports/cluster_profiles.md (nice markdown for slides/notes)

üìå How to position this in your pipeline
Fetch & clean ‚Üí raw_tvmaze.jsonl
Summarize ‚Üí shows.parquet
Merge (keep original + AI summary) ‚Üí shows_merged.parquet + shows_for_review.csv
Embed ‚Üí vectors/summaries.npy + vectors/summaries_index.parquet
UMAP + K-Means ‚Üí shows_with_umap_kmeans.parquet + plot
üëâ Semantic Cluster Labeling (this step, mandatory) ‚Üí
shows_with_cluster_labels.parquet, cluster_profiles.csv, cluster_profiles.md
(Later) Supervised classification ‚Üí compare to your cluster labels / categories

üéØ Why this is important (and belongs before supervised classification)
Gives you human-interpretable descriptions of unsupervised groups
Helps you choose/adjust your predefined categories (ground them in the data)
Lets you sanity-check and explain your clusters in slides

If you want, I can also add a snippet that uses those cluster_labels to seed your category list automatically (e.g., pick the top keywords/genres and propose category names).

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import silhouette_score
import ast

In [2]:
DATA_DIR = Path("data")
REPORTS = Path("reports"); REPORTS.mkdir(parents=True, exist_ok=True)

# ---- Load clustered coords/labels (from your UMAP + KMeans step)
clustered = pd.read_parquet(DATA_DIR / "shows_with_umap_kmeans.parquet")   # id, name, u1, u2, cluster

# ---- Load review info (names, genres, both summaries)
review = pd.read_csv(DATA_DIR / "shows_for_review.csv")  # id, name, genres, original_summary, ai_summary

# ---- Merge
df = pd.merge(
    review,
    clustered[["id", "u1", "u2", "cluster"]],
    on="id",
    how="left"
)

# ---- Ensure 'genres' is a list (CSV often stores it as a string)
def to_list_maybe(x):
    if isinstance(x, list):
        return x
    if isinstance(x, str):
        x = x.strip()
        # Try to parse JSON-like list
        if x.startswith("[") and x.endswith("]"):
            try:
                return list(ast.literal_eval(x))
            except Exception:
                pass
        # Fallback: split on commas
        return [t.strip() for t in x.split(",")] if x else []
    return []

df["genres"] = df["genres"].apply(to_list_maybe)

# ---- Basic sanity
assert "ai_summary" in df.columns, "ai_summary column missing. Make sure you merged after make_summaries."
assert "cluster" in df.columns, "cluster column missing. Run UMAP + KMeans first."


In [3]:

# ---- Build a TF-IDF model over ALL summaries, then summarize per cluster
texts_all = df["ai_summary"].fillna("").astype(str).values
vectorizer = TfidfVectorizer(
    max_features=5000,      # big enough to get signal, small enough to be fast
    ngram_range=(1,2),      # unigrams + bigrams
    min_df=2,               # ignore rare terms
    max_df=0.8,             # drop very common terms
    stop_words="english"
)
X_tfidf = vectorizer.fit_transform(texts_all)
terms = np.array(vectorizer.get_feature_names_out())

def top_keywords_for_cluster(rows_idx, topk=10):
    # average TF-IDF within the cluster, then take top-k features
    if len(rows_idx) == 0:
        return []
    mean_tfidf = X_tfidf[rows_idx].mean(axis=0).A1
    top_idx = np.argsort(-mean_tfidf)[:topk]
    return terms[top_idx].tolist()


In [4]:

# ---- Top genres per cluster
def top_genres_for_cluster(sub, k=3):
    flat = [g for lst in sub["genres"].dropna() for g in (lst if isinstance(lst, list) else [])]
    if not flat:
        return []
    return pd.Series(flat).value_counts().head(k).index.tolist()



In [7]:
# ---------------------------------------------------------
# SUMMARIZE EACH CLUSTER
# ---------------------------------------------------------
profiles = []
for c, sub in df.groupby("cluster", sort=True):
    row_idx = sub.index.values
    keywords = top_keywords_for_cluster(row_idx, topk=8)
    top_genres = top_genres_for_cluster(sub, k=3)
    examples = sub["name"].head(5).tolist()
    label = ", ".join(keywords[:3]) if keywords else f"Cluster {c}"

    profiles.append({
        "cluster": int(c),
        "num_shows": int(len(sub)),
        "cluster_label": label,
        "top_keywords": ", ".join(keywords),
        "top_genres": ", ".join(top_genres),
        "example_shows": ", ".join(examples),
    })

profiles_df = pd.DataFrame(profiles).sort_values("cluster").reset_index(drop=True)

# ---- Show on screen
print("üìã Cluster profiles preview:")
print(profiles_df.head(10).to_string(index=False), "\n")

# ---------------------------------------------------------
# ADD LABELS BACK TO FULL DATA
# ---------------------------------------------------------
labels_map = dict(zip(profiles_df["cluster"], profiles_df["cluster_label"]))
df["cluster_label"] = df["cluster"].map(labels_map)

print("‚úÖ Sample of merged data with cluster labels:")
print(df[["id", "name", "cluster", "cluster_label"]].head(10), "\n")



üìã Cluster profiles preview:
 cluster  num_shows                cluster_label                                                         top_keywords                     top_genres                                                         example_shows
       0         40           family, love, life     family, love, life, downs, ups downs, ups, challenges, heartfelt         Comedy, Romance, Drama Glee, Californication, Last Man Standing, Nashville, Red Band Society
       1         49     supernatural, dark, town    supernatural, dark, town, world, secrets, forces, navigates, life        Drama, Horror, Thriller                     Under the Dome, Bitten, Revenge, Grimm, Lost Girl
       2         43         crime, justice, team           crime, justice, team, high, stakes, world, navigate, drama           Drama, Crime, Action       Person of Interest, True Detective, Homeland, Gotham, Continuum
       3         31 diverse, humanity, survivors    diverse, humanity, survivors, new, group,

In [6]:
# ---- Save enriched per-show data (for later classification and analysis)
out_parquet = DATA_DIR / "shows_with_cluster_labels.parquet"
df.to_parquet(out_parquet, index=False)
print("‚úÖ Saved:", out_parquet.resolve())

# ---- Save compact cluster profiles
out_csv = DATA_DIR / "cluster_profiles.csv"
profiles_df.to_csv(out_csv, index=False, encoding="utf-8")
print("‚úÖ Saved:", out_csv.resolve())

# ---- Also save a slide-ready Markdown summary
md_path = REPORTS / "cluster_profiles.md"
with open(md_path, "w", encoding="utf-8") as f:
    f.write("# Cluster Profiles\n\n")
    for _, r in profiles_df.iterrows():
        f.write(f"## Cluster {r['cluster']} ‚Äî {r['cluster_label']}  \n")
        f.write(f"- **#Shows:** {r['num_shows']}\n")
        if r["top_genres"]:
            f.write(f"- **Top Genres:** {r['top_genres']}\n")
        f.write(f"- **Top Keywords:** {r['top_keywords']}\n")
        f.write(f"- **Example Shows:** {r['example_shows']}\n\n")
print("üìù Saved:", md_path.resolve())

‚úÖ Saved: C:\Users\brethm01\tv-nlp\src\data\shows_with_cluster_labels.parquet
‚úÖ Saved: C:\Users\brethm01\tv-nlp\src\data\cluster_profiles.csv
üìù Saved: C:\Users\brethm01\tv-nlp\src\reports\cluster_profiles.md


In [8]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from pathlib import Path

# Load clustered data (after merging and KMeans)
df = pd.read_parquet("data/shows_with_umap_kmeans.parquet")   # id, name, u1, u2, cluster
review = pd.read_csv("data/shows_for_review.csv")             # id, name, ai_summary
merged = pd.merge(review, df[["id", "cluster"]], on="id", how="left")

# Clean missing summaries
merged["ai_summary"] = merged["ai_summary"].fillna("")

print(f"‚úÖ Data loaded: {len(merged)} shows across {merged['cluster'].nunique()} clusters\n")

# --- Build TF-IDF model ---
vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2),     # unigrams + bigrams
    stop_words="english",
    min_df=2,
    max_df=0.8
)

X_tfidf = vectorizer.fit_transform(merged["ai_summary"])
terms = np.array(vectorizer.get_feature_names_out())

# --- Compute top keywords per cluster ---
def top_keywords_for_cluster(cluster_id, topk=10):
    rows = merged.index[merged["cluster"] == cluster_id].tolist()
    if not rows:
        return []
    mean_tfidf = X_tfidf[rows].mean(axis=0).A1
    top_idx = np.argsort(-mean_tfidf)[:topk]
    return terms[top_idx].tolist()

results = []
for c in sorted(merged["cluster"].dropna().unique()):
    keywords = top_keywords_for_cluster(int(c), topk=10)
    results.append({"cluster": int(c), "top_keywords": ", ".join(keywords)})

keywords_df = pd.DataFrame(results)

# --- Show preview ---
print("üìã Top TF-IDF keywords per cluster:\n")
print(keywords_df.to_string(index=False))

# --- Save for analysis / slides ---
out_path = Path("data/top_keywords_per_cluster.csv")
keywords_df.to_csv(out_path, index=False, encoding="utf-8")
print(f"\n‚úÖ Saved: {out_path.resolve()}")

‚úÖ Data loaded: 485 shows across 5 clusters

üìã Top TF-IDF keywords per cluster:

 cluster                                                                           top_keywords
       0      family, love, life, downs, ups downs, ups, challenges, heartfelt, quirky, moments
       1      supernatural, dark, town, world, secrets, forces, navigates, life, past, identity
       2           crime, justice, team, high, stakes, world, navigate, drama, new, high stakes
       3 diverse, humanity, survivors, new, group, face, challenges, earth, thrilling, navigate
       4     personal, world, challenges, family, navigates, life, explores, city, drama, sharp

‚úÖ Saved: C:\Users\brethm01\tv-nlp\src\data\top_keywords_per_cluster.csv


In [11]:

#from wordcloud import WordCloud
#import matplotlib.pyplot as plt

#for _, row in keywords_df.iterrows():
#    plt.figure(figsize=(5, 5))
#    wc = WordCloud(width=600, height=400, background_color="white").generate(row["top_keywords"])
#    plt.imshow(wc, interpolation="bilinear")
#    plt.axis("off")
#    plt.title(f"Cluster {row['cluster']} ‚Äî Top Keywords")
#    plt.tight_layout()
#    plt.show()