# Unsupervised Theme Discovery for Supply-Chain Sustainability Texts


This notebook helps you discover data-driven themes in your corpus (e.g., 700 story + response pairs).
It follows a **map → cluster → label** pattern using text embeddings and clustering, then surfaces
representative quotes and exports results for further analysis.

**What you'll get**
- Cleaned corpus (merging story and response text per case)
- Vector embeddings (Sentence-Transformers or OpenAI embeddings)
- Dimensionality reduction (UMAP) for visualization
- Clustering (HDBSCAN by default; k-means as a fallback)
- Automatic keyword-based labels per cluster (c-TF-IDF)
- Representative quotes/examples per cluster
- CSV exports you can use downstream

> Tip: Treat this as *broad discovery*; you'll later do schema-based extraction for your RQs.


## 0) Prerequisites

Install these packages in your environment (e.g., with `pip install ...`):

```
pip install pandas numpy scikit-learn umap-learn hdbscan sentence-transformers nltk tqdm
# Optional (for OpenAI-based embeddings or labels)
pip install openai
```


## 1) Load your data

Expected input: a CSV or JSONL with at least these columns:
- `Companies`, `Company Sectors`, `Company Headquarters`, `Countries`, `Backdate`
- `story_text` and `response_text`

You can also adapt the loader if your schema differs.


In [6]:
# === Config ===
INPUT_PATH = "20250925_1051_bhrrc_scraper_output.json"  # or "your_cases.jsonl"
INPUT_FORMAT = "json"           # "csv" or "jsonl"
TEXT_COLUMNS = ["story_text", "response_text"]  # adapt if needed
ID_COL = None  # if you have a unique ID column, put its name here
DATE_COL = "Backdate"  # optional; we'll try to parse
SAVE_PREFIX = "unsup_themes"

import pandas as pd
import numpy as np
import os, re, math
from datetime import datetime

def load_data(path, fmt="json"):
    if fmt == "csv":
        df = pd.read_csv(path)
    elif fmt == "json":
        df = pd.read_json(path, lines=True)
    else:
        raise ValueError("Unsupported format")
    return df

os.chdir("C:/Users/bscherrer/Documents/snf-project3")

df = load_data(INPUT_PATH, INPUT_FORMAT)
print("Loaded rows:", len(df))
df.head(3)


Loaded rows: 243


Unnamed: 0,Companies,Company Sectors,Company Headquarters,Countries,Regions,Response Sectors,Backdate,Title,Responded To,Tags,Responded,Authors,URL,Link to Company Page,Story,Response,story_text,response_text
0,Shell plc,"Hydrogen|Oil, gas & coal",GB,IE,,"Oil, gas & coal",02.06.2009,Shell response re Corrib gas protest,Ireland: Archbishop Desmond Tutu raises concer...,Protests|Violence,Yes,"Terry Nolan, Managing Director, Shell E&P Ireland",https://www.business-humanrights.org/en/latest...,https://www.business-humanrights.org/en/compan...,,,Ireland: Archbishop Desmond Tutu raises concer...,We wish to state categorically that there was ...
1,Shell plc,"Hydrogen|Oil, gas & coal",GB,NG,,"Oil, gas & coal",08.07.2009,Shell response re climate change report,NGO report on Shell's impact on climate change...,"Clean, Healthy & Sustainable Environment|Clima...",Yes,Shell,https://www.business-humanrights.org/en/latest...,https://www.business-humanrights.org/en/compan...,,,NGO report on Shell's impact on climate change...,The report makes assumptions about investment ...
2,Shell plc,"Hydrogen|Oil, gas & coal",GB,NG,,"Oil, gas & coal",08.11.2006,Shell response to Urgent Action statement by M...,Movement for the Survival of the Ogoni People ...,"Clean, Healthy & Sustainable Environment|Secur...",Yes,Shell,https://www.business-humanrights.org/en/latest...,https://www.business-humanrights.org/en/compan...,,,Movement for the Survival of the Ogoni People ...,"For many years, MOSOP has claimed that SPDC [S..."


## 2) Clean & prepare the corpus

- Merge story & response into one `text` field per case (or keep both; we’ll default to a combined text).
- Light normalization (whitespace, unicode). Keep punctuation and case (can help for names).

We also add a short `doc_id` for traceability in later exports.


In [None]:
import unicodedata

def normalize_text(s):
    if not isinstance(s, str):
        return ""
    s = unicodedata.normalize("NFKC", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

# Build a combined text field
def combine_text(row, cols):
    parts = [normalize_text(row.get(c, "")) for c in cols if c in row]
    return " \n\n ".join([p for p in parts if p])

df["text"] = df.apply(lambda r: combine_text(r, TEXT_COLUMNS), axis=1)

# Create a doc_id for traceability
if ID_COL and ID_COL in df.columns:
    df["doc_id"] = df[ID_COL].astype(str)
else:
    df["doc_id"] = [f"doc_{i:04d}" for i in range(len(df))]

# Parse Backdate if present (day.month.year common format)
def parse_backdate(x):
    if pd.isna(x):
        return pd.NaT
    x = str(x).strip()
    for fmt in ("%d.%m.%Y", "%Y-%m-%d", "%d/%m/%Y", "%m/%d/%Y"):
        try:
            return datetime.strptime(x, fmt).date()
        except:
            pass
    return pd.NaT

if "Backdate" in df.columns:
    df["date"] = df["Backdate"].apply(parse_backdate)
else:
    df["date"] = pd.NaT

# Filter empty texts
df = df[df["text"].str.len() > 0].reset_index(drop=True)
print("After cleaning, rows:", len(df))
df[["doc_id", "Companies", "Company Sectors", "Countries", "text", "date"]].head(3)

After cleaning, rows: 199


"Ireland: Archbishop Desmond Tutu raises concerns over alleged assault on a protestor during protest against Shell's Corrib project - Business & Human Rights Resource Centre \n\n We wish to state categorically that there was no physical attack of any kind on Mr Corduff by anyone while he was present on our construction site. The security staff employed by the Corrib Gas Partners have been fully briefed on the ethical and behavioural standards they are expected to meet when engaging with community m"

## 3) Create embeddings

Two options:
1. **Sentence-Transformers** (default, local): e.g., `all-MiniLM-L6-v2` (fast) or `all-mpnet-base-v2` (higher quality).
2. **OpenAI embeddings** (optional): requires API key; can be helpful if you already use their stack.

We'll implement Sentence-Transformers first, with a toggle to switch.


In [14]:
# === Embeddings (Sentence-Transformers by default) ===
USE_OPENAI = False  # set True to use OpenAI embeddings instead
SENTENCE_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"  # or 'all-mpnet-base-v2'

embeddings = None

if not USE_OPENAI:
    from sentence_transformers import SentenceTransformer
    from tqdm import tqdm

    model = SentenceTransformer(SENTENCE_MODEL_NAME)
    texts = df["text"].tolist()
    embeddings = model.encode(texts, show_progress_bar=True, convert_to_numpy=True, normalize_embeddings=True)
else:
    # OpenAI path (optional). Requires OPENAI_API_KEY in env.
    # from openai import OpenAI
    # client = OpenAI()
    # def get_openai_embeds(batch):
    #     resp = client.embeddings.create(
    #         model="text-embedding-3-large",
    #         input=batch
    #     )
    #     return np.array([d.embedding for d in resp.data], dtype="float32")
    # # Batch and call API here...
    raise NotImplementedError("Set USE_OPENAI=False or implement your OpenAI embedding code.")
    
print("Embedding shape:", None if embeddings is None else embeddings.shape)


  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Batches: 100%|██████████| 7/7 [00:06<00:00,  1.06it/s]

Embedding shape: (199, 384)





## 4) Dimensionality reduction (UMAP)

UMAP to:
- 2D for plotting
- ~15D (optional) for clustering stability

HDBSCAN can work directly on the original embedding, but many practitioners cluster on a lower-d space for robustness.


In [None]:
# === UMAP dimensionality reduction ===
from umap import UMAP

# 2D for visualization
umap_2d = UMAP(n_components=2, random_state=42)
X_2d = umap_2d.fit_transform(embeddings)

# Optional: reduced space for clustering (often 10-15 dims)
umap_hd = UMAP(n_components=15, random_state=42)
X_hd = umap_hd.fit_transform(embeddings)

df["umap_x"] = X_2d[:,0]
df["umap_y"] = X_2d[:,1]

print("UMAP shapes:", X_2d.shape, X_hd.shape)


TypeError: 'module' object is not callable

## 5) Clustering

- **HDBSCAN** (density-based, auto-detects number of clusters, handles noise)
- Fallback: **k-means** (requires `n_clusters` choice)

We save `cluster_id` per document. HDBSCAN may assign `-1` to noise.


In [None]:
# === Clustering (HDBSCAN with fallback to k-means) ===
cluster_labels = None
algo = None

try:
    import hdbscan
    clusterer = hdbscan.HDBSCAN(min_cluster_size=15, min_samples=None, metric='euclidean')
    cluster_labels = clusterer.fit_predict(X_hd)  # or embeddings directly
    algo = "hdbscan"
    print("HDBSCAN clusters (incl. noise=-1):", len(set(cluster_labels)))
except Exception as e:
    print("HDBSCAN unavailable or failed:", e)
    # Fallback to k-means with a heuristic k
    from sklearn.cluster import KMeans
    k = max(5, int(math.sqrt(len(df)) // 2))  # rough heuristic, tune as needed
    kmeans = KMeans(n_clusters=k, random_state=42, n_init="auto")
    cluster_labels = kmeans.fit_predict(X_hd)
    algo = "kmeans"
    print("KMeans clusters:", k)

df["cluster_id"] = cluster_labels
df["is_noise"] = (cluster_labels == -1) if algo == "hdbscan" else False
df["cluster_id"].value_counts().sort_index().head(20)


## 6) Topic labels via keywords (c-TF-IDF)

We compute per-cluster "class-based TF-IDF" keywords and short labels. This approximates BERTopic's labeling step.
Later, you may refine labels with an LLM (optional cell below).


In [None]:
# === c-TF-IDF keyword extraction per cluster ===
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

def c_tf_idf(corpus, m, ngram_range=(1,2), min_df=2):
    # corpus: list of concatenated documents per class
    vectorizer = CountVectorizer(ngram_range=ngram_range, min_df=min_df, stop_words='english')
    X = vectorizer.fit_transform(corpus)
    transformer = TfidfTransformer(norm=None, use_idf=True, smooth_idf=True, sublinear_tf=False)
    tf_idf = transformer.fit_transform(X)
    # scale by class lengths (m = total documents)
    tf_idf = tf_idf / m
    return tf_idf, vectorizer

def top_terms_per_class(tf_idf, vectorizer, topk=15):
    terms = np.array(vectorizer.get_feature_names_out())
    tops = []
    for row in tf_idf:
        idx = np.argsort(row.toarray()[0])[::-1][:topk]
        tops.append(terms[idx].tolist())
    return tops

# Build class documents: concatenate texts per cluster
clusters = sorted(df["cluster_id"].dropna().unique().tolist())
cluster_texts = []
cluster_sizes = []

for c in clusters:
    texts_c = df.loc[df["cluster_id"]==c, "text"].astype(str).tolist()
    cluster_sizes.append(len(texts_c))
    cluster_texts.append(" \n ".join(texts_c) if texts_c else "")

tfidf_mat, vect = c_tf_idf(cluster_texts, m=len(df), ngram_range=(1,2), min_df=2)
keywords_per_cluster = top_terms_per_class(tfidf_mat, vect, topk=15)

cluster_labels_map = {}
for c, kws in zip(clusters, keywords_per_cluster):
    # Create a simple label from top 3 keywords
    label = ", ".join(kws[:3]) if kws else "misc"
    cluster_labels_map[c] = {"label": label, "keywords": kws}

# Attach labels
df["cluster_label"] = df["cluster_id"].map(lambda c: cluster_labels_map.get(c, {}).get("label", "misc"))
df[["doc_id","cluster_id","cluster_label"]].head(10)


## 7) Representative quotes / examples

For each cluster, surface a few short snippets from the most central documents (centroid-based for k-means;
for HDBSCAN, we pick documents with high membership probability if available, else nearest to cluster centroid in embedding space).


In [None]:
# === Representative snippets per cluster ===
from sklearn.metrics import pairwise_distances
import numpy as np

def get_representatives(emb_matrix, ids, n=5):
    # pick n docs closest to centroid
    if len(ids) == 0:
        return []
    idx = np.array(ids)
    centroid = emb_matrix[idx].mean(axis=0, keepdims=True)
    dists = pairwise_distances(emb_matrix[idx], centroid, metric="euclidean").ravel()
    order = np.argsort(dists)
    return idx[order][:n].tolist()

def shorten(text, n=240):
    s = re.sub(r"\s+", " ", text).strip()
    return (s[:n] + "…") if len(s) > n else s

cluster_summaries = []
for c in clusters:
    doc_idx = df.index[df["cluster_id"]==c].tolist()
    rep_idx = get_representatives(embeddings, doc_idx, n=5)
    reps = []
    for ridx in rep_idx:
        row = df.iloc[ridx]
        # use part of response_text if available, else combined text
        base = row.get("response_text", "") or row["text"]
        reps.append({
            "doc_id": row["doc_id"],
            "snippet": shorten(base, 280)
        })
    cluster_summaries.append({
        "cluster_id": int(c),
        "size": int(len(doc_idx)),
        "label": cluster_labels_map.get(c, {}).get("label", "misc"),
        "keywords": cluster_labels_map.get(c, {}).get("keywords", []),
        "representatives": reps
    })

# Preview a couple of clusters
cluster_summaries[:2]


## 8) Visualization & exports

- 2D scatter plot of UMAP with cluster IDs
- CSV export: `themes_assignments.csv` and `cluster_summaries.csv` including keywords and representative quotes


In [None]:
# === Simple 2D scatter plot ===
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(8,6))
plt.scatter(df["umap_x"], df["umap_y"], s=6)
plt.title("UMAP projection (all docs)")
plt.xlabel("UMAP-1")
plt.ylabel("UMAP-2")
plt.show()

# Export results
assign_cols = ["doc_id","Companies","Company Sectors","Company Headquarters","Countries","date","cluster_id","cluster_label","umap_x","umap_y"]
assign_cols = [c for c in assign_cols if c in df.columns]
assign_df = df[assign_cols].copy()
assign_path = f"{SAVE_PREFIX}_assignments.csv"
assign_df.to_csv(assign_path, index=False)

# Summaries export
import json
summ_path = f"{SAVE_PREFIX}_cluster_summaries.json"
with open(summ_path, "w", encoding="utf-8") as f:
    json.dump(cluster_summaries, f, ensure_ascii=False, indent=2)

assign_path, summ_path


## 9) (Optional) LLM-assisted labeling

If you want nicer human-readable labels, call an LLM with each cluster's top keywords and 2–3 example snippets.
This cell is optional and requires an API key (commented by default).


In [None]:
# === Optional: LLM-assisted label refinement (commented) ===
# This cell sends top keywords + representative snippets to an LLM to produce cleaner labels.
# Requires OPENAI_API_KEY in your environment.
#
# from openai import OpenAI
# client = OpenAI()
#
# def refine_label(keywords, reps):
#     prompt = f"""
#     You are labeling a text cluster from corporate sustainability responses.
#     Here are the top keywords: {', '.join(keywords)}
#     Here are representative snippets:
#     {chr(10).join('- ' + r['snippet'] for r in reps)}
#     Suggest a short, human-readable label (max 6 words) and a one-sentence description.
#     Respond as JSON with keys: label, description.
#     """
#     resp = client.chat.completions.create(
#         model="gpt-4o-mini",
#         messages=[{"role":"user","content":prompt}],
#         temperature=0.2
#     )
#     import json
#     return json.loads(resp.choices[0].message.content)
#
# refined = {}
# for cs in cluster_summaries:
#     refined[cs["cluster_id"]] = refine_label(cs["keywords"], cs["representatives"])
#
# refined
