# Customer Complaints Log Mining

## Project Overview
**Objective:** Automate the analysis of unstructured exception logs using GenAI and semantic clustering

**Data Source:** [Filtered CFPB Consumer Complaint Database](https://www.consumerfinance.gov/data-research/consumer-complaints/search/api/v1/?date_received_max=2026-02-14&date_received_min=2011-12-01&field=all&format=csv&has_narrative=true&no_aggs=true&product=Credit%20card%20or%20prepaid%20card&product=Money%20transfer%2C%20virtual%20currency%2C%20or%20money%20service&product=Checking%20or%20savings%20account&size=387654)

**Target Products:** Credit Cards and Money Transfers

**Approach:** End-to-end pipeline combining LLM extraction, semantic vectorisation, and unsupervised clustering

---

### Phases
1. **Data Ingestion & Strategic Sampling**: Filter and sample high-signal complaint narratives
2. **GenAI Extraction**: Extract structured exception data (failure type, ISO codes, severity)
3. **Vectorisation & Clustering**: Identify recurring error patterns
4. **Executive Dashboard**: Visualise and summarise systemic issues

## Setup: Import Core Libraries

Import the fundamental data manipulation library required for the analysis.

In [None]:
#Phase 1
import pandas as pd
import numpy as np
import re
from tqdm import tqdm
#Phase 2
import json
import os
import time
from dotenv import load_dotenv
from groq import Groq
#Phase 3
from sklearn.cluster import KMeans, HDBSCAN
from sklearn.metrics import silhouette_score, pairwise_distances_argmin_min,davies_bouldin_score, calinski_harabasz_score
from sklearn.feature_extraction.text import TfidfVectorizer
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
import plotly.express as px


## Raw Data

Load the CFPB Consumer Complaint Database, also available on github under the data section. This CSV contains complaint narratives that simulate real-world payment exception logs.

In [None]:
#Load data
df = pd.read_csv("<Complaints data>")
df.head(5)

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,03/09/20,Credit card or prepaid card,General-purpose credit card or charge card,Fees or interest,Problem with fees,On XX/XX/2020 I booked a hotel in XXXX that re...,,JPMORGAN CHASE & CO.,CT,06854,,Consent provided,Web,03/09/20,Closed with explanation,Yes,,3559709
1,03/18/20,Credit card or prepaid card,General-purpose credit card or charge card,Fees or interest,Problem with fees,I have had a Barclays/Uber credit card for nea...,Company has responded to the consumer and the ...,BARCLAYS BANK DELAWARE,CA,94596,,Consent provided,Web,03/18/20,Closed with explanation,Yes,,3571498
2,01/15/18,"Money transfer, virtual currency, or money ser...",Virtual currency,Other service problem,,Unable to log in as 2FA code was lost. Filed i...,,"Coinbase, Inc.",,XXXXX,,Consent provided,Web,01/15/18,Closed with explanation,Yes,,2782700
3,01/04/20,Credit card or prepaid card,Government benefit card,Trouble using the card,Trouble getting information about the card,I received my new card for my child support an...,Company has responded to the consumer and the ...,"BANK OF AMERICA, NATIONAL ASSOCIATION",AZ,85379,Servicemember,Consent provided,Web,01/04/20,Closed with explanation,Yes,,3485072
4,11/30/25,Checking or savings account,Checking account,Managing an account,Problem making or receiving payments,My XXXX access on my SoFi checking account has...,,"SOFI TECHNOLOGIES, INC.",NJ,07052,,Consent provided,Web,11/30/25,Closed with explanation,Yes,,18037906


---

## Phase 1: Data Ingestion & Sampling

### Objectives
- Filter records where consumer complaint narrative is not null
- Focus on Credit Card and Money Transfer products
- Remove low-signal narratives (too short or over-redacted)
- Perform stratified sampling to create a manageable dataset

### Signal Quality Heuristics
- Minimum narrative length: 200 characters
- Maximum redaction ratio: 35% (to filter heavily XXXX-masked text)

In [None]:
# Phase 1: Filter + strategic sampling

narrative_col = "Consumer complaint narrative"
product_col = "Product"

if narrative_col not in df.columns or product_col not in df.columns:
    raise ValueError(f"Expected columns missing. Have: {list(df.columns)[:10]}...")

product_mask = df[product_col].str.contains(
    r"credit card|money transfer|transfer|card",
    case=False,
    na=False,
    regex=True,
)

narrative_mask = df[narrative_col].notna()

filtered = df.loc[product_mask & narrative_mask].copy()

# Heuristic for low-signal narratives
min_len = 200
max_redaction_ratio = 0.35

x_pattern = re.compile(r"[Xx]")

lengths = filtered[narrative_col].str.len().fillna(0)
redacted_chars = filtered[narrative_col].fillna("").apply(lambda t: len(x_pattern.findall(t)))
redaction_ratio = redacted_chars / lengths.replace(0, np.nan)
redaction_ratio = redaction_ratio.fillna(1.0)

signal_mask = (lengths >= min_len) & (redaction_ratio <= max_redaction_ratio)
filtered = filtered.loc[signal_mask].copy()

print(f"Filtered rows: {len(filtered):,}")
filtered[[product_col, narrative_col]].head(3)

Filtered rows: 204,598


Unnamed: 0,Product,Consumer complaint narrative
0,Credit card or prepaid card,On XX/XX/2020 I booked a hotel in XXXX that re...
1,Credit card or prepaid card,I have had a Barclays/Uber credit card for nea...
2,"Money transfer, virtual currency, or money ser...",Unable to log in as 2FA code was lost. Filed i...


In [None]:
# Stratified sampling (2k-5k)
min_n = 2000
max_n = 5000

# Reduce for free-tier LLM speed/cost
pilot_n = 500

if pilot_n is not None:
    target_n = int(np.clip(pilot_n, 1, max_n))
elif len(filtered) < min_n:
    target_n = len(filtered)
else:
    target_n = int(np.clip(len(filtered), min_n, max_n))

# Proportional stratified sample by Product
product_counts = filtered[product_col].value_counts()
product_props = product_counts / product_counts.sum()

sample_parts = []
rs = 42
for product, prop in product_props.items():
    n_i = max(1, int(round(prop * target_n)))
    subset = filtered.loc[filtered[product_col] == product]
    n_i = min(n_i, len(subset))
    sample_parts.append(subset.sample(n=n_i, random_state=rs))

sampled = pd.concat(sample_parts).sample(frac=1.0, random_state=rs).reset_index(drop=True)
print(f"Sampled rows: {len(sampled):,}")

# Track a stable id for saving progress
id_candidates = [c for c in df.columns if c.lower().replace(" ", "") in {"complaintid", "complaint_id"}]
record_id_col = id_candidates[0] if id_candidates else df.columns[0]
print(f"Using record id column: {record_id_col}")

sampled[[record_id_col, product_col, narrative_col]].head(3)

Sampled rows: 500
Using record id column: Complaint ID


Unnamed: 0,Complaint ID,Product,Consumer complaint narrative
0,2713392,Credit card or prepaid card,I have two credit lines with Citi bank. One ba...
1,5011508,"Money transfer, virtual currency, or money ser...",I attempted to purchase a XXXX XXXX through XX...
2,4190905,Credit card or prepaid card,Last XXXX my Wayfair store card was converted ...


---

## Phase 2: GenAI Extraction Pipeline

### Structured Exception Schema
Extract the following fields from each complaint narrative using Groq's LLM:

| Field | Description |
|-------|-------------|
| **failure_type** | Liquidity, Authentication, Tech Error, Fraud, Compliance, Dispute, Other |
| **iso_20022_proxy** | Simulated ISO 20022 reason codes (AC01, AM04, MS03, etc.) |
| **root_cause** | Concise semantic summary of the issue |
| **severity_index** | Integer 1-10 based on described impact |

### Safety Features
- Progress saving: Results appended to `llm_extracted_logs.jsonl` after each successful call
- Resume capability: Skip already-processed records on re-run
- Error handling: Log errors without halting the loop

In [None]:
# Phase 2: LLM extraction to structured JSON

load_dotenv()
# Save Groq key as a .env file locally in same location
groq_key = os.getenv("GROQ_API_KEY")
if not groq_key:
    raise ValueError("GROQ_API_KEY not found. Please set it as an environment variable.")

client = Groq(api_key=groq_key)

output_path = "llm_extracted_logs.jsonl"

existing_ids = set()
if os.path.exists(output_path):
    with open(output_path, "r", encoding="utf-8") as f:
        for line in f:
            try:
                payload = json.loads(line)
                existing_ids.add(payload.get("record_id"))
            except json.JSONDecodeError:
                continue

system_prompt = (
    "You are a payment-ops analyst extracting structured exception data from noisy logs. "
    "Return only valid JSON matching the schema."
)

schema_hint = {
    "failure_type": "Liquidity | Authentication | Tech Error | Fraud | Compliance | Dispute | Other",
    "iso_20022_proxy": "AC01 | AM04 | MS03 | FF01 | RC01 | SL01 | ZZ99",
    "root_cause": "short semantic summary",
    "severity_index": "1-10 integer"
}

user_prompt_template = (
    "Extract a technical exception record from the narrative. "
    "If details are missing, choose 'Other' and a conservative severity. "
    "Map to a simulated ISO 20022 reason code. "
    "Return JSON only.\n\n"
    "Schema hint: {schema}\n\n"
    "Narrative:\n{narrative}"
)

json_pattern = re.compile(r"\{.*\}", re.DOTALL)

def parse_json(text: str) -> dict:
    match = json_pattern.search(text)
    if not match:
        raise ValueError("No JSON object found")
    return json.loads(match.group(0))

results = []
max_llm_calls = 500
processed_calls = 0

for _, row in tqdm(sampled.iterrows(), total=len(sampled)):
    if processed_calls >= max_llm_calls:
        break

    record_id = row[record_id_col]
    if record_id in existing_ids:
        continue

    narrative = str(row[narrative_col])

    try:
        response = client.chat.completions.create(
            model="llama-3.1-8b-instant",
            temperature=0.1,
            messages=[
                {"role": "system", "content": system_prompt},
                {
                    "role": "user",
                    "content": user_prompt_template.format(
                        schema=json.dumps(schema_hint),
                        narrative=narrative,
                    ),
                },
            ],
        )

        raw_text = response.choices[0].message.content
        parsed = parse_json(raw_text)
        parsed["record_id"] = record_id
        parsed["product"] = row[product_col]
        results.append(parsed)
        processed_calls += 1

        with open(output_path, "a", encoding="utf-8") as f:
            f.write(json.dumps(parsed, ensure_ascii=True) + "\n")

    except Exception as e:
        error_payload = {
            "record_id": record_id,
            "error": str(e),
        }
        with open(output_path, "a", encoding="utf-8") as f:
            f.write(json.dumps(error_payload, ensure_ascii=True) + "\n")
        time.sleep(0.5)

print(f"New records processed: {len(results):,}")

# Load all processed records (including previous runs)
processed = []
with open(output_path, "r", encoding="utf-8") as f:
    for line in f:
        try:
            payload = json.loads(line)
            if "root_cause" in payload:
                processed.append(payload)
        except json.JSONDecodeError:
            continue

extracted_df = pd.DataFrame(processed)
extracted_df.head(3)

100%|██████████| 500/500 [39:10<00:00,  4.70s/it]  

New records processed: 492





Unnamed: 0,failure_type,iso_20022_proxy,root_cause,severity_index,record_id,product
0,Compliance,RC01,Account closure due to suspected identity veri...,8,7150434,Credit card or prepaid card
1,Tech Error,MS03,Delayed alert sending and miscommunication bet...,8,2883204,"Money transfer, virtual currency, or money ser..."
2,Compliance,RC01,Company failed to provide terms and conditions...,8,3898508,Credit card or prepaid card


---

## Phase 3: Vectorisation & Clustering

### Approach
1. Vectorise the extracted `root_cause` field using:
   - Primary: Sentence Transformers (`all-MiniLM-L6-v2`) for semantic embeddings
   - Fallback: TF-IDF for lightweight text representation

2. Clustering: First try K-Means with optimal cluster count determined by silhouette score, then test with HDBSCAN for model comparison.

3. Auto-labelling: Use TF-IDF to extract the top 3 keywords representing each cluster

### Goal
Identify recurring exception patterns that represent systemic operational issues, rather than isolated incidents.

In [None]:
# Phase 3: Vectorisation + clustering
root_causes = extracted_df["root_cause"].fillna("").astype(str)

use_embeddings = True
try:
    from sentence_transformers import SentenceTransformer
    embedder = SentenceTransformer("all-MiniLM-L6-v2")
    X = embedder.encode(root_causes.tolist(), show_progress_bar=True)
except Exception as e:
    print(f"Embedding fallback to TF-IDF due to: {e}")
    use_embeddings = False

if not use_embeddings:
    tfidf = TfidfVectorizer(max_features=2000, ngram_range=(1, 2), stop_words="english")
    X = tfidf.fit_transform(root_causes)

# Choose k via silhouette, capture inertia for elbow curve
k_min, k_max = 3, 10
best_k, best_score = None, -1
k_values, inertias, sil_scores = [], [], []

for k in range(k_min, k_max + 1):
    km = KMeans(n_clusters=k, random_state=42, n_init="auto")
    labels = km.fit_predict(X)
    score = silhouette_score(X, labels)

    k_values.append(k)
    inertias.append(km.inertia_)
    sil_scores.append(score)

    if score > best_score:
        best_k, best_score = k, score

# Hyperparameter tuning visualisation
fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(
    go.Scatter(x=k_values, y=inertias, mode="lines+markers", name="Inertia", line=dict(color="blue")),
    secondary_y=False,
)
fig.add_trace(
    go.Scatter(x=k_values, y=sil_scores, mode="lines+markers", name="Silhouette Score", line=dict(color="red")),
    secondary_y=True,
)
fig.add_vline(x=best_k, line_dash="dash", line_color="grey")

fig.update_layout(
    title="K-Means Tuning: Inertia and Silhouette Score",
    xaxis_title="Number of Clusters (k)",
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
)
fig.update_yaxes(title_text="Inertia", secondary_y=False)
fig.update_yaxes(title_text="Silhouette Score", secondary_y=True)
fig.show()

kmeans = KMeans(n_clusters=best_k, random_state=42, n_init="auto")
cluster_labels = kmeans.fit_predict(X)

extracted_df["cluster_id"] = cluster_labels

vector_matrix = X.toarray() if hasattr(X, "toarray") else np.asarray(X)
extracted_df["vector"] = list(vector_matrix)


def get_cluster_names(df, cluster_col, vector_col, text_col="root_cause"):
    names = {}
    for cluster_id, group in df.groupby(cluster_col):
        vectors = np.vstack(group[vector_col].to_list())
        centroid = vectors.mean(axis=0, keepdims=True)
        closest_idx, _ = pairwise_distances_argmin_min(centroid, vectors, metric="euclidean")
        rep_text = str(group.iloc[closest_idx[0]][text_col]).strip()
        names[cluster_id] = f"Cluster {cluster_id}: {rep_text}"
    return names


cluster_names = get_cluster_names(extracted_df, "cluster_id", "vector", "root_cause")
extracted_df["cluster_label"] = extracted_df["cluster_id"].map(cluster_names)
extracted_df[["cluster_label", "root_cause"]].head(5)

Embedding fallback to TF-IDF due to: [WinError 1114] A dynamic link library (DLL) initialization routine failed. Error loading "c:\Users\beckk\AppData\Local\Programs\Python\Python312\Lib\site-packages\torch\lib\c10.dll" or one of its dependencies.


Unnamed: 0,cluster_label,root_cause
0,Cluster 6: Unauthorized access to account info...,Account closure due to suspected identity veri...
1,Cluster 4: Beneficiary bank failed to provide ...,Delayed alert sending and miscommunication bet...
2,Cluster 4: Beneficiary bank failed to provide ...,Company failed to provide terms and conditions...
3,Cluster 8: Unfair dispute resolution process a...,Unfair dispute resolution process and failure ...
4,Cluster 3: Repeated late fees and interest cha...,Banco Popular's inconsistent application of th...


In [None]:
# HDBSCAN clustering with validation and visualisation
vector_matrix = np.vstack(extracted_df["vector"].to_list())

# Standardise embeddings before density clustering
scaled_vectors = StandardScaler().fit_transform(vector_matrix)

# Dimensionality reduction for clustering stability
n_components = min(50, scaled_vectors.shape[1], max(2, scaled_vectors.shape[0] - 1))
pca_cluster = PCA(n_components=n_components, random_state=42)
cluster_vectors = pca_cluster.fit_transform(scaled_vectors)
extracted_df["cluster_vector"] = list(cluster_vectors)


def optimise_hdbscan(embeddings):
    results = []
    for min_cluster_size in [5, 10, 15, 20]:
        for min_samples in [1, 5, 10]:
            model = HDBSCAN(
                min_cluster_size=min_cluster_size,
                min_samples=min_samples,
                metric="euclidean",
                store_centers="centroid",
            )
            labels = model.fit_predict(embeddings)
            noise_ratio = float((labels == -1).mean())

            valid_mask = labels != -1
            valid_labels = labels[valid_mask]
            score = np.nan
            if len(np.unique(valid_labels)) > 1:
                score = silhouette_score(embeddings[valid_mask], valid_labels)

            results.append(
                {
                    "min_cluster_size": min_cluster_size,
                    "min_samples": min_samples,
                    "silhouette": score,
                    "noise_ratio": noise_ratio,
                }
            )

    results_df = pd.DataFrame(results).sort_values("silhouette", ascending=False)

    filtered = results_df[(results_df["noise_ratio"] < 0.30) & results_df["silhouette"].notna()]
    if len(filtered) == 0:
        best_row = results_df.iloc[0]
    else:
        best_row = filtered.iloc[0]

    best_params = {
        "min_cluster_size": int(best_row["min_cluster_size"]),
        "min_samples": int(best_row["min_samples"]),
    }
    return best_params, results_df


best_params, grid_results = optimise_hdbscan(cluster_vectors)
print("HDBSCAN grid search (top 5):")
print(grid_results.head(5).to_string(index=False))
print(f"Selected params: {best_params}")

hdbscan = HDBSCAN(
    min_cluster_size=best_params["min_cluster_size"],
    min_samples=best_params["min_samples"],
    metric="euclidean",
    store_centers="centroid",
)

hdb_labels = hdbscan.fit_predict(cluster_vectors)

# Set the active clustering model to HDBSCAN
extracted_df["cluster_id"] = hdb_labels
extracted_df["cluster_model"] = "hdbscan"


def get_smart_cluster_names(df, cluster_col, vector_col, text_col="root_cause"):
    names = {}
    for cluster_id, group in df.groupby(cluster_col):
        if cluster_id == -1:
            continue
        vectors = np.vstack(group[vector_col].to_list())
        centroid = vectors.mean(axis=0, keepdims=True)
        closest_idx, _ = pairwise_distances_argmin_min(centroid, vectors, metric="euclidean")
        rep_text = str(group.iloc[closest_idx[0]][text_col]).strip()
        names[cluster_id] = f"Cluster {cluster_id}: {rep_text}"
    return names


cluster_names = get_smart_cluster_names(extracted_df, "cluster_id", "cluster_vector", "root_cause")
extracted_df["cluster_label"] = extracted_df["cluster_id"].map(cluster_names).fillna("Noise/Outliers")

# Validation metrics (excluding noise)
valid_mask = extracted_df["cluster_id"] != -1
valid_vectors = cluster_vectors[valid_mask]
valid_labels = extracted_df.loc[valid_mask, "cluster_id"].to_numpy()

if len(np.unique(valid_labels)) > 1:
    sil = silhouette_score(valid_vectors, valid_labels)
    dbi = davies_bouldin_score(valid_vectors, valid_labels)
    chi = calinski_harabasz_score(valid_vectors, valid_labels)
    print(f"Silhouette Score (no noise): {sil:.4f}")
    print(f"Davies-Bouldin Index (no noise): {dbi:.4f}")
    print(f"Calinski-Harabasz Index (no noise): {chi:.4f}")
else:
    print("Not enough clusters for validation metrics after noise removal.")

# t-SNE projection for visual validation
tsne = TSNE(n_components=2, perplexity=30, random_state=42, init="pca")
tsne_result = tsne.fit_transform(cluster_vectors)

extracted_df["tsne_x"] = tsne_result[:, 0]
extracted_df["tsne_y"] = tsne_result[:, 1]

plot_df = extracted_df[["tsne_x", "tsne_y", "cluster_id", "cluster_label", "root_cause", "severity_index"]].copy()
plot_df["cluster_label_plot"] = plot_df["cluster_id"].astype(str)
plot_df.loc[plot_df["cluster_id"] == -1, "cluster_label_plot"] = "Noise"
plot_df["point_type"] = plot_df["cluster_id"].apply(lambda v: "Noise" if v == -1 else "Cluster")

fig = px.scatter(
    plot_df,
    x="tsne_x",
    y="tsne_y",
    color="cluster_label_plot",
    symbol="point_type",
    title="Semantic Error Clusters (t-SNE Projection)",
    hover_data={
        "cluster_label": True,
        "root_cause": True,
        "severity_index": True,
        "cluster_id": False,
        "cluster_label_plot": False,
        "point_type": False,
    },
    color_discrete_map={"Noise": "#9e9e9e"},
)
fig.update_traces(marker=dict(size=6))
fig.update_layout(legend_title_text="Cluster")
fig.show()

extracted_df[["cluster_id", "cluster_label", "root_cause"]].head(5)

HDBSCAN grid search (top 5):
 min_cluster_size  min_samples  silhouette  noise_ratio
                5            1    0.803687     0.807547
                5           10    0.644653     0.601887
               10           10    0.644653     0.601887
               15           10    0.644653     0.601887
               10            5    0.627143     0.583019
Selected params: {'min_cluster_size': 5, 'min_samples': 1}
Silhouette Score (no noise): 0.8037
Davies-Bouldin Index (no noise): 0.4340
Calinski-Harabasz Index (no noise): 569.9270


Unnamed: 0,cluster_id,cluster_label,root_cause
0,-1,Noise/Outliers,Account closure due to suspected identity veri...
1,-1,Noise/Outliers,Delayed alert sending and miscommunication bet...
2,-1,Noise/Outliers,Company failed to provide terms and conditions...
3,2,Cluster 2: Unfair dispute resolution process a...,Unfair dispute resolution process and failure ...
4,-1,Noise/Outliers,Banco Popular's inconsistent application of th...


---

## Phase 4: Dashboard & Narrative

### Visualisation
Generate an interactive Treemap showing:
- Cluster density: Size represents the number of exceptions in each cluster
- Failure type breakdown: Hierarchical view by cluster and failure category

### Executive Summary
Provide a non-technical summary for stakeholders highlighting:
1. The top 3 systemic issues identified through clustering
2. Why semantic intelligence outperforms traditional keyword monitoring
3. Actionable insights for operational improvements

In [31]:
# Phase 4: Executive dashboard + narrative
import plotly.express as px

cluster_counts = (
    extracted_df.groupby(["cluster_label", "failure_type"], dropna=False)
    .size()
    .reset_index(name="count")
)

fig = px.treemap(
    cluster_counts,
    path=["cluster_label", "failure_type"],
    values="count",
    color="count",
    color_continuous_scale="Blues",
)
fig.update_layout(margin=dict(t=40, l=10, r=10, b=10))
fig.show()

# Executive summary for a non-technical stakeholder
cluster_summary = (
    extracted_df.groupby("cluster_label")
    .size()
    .reset_index(name="count")
    .sort_values("count", ascending=False)
)

top3 = cluster_summary.head(3).to_dict("records")

summary_lines = [
    "Executive Summary:",
    "1) The logs show three dominant systemic issues driving customer friction:",
]

for i, item in enumerate(top3, start=1):
    summary_lines.append(f"   {i}. {item['cluster_label']} (n={item['count']})")

summary_lines += [
    "2) These clusters represent recurring root-cause themes rather than keyword matches,",
    "   allowing us to quantify the true impact across products and isolate the biggest failure modes.",
    "3) Compared with traditional keyword monitoring, this semantic approach reduces noise",
    "   from redacted or inconsistent logs and reveals consistent operational patterns.",
]

print("\n".join(summary_lines))

Executive Summary:
1) The logs show three dominant systemic issues driving customer friction:
   1. Noise/Outliers (n=428)
   2. Cluster 2: Unfair dispute resolution process and failure to comply with error resolution requirements (n=26)
   3. Cluster 1: Unfair dispute resolution process and failure to prevent and address fraud (n=16)
2) These clusters represent recurring root-cause themes rather than keyword matches,
   allowing us to quantify the true impact across products and isolate the biggest failure modes.
3) Compared with traditional keyword monitoring, this semantic approach reduces noise
   from redacted or inconsistent logs and reveals consistent operational patterns.


In [33]:
# Prioritised fix list based on severity and volume
prioritised = extracted_df.loc[
    (extracted_df["cluster_id"] != -1) & (extracted_df.get("cluster_model", "hdbscan") == "hdbscan")
].copy()
prioritised["severity_index"] = pd.to_numeric(prioritised["severity_index"], errors="coerce")

def most_common(series):
    series = series.dropna()
    return series.value_counts().idxmax() if not series.empty else "Unknown"

cluster_stats = (
    prioritised.groupby(["cluster_id", "cluster_label", "product"], dropna=False)
    .agg(
        avg_severity=("severity_index", "mean"),
        volume=("cluster_id", "count"),
        failure_type=("failure_type", most_common),
        iso_20022_proxy=("iso_20022_proxy", most_common),
    )
    .reset_index()
)

max_volume = cluster_stats["volume"].max() or 1
cluster_stats["volume_score"] = cluster_stats["volume"] / max_volume
cluster_stats["priority_score"] = (cluster_stats["avg_severity"] * 0.7) + (cluster_stats["volume_score"] * 3)

ranked = cluster_stats.sort_values(["priority_score", "avg_severity", "volume"], ascending=False)

print("Prioritised Fix List (Incident Report):")
for _, row in ranked.head(5).iterrows():
    cluster_id = row["cluster_id"]
    cluster_label = row["cluster_label"]
    product = row["product"]
    avg_sev = row["avg_severity"]
    volume = int(row["volume"])
    failure_type = row["failure_type"]
    iso_code = row["iso_20022_proxy"]

    severity_text = "NA" if pd.isna(avg_sev) else f"{avg_sev:.1f}"
    issue_text = cluster_label.split(":", 1)[-1].strip()

    print(f" [Cluster {cluster_id}] Issue: \"{issue_text}\"")
    print(f" Impact: Severity {severity_text} | Volume {volume}")
    print(f" Type: {failure_type} ({iso_code})")
    print(f" ACTION: Investigate logic in {product}")
    print("")

Prioritised Fix List (Incident Report):
 [Cluster 2] Issue: "Unfair dispute resolution process and failure to comply with error resolution requirements"
 Impact: Severity 8.0 | Volume 26
 Type: Compliance (RC01)
 ACTION: Investigate logic in Money transfer, virtual currency, or money service

 [Cluster 1] Issue: "Unfair dispute resolution process and failure to prevent and address fraud"
 Impact: Severity 8.0 | Volume 16
 Type: Compliance (RC01)
 ACTION: Investigate logic in Money transfer, virtual currency, or money service

 [Cluster 4] Issue: "Inadequate dispute resolution and lack of accountability"
 Impact: Severity 8.0 | Volume 14
 Type: Dispute (SL01)
 ACTION: Investigate logic in Money transfer, virtual currency, or money service

 [Cluster 5] Issue: "Lack of accountability and transparency in dispute resolution process"
 Impact: Severity 8.0 | Volume 9
 Type: Dispute (SL01)
 ACTION: Investigate logic in Money transfer, virtual currency, or money service

 [Cluster 7] Issue: "U

---

## Key Findings & Next Steps

### What We've Accomplished
Filtered then sampled 500 high-signal complaint narratives  
Extracted structured exception data using LLM (failure types, ISO codes, severity)  
Identified recurring error patterns through semantic clustering  
Generated actionable insights for operational improvements  

### Recommendations
Use the prioritised fix list above to focus on the highest-impact clusters, balancing severity with volume. Start with clusters showing high severity and high volume.

### Business Value
This Semantic Intelligence approach enables:
- Proactive issue detection: Identify systemic problems before they escalate
- Resource optimisation: Focus engineering efforts on the most impactful issues  
- Operational resilience: Reduce customer friction through data-driven root-cause analysis

### Scaling Considerations
To expand this prototype:
1. Increase sample size to 2,000-5,000 records (requires paid LLM tier)
2. Integrate real-time log streaming for continuous monitoring
3. Deploy predictive models to forecast exception trends
4. Build automated alerting for emerging cluster anomalies