# Clustering

I wanted to do a very quick test to see if embedding and clustering looked like a viable approach. However, I abandoned it quickly—an interesting idea, but getting it to work well would require a lot more work and research than I wanted to put into it right now.

**Basic Approach:**
- Embed responses.
- Run a clustering algorithm to group similar responses.
- Convert groups of responses into a theme.

**Improvments:**
A few things which might help improve responses...
- Splitting responses that cover multiple topics (or test it on codes).
- Trying a different clustering algorithm (K-means is fast, and the paper suggested it, but I'm not convinced it's the best for this use case).
- Dimensionality reduction.
- Using a vector database.

And, of course, all the other changes needed to make this work with large datasets. But at that point, re-architecting would be a better approach.

**Resources:**
- [Text Clustering with Large Language Model
Embeddings](https://arxiv.org/pdf/2403.15112)

In [91]:
import requests
import polars as pl
from functools import partial
from pathlib import Path
from IPython.display import Markdown

from thematic_analysis.utils.cleaning import clean_answers
from thematic_analysis.llm import get_llm_client, get_embeddings
from thematic_analysis.utils.embeddings import embed_column

In [92]:
df = pl.read_csv(Path("../data/wallmart.csv"))

question = df.columns[2]
df = df.select(["Session ID", question]).rename({
    "Session ID" : "id",
    question : "answer"
})
df = clean_answers(df)

df.head()

id,answer
str,str
"""94f2d4c3-b513-411c-b505-a11290…","""The customer service that your…"
"""1797c6f2-c501-44b7-b549-a33c29…","""I would complain about how mor…"
"""20e1a746-8311-450b-ba54-62d15f…","""Why do you quit taking cash at…"
"""234eb679-ffb8-4e5e-a45a-e50a7c…","""You don't just go into Walmart…"
"""511293b3-42e1-4712-a3a9-8838e0…","""East Waco needs a Walmart neig…"


In [93]:
df = await embed_column(df, "answer")

In [94]:
df.head()

id,answer,answer_embedding
str,str,list[f64]
"""94f2d4c3-b513-411c-b505-a11290…","""The customer service that your…","[0.034766, -0.001419, … 0.009692]"
"""1797c6f2-c501-44b7-b549-a33c29…","""I would complain about how mor…","[-0.04273, 0.035613, … 0.033576]"
"""20e1a746-8311-450b-ba54-62d15f…","""Why do you quit taking cash at…","[0.012912, -0.024059, … 0.003538]"
"""234eb679-ffb8-4e5e-a45a-e50a7c…","""You don't just go into Walmart…","[-0.006762, 0.030485, … 0.011957]"
"""511293b3-42e1-4712-a3a9-8838e0…","""East Waco needs a Walmart neig…","[-0.051505, 0.001391, … 0.018833]"


In [95]:
from sklearn.cluster import HDBSCAN

In [96]:
import numpy as np

embeddings = df["answer_embedding"].to_numpy()
embeddings = np.vstack(embeddings).astype(np.float64)

In [97]:
embeddings.shape[0]

253

In [110]:
min_cluster_size = max(2, int(0.05 * embeddings.shape[0])) * 2
clusterer = HDBSCAN(min_cluster_size=5, metric="cosine", cluster_selection_epsilon=0.0)

In [111]:
min_cluster_size

24

In [112]:
cluster_labels = clusterer.fit_predict(embeddings)

In [113]:
cluster_labels

array([-1, -1, -1, -1, -1,  0,  1, -1, -1, -1, -1, -1, -1,  0, -1, -1,  0,
       -1, -1, -1, -1, -1, -1, -1,  0, -1, -1, -1, -1, -1, -1,  0, -1, -1,
       -1,  1,  0,  1,  0, -1, -1,  0, -1, -1, -1, -1, -1,  1, -1, -1,  0,
        0, -1, -1, -1,  0, -1,  0, -1,  0, -1, -1, -1, -1, -1,  0,  0,  0,
        0,  0, -1, -1, -1, -1, -1, -1,  0, -1, -1, -1, -1, -1, -1, -1,  0,
       -1, -1, -1,  1, -1,  0, -1,  0, -1, -1, -1,  0, -1,  0, -1, -1, -1,
       -1, -1,  1, -1,  0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        0, -1,  0,  0, -1, -1, -1, -1, -1,  0, -1, -1, -1, -1, -1, -1, -1,
       -1,  0,  0, -1, -1,  0, -1,  0,  0,  1, -1, -1, -1,  0, -1, -1,  0,
       -1, -1, -1,  0, -1, -1, -1,  0, -1, -1,  0,  0, -1, -1,  0, -1,  0,
       -1,  0, -1, -1, -1, -1,  0, -1,  0, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1,  1, -1, -1, -1,  0,  0, -1,  0, -1, -1, -1,  0,
       -1, -1,  0,  0,  0

In [114]:
df = df.with_columns(
    pl.Series(name="cluster", values=cluster_labels)
)

In [115]:
df.group_by("cluster").agg(
    pl.col("answer").count()
)

cluster,answer
i64,u32
0,61
1,9
-1,183
