<a href="https://colab.research.google.com/github/ACCMouli/chandu/blob/main/topicmodelling/03_Embeddings_KMeans_Topics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Embeddings + KMeans Topic Mining


## Step 0: Install & Imports

In [1]:
# !pip install sentence-transformers scikit-learn pandas
import pandas as pd

#Pretrained Transformer encoder to convert sentences/documents into embeddings (vectors that capture meaning).
from sentence_transformers import SentenceTransformer

#K-Means clustering algorithm.
from sklearn.cluster import KMeans

#Later used to pull top keywords per cluster.
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np, os, random
random.seed(42)

## Step 1: Load the CSV

In [3]:
csv_path = "https://github.com/giridhar276/genai/raw/refs/heads/main/topicmodelling/topics_100.csv"
df = pd.read_csv(csv_path)
df.head()

Unnamed: 0,id,text
0,79,IPv6 users cannot reach the upload endpoint
1,4,Login form shows captcha error even for first ...
2,55,Audit logs missing entries for sensitive actions
3,3,SSO login loops back to the sign-in page repea...
4,72,Corporate proxy strips authorization headers


## Step 2: Encode

In [4]:

#Loads a light, fast sentence embedding model.
#  It maps each text to a 384-dimensional vector (by default for this model).

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")


#batch_size=64: encodes in mini-batches (speed/memory tradeoff).
#show_progress_bar=True: visual feedback while encoding.
#normalize_embeddings=True: L2-normalizes vectors so cosine similarity ~ dot product; often stabilizes clustering.

emb = model.encode(df["text"].astype(str).tolist(),
                   batch_size=64, show_progress_bar=True, normalize_embeddings=True)
emb.shape

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

(100, 384)

## Step 3: KMeans

In [5]:
k = 10
kmeans = KMeans(n_clusters=k, n_init="auto", random_state=42)
labels = kmeans.fit_predict(emb)
df["cluster"] = labels
df.head()

Unnamed: 0,id,text,cluster
0,79,IPv6 users cannot reach the upload endpoint,4
1,4,Login form shows captcha error even for first ...,5
2,55,Audit logs missing entries for sensitive actions,0
3,3,SSO login loops back to the sign-in page repea...,5
4,72,Corporate proxy strips authorization headers,0


## Step 4: Cluster Keywords

In [10]:
def keywords_for_cluster(texts, topn=12):
    vec = CountVectorizer(stop_words="english", max_features=5000, ngram_range=(1,2),min_df=1)
    X = vec.fit_transform(texts)
    freqs = np.asarray(X.sum(axis=0)).ravel()
    terms = vec.get_feature_names_out()
    idx = freqs.argsort()[-topn:][::-1]
    return [terms[i] for i in idx]

for c in range(k):
    docs_c = df.loc[df["cluster"] == c, "text"].tolist()
    if len(docs_c) == 0:
        print(f"Cluster {c}: (no docs)")
    else:
        print(f"Cluster {c}: " + ", ".join(keywords_for_cluster(docs_c, topn=12)))

Cluster 0: api, responses, testing, strips, sensitive actions, sensitive, secret rotation, secret, rotation breaks, rotation, usage, returns 429
Cluster 1: login, push, users, token, mode, logout, does, dashboard, tokens rotated, tokens, trigger alert, update
Cluster 2: file, csv, xlsx returns, uploads, spikes exporting, spikes, returns corrupted, returns, reports, percent big, percent, past invoices
Cluster 3: date, wrong, yearly plan, wrong timezone, yearly, wrong eu, vat yearly, value switching, vat, unexpectedly, timezone, tax calculation
Cluster 4: webhooks, vpn, webhook, retries, fails, webhooks verify, webhooks accept, websocket, wifi captive, webhook callbacks, webhook retries, vpn blocks
Cluster 5: login, links, error, password, cause, user, working, user session, workspaces, working login, unlock fails, unlock
Cluster 6: report, fails, fields, custom, vanish, zapier, zapier action, trying, total includes, total, tooltip covers, tooltip
Cluster 7: integration, codes, card, wea

## Step 5: Save

In [11]:
df.to_csv("embeddings_kmeans_topics_assigned.csv", index=False)
df.head()

Unnamed: 0,id,text,cluster
0,79,IPv6 users cannot reach the upload endpoint,4
1,4,Login form shows captcha error even for first ...,5
2,55,Audit logs missing entries for sensitive actions,0
3,3,SSO login loops back to the sign-in page repea...,5
4,72,Corporate proxy strips authorization headers,0
