# Notebook 5: KMeans Topic Modeling

## 1. Introduction

### 1.1 Objective
Discover latent thematic domains in Kiswahili transcriptions using unsupervised clustering.

### 1.2 Mathematical Foundation

**KMeans Objective:**
$$\min_{C} \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2$$

Where:
- $C_i$: Cluster $i$
- $\mu_i$: Centroid of cluster $i$

**Silhouette Score:**
$$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$$

Where:
- $a(i)$: Mean intra-cluster distance
- $b(i)$: Mean nearest-cluster distance

---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
import umap

import warnings
warnings.filterwarnings('ignore')

SEED = 42
np.random.seed(SEED)

## 2. Load Data

In [None]:
PROJECT_ROOT = Path.cwd().parent
DATA_DIR = PROJECT_ROOT / 'data'

df = pd.read_csv(DATA_DIR / 'train.csv')
df = df.dropna(subset=['sentence']).head(2000)
print(f"Working with {len(df)} samples")

## 3. Text Preprocessing

In [None]:
# Kiswahili stopwords
swahili_stopwords = ['na', 'ya', 'wa', 'ni', 'kwa', 'la', 'za', 'katika', 'au', 'kama']

def preprocess_text(text):
    text = text.lower().strip()
    return text

df['sentence_clean'] = df['sentence'].apply(preprocess_text)
print("Text preprocessing complete.")

## 4. TF-IDF Vectorization

In [None]:
vectorizer = TfidfVectorizer(
    max_features=500,
    stop_words=swahili_stopwords,
    ngram_range=(1, 2),
    min_df=2
)

X_tfidf = vectorizer.fit_transform(df['sentence_clean'])
print(f"TF-IDF matrix shape: {X_tfidf.shape}")

## 5. Elbow Method for Optimal K

In [None]:
inertias = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=SEED, n_init=10)
    kmeans.fit(X_tfidf)
    inertias.append(kmeans.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.grid(True)
plt.show()

## 6. Silhouette Analysis

In [None]:
silhouette_scores = []

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=SEED, n_init=10)
    labels = kmeans.fit_predict(X_tfidf)
    score = silhouette_score(X_tfidf, labels)
    silhouette_scores.append(score)

plt.figure(figsize=(10, 6))
plt.plot(K_range, silhouette_scores, 'ro-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis')
plt.grid(True)
plt.show()

optimal_k = K_range[np.argmax(silhouette_scores)]
print(f"Optimal K: {optimal_k}")

## 7. Final KMeans Clustering

In [None]:
kmeans_final = KMeans(n_clusters=optimal_k, random_state=SEED, n_init=10)
df['cluster'] = kmeans_final.fit_predict(X_tfidf)

print(f"Cluster distribution:\n{df['cluster'].value_counts().sort_index()}")

## 8. Dimensionality Reduction with PCA

In [None]:
pca = PCA(n_components=2, random_state=SEED)
X_pca = pca.fit_transform(X_tfidf.toarray())

plt.figure(figsize=(12, 8))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df['cluster'], cmap='viridis', alpha=0.6)
plt.colorbar(scatter)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('KMeans Clusters (PCA Projection)')
plt.grid(True)
plt.show()

## 9. Top Terms per Cluster

In [None]:
feature_names = vectorizer.get_feature_names_out()
centroids = kmeans_final.cluster_centers_

print("Top 10 terms per cluster:\n")
for i, centroid in enumerate(centroids):
    top_indices = centroid.argsort()[-10:][::-1]
    top_terms = [feature_names[idx] for idx in top_indices]
    print(f"Cluster {i}: {', '.join(top_terms)}")

## 10. Save Results

In [None]:
df.to_csv(DATA_DIR / 'clustered_data.csv', index=False)
print("Clustered data saved.")

## 11. Conclusion

### Key Findings:
1. ✅ Identified optimal number of clusters using elbow method and silhouette analysis
2. ✅ Discovered latent thematic domains in Kiswahili text
3. ✅ Visualized clusters in reduced dimensional space
4. ✅ Extracted interpretable topic keywords

### Implications:
- Topic diversity informs sentiment model generalizability
- Cluster-specific models may improve performance

### Next Steps:
Proceed to **Notebook 6**: Model Optimization (Quantization & Distillation)