# Lab 7: Hierarchical Agglomerative Text Clustering (HAC / AGNES)

This notebook implements **Hierarchical Agglomerative Clustering (HACL)** for a Twitter text dataset, following the lab instructions:

1. Load the tweets
2. Preprocess text (tokenization, lemmatization, stopword removal, etc.)
3. Compute the term–document matrix (TF‑IDF)
4. Compute distance matrix
5. Perform Hierarchical Agglomerative Clustering
6. Draw the dendrogram
7. Interpret clustering
8. Compare clusters with existing class labels


In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import silhouette_score, adjusted_rand_score, confusion_matrix
from sklearn.preprocessing import LabelEncoder

from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


## 1. Load the Tweets Dataset

👉 **Note:** Replace `tweets_3class.csv` with your actual file name / path.
The dataset is expected to have at least:
- a text column (e.g. `text` or `tweet`)
- a label column (e.g. `label` or `class`)


In [ ]:
# TODO: change file name if needed
df = pd.read_csv('tweets_3class.csv')  # <-- replace with your dataset filename

print(df.head())
print("\nColumns:", df.columns.tolist())


Set the names of the **text column** and **label column** here so that the rest of the notebook works with your dataset.

In [ ]:
# Set these to match your dataset
TEXT_COL = 'text'   # e.g. 'text', 'tweet'
LABEL_COL = 'label' # e.g. 'label', 'class'

texts = df[TEXT_COL].astype(str).values
labels_raw = df[LABEL_COL].astype(str).values

print("Number of samples:", len(texts))
print("Unique labels:", np.unique(labels_raw))


## 2. NLP Pre‑processing

- Lower‑casing
- Removing URLs, mentions, hashtags, digits, punctuation
- Tokenization
- Stopword removal
- Lemmatization


In [ ]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_tweet(text: str) -> str:
    # lower case
    text = text.lower()
    # remove urls
    text = re.sub(r'http\S+|www\S+', ' ', text)
    # remove mentions and hashtags
    text = re.sub(r'[@#]\w+', ' ', text)
    # keep only letters
    text = re.sub(r'[^a-z\s]', ' ', text)
    # tokenize
    tokens = nltk.word_tokenize(text)
    # remove stopwords and short tokens, then lemmatize
    tokens = [lemmatizer.lemmatize(tok) for tok in tokens
              if tok not in stop_words and len(tok) > 2]
    return ' '.join(tokens)

clean_texts = [clean_tweet(t) for t in texts]
df['clean_text'] = clean_texts
df[['clean_text', LABEL_COL]].head()


## 3. Term‑Document Matrix (TF‑IDF)


In [ ]:
vectorizer = TfidfVectorizer(max_features=5000)
X_tfidf = vectorizer.fit_transform(clean_texts)
X_tfidf.shape


## 4. Distance Matrix (Cosine Distance)

We compute pairwise cosine distances on the TF‑IDF vectors.

In [ ]:
# condensed distance matrix required for linkage/dendrogram
distance_matrix = pdist(X_tfidf.toarray(), metric='cosine')
distance_matrix.shape


## 5. Perform Hierarchical Agglomerative Clustering

We start with each tweet as its own cluster, and iteratively merge closest clusters
using a chosen **linkage** criterion (e.g. `average`, `complete`, `single`, `ward`).


In [ ]:
linkage_method = 'average'  # try 'single', 'complete', 'ward' too (ward needs euclidean)
Z = linkage(distance_matrix, method=linkage_method)
Z[:5]


## 6. Draw the Dendrogram


In [ ]:
plt.figure(figsize=(12, 5))
dendrogram(Z, truncate_mode='level', p=5)
plt.title(f'Dendrogram (linkage={linkage_method})')
plt.xlabel('Sample index or (cluster size)')
plt.ylabel('Distance')
plt.tight_layout()
plt.show()


## 7. Cut the Dendrogram into k Clusters and Interpret

We now choose a number of clusters `k` (e.g. 3 classes as mentioned in the lab) and
obtain cluster labels.  We then compare them to the true labels in the dataset.


In [ ]:
from scipy.cluster.hierarchy import fcluster

k = 3  # number of target clusters (change if needed)
cluster_labels = fcluster(Z, k, criterion='maxclust')

df['cluster'] = cluster_labels
df[[TEXT_COL, 'clean_text', LABEL_COL, 'cluster']].head()


## 8. Compare Clusters with Existing Class Labels

We use:
- Confusion matrix
- Adjusted Rand Index (ARI)
- Silhouette score (internal clustering quality)


In [ ]:
# Encode true labels as integers for comparison
le = LabelEncoder()
y_true = le.fit_transform(labels_raw)
y_cluster = cluster_labels

cm = confusion_matrix(y_true, y_cluster)
ari = adjusted_rand_score(y_true, y_cluster)
sil = silhouette_score(X_tfidf, y_cluster, metric='cosine')

print('Confusion Matrix (rows=true, cols=cluster):')
print(cm)
print('\nAdjusted Rand Index (higher better, 0=random):', ari)
print('Silhouette score (higher better):', sil)

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Cluster label')
plt.ylabel('True label')
plt.title('Confusion Matrix: True vs Cluster Labels')
plt.show()


## 9. Simple Cluster Interpretation

To interpret each cluster, we can look at a few example tweets from each cluster
and also show the top TF‑IDF terms for that cluster.


In [ ]:
feature_names = np.array(vectorizer.get_feature_names_out())

for c in range(1, k+1):
    print(f"\n=== Cluster {c} ===")
    idx = np.where(cluster_labels == c)[0]
    print(f"Number of tweets: {len(idx)}")
    
    # Show a few example tweets
    for t in df.iloc[idx][:3][TEXT_COL]:
        print("-", t)
    
    # Compute mean TF-IDF for this cluster and get top terms
    mean_tfidf = X_tfidf[idx].mean(axis=0).A1
    top_idx = mean_tfidf.argsort()[::-1][:10]
    print("Top terms:", feature_names[top_idx])
