# 🗣️ Week 08 Lab – Topic Clustering using Transformer Embeddings

**LSE DS205 – Advanced Data Manipulation (2024/25)**

Alex Soldatkin

<div style="font-family: system-ui; padding: 20px 30px 20px 20px; background-color: #FFFFFF; border-left: 8px solid #0C79A2; border-radius: 8px; box-shadow: 0 4px 12px rgba(0, 0, 0, 0.1);max-width:700px">

**Today's Session**
- 📅 Tuesday, 11 March 2025

**Prerequisites**  

You should be comfortable with the following concepts before starting this lab:

- ✅ Understanding of the concepts and packages introduced in 🗣️ **Week 08 Lecture**
- ✅ Familiarity with transformer models and embeddings from previous labs
- ✅ Basic knowledge of clustering algorithms (K-means, DBSCAN)

</div>

## 1. Introduction

In this notebook, we'll explore how to use transformer-based embeddings to perform topic clustering on a collection of climate policy documents. Specifically, we'll investigate:

1. Can we use embeddings to identify distinct topics within the documents?
2. How do different clustering algorithms (K-means and DBSCAN) perform on these embeddings?
3. What do the resulting clusters represent?

We'll work with a small sample of documents (about 10) to make the analysis manageable.

In [1]:
import os
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm, trange

# Import LetsPlot for visualisation
from lets_plot import *
LetsPlot.setup_html()

# Imports from the 🤗 HuggingFace transformers library
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM

# Import our utility functions
from utils import get_embeddings
from clustering_utils import (
    load_ndc_doc_strings, perform_kmeans_clustering, perform_dbscan_clustering,
    find_optimal_k, extract_top_terms_per_cluster, reduce_dimensions_for_visualization,
    get_cluster_summary
)

## 2. Loading the ClimateBERT Model

We'll use the ClimateBERT model to generate embeddings for our documents, as it's been specifically fine-tuned for climate-related text.

In [2]:
climate_model_name = "climatebert/distilroberta-base-climate-f"
models_save_dir = "./local_models"

# Path to the local model
local_model_path = os.path.join(models_save_dir, climate_model_name)

# Check if model exists locally, if not download it
if not os.path.exists(local_model_path):
    print("Downloading ClimateBERT model...")
    os.makedirs(os.path.dirname(local_model_path), exist_ok=True)
    tokenizer = AutoTokenizer.from_pretrained(climate_model_name, use_auth_token=False)
    model = AutoModelForMaskedLM.from_pretrained(climate_model_name, use_auth_token=False)
    
    # Save it to a local_models folder
    tokenizer.save_pretrained(local_model_path)
    model.save_pretrained(local_model_path)
else:
    print("Loading ClimateBERT model from local directory...")

Loading ClimateBERT model from local directory...


In [3]:
# Load tokenizer and model
climate_tokenizer = AutoTokenizer.from_pretrained(local_model_path)
climate_model = AutoModel.from_pretrained(local_model_path)  # Using AutoModel for embeddings

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Some weights of RobertaModel were not initialized from the model checkpoint at ./local_models/climatebert/distilroberta-base-climate-f and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 3. Loading a Sample of Climate Policy Documents

We'll load a small sample of about 10 documents from the robust preprocessing folder.

In [4]:
# Load just 10 documents from the robust preprocessing folder
docs_df = load_ndc_doc_strings("data/ndc-docs-robust", filter_english=True, max_docs=10)

print(f"Loaded {len(docs_df)} documents")
display(docs_df.head())

Loading NDC documents from data/ndc-docs-robust
Loaded 10 documents


Unnamed: 0,file,doc
0,Ukraine NDC_July 31-Ukraine.txt,Updated Nationally Determined Contribution of ...
1,Brunei Darussalams NDC 2020-Brunei Darussalam.txt,Brunei Darussalam Nationally Determined Contri...
2,INDC_AFG_20150927_FINAL-Afghanistan.txt,ISLAMIC REPUBLIC OF AFGHANISTAN\nIntended Nati...
3,Trinidad and Tobago Final INDC-Trinidad and To...,TRINIDAD AND TOBAGO\nINTENDED NATIONALLY DETER...
4,FirstNDC-Eng-Syrian Arab Republic-Syrian Arab ...,Syrian Arab Republic\nNationally Determined Co...


Let's take a quick look at which countries are represented in our sample:

In [5]:
# Extract country names from filenames
docs_df['country'] = docs_df['file'].str.split('-').str[-1].str.replace('.txt', '')
docs_df[['country', 'file']]

Unnamed: 0,country,file
0,Ukraine,Ukraine NDC_July 31-Ukraine.txt
1,Brunei Darussalam,Brunei Darussalams NDC 2020-Brunei Darussalam.txt
2,Afghanistan,INDC_AFG_20150927_FINAL-Afghanistan.txt
3,Trinidad and Tobago,Trinidad and Tobago Final INDC-Trinidad and To...
4,Syrian Arab Republic,FirstNDC-Eng-Syrian Arab Republic-Syrian Arab ...
5,Kazakhstan,12updated NDC KAZ_Gov Decree313_19042023_en_co...
6,Ghana,Ghanas Updated Nationally Determined Contribut...
7,Botswana,BOTSWANA-Botswana.txt
8,United Kingdom of Great Britain and Northern I...,UK NDC ICTU 2022-United Kingdom of Great Brita...
9,Thailand,Thailand 2nd Updated NDC-Thailand.txt


## 4. Generating Document Embeddings

Now we'll generate embeddings for our documents using ClimateBERT. We'll create a summary embedding for each document.

In [6]:
# Generate embeddings using the first 512 tokens of each document
print("Generating embeddings...")
embeddings = get_embeddings(docs_df['doc'].tolist(), climate_model, climate_tokenizer, max_length=512)

print(f"Generated embeddings with shape: {embeddings.shape}")

Generating embeddings...
Generated embeddings with shape: (10, 768)


## 5. Finding the Optimal Number of Clusters

Before applying K-means clustering, let's determine the optimal number of clusters using the silhouette score.

In [7]:
# Find optimal number of clusters
max_k = min(9, len(docs_df) - 1)  # Can't have more clusters than documents - 1
optimal_k, best_score, all_scores = find_optimal_k(embeddings, max_k=max_k)

print(f"Optimal number of clusters: {optimal_k} with silhouette score: {best_score:.3f}")

# Visualize silhouette scores
scores_df = pd.DataFrame(all_scores, columns=['k', 'silhouette_score'])

p = ggplot(scores_df, aes(x='k', y='silhouette_score')) + \
    geom_line(color='#0C79A2', size=1.5) + \
    geom_point(color='#0C79A2', size=3) + \
    geom_point(data=scores_df[scores_df['k'] == optimal_k], color='red', size=5) + \
    labs(title="Silhouette Scores for Different Numbers of Clusters", 
         x="Number of Clusters (k)", 
         y="Silhouette Score") + \
    theme_minimal()

p

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md



Optimal number of clusters: 5 with silhouette score: 0.152


## 6. Applying K-means Clustering

Now let's apply K-means clustering with the optimal number of clusters.

In [8]:
# Perform K-means clustering with the optimal number of clusters
kmeans_labels, cluster_centers, kmeans_silhouette = perform_kmeans_clustering(embeddings, n_clusters=optimal_k)

# Add cluster labels to the DataFrame
docs_df['kmeans_cluster'] = kmeans_labels

# Display clustering results
print(f"K-means clustering with {optimal_k} clusters. Silhouette score: {kmeans_silhouette:.3f}")
display(docs_df[['country', 'file', 'kmeans_cluster']])

K-means clustering with 5 clusters. Silhouette score: 0.152


Unnamed: 0,country,file,kmeans_cluster
0,Ukraine,Ukraine NDC_July 31-Ukraine.txt,2
1,Brunei Darussalam,Brunei Darussalams NDC 2020-Brunei Darussalam.txt,1
2,Afghanistan,INDC_AFG_20150927_FINAL-Afghanistan.txt,1
3,Trinidad and Tobago,Trinidad and Tobago Final INDC-Trinidad and To...,0
4,Syrian Arab Republic,FirstNDC-Eng-Syrian Arab Republic-Syrian Arab ...,2
5,Kazakhstan,12updated NDC KAZ_Gov Decree313_19042023_en_co...,3
6,Ghana,Ghanas Updated Nationally Determined Contribut...,4
7,Botswana,BOTSWANA-Botswana.txt,0
8,United Kingdom of Great Britain and Northern I...,UK NDC ICTU 2022-United Kingdom of Great Brita...,4
9,Thailand,Thailand 2nd Updated NDC-Thailand.txt,1


## 7. Applying DBSCAN Clustering

Let's also try DBSCAN, which doesn't require specifying the number of clusters in advance.

In [9]:
# Perform DBSCAN clustering
# Start with conservative parameters
dbscan_labels, dbscan_silhouette = perform_dbscan_clustering(embeddings, eps=0.2, min_samples=2)

# Add cluster labels to the DataFrame
docs_df['dbscan_cluster'] = dbscan_labels

# Count clusters (excluding noise points labeled as -1)
n_clusters = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)

print(f"DBSCAN clustering found {n_clusters} clusters. Silhouette score: {dbscan_silhouette:.3f}")
print(f"Number of noise points: {sum(dbscan_labels == -1)}")
display(docs_df[['country', 'file', 'dbscan_cluster']])

DBSCAN clustering found 0 clusters. Silhouette score: 0.000
Number of noise points: 10


Unnamed: 0,country,file,dbscan_cluster
0,Ukraine,Ukraine NDC_July 31-Ukraine.txt,-1
1,Brunei Darussalam,Brunei Darussalams NDC 2020-Brunei Darussalam.txt,-1
2,Afghanistan,INDC_AFG_20150927_FINAL-Afghanistan.txt,-1
3,Trinidad and Tobago,Trinidad and Tobago Final INDC-Trinidad and To...,-1
4,Syrian Arab Republic,FirstNDC-Eng-Syrian Arab Republic-Syrian Arab ...,-1
5,Kazakhstan,12updated NDC KAZ_Gov Decree313_19042023_en_co...,-1
6,Ghana,Ghanas Updated Nationally Determined Contribut...,-1
7,Botswana,BOTSWANA-Botswana.txt,-1
8,United Kingdom of Great Britain and Northern I...,UK NDC ICTU 2022-United Kingdom of Great Brita...,-1
9,Thailand,Thailand 2nd Updated NDC-Thailand.txt,-1


## 8. Analyzing the Clusters

Let's analyze what each cluster represents by examining the top terms in each cluster.

In [10]:
# Extract top terms for K-means clusters
kmeans_cluster_terms = extract_top_terms_per_cluster(docs_df, kmeans_labels, n_terms=15)

# Display the top terms for each K-means cluster
print("Top terms in each K-means cluster:")
for cluster_id, terms in kmeans_cluster_terms.items():
    cluster_docs = docs_df[docs_df['kmeans_cluster'] == cluster_id]['country'].tolist()
    print(f"\n🔹 Cluster {cluster_id} (Countries: {', '.join(cluster_docs)})")
    print("  " + ", ".join([f"{term[0]} ({term[1]})" for term in terms]))

Top terms in each K-means cluster:

🔹 Cluster 0 (Countries: Trinidad and Tobago, Botswana)
  emissions (44), climate (39), trinidad (32), tobago (32), change (30), sectors (23), national (23), reduction (22), mitigation (18), economic (16), development (15), botswana (15), adaptation (14), based (13), policy (13)

🔹 Cluster 1 (Countries: Brunei Darussalam, Afghanistan, Thailand)
  climate (177), change (113), national (93), brunei (81), darussalam (71), emissions (71), development (64), adaptation (62), afghanistan (61), energy (59), ghg (55), management (51), ndc (49), support (42), mitigation (41)

🔹 Cluster 2 (Countries: Ukraine, Syrian Arab Republic)
  ukraine (89), emissions (69), energy (66), climate (57), ghg (57), national (54), syrian (49), sector (43), economic (41), including (38), ndc (38), change (36), agreement (35), republic (34), environmental (32)

🔹 Cluster 3 (Countries: Kazakhstan)
  kazakhstan (100), climate (70), republic (61), adaptation (53), change (48), impleme

Now let's do the same for DBSCAN clusters, if any were found:

In [11]:
# Extract top terms for DBSCAN clusters (excluding noise points)
if n_clusters > 0:
    dbscan_cluster_terms = extract_top_terms_per_cluster(docs_df, dbscan_labels, n_terms=15)
    
    # Display the top terms for each DBSCAN cluster
    print("Top terms in each DBSCAN cluster:")
    for cluster_id, terms in dbscan_cluster_terms.items():
        cluster_docs = docs_df[docs_df['dbscan_cluster'] == cluster_id]['country'].tolist()
        print(f"\n🔹 Cluster {cluster_id} (Countries: {', '.join(cluster_docs)})")
        print("  " + ", ".join([f"{term[0]} ({term[1]})" for term in terms]))
else:
    print("No valid DBSCAN clusters found with current parameters.")

No valid DBSCAN clusters found with current parameters.


## 9. Visualizing the Clusters

Let's visualize our embeddings and clusters in 2D space using PCA for dimensionality reduction.

In [12]:
# Reduce the dimensionality of the embeddings to 2D for visualization
reduced_embeddings, explained_variance = reduce_dimensions_for_visualization(embeddings)

print(f"Explained variance by the first two principal components: {explained_variance[0]:.3f}, {explained_variance[1]:.3f}")
print(f"Total explained variance: {sum(explained_variance):.3f}")

Explained variance by the first two principal components: 0.271, 0.192
Total explained variance: 0.463


In [13]:
# Create a DataFrame for plotting with LetsPlot
viz_df = pd.DataFrame({
    'x': reduced_embeddings[:, 0],
    'y': reduced_embeddings[:, 1],
    'kmeans_cluster': [f'Cluster {c}' for c in kmeans_labels],
    'dbscan_cluster': [f'Cluster {c}' if c != -1 else 'Noise' for c in dbscan_labels],
    'country': docs_df['country'],
    'filename': docs_df['file']
})

# Visualize K-means clusters
kmeans_plot = ggplot(viz_df, aes(x='x', y='y', color='kmeans_cluster')) + \
    geom_point(size=4) + \
    geom_text(aes(label='country'), nudge_y=0.01, size=10) + \
    labs(title="K-means Clustering of Climate Policy Documents", 
         subtitle=f"Based on ClimateBERT embeddings (k={optimal_k})",
         x=f"PC1 ({explained_variance[0]:.1%} variance)", 
         y=f"PC2 ({explained_variance[1]:.1%} variance)") + \
    theme_minimal()

kmeans_plot

In [14]:
# Visualize DBSCAN clusters
dbscan_plot = ggplot(viz_df, aes(x='x', y='y', color='dbscan_cluster')) + \
    geom_point(size=4) + \
    geom_text(aes(label='country'), nudge_y=0.01, size=10) + \
    labs(title="DBSCAN Clustering of Climate Policy Documents", 
         subtitle=f"Based on ClimateBERT embeddings (eps=0.2, min_samples=2)",
         x=f"PC1 ({explained_variance[0]:.1%} variance)", 
         y=f"PC2 ({explained_variance[1]:.1%} variance)") + \
    theme_minimal()

dbscan_plot

## 10. Fine-tuning DBSCAN Parameters

DBSCAN is often sensitive to its parameters. Let's try different `eps` values to see if we can improve the clustering results.

In [23]:
# Try different eps values for DBSCAN
eps_values = [0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5]
dbscan_results = []

for eps in eps_values:
    labels, silhouette = perform_dbscan_clustering(embeddings, eps=eps, min_samples=2)
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    noise_points = sum(labels == -1)
    dbscan_results.append({
        'eps': eps,
        'n_clusters': n_clusters,
        'noise_points': noise_points,
        'silhouette': silhouette
    })

dbscan_tuning_df = pd.DataFrame(dbscan_results)
display(dbscan_tuning_df)

# Visualize the results
p = ggplot(dbscan_tuning_df, aes(x='eps')) + \
    geom_line(aes(y='n_clusters', color='n_clusters'), size=1.5) + \
    geom_point(aes(y='n_clusters'), color='blue', size=3) + \
    geom_line(aes(y='noise_points', color='noise_points'), size=1.5) + \
    geom_point(aes(y='noise_points'), color='red', size=3) + \
    scale_color_manual(values=['blue', 'red']) + \
    labs(title="Effect of eps Parameter on DBSCAN Clustering", 
         x="eps", 
         y="Count",
         color="") + \
    theme_minimal()

p

Unnamed: 0,eps,n_clusters,noise_points,silhouette
0,0.1,0,10,0
1,0.15,0,10,0
2,0.2,0,10,0
3,0.25,0,10,0
4,0.3,0,10,0
5,0.35,0,10,0
6,0.4,0,10,0
7,0.45,0,10,0
8,0.5,0,10,0


Let's select the best eps value based on silhouette score and re-run DBSCAN:

In [17]:
# Find optimal eps value (with silhouette > 0)
valid_results = dbscan_tuning_df[dbscan_tuning_df['silhouette'] > 0]
if not valid_results.empty:
    best_eps = valid_results.loc[valid_results['silhouette'].idxmax(), 'eps']
else:
    # If no valid silhouette, choose eps that gives some reasonable clustering
    reasonable_results = dbscan_tuning_df[(dbscan_tuning_df['n_clusters'] > 0) & 
                                          (dbscan_tuning_df['noise_points'] < len(docs_df)/2)]
    if not reasonable_results.empty:
        best_eps = reasonable_results.iloc[0]['eps']
    else:
        best_eps = dbscan_tuning_df.iloc[0]['eps']  # Fallback to the first eps value

print(f"Selected best eps value: {best_eps}")

# Re-run DBSCAN with optimal parameters
optimal_dbscan_labels, optimal_dbscan_silhouette = perform_dbscan_clustering(embeddings, eps=best_eps, min_samples=2)

# Count clusters (excluding noise points labeled as -1)
optimal_n_clusters = len(set(optimal_dbscan_labels)) - (1 if -1 in optimal_dbscan_labels else 0)
print(f"Optimized DBSCAN clustering found {optimal_n_clusters} clusters. Silhouette score: {optimal_dbscan_silhouette:.3f}")
print(f"Number of noise points: {sum(optimal_dbscan_labels == -1)}")

# Update the DataFrame
docs_df['optimal_dbscan_cluster'] = optimal_dbscan_labels
display(docs_df[['country', 'file', 'optimal_dbscan_cluster']])

Selected best eps value: 0.1
Optimized DBSCAN clustering found 0 clusters. Silhouette score: 0.000
Number of noise points: 10


Unnamed: 0,country,file,optimal_dbscan_cluster
0,Ukraine,Ukraine NDC_July 31-Ukraine.txt,-1
1,Brunei Darussalam,Brunei Darussalams NDC 2020-Brunei Darussalam.txt,-1
2,Afghanistan,INDC_AFG_20150927_FINAL-Afghanistan.txt,-1
3,Trinidad and Tobago,Trinidad and Tobago Final INDC-Trinidad and To...,-1
4,Syrian Arab Republic,FirstNDC-Eng-Syrian Arab Republic-Syrian Arab ...,-1
5,Kazakhstan,12updated NDC KAZ_Gov Decree313_19042023_en_co...,-1
6,Ghana,Ghanas Updated Nationally Determined Contribut...,-1
7,Botswana,BOTSWANA-Botswana.txt,-1
8,United Kingdom of Great Britain and Northern I...,UK NDC ICTU 2022-United Kingdom of Great Brita...,-1
9,Thailand,Thailand 2nd Updated NDC-Thailand.txt,-1


Let's visualise the optimised DBSCAN clusters:

In [18]:
# Update visualization DataFrame
viz_df['optimal_dbscan_cluster'] = [f'Cluster {c}' if c != -1 else 'Noise' for c in optimal_dbscan_labels]

# Visualize optimized DBSCAN clusters
optimal_dbscan_plot = ggplot(viz_df, aes(x='x', y='y', color='optimal_dbscan_cluster')) + \
    geom_point(size=4) + \
    geom_text(aes(label='country'), nudge_y=0.01, size=10) + \
    labs(title="Optimized DBSCAN Clustering of Climate Policy Documents", 
         subtitle=f"Based on ClimateBERT embeddings (eps={best_eps}, min_samples=2)",
         x=f"PC1 ({explained_variance[0]:.1%} variance)", 
         y=f"PC2 ({explained_variance[1]:.1%} variance)") + \
    theme_minimal()

optimal_dbscan_plot

Let's extract the top terms for the optimized DBSCAN clusters:

In [19]:
# Extract top terms for optimized DBSCAN clusters
if optimal_n_clusters > 0:
    optimal_dbscan_cluster_terms = extract_top_terms_per_cluster(docs_df, optimal_dbscan_labels, n_terms=15)
    
    # Display the top terms for each optimized DBSCAN cluster
    print("Top terms in each optimized DBSCAN cluster:")
    for cluster_id, terms in optimal_dbscan_cluster_terms.items():
        cluster_docs = docs_df[docs_df['optimal_dbscan_cluster'] == cluster_id]['country'].tolist()
        print(f"\n🔹 Cluster {cluster_id} (Countries: {', '.join(cluster_docs)})")
        print("  " + ", ".join([f"{term[0]} ({term[1]})" for term in terms]))
else:
    print("No valid optimized DBSCAN clusters found with current parameters.")

No valid optimized DBSCAN clusters found with current parameters.


## 11. Interpreting the Clusters

Now let's put it all together to interpret what distinct topics are present in our climate policy documents based on the clustering results.

In [20]:
# Create a summary of clusters
cluster_summary = pd.DataFrame({
    'Clustering Method': ['K-means'] * optimal_k + ['DBSCAN'] * optimal_n_clusters,
    'Cluster': [f'Cluster {i}' for i in range(optimal_k)] + 
              [f'Cluster {i}' for i in sorted(set(optimal_dbscan_labels)) if i != -1],
    'Countries': [', '.join(docs_df[docs_df['kmeans_cluster'] == i]['country'].tolist()) for i in range(optimal_k)] +
                [', '.join(docs_df[docs_df['optimal_dbscan_cluster'] == i]['country'].tolist()) 
                 for i in sorted(set(optimal_dbscan_labels)) if i != -1],
    'Top Terms': [', '.join([term[0] for term in terms[:5]]) for terms in kmeans_cluster_terms.values()] +
                [', '.join([term[0] for term in optimal_dbscan_cluster_terms.get(i, [])[:5]]) 
                 for i in sorted(set(optimal_dbscan_labels)) if i != -1]
})

display(cluster_summary)

Unnamed: 0,Clustering Method,Cluster,Countries,Top Terms
0,K-means,Cluster 0,"Trinidad and Tobago, Botswana","emissions, climate, trinidad, tobago, change"
1,K-means,Cluster 1,"Brunei Darussalam, Afghanistan, Thailand","climate, change, national, brunei, darussalam"
2,K-means,Cluster 2,"Ukraine, Syrian Arab Republic","ukraine, emissions, energy, climate, ghg"
3,K-means,Cluster 3,Kazakhstan,"kazakhstan, climate, republic, adaptation, change"
4,K-means,Cluster 4,"Ghana, United Kingdom of Great Britain and Nor...","determined, nationally, contribution, climate,..."
