# 🗣️ Week 08 Lab – Chunk-Level Clustering with Transformer Embeddings

**LSE DS205 – Advanced Data Manipulation (2024/25)**

Alex Soldatkin

<div style="font-family: system-ui; padding: 20px 30px 20px 20px; background-color: #FFFFFF; border-left: 8px solid #0C79A2; border-radius: 8px; box-shadow: 0 4px 12px rgba(0, 0, 0, 0.1);max-width:700px">

**Today's Session**
- 📅 Tuesday, 11 March 2025

**Prerequisites**  

You should be comfortable with the following concepts before starting this lab:

- ✅ Understanding of the concepts and packages introduced in 🗣️ **Week 08 Lecture**
- ✅ Familiarity with transformer models and embeddings from previous labs
- ✅ Basic knowledge of document chunking and clustering algorithms

</div>

## 1. Introduction

In this notebook, we'll explore how to perform clustering at the chunk level rather than the document level. Specifically, we will:

1. Load a single climate policy document
2. Split it into chunks using the same strategy as in NB02 (line-by-line)
3. Generate embeddings for each chunk using ClimateBERT
4. Apply clustering techniques to identify topic clusters within the document
5. Visualise the results using interactive plots

This approach allows us to identify distinct topics and sections within a single document.

In [1]:
import os
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm, trange

# Import LetsPlot for visualisation
from lets_plot import *

LetsPlot.setup_html()

# Imports from the 🤗 HuggingFace transformers library
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM

# Import our utility functions
from utils import get_embeddings
from clustering_utils import (
    load_ndc_doc_strings,
    perform_kmeans_clustering,
    perform_dbscan_clustering,
    find_optimal_k,
    extract_top_terms_per_cluster,
    reduce_dimensions_for_visualization,
)
from chunk_utils import (
    prepare_document_chunks,
    create_interactive_plot,
    create_section_flow_visualization,
)

## 2. Loading the ClimateBERT Model

We'll use the ClimateBERT model to generate embeddings for our document chunks, as it's been specifically fine-tuned for climate-related text.

In [2]:
climate_model_name = "climatebert/distilroberta-base-climate-f"
models_save_dir = "./local_models"

# Path to the local model
local_model_path = os.path.join(models_save_dir, climate_model_name)

# Check if model exists locally, if not download it
if not os.path.exists(local_model_path):
    print("Downloading ClimateBERT model...")
    os.makedirs(os.path.dirname(local_model_path), exist_ok=True)
    tokenizer = AutoTokenizer.from_pretrained(climate_model_name, use_auth_token=False)
    model = AutoModelForMaskedLM.from_pretrained(
        climate_model_name, use_auth_token=False
    )

    # Save it to a local_models folder
    tokenizer.save_pretrained(local_model_path)
    model.save_pretrained(local_model_path)
else:
    print("Loading ClimateBERT model from local directory...")

Loading ClimateBERT model from local directory...


In [3]:
# Load tokenizer and model
climate_tokenizer = AutoTokenizer.from_pretrained(local_model_path)
climate_model = AutoModel.from_pretrained(
    local_model_path
)  # Using AutoModel for embeddings

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Some weights of RobertaModel were not initialized from the model checkpoint at ./local_models/climatebert/distilroberta-base-climate-f and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 3. Loading a Single Document and Chunking

We'll select a single document to analyze in depth. This will allow us to see how the topics vary across the document.

In [4]:
# Load documents from the robust preprocessing folder
docs_df = load_ndc_doc_strings("data/ndc-docs-robust", filter_english=True)

# Let's see what documents are available
docs_df["country"] = docs_df["file"].str.split("-").str[-1].str.replace(".txt", "")
print(f"Loaded {len(docs_df)} documents")
# Display a few examples
display(docs_df[["country", "file"]].sample(10))

Loading NDC documents from data/ndc-docs-robust
Loaded 126 documents


Unnamed: 0,country,file
105,Bissau,NDC-Guinea Bissau-12102021.Final-Guinea-Bissau...
2,Afghanistan,INDC_AFG_20150927_FINAL-Afghanistan.txt
56,Mongolia,First Submission of Mongolias NDC-Mongolia.txt
47,Barbados,2021 Barbados NDC update - 21 July 2021-Barbad...
78,Türkiye,TÜRKİYE_UPDATED 1st NDC_EN-Türkiye.txt
118,Tonga,Tongas Second NDC-Tonga.txt
15,United Kingdom of Great Britain and Northern I...,UKs 2035 NDC ICTU-United Kingdom of Great Brit...
42,Zambia,Final Zambia_Revised and Updated_NDC_2021_-Zam...
71,Grenada,GrenadaSecondNDC2020 - 01-12-20-Grenada.txt
8,United Kingdom of Great Britain and Northern I...,UK NDC ICTU 2022-United Kingdom of Great Brita...


In [5]:
# Let's select the UK document for detailed analysis
uk_document = (
    "UK NDC ICTU 2022-United Kingdom of Great Britain and Northern Ireland.txt"
)

# Extract the UK document from the dataframe
selected_doc = docs_df[docs_df["file"] == uk_document].iloc[0]
print(f"Selected document: {selected_doc['file']}")
print(f"Document length: {len(selected_doc['doc'])} characters")

Selected document: UK NDC ICTU 2022-United Kingdom of Great Britain and Northern Ireland.txt
Document length: 88480 characters


Now we'll chunk the document using the same strategy as in NB02 - splitting it into individual lines:

In [6]:
# Prepare document chunks using the utility function
chunks_df = prepare_document_chunks(selected_doc["doc"], selected_doc["file"])

print(f"Created {len(chunks_df)} chunks from the document")
display(chunks_df.head(10))

Created 431 chunks from the document


Unnamed: 0,text,line_number,file
0,United Kingdom of Great Britain and Northern I...,1,UK NDC ICTU 2022-United Kingdom of Great Brita...
1,Presented to Parliament by the Secretary of St...,2,UK NDC ICTU 2022-United Kingdom of Great Brita...
2,Updated: September 2022,3,UK NDC ICTU 2022-United Kingdom of Great Brita...
3,CP 744,4,UK NDC ICTU 2022-United Kingdom of Great Brita...
4,United Kingdom of Great Britain and Northern I...,5,UK NDC ICTU 2022-United Kingdom of Great Brita...
5,Presented to Parliament by the Secretary of St...,6,UK NDC ICTU 2022-United Kingdom of Great Brita...
6,Updated: September 2022,7,UK NDC ICTU 2022-United Kingdom of Great Brita...
7,© Crown copyright 2022,8,UK NDC ICTU 2022-United Kingdom of Great Brita...
8,This publication is licensed under the terms o...,9,UK NDC ICTU 2022-United Kingdom of Great Brita...
9,Where we have identified any third party copyr...,10,UK NDC ICTU 2022-United Kingdom of Great Brita...


## 4. Generating Embeddings for Document Chunks

Now we'll generate embeddings for each chunk using ClimateBERT. This will allow us to create a vector representation of each line in the document.

In [7]:
# Generate embeddings for the chunks
print("Generating embeddings for document chunks...")

# Filter out very short chunks that may not be meaningful
filtered_chunks_df = chunks_df[chunks_df["text"].str.len() > 10].reset_index(drop=True)
print(f"Using {len(filtered_chunks_df)} chunks after filtering very short lines")

# Process embeddings in batches to avoid memory issues
batch_size = 20
all_embeddings = np.zeros((len(filtered_chunks_df), 768))

for i in trange(0, len(filtered_chunks_df), batch_size):
    batch_end = min(i + batch_size, len(filtered_chunks_df))
    batch_texts = filtered_chunks_df["text"].iloc[i:batch_end].tolist()
    batch_embeddings = get_embeddings(batch_texts, climate_model, climate_tokenizer)
    all_embeddings[i:batch_end] = batch_embeddings

print(f"Generated embeddings with shape: {all_embeddings.shape}")

Generating embeddings for document chunks...
Using 379 chunks after filtering very short lines


  0%|          | 0/19 [00:00<?, ?it/s]

Generated embeddings with shape: (379, 768)


## 5. Finding the Optimal Number of Clusters

Before applying K-means clustering, let's determine the optimal number of clusters using the silhouette score.

In [8]:
# Find optimal number of clusters
max_k = min(15, len(filtered_chunks_df) - 1)  # Can't have more clusters than chunks - 1
optimal_k, best_score, all_scores = find_optimal_k(all_embeddings, max_k=max_k)

print(
    f"Optimal number of clusters: {optimal_k} with silhouette score: {best_score:.3f}"
)

# Visualize silhouette scores
scores_df = pd.DataFrame(all_scores, columns=["k", "silhouette_score"])

p = (
    ggplot(scores_df, aes(x="k", y="silhouette_score"))
    + geom_line(color="#0C79A2", size=1.5)
    + geom_point(color="#0C79A2", size=3)
    + geom_point(data=scores_df[scores_df["k"] == optimal_k], color="red", size=5)
    + labs(
        title="Silhouette Scores for Different Numbers of Clusters",
        x="Number of Clusters (k)",
        y="Silhouette Score",
    )
    + theme_minimal()
)

p

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md



Optimal number of clusters: 15 with silhouette score: 0.178


## 6. Applying K-means Clustering to Document Chunks

Now let's apply K-means clustering with the optimal number of clusters.

In [9]:
# Perform K-means clustering with the optimal number of clusters
kmeans_labels, cluster_centers, kmeans_silhouette = perform_kmeans_clustering(
    all_embeddings, n_clusters=optimal_k
)

# Add cluster labels to the DataFrame
filtered_chunks_df["kmeans_cluster"] = kmeans_labels

# Display clustering results
print(
    f"K-means clustering with {optimal_k} clusters. Silhouette score: {kmeans_silhouette:.3f}"
)

# Check the distribution of clusters
cluster_counts = filtered_chunks_df["kmeans_cluster"].value_counts().sort_index()
display(pd.DataFrame({"Cluster": cluster_counts.index, "Count": cluster_counts.values}))

K-means clustering with 15 clusters. Silhouette score: 0.178


Unnamed: 0,Cluster,Count
0,0,49
1,1,23
2,2,40
3,3,67
4,4,19
5,5,31
6,6,24
7,7,18
8,8,9
9,9,14


Let's extract the top terms for each cluster to understand what they represent:

In [10]:
# Extract top terms for K-means clusters
kmeans_cluster_terms = extract_top_terms_per_cluster(
    filtered_chunks_df.rename(columns={"text": "doc"}), kmeans_labels, n_terms=10
)

# Display the top terms for each K-means cluster
print("Top terms in each K-means cluster:")
for cluster_id, terms in kmeans_cluster_terms.items():
    print(
        f"\n🔹 Cluster {cluster_id} (Count: {sum(filtered_chunks_df['kmeans_cluster'] == cluster_id)})"
    )
    print("  " + ", ".join([f"{term[0]} ({term[1]})" for term in terms]))

    # Show a few examples of text chunks from this cluster
    cluster_examples = (
        filtered_chunks_df[filtered_chunks_df["kmeans_cluster"] == cluster_id]["text"]
        .sample(min(3, sum(filtered_chunks_df["kmeans_cluster"] == cluster_id)))
        .tolist()
    )
    print("\n  Example chunks:")
    for i, example in enumerate(cluster_examples):
        print(
            f"  {i+1}. {example[:100]}..."
            if len(example) > 100
            else f"  {i+1}. {example}"
        )

Top terms in each K-means cluster:

🔹 Cluster 0 (Count: 49)
  nationally (48), determined (48), contribution (48), updated (45), september (45), united (3), kingdom (3), great (3), britain (3), northern (3)

  Example chunks:
  1. UK’s Nationally Determined Contribution – updated September 2022
  2. UK’s Nationally Determined Contribution – updated September 2022
  3. UK’s Nationally Determined Contribution – updated September 2022

🔹 Cluster 1 (Count: 23)
  climate (17), gender (15), women (11), energy (11), action (8), clean (8), countries (8), international (7), people (7), equal (7)

  Example chunks:
  1. the Generation Equality Forum’s Action Coalition on Feminist Action for Climate Justice54, the UK wi...
  2. The UK also supports gender balance in physics and computing to increase Science, Technology, Engine...
  3. Signed by over 30 major donor countries under the UK’s COP Presidency, the International Just Transi...

🔹 Cluster 2 (Count: 40)
  ndc (22), emissions (20), climate

## 7. Applying DBSCAN Clustering to Document Chunks

Let's also try DBSCAN, which doesn't require specifying the number of clusters in advance.

In [11]:
# Try different eps values for DBSCAN
eps_values = [0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5]
dbscan_results = []

for eps in eps_values:
    labels, silhouette = perform_dbscan_clustering(
        all_embeddings, eps=eps, min_samples=3
    )
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    noise_points = sum(labels == -1)
    dbscan_results.append(
        {
            "eps": eps,
            "n_clusters": n_clusters,
            "noise_points": noise_points,
            "silhouette": silhouette,
        }
    )

dbscan_tuning_df = pd.DataFrame(dbscan_results)
display(dbscan_tuning_df)

# Visualize the results
p = (
    ggplot(dbscan_tuning_df, aes(x="eps"))
    + geom_line(aes(y="n_clusters"), color="blue", size=1.5)
    + geom_point(aes(y="n_clusters"), color="blue", size=3)
    + geom_line(aes(y="noise_points"), color="red", size=1.5)
    + geom_point(aes(y="noise_points"), color="red", size=3)
    + labs(title="Effect of eps Parameter on DBSCAN Clustering", x="eps", y="Count")
    + ggtitle(
        "Effect of eps Parameter on DBSCAN Clustering",
        "Blue=Number of Clusters, Red=Noise Points",
    )
    + theme_minimal()
)

p

Unnamed: 0,eps,n_clusters,noise_points,silhouette
0,0.1,4,318,0.996981
1,0.15,4,318,0.996981
2,0.2,4,318,0.996981
3,0.25,4,318,0.996981
4,0.3,4,318,0.996981
5,0.35,4,318,0.996981
6,0.4,4,318,0.996981
7,0.45,4,318,0.996981
8,0.5,4,318,0.996981


In [12]:
# Select the best eps value that gives a reasonable number of clusters
# and not too many noise points
best_eps_row = dbscan_tuning_df[
    (dbscan_tuning_df["n_clusters"] > 2)
    & (dbscan_tuning_df["noise_points"] < len(filtered_chunks_df) * 0.3)
]

if len(best_eps_row) > 0:
    # Choose the one with the highest silhouette score
    best_eps = best_eps_row.loc[best_eps_row["silhouette"].idxmax(), "eps"]
else:
    # Fallback to the eps that gives the most clusters
    best_eps = dbscan_tuning_df.loc[dbscan_tuning_df["n_clusters"].idxmax(), "eps"]

print(f"Selected best eps value: {best_eps}")

# Apply DBSCAN with the selected eps value
dbscan_labels, dbscan_silhouette = perform_dbscan_clustering(
    all_embeddings, eps=best_eps, min_samples=3
)

# Add cluster labels to the DataFrame
filtered_chunks_df["dbscan_cluster"] = dbscan_labels

# Count clusters (excluding noise points labeled as -1)
n_clusters = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
print(
    f"DBSCAN clustering found {n_clusters} clusters. Silhouette score: {dbscan_silhouette:.3f}"
)
print(f"Number of noise points: {sum(dbscan_labels == -1)} out of {len(dbscan_labels)}")

# Check the distribution of clusters
if n_clusters > 0:
    cluster_counts = filtered_chunks_df["dbscan_cluster"].value_counts().sort_index()
    display(
        pd.DataFrame({"Cluster": cluster_counts.index, "Count": cluster_counts.values})
    )

Selected best eps value: 0.1
DBSCAN clustering found 4 clusters. Silhouette score: 0.997
Number of noise points: 318 out of 379


Unnamed: 0,Cluster,Count
0,-1,318
1,0,3
2,1,45
3,2,10
4,3,3


## 8. Visualizing Document Chunks in 2D Space

Now we'll visualize the document chunks in 2D space using PCA, colored by their cluster assignments.

In [13]:
# Reduce dimensions for visualization
reduced_embeddings, explained_variance = reduce_dimensions_for_visualization(
    all_embeddings
)

print(
    f"Explained variance by the first two principal components: {explained_variance[0]:.3f}, {explained_variance[1]:.3f}"
)
print(f"Total explained variance: {sum(explained_variance):.3f}")

Explained variance by the first two principal components: 0.145, 0.104
Total explained variance: 0.249


In [14]:
# Create a DataFrame for plotting
viz_df = pd.DataFrame(
    {
        "x": reduced_embeddings[:, 0],
        "y": reduced_embeddings[:, 1],
        "kmeans_cluster": [
            f"Cluster {c}" for c in filtered_chunks_df["kmeans_cluster"]
        ],
        "dbscan_cluster": [
            f"Cluster {c}" if c != -1 else "Noise"
            for c in filtered_chunks_df["dbscan_cluster"]
        ],
        "line_number": filtered_chunks_df["line_number"],
        "text": filtered_chunks_df["text"],
    }
)

# Create an interactive plot for K-means clusters
kmeans_plot = create_interactive_plot(
    viz_df,
    color_column="kmeans_cluster",
    title=f"K-means Clustering of Document Chunks (k={optimal_k})",
    explained_variance=explained_variance,
)

kmeans_plot

In [15]:
# Create an interactive plot for DBSCAN clusters
if n_clusters > 0:
    dbscan_plot = create_interactive_plot(
        viz_df,
        color_column="dbscan_cluster",
        title=f"DBSCAN Clustering of Document Chunks (eps={best_eps})",
        explained_variance=explained_variance,
    )

    dbscan_plot
else:
    print("DBSCAN did not find any meaningful clusters with the selected parameters.")

## 9. Analyzing Chunks by Position in Document

Let's also visualize how the clusters are distributed throughout the document by looking at their positions in the original text.

In [19]:
# Add normalized position (0-1) to represent where in the document the chunk appears
filtered_chunks_df["position"] = (
    filtered_chunks_df["line_number"] / filtered_chunks_df["line_number"].max()
)

# Create a DataFrame for plotting document position vs. clusters
position_df = pd.DataFrame(
    {
        "position": filtered_chunks_df["position"],
        "kmeans_cluster": filtered_chunks_df["kmeans_cluster"],
        "line_number": filtered_chunks_df["line_number"],
        "text": filtered_chunks_df["text"],
    }
)

# Plot K-means clusters against document position
position_plot = (
    ggplot(
        position_df,
        aes(x="position", y="kmeans_cluster", color="kmeans_cluster"),
    )
    + geom_point(size=4, alpha=0.7)
    + labs(
        title="Distribution of Clusters Throughout the Document",
        x="Relative Position in Document (0-1)",
        y="K-means Cluster",
    )
    + scale_x_continuous(
        labels=["Start", "25%", "50%", "75%", "End"], breaks=[0, 0.25, 0.5, 0.75, 1]
    )
    + theme_minimal()
    + theme(legend_position="none")
)

position_plot

## 10. Document Structure Analysis

Finally, let's analyze how the document is structured by looking at how topics (clusters) flow through the document.

In [17]:
# Group consecutive chunks with the same cluster to identify document sections
filtered_chunks_df["section_change"] = (
    filtered_chunks_df["kmeans_cluster"]
    != filtered_chunks_df["kmeans_cluster"].shift(1)
).astype(int)
filtered_chunks_df["section_id"] = filtered_chunks_df["section_change"].cumsum()

# Get section information
sections = (
    filtered_chunks_df.groupby("section_id")
    .agg(
        {
            "kmeans_cluster": "first",
            "line_number": ["min", "max"],
            "text": lambda x: "\n".join(x.tolist()[:3])
            + ("\n..." if len(x) > 3 else ""),
            "position": "mean",
        }
    )
    .reset_index()
)

# Rename columns for clarity
sections.columns = [
    "section_id",
    "cluster",
    "start_line",
    "end_line",
    "sample_text",
    "avg_position",
]
sections["length"] = sections["end_line"] - sections["start_line"] + 1

# Sort by position in document
sections = sections.sort_values("start_line")

# Display the sections
print(f"Found {len(sections)} document sections based on topic clustering\n")
for i, row in sections.iterrows():
    print(
        f"Section {i+1} (Cluster {row['cluster']}, Lines {row['start_line']}-{row['end_line']}, {row['length']} lines)"
    )
    # Get the top terms for this cluster
    if row["cluster"] in kmeans_cluster_terms:
        top_terms = ", ".join(
            [term[0] for term in kmeans_cluster_terms[row["cluster"]][:5]]
        )
        print(f"Top terms: {top_terms}")
    print(f"Sample text:\n{row['sample_text']}\n")

Found 288 document sections based on topic clustering

Section 1 (Cluster 0, Lines 1-1, 1 lines)
Top terms: nationally, determined, contribution, updated, september
Sample text:
United Kingdom of Great Britain and Northern Ireland’s Nationally Determined Contribution

Section 2 (Cluster 3, Lines 2-3, 2 lines)
Top terms: applicable, energy, information, reference, local
Sample text:
Presented to Parliament by the Secretary of State for Business, Energy, and Industrial Strategy by Command of His Majesty
Updated: September 2022

Section 3 (Cluster 0, Lines 5-5, 1 lines)
Top terms: nationally, determined, contribution, updated, september
Sample text:
United Kingdom of Great Britain and Northern Ireland’s Nationally Determined Contribution

Section 4 (Cluster 3, Lines 6-8, 3 lines)
Top terms: applicable, energy, information, reference, local
Sample text:
Presented to Parliament by the Secretary of State for Business, Energy, and Industrial Strategy by Command of His Majesty
Updated: Septemb

Let's visualize the document structure as a flow of sections:

In [29]:
# # Create a data frame for the section flow visualization
# section_flow_df = sections[
#     ["section_id", "cluster", "start_line", "end_line", "length"]
# ].copy()
# section_flow_df["start_pos"] = (
#     section_flow_df["start_line"] / filtered_chunks_df["line_number"].max()
# )
# section_flow_df["end_pos"] = (
#     section_flow_df["end_line"] / filtered_chunks_df["line_number"].max()
# )
# section_flow_df["length"] = (
#     section_flow_df["end_line"] - section_flow_df["start_line"] + 1
# )

# # Create the section flow visualization
# section_flow_plot = create_section_flow_visualization(
#     section_flow_df,
#     colors_dict={0: "#0C79A2", 1: "#FFA500", 2: "#228B22", 3: "#FF6347", 4: "#8A2BE2", 5: "#5F9EA0", 6: "#D2691E", 7: "#6495ED"}
# )

# section_flow_plot

Let us create a visulaisation of chunk clusters with layered tooltips to see the distribution of clusters in the document using lets_plot.

In [None]:
# Let us create a visulaisation of chunk clusters with layered tooltips to see the distribution of clusters in the document using lets_plot.

# Create a DataFrame for plotting document position vs. clusters

position_df = pd.DataFrame(
    {
        "position": filtered_chunks_df["position"],
        "kmeans_cluster": filtered_chunks_df["kmeans_cluster"],
        "line_number": filtered_chunks_df["line_number"],
        "text": filtered_chunks_df["text"],
    }
)

# Plot K-means clusters against document position

position_plot = (
    ggplot(
        position_df,
        aes(x="position", y="kmeans_cluster", color="kmeans_cluster"),
    )
    + geom_point(
        size=4,
        alpha=0.7,
        tooltips=layer_tooltips().line("line: @line_number").line("@text"),
    )
    + labs(
        title="Distribution of Clusters Throughout the Document",
        x="Relative Position in Document (0-1)",
        y="K-means Cluster",
    )
    + scale_x_continuous(
        labels=["Start", "25%", "50%", "75%", "End"], breaks=[0, 0.25, 0.5, 0.75, 1]
    )
    + theme_minimal()
    + theme(legend_position="none")
)

position_plot

In [None]:
# same plot but for dbscan clusters

position_df = pd.DataFrame(
    {
        "position": filtered_chunks_df["position"],
        "dbscan_cluster": filtered_chunks_df["dbscan_cluster"],
        "line_number": filtered_chunks_df["line_number"],
        "text": filtered_chunks_df["text"],
    }
)

# Plot DBSCAN clusters against document position

position_plot = (
    ggplot(
        position_df,
        aes(x="position", y="dbscan_cluster", color="dbscan_cluster"),
    )
    + geom_point(
        size=4, alpha=0.7, tooltips=layer_tooltips().line("@line_number").line("@text")
    )
    + labs(
        title="Distribution of Clusters Throughout the Document",
        x="Relative Position in Document (0-1)",
        y="DBSCAN Cluster",
    )
    + scale_x_continuous(
        labels=["Start", "25%", "50%", "75%", "End"], breaks=[0, 0.25, 0.5, 0.75, 1]
    )
    + theme_minimal()
    + theme(legend_position="none")
)

position_plot