<a href="https://colab.research.google.com/github/GoldPapaya/GoldPapaya/blob/main/8.clustering/Clustering_Sentence_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/8.clustering/Clustering_Sentence_Embeddings.ipynb)


This notebook explores the use of SentenceBERT to generate representations of sequences (sentences, documents) and clustering those representations using K-means.

In [1]:
!pip install sentence-transformers



In [3]:
# Get movies summaries and book titles to cluster
!wget https://raw.githubusercontent.com/dbamman/anlp25/main/data/plot_summaries.txt
!wget https://raw.githubusercontent.com/dbamman/anlp25/main/data/loc/dev.tsv -O book_titles.txt

--2025-10-14 23:42:30--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/plot_summaries.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75934033 (72M) [text/plain]
Saving to: ‘plot_summaries.txt’


2025-10-14 23:42:31 (486 MB/s) - ‘plot_summaries.txt’ saved [75934033/75934033]

--2025-10-14 23:42:31--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/loc/dev.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 197482 (193K) [text/plain]
Saving to: ‘book_titles.txt’


2025-10-14 23:42:31 (85.9 MB/s) - ‘book_

In [2]:
from math import sqrt

import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from tqdm import tqdm


In [4]:
def read_data(filename):
    data = []
    with open(filename, encoding="utf-8") as file:
        for line in file:
            cols = line.rstrip().split("\t")
            idd = cols[0]
            summary = cols[1]
            data.append((idd, summary))
    return data

In [5]:
movies = read_data("plot_summaries.txt")
book_titles = read_data("book_titles.txt")

Load the sentence embedding model.

In [6]:
sentence_model = SentenceTransformer('sentence-transformers/all-distilroberta-v1')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/653 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/328M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Let's try embedding a sentence. What is the shape of the embedding?

In [7]:
embedding = sentence_model.encode("this is a sentence")
print(embedding.shape)

(768,)


In [8]:
def cosine_similarity(one, two):
  return np.dot(one,two) / (sqrt(np.dot(one,one)) * sqrt(np.dot(two,two)))

In [9]:
def get_embeddings(data, model):
    X = []

    # Get sentence embeddings for each doc
    for idx, doc in tqdm(data):
        embedding = model.encode(doc)
        X.append(embedding)

    return np.array(X)

In [10]:
def run_all(data, model, num_clusters=10):

    embeddings = get_embeddings(data, model)

    # Run K-means
    kmeans = KMeans(n_clusters=num_clusters, random_state=0).fit(embeddings)

    # For each cluster, print out the n documents closest to the cluster center
    clusters = {}
    for idx, label in enumerate(kmeans.labels_):
        if label not in clusters:
            clusters[label] = []
        clusters[label].append((idx, data[idx][1]))

    for label in clusters:
        sims = {}
        cluster_center = kmeans.cluster_centers_[label]
        for idx, doc in clusters[label]:
            sim = cosine_similarity(cluster_center, embeddings[idx])
            sims[idx] = sim
        for k, v in sorted(sims.items(), key=lambda item: item[1], reverse=True)[:5]:
            print(k,"%.3f" % v, data[k][1])

        print()


# Book titles

In [13]:
run_all(book_titles[:1000], sentence_model, num_clusters=5)

100%|██████████| 1000/1000 [00:06<00:00, 152.78it/s]

535 0.659 The history and power of writing / None
504 0.610 Information science / None
625 0.607 Historic Newfoundland, None
216 0.606 The language laboratory, None
312 0.604 English traditional customs / None

457 0.575 Seize the day : seven steps to achieving the extraordinary in an ordinary world /
87 0.533 Environmental, health & safety manager's handbook / None
334 0.528 Smoke : a global history of smoking /
831 0.479 Vest-pocket handbook of engineering. None
85 0.472 Essential public health : theory and practice /

135 0.641 Commentary with case-law on the Central civil services (classification, control, and appeal) rules, 1965, and allied service matters, None
177 0.626 Public hearing before Assembly Judiciary, Law, and Public Safety Committee--Assembly and Senate bill nos. A-16/S-1211, A-1191, A-1341, A-1413, A-2466, A-2467, A-2492, A-2514, A-2957, and S-1208 (legislation dealing with the Right to Die) : November 15, 1990, Room 418, State House Annex, Trenton, New Jersey /
107 




# Movie summaries

In [12]:
run_all(movies[:100], sentence_model, num_clusters=10)

100%|██████████| 100/100 [00:01<00:00, 68.33it/s]

96 0.872 The film centres on a group of classmates who attendeded the same Chemistry class in their final year of college. Among them, Murali ([[Narain  is a singer and P. Sukumaran  is a firebrand leader of the left-winged students union. Sukumaran's rival, Satheesan , leads the opposite faction of student politics and is aided by his sidekick, Vasu . Pius , a rich and spoiled brat of parents settled in the Gulf, is the campus Romeo and Sukumaran's best friend. Thara Kurup  is the daughter of the Member of the state Legislative Assembly  from Muvattupuzha; she is a danseuse who regularly wins awards for the college with her performances. It was Murali's dream to have a ten-year reunion, but he mysteriously dies before the reunion can be organised. His parents, Professor Iyer  and Lakshmi Teacher , both teachers at the college, decide to fulfill their departed son's dream and bring his classmates together for a reunion. Sukumaran is now a diamond dealer based in Mumbai and is a divorce




**Q1**: Play around with this method and vary the number of movies clustered, along with the number of clusters.  How would you rate the coherence and interepretability of these clusters? Try to label some of the clusters and discuss with your neighbors about the overall coherence.