<a href="https://colab.research.google.com/github/SaibalPatraDS/Hands-on-LLM/blob/main/Text_Clustering_using_Sentence_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text Clustering using Sentence Transformers

---

What we have: **Documents**

---

What we want to do? : **Build Text Clusters**

---

Steps :
1. Formation of Clusters of the text data with similarity [`Topic Clustering`]
2. Naming the formed clusters. [`Topic Modelling`]
   

### Text Clustering Pipeline

1. Embeddings using Embedding Model
2. Lowering the Dimension of the Embeddings using `Dimentionality Reduction Techniques` (PCA❌/UMAP✅)
3. Clustering the dimentionality reducted embeddings using Clustering Algorithms

In [1]:
## downloading necessary packages
# %%capture
!pip install bertopic datasets datamapplot

Collecting bertopic
  Downloading bertopic-0.16.4-py3-none-any.whl.metadata (23 kB)
Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting datamapplot
  Downloading datamapplot-0.4.2-py3-none-any.whl.metadata (6.1 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.40-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3

In [2]:
## loading the data
from datasets import load_dataset

## ArXiv Dataset
arxiv_data = load_dataset(
    "MaartenGr/arxiv_nlp"
)
arxiv_data
## segmentation of different aspects
titles = arxiv_data["train"]["Titles"]
abstracts = arxiv_data["train"]["Abstracts"]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/617 [00:00<?, ?B/s]

data.csv:   0%|          | 0.00/53.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/44949 [00:00<?, ? examples/s]

In [3]:
## checking the data
titles[0], abstracts[0]

('Introduction to Arabic Speech Recognition Using CMUSphinx System',
 '  In this paper Arabic was investigated from the speech recognition problem\npoint of view. We propose a novel approach to build an Arabic Automated Speech\nRecognition System (ASR). This system is based on the open source CMU Sphinx-4,\nfrom the Carnegie Mellon University. CMU Sphinx is a large-vocabulary;\nspeaker-independent, continuous speech recognition system based on discrete\nHidden Markov Models (HMMs). We build a model using utilities from the\nOpenSource CMU Sphinx. We will demonstrate the possible adaptability of this\nsystem to Arabic voice recognition.\n')

## Common Pipeline for `Text Clustering`

### Creating Embeddings of the Data

In [5]:
## Embeddings of the Documents
from sentence_transformers import SentenceTransformer

## create Embeddings of the Text Data
embedding_model = SentenceTransformer("thenlper/gte-small")
embeddings = embedding_model.encode(
    abstracts,
    show_progress_bar=True
)

modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/66.7M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1405 [00:00<?, ?it/s]

In [6]:
## checking the shape of Embeddings
embeddings.shape

(44949, 384)

### 2. Reducing the Dimensionality of Embeddings

In [9]:
## reducing the Dimensionality of the Embeddings
from umap import UMAP

## defining the umap model
umap_model = UMAP(
    n_components=5,
    min_dist=0.0,
    metric='cosine',
    random_state=42
)
## fitting the model into data/embeddings
reduced_embeddings = umap_model.fit_transform(embeddings)

  warn(


In [10]:
## checking the shape of the Reduced Embeddings
reduced_embeddings.shape, embeddings.shape

((44949, 5), (44949, 384))

### Clustering the Embeddings

In [14]:
## clsutering the reduced Embeddings
from hdbscan import HDBSCAN

## defining hte model and fitting the model into reduced embeddings
model_hdbscan = HDBSCAN(
    min_cluster_size=50,
    metric = 'euclidean',
    cluster_selection_method='eom'
).fit(reduced_embeddings)

## clsuter labels
cluster_labels = model_hdbscan.labels_
## no of clusters
len(set(cluster_labels))

150

### Inspecting the Clusters

In [22]:
## Manually Inspecting the Clusters
import numpy as np

## Inspecting one of the Clusters
# cluster = 0 ## Sign Language
# cluster = 100  ## Sementic Parsing
cluster = 19 ## Formality of Writing
for index in np.where(cluster_labels==cluster)[0][:3]:
  print(abstracts[index][:300] + "..." + "\n")


  Formality is one of the most important dimensions of writing style variation.
In this study we conducted an inter-rater reliability experiment for assessing
sentence formality on a five-point Likert scale, and obtained good agreement
results as well as different rating distributions for different ...

  This paper focuses on style transfer on the basis of non-parallel text. This
is an instance of a broad family of problems including machine translation,
decipherment, and sentiment modification. The key challenge is to separate the
content from other aspects such as style. We assume a shared laten...

  This paper presents a Semantic Attribute Modulation (SAM) for language
modeling and style variation. The semantic attribute modulation includes
various document attributes, such as titles, authors, and document categories.
We consider two types of attributes, (title attributes and category
attribu...



## Conclusion:

1. Three steps :
  1. Creating Embeddings of the text data
  2. Dimensionality Reduction of the Embeddings
  3. CLustering of the Reduced Embeddings

Manually we can observe that the cluster created are completely fine and have similarity within themselves.

for an example, cluster 0 is about `Sign Language`, similarly cluster 100 is about `Sementic Parsing`.