<a href="https://colab.research.google.com/github/Arjun9271/Hands_on_llms/blob/main/chapter_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# From text clustering to topic modelling

-  Embedding Generation
- Dimensionality Reduction
- Clustering
- TF-IDF Computation




In [1]:
pip install bertopic datasets

Collecting bertopic
  Downloading bertopic-0.16.4-py3-none-any.whl.metadata (23 kB)
Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.40-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting pynndescent>=0.5 (from umap-learn>=0.5.0->bertop

# Dataset Loading

In [2]:
# Load data from huggingface
from datasets import load_dataset
dataset = load_dataset("maartengr/arxiv_nlp")["train"]

# Extract metadata
abstracts = dataset["Abstracts"]
titles = dataset["Titles"]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/617 [00:00<?, ?B/s]

data.csv:   0%|          | 0.00/53.2M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

## Embedding Generation:

In [3]:
from sentence_transformers import SentenceTransformer

# Create an embedding for each abstract
embedding_model = SentenceTransformer('thenlper/gte-small')
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/66.7M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1405 [00:00<?, ?it/s]

## Dimensionality Reduction:

In [4]:
from umap import UMAP

# We reduce the input embeddings from 384 dimenions to 5 dimenions
umap_model = UMAP(
    n_components=5, min_dist=0.0, metric='cosine', random_state=42
)

reduced_embeddings = umap_model.fit_transform(embeddings)

  warn(


## Clustering:

In [5]:
from hdbscan import HDBSCAN

# We fit the model and extract the clusters
hdbscan_model = HDBSCAN(
    min_cluster_size=50, metric='euclidean', cluster_selection_method='eom'
).fit(reduced_embeddings)



## BERTopic Integration:

In [6]:
from bertopic import BERTopic

# Train our model with our previously defined models
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    verbose=True
).fit(abstracts, embeddings)

2025-01-27 12:37:26,626 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-01-27 12:38:32,422 - BERTopic - Dimensionality - Completed ✓
2025-01-27 12:38:32,427 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-01-27 12:38:36,154 - BERTopic - Cluster - Completed ✓
2025-01-27 12:38:36,178 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-01-27 12:38:40,334 - BERTopic - Representation - Completed ✓


Now, let's start exploring the topics that we got by running the code above.

In [7]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,14551,-1_the_of_and_to,"[the, of, and, to, in, we, language, for, that...",[ Knowledge-enhanced pre-trained models for l...
1,0,2224,0_question_qa_questions_answer,"[question, qa, questions, answer, answering, a...","[ In open question answering (QA), the answer..."
2,1,2050,1_speech_asr_recognition_end,"[speech, asr, recognition, end, acoustic, spea...","[ Voice Assistants such as Alexa, Siri, and G..."
3,2,1520,2_medical_clinical_biomedical_patient,"[medical, clinical, biomedical, patient, healt...",[ Biomedical Named Entity Recognition (NER) i...
4,3,964,3_translation_nmt_machine_bleu,"[translation, nmt, machine, bleu, neural, engl...",[ Neural Machine Translation (NMT) models ach...
...,...,...,...,...,...
145,144,53,144_gans_gan_adversarial_generation,"[gans, gan, adversarial, generation, generativ...",[ Text generation is of particular interest i...
146,145,52,145_backdoor_attacks_attack_triggers,"[backdoor, attacks, attack, triggers, poisoned...","[ The prompt-based learning paradigm, which b..."
147,146,51,146_counseling_mental_therapy_health,"[counseling, mental, therapy, health, psychoth...",[ Mental health care poses an increasingly se...
148,147,51,147_multimodal_modality_fusion_sentiment,"[multimodal, modality, fusion, sentiment, moda...",[ Multimodal machine learning is a core resea...


Hundreds of topics were generated using the default model! To get the top 10 keywords per topic as well as their c-TF-IDF weights, we can use the get_topic() function:

In [8]:
topic_model.get_topic(11)

[('image', 0.03405212783603451),
 ('visual', 0.02469688961457895),
 ('vision', 0.017452755178315595),
 ('multimodal', 0.015693725767634257),
 ('captioning', 0.015434693152222316),
 ('captions', 0.0142024492412433),
 ('images', 0.014154038162453258),
 ('modal', 0.01374044042790457),
 ('caption', 0.012018158613556193),
 ('language', 0.00868122228613332)]

We can use the find_topics() function to search for specific topics based on a search term. Let’s search for a topic about topic modeling:

In [9]:
topic_model.find_topics("topic modeling")

([22, -1, 2, 45, 30],
 [0.95497566, 0.911328, 0.9087179, 0.90749735, 0.90506303])

→ It returns that topic 22 has a relatively high similarity (0.95) with our search term. If we then inspect the topic, we can see that it is indeed a topic about topic modeling

In [10]:
topic_model.get_topic(22)

[('topic', 0.06945617465332493),
 ('topics', 0.036023179433915774),
 ('lda', 0.01661929568243725),
 ('latent', 0.013983559515945656),
 ('document', 0.012634641415039357),
 ('documents', 0.012620262980522618),
 ('modeling', 0.012530857445661283),
 ('dirichlet', 0.010302387745128845),
 ('word', 0.008892079799744539),
 ('allocation', 0.008109142901468164)]

that seems like a topic that is, in part, characterized by the classic LDA technique. Let's see if the BERTopic paper was also assigned to topic 22:

In [11]:
topic_model.topics_[titles.index('Attention Is All You Need')]

3

In [12]:
topic_model.get_topic(3)

[('translation', 0.03429182232722864),
 ('nmt', 0.025346606350990838),
 ('machine', 0.015848741912224046),
 ('bleu', 0.010867090169219404),
 ('neural', 0.01054737228778969),
 ('english', 0.009838526409623948),
 ('parallel', 0.008734132824474031),
 ('resource', 0.00862154389742474),
 ('source', 0.008135500741401963),
 ('languages', 0.00796618912294559)]

### Outputs:

- **Clusters:** Keywords assigned to each cluster.
- **Search Feature:** Retrieve clusters associated with specific terms.

## Visualization

### Document-Level Visualization

- Provides a graphical representation of clusters and their corresponding keywords.
- **Examples:**
    - **Orange Cluster:** `Questions` (e.g., Q&A systems).
    - **Red Cluster:** `Medical` (e.g., clinical and biomedical research).
    - **Green Cluster:** `Visual` (e.g., image processing and multimodal tasks).

In [13]:
# Visualize topics and documents
fig = topic_model.visualize_documents(
    titles,
    reduced_embeddings=reduced_embeddings,
    width=1200,
    hide_annotations=True
)

# Update fonts of legend for easier visualization
fig.update_layout(font=dict(size=16))

### Hierarchical Clustering

- Illustrates inter-cluster relationships.
- **Example:**
    - Cluster 11 (`Image, Visual`) aligns closely with clusters 122 and 107, under the broader category of `Emotion, Recommendation`.

In [14]:
# Visualize barchart with ranked keywords
topic_model.visualize_barchart()

# Visualize relationships between topics
topic_model.visualize_heatmap(n_clusters=30)

# Visualize the potential hierarchical structure of topics
topic_model.visualize_hierarchy()