<a href="https://colab.research.google.com/github/SaibalPatraDS/Hands-on-LLM/blob/main/Topic_Modelling_using_BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Topic Modelling using `BERTopic`

In [None]:
!pip install bertopic datasets

### Steps to Follow:

1. Embeddings
2. Dimensionality Reduction
3. Clustering
4. CountVectorizer
5. c-TF-IDF

In [None]:
## Loading the Data
from datasets import load_dataset
## loading NLP data
arxiv_nlp = load_dataset(
    "MaartenGr/arxiv_nlp"
)
## looking into data
arxiv_nlp

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/617 [00:00<?, ?B/s]

data.csv:   0%|          | 0.00/53.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/44949 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Titles', 'Abstracts', 'Years', 'Categories'],
        num_rows: 44949
    })
})

In [None]:
## selecting only imp columns - Extracting Metadata
abstracts = arxiv_nlp['train']['Abstracts']
titles = arxiv_nlp['train']['Titles']

### 1. Embeddings

In [None]:
## Loading the Sentence Transformer Model
from sentence_transformers import SentenceTransformer

## Loading the Model and Create Embeddings
model = SentenceTransformer(
    "thenlper/gte-small"
)
## Create Embeddings
embeddings = model.encode(abstracts, show_progress_bar = True)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/66.7M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1405 [00:00<?, ?it/s]

### 2. Dimesionality Reduction

In [None]:
## loading model for Dimensionality Reduction
from umap import UMAP

## implementing UMAP Model with 5 dimensional space
umap = UMAP(
    n_components=5,
    min_dist=0.0,
    metric='cosine',
    random_state = 42
)
## Aplying the model
umap_embeddings = umap.fit_transform(embeddings)
## checking the embeddings shape
embeddings.shape, umap_embeddings.shape

  warn(


((44949, 384), (44949, 5))

### 3. Clustering

In [None]:
## Loading Packages for CLustering
from hdbscan import HDBSCAN
## Implementing the model
hdbscan_model = HDBSCAN(
    min_cluster_size=50,
    metric='euclidean',
    cluster_selection_method='eom'
).fit(umap_embeddings)

### 4. BERTopic Modelling

In [None]:
from bertopic import BERTopic

## Train Our model with previously build models
bertopic_model = BERTopic(
    embedding_model=model,
    umap_model=umap,
    hdbscan_model=hdbscan_model,
    verbose=True
).fit(abstracts, embeddings)

2024-12-11 18:10:44,975 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-12-11 18:11:47,896 - BERTopic - Dimensionality - Completed ✓
2024-12-11 18:11:47,899 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-12-11 18:11:53,654 - BERTopic - Cluster - Completed ✓
2024-12-11 18:11:53,676 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-12-11 18:12:00,555 - BERTopic - Representation - Completed ✓


### Exploring the Topics Generated by `BERTopic`

In [None]:
## getting the topic info
bertopic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,14462,-1_the_of_and_to,"[the, of, and, to, in, we, for, that, language...",[ Cross-lingual text classification aims at t...
1,0,2241,0_question_questions_qa_answer,"[question, questions, qa, answer, answering, a...",[ Question generation (QG) attempts to solve ...
2,1,2098,1_speech_asr_recognition_end,"[speech, asr, recognition, end, acoustic, audi...",[ End-to-end models have achieved impressive ...
3,2,903,2_image_visual_multimodal_images,"[image, visual, multimodal, images, vision, mo...",[ In this paper we propose a model to learn m...
4,3,887,3_summarization_summaries_summary_abstractive,"[summarization, summaries, summary, abstractiv...",[ We present a novel divide-and-conquer metho...
...,...,...,...,...,...
148,147,54,147_counseling_mental_therapy_health,"[counseling, mental, therapy, health, psychoth...",[ Mental health care poses an increasingly se...
149,148,53,148_chatgpt_its_openai_has,"[chatgpt, its, openai, has, it, tasks, capabil...","[ Over the last few years, large language mod..."
150,149,52,149_mixed_code_sentiment_mixing,"[mixed, code, sentiment, mixing, english, anal...",[ In today's interconnected and multilingual ...
151,150,51,150_diffusion_generation_autoregressive_text,"[diffusion, generation, autoregressive, text, ...",[ Diffusion models have achieved great succes...


In [None]:
## get top 10 keywords corresponding to each topic using get_topic() function
bertopic_model.get_topic(1)

[('speech', 0.029008261668635314),
 ('asr', 0.01953174996310042),
 ('recognition', 0.013885176883441992),
 ('end', 0.010564109790927942),
 ('acoustic', 0.00981251409399381),
 ('audio', 0.006891787503395377),
 ('speaker', 0.006848250444683579),
 ('error', 0.006612295041159941),
 ('wer', 0.006588686454780454),
 ('the', 0.006447719139235802)]

In [None]:
## use find_topics() to search any particular topics
bertopic_model.find_topics("Speech Recognition")

([1, 127, 109, 144, 119],
 [0.92427385, 0.891863, 0.8895864, 0.8864057, 0.8859229])

In [None]:
## cross checking with the given topics number
bertopic_model.get_topic(1)

[('speech', 0.029008261668635314),
 ('asr', 0.01953174996310042),
 ('recognition', 0.013885176883441992),
 ('end', 0.010564109790927942),
 ('acoustic', 0.00981251409399381),
 ('audio', 0.006891787503395377),
 ('speaker', 0.006848250444683579),
 ('error', 0.006612295041159941),
 ('wer', 0.006588686454780454),
 ('the', 0.006447719139235802)]

In [None]:
## Checking whether a Paper is there in the Cluster or Not
bertopic_model.topics_[titles.index("Arabic Speech Recognition System using CMU-Sphinx4")]

109

### Visualization of the Documents Representation

In [None]:
## Visualize the topics and Documents
fig = bertopic_model.visualize_documents(
    titles,
    reduced_embeddings = umap_embeddings,
    width = 1200,
    hide_annotations = True
)

## update fonts of legend for easier visualization
fig.update_layout(font = dict(size=16))

### Hierarchical Clustering

In [None]:
## Visualize bar charts with ranked keywords
bertopic_model.visualize_barchart()

## Visualize relationships between topics
bertopic_model.visualize_heatmap(n_clusters = 50)

## Visualize the potential hierarchical structure of topics
bertopic_model.visualize_hierarchy()