<a href="https://colab.research.google.com/github/HaardhikK/Disease_predict_ml/blob/main/Bert_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tutorial** - Topic Modeling with BERTopic
(last updated 01-09-2022)

In this tutorial we will be exploring how to use BERTopic to create topics from the well-known 20Newsgroups dataset. The most frequent use-cases and methods are discussed together with important parameters to keep a look out for.


## BERTopic
BERTopic is a topic modeling technique that leverages 🤗 transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

<br>

<img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="40%">

In [None]:
%%capture
!pip install bertopic

In [None]:
import numpy as np
import pandas as pd
from IPython.display import display
from tqdm import tqdm
from collections import Counter
import ast

import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import seaborn as sb

from sklearn.feature_extraction.text import CountVectorizer
from textblob import TextBlob
import scipy.stats as stats


from bokeh.plotting import figure, output_file, show
from bokeh.models import Label
from bokeh.io import output_notebook
output_notebook()

%matplotlib inline

In [None]:
docs = pd.read_csv('cleaned_disaster_tweets.csv', encoding='latin1', parse_dates=[0], infer_datetime_format=True)


  docs = pd.read_csv('cleaned_disaster_tweets.csv', encoding='latin1', parse_dates=[0], infer_datetime_format=True)
  docs = pd.read_csv('cleaned_disaster_tweets.csv', encoding='latin1', parse_dates=[0], infer_datetime_format=True)


In [None]:
docs=docs['text'].tolist()

In [None]:
docs

[' happened a terrible car crash',
 'our deeds are the reason of this  may allah forgive us all',
 'heard about  is different cities, stay safe everyone.',
 'there is a forest fire at spot pond, geese are fleeing across the street, i cannot save them all',
 'forest fire near la ronge sask. canada',
 "all residents asked to 'shelter in place' are being notified by officers. no other evacuation or shelter in place orders are expected",
 '13,000 people receive  evacuation orders in california ',
 ' got sent this photo from ruby  as smoke from  pours into a school ',
 ' update => california hwy. 20 closed in both directions due to lake county fire -  ',
 'apocalypse lighting.  ',
 '  heavy rain causes flash flooding of streets in manitou, colorado springs areas',
 'typhoon soudelor kills 28 in china and taiwan',
 "we're shaking...it's an earthquake",
 "i'm on top of the hill and i can see a fire in the woods...",
 "there's an emergency evacuation happening now in the building across the st

# **Topic Modeling**

In this example, we will go through the main components of BERTopic and the steps necessary to create a strong topic model.




## Training

We start by instantiating BERTopic. We set language to `english` since our documents are in the English language. If you would like to use a multi-lingual model, please use `language="multilingual"` instead.

We will also calculate the topic probabilities. However, this can slow down BERTopic significantly at large amounts of data (>100_000 documents). It is advised to turn this off if you want to speed up the model.


In [None]:
from bertopic import BERTopic

topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(docs)

2024-05-16 04:03:55,160 - BERTopic - Embedding - Transforming documents to embeddings.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/340 [00:00<?, ?it/s]

2024-05-16 04:04:11,111 - BERTopic - Embedding - Completed ✓
2024-05-16 04:04:11,112 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-05-16 04:04:47,451 - BERTopic - Dimensionality - Completed ✓
2024-05-16 04:04:47,453 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-05-16 04:05:24,934 - BERTopic - Cluster - Completed ✓
2024-05-16 04:05:24,949 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-05-16 04:05:25,379 - BERTopic - Representation - Completed ✓


**NOTE**: Use `language="multilingual"` to select a model that support 50+ languages.

## Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents.

In [None]:
freq = topic_model.get_topic_info(); freq.head(5)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,3204,-1_the_and_to_my,"[the, and, to, my, is, you, was, it, in, of]",[the prophet (peace be upon him) said 'save yo...
1,0,124,0_flooding_floods_flood_myanmar,"[flooding, floods, flood, myanmar, rains, flas...","[donate to help myanmar flooding victims , flo..."
2,1,103,1_electrocuted_electrocute_charger_myself,"[electrocuted, electrocute, charger, myself, h...","[i hope i get electrocuted today at work, woma..."
3,2,93,2_mosque_saudi_suicide_bomber,"[mosque, saudi, suicide, bomber, kills, 15, se...",[ suicide bomber kills 15 in saudi security si...
4,3,93,3_screaming_screamed_screams_im,"[screaming, screamed, screams, im, externally,...","[\nim screaming, im screaming, im screaming]"


-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [None]:
topic_model.get_topic(0)  # Select the most frequent topic

[('flooding', 0.052752460922185065),
 ('floods', 0.052320550903539),
 ('flood', 0.03906399100495942),
 ('myanmar', 0.03847370572495518),
 ('rains', 0.023848595194023797),
 ('flash', 0.021739113152640138),
 ('heavy', 0.018212810821029376),
 ('monsoon', 0.01752985826260455),
 ('relief', 0.016694016635816655),
 ('county', 0.014129698874133205)]

**NOTE**: BERTopic is stocastich which mmeans that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

In [None]:
### Attributes

In [None]:
topic_model.topics_[:10]

[-1, 156, 79, -1, -1, 137, -1, 31, 40, 28]

## Visualize Topics
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation.
Instead, we can visualize the topics that were generated in a way very similar to
[LDAvis](https://github.com/cpsievert/LDAvis):

In [None]:
topic_model.visualize_topics()

## Visualize Topic Hierarchy

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another. This might help selecting an appropriate nr_topics when reducing the number of topics that you have created.

In [None]:
topic_model.visualize_hierarchy(top_n_topics=50)

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [None]:
topic_model.visualize_barchart(top_n_topics=8)

## Visualize Topic Similarity
Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [None]:
topic_model.visualize_heatmap(n_clusters=20, width=1000, height=1000)

## Visualize Term Score Decline
Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added. At some point adding words to the topic representation only marginally increases the total c-TF-IDF score and would not be beneficial for its representation.

To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word. In other words, the position of the words (term rank), where the words with the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method, the select the best number of words in a topic.


In [None]:
topic_model.visualize_term_rank()

# **Topic Representation**
After having created the topic model, you might not be satisfied with some of the parameters you have chosen. Fortunately, BERTopic allows you to update the topics after they have been created.

This allows for fine-tuning the model to your specifications and wishes.

## Update Topics
When you have trained a model and viewed the topics and the words that represent them,
you might not be satisfied with the representation. Perhaps you forgot to remove
stopwords or you want to try out a different `n_gram_range`. We can use the function `update_topics` to update
the topic representation with new parameters for `c-TF-IDF`:


In [None]:
topic_model.update_topics(docs, n_gram_range=(1, 2))

In [None]:
topic_model.get_topic(0)   # We select topic that we viewed before

[('floods', 0.03391227061608461),
 ('flooding', 0.033645782234383775),
 ('flood', 0.025420046623872514),
 ('myanmar', 0.02295819380650914),
 ('rains', 0.014121903431111443),
 ('flash', 0.013246748741624379),
 ('in myanmar', 0.012726447641140557),
 ('floods in', 0.011690338392335273),
 ('heavy', 0.011056343816059064),
 ('monsoon', 0.010097626823556182)]

## Topic Reduction
We can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so,
is that you can decide the number of topics after knowing how many are actually created. It is difficult to
predict before training your model how many topics that are in your documents and how many will be extracted.
Instead, we can decide afterwards how many topics seems realistic:





In [None]:
topic_model.reduce_topics(docs, nr_topics=60)

2024-05-16 04:09:04,377 - BERTopic - Topic reduction - Reducing number of topics
2024-05-16 04:09:05,074 - BERTopic - Topic reduction - Reduced number of topics from 244 to 60


<bertopic._bertopic.BERTopic at 0x7a067f4bf670>

In [None]:
# Access the newly updated topics with:
print(topic_model.topics_)

[-1, 6, 4, -1, -1, 4, -1, 3, 3, 2, 2, 2, -1, -1, 4, -1, 29, -1, -1, 2, 14, -1, -1, -1, 9, -1, 0, 39, -1, 0, 2, 56, 0, 2, -1, 0, 0, 28, 0, -1, 0, 28, 56, 0, 19, -1, -1, -1, 9, 5, 9, 27, 9, 9, -1, 9, -1, 9, -1, -1, 0, 9, -1, 9, 9, 1, -1, 9, -1, 9, 0, 5, -1, 3, 3, -1, 4, 3, -1, -1, 3, -1, -1, -1, 9, 0, -1, -1, 0, 9, 3, 3, 9, -1, -1, -1, 0, 0, 0, 0, -1, 3, -1, -1, 0, -1, -1, 0, -1, 0, -1, 0, 0, 0, -1, 41, 4, -1, -1, -1, 0, 0, -1, 0, -1, 0, -1, 0, 0, -1, 0, 0, -1, 0, 0, 0, 0, 0, -1, -1, -1, 0, 4, -1, -1, 1, 6, -1, 45, -1, 45, 45, 45, -1, 45, 45, 45, -1, 6, -1, 45, 45, 45, 45, 45, 45, -1, 45, 45, 8, 45, 45, 45, -1, 11, -1, 45, 45, -1, 45, 0, -1, -1, 6, 0, 45, 6, -1, 9, 45, 45, 45, 45, -1, -1, 6, 0, 0, 0, 0, 0, 0, -1, 26, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 26, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 23, 0, 0, 0, 56, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 4, 0, 0, 4, 4, 0, -1, 4, 31, -1, -1, 4, 4, 0, 0, 4, 4, -1, 0, 4, 18, 0, 0, -1, -1, 4, 4, 1, 0, 0, 4, 4, 4, 4, 31, 0, 4, 0, 4, 0, 0, 0, 4,

# **Search Topics**
After having trained our model, we can use `find_topics` to search for topics that are similar
to an input search_term. Here, we are going to be searching for topics that closely relate the
search term "vehicle". Then, we extract the most similar topic and check the results:

In [None]:
similar_topics, similarity = topic_model.find_topics("flooding", top_n=5); similar_topics

[2, 7, 13, 48, 4]

In [None]:
topic_model.get_topic(7)

[('famine', 0.04355488100071994),
 ('tsunami', 0.03683306416344358),
 ('food', 0.03510116984641643),
 ('disaster', 0.031732390954855465),
 ('natural disaster', 0.02931981147700066),
 ('natural', 0.02856028914538427),
 ('crematoria', 0.022347809734704994),
 ('food crematoria', 0.022347809734704994),
 ('outrage', 0.02172710159317605),
 ('amid crisis', 0.021323728777598524)]

# **Model serialization**
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved.

In [None]:
# Save model
topic_model.save("my_model")

In [None]:
# Load model
my_model = BERTopic.load("my_model")

# **Embedding Models**
The parameter `embedding_model` takes in a string pointing to a sentence-transformers model, a SentenceTransformer, or a Flair DocumentEmbedding model.

## Sentence-Transformers
You can select any model from sentence-transformers here and pass it through BERTopic with embedding_model:



In [None]:
topic_model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")

Or select a SentenceTransformer model with your own parameters:


In [None]:
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
topic_model = BERTopic(embedding_model=sentence_model, verbose=True)

Click [here](https://www.sbert.net/docs/pretrained_models.html) for a list of supported sentence transformers models.  
