Reference: https://github.com/MaartenGr/BERTopic/blob/master/notebooks/BERTopic.ipynb

# BERTopic

BERTopic is a topic modeling technique that leverages huggingface transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

BERTopic supports guided, (semi-) supervised, hierarchical, and dynamic topic modeling. It even supports visualizations similar to LDAvis!

In [19]:
# ! pip install bertopic

In [18]:
from bertopic import BERTopic

In [7]:
def read_txt_as_lst(filename):
    with open(filename) as f:
        line_list = [line.strip() for line in f.readlines()]
    return line_list

In [8]:
# txt file with most commonly used 3000 words in English language
filename = '/home/mintu/MedCat/Bertopic/top_3000_common_words.txt'
word_list = read_txt_as_lst(filename)

In [12]:
model = BERTopic(language="english")
topics, probs = model.fit_transform(word_list)

In [13]:
model.get_topic_freq()

Unnamed: 0,Topic,Count
0,-1,1148
1,0,106
2,1,84
3,2,79
4,3,73
...,...,...
59,58,12
60,59,12
61,60,12
62,61,11


In [14]:
model.get_topic(40)

[('baseball', 0.21389708898389212),
 ('basket', 0.21389708898389212),
 ('swing', 0.21389708898389212),
 ('sport', 0.21389708898389212),
 ('soccer', 0.21389708898389212),
 ('quarterback', 0.21389708898389212),
 ('pitch', 0.21389708898389212),
 ('olympic', 0.21389708898389212),
 ('league', 0.21389708898389212),
 ('golf', 0.21389708898389212)]

In [16]:
similar_topics, similarity = model.find_topics("car", top_n=5)
similar_topics

[11, 56, 9, 26, 22]

In [17]:
model.get_topic(11)

[('adventure', 0.09167018099309662),
 ('visitor', 0.09167018099309662),
 ('auto', 0.09167018099309662),
 ('bike', 0.09167018099309662),
 ('pole', 0.09167018099309662),
 ('cycle', 0.09167018099309662),
 ('camp', 0.09167018099309662),
 ('car', 0.09167018099309662),
 ('motor', 0.09167018099309662),
 ('wheel', 0.09167018099309662)]

## Embedding Model

We can select any model from sentence-transformers and use it instead of the preselected models by simply passing the model through
BERTopic with embedding_model.

In [22]:
# txt file with most common diseases in English language
filename = '/home/mintu/MedCat/Bertopic/top_diseases.txt'
disease_list = read_txt_as_lst(filename)

In [24]:
disease_list[:5]

['Abdominal aortic aneurysm',
 'Acne',
 'Acute cholecystitis',
 'Acute lymphoblastic leukaemia',
 'Acute lymphoblastic leukaemia: Children']

In [25]:
st_model = BERTopic(embedding_model="emilyalsentzer/Bio_ClinicalBERT")

In [26]:
topics, probs = st_model.fit_transform(disease_list)

No sentence-transformers model found with name /home/mintu/.cache/torch/sentence_transformers/emilyalsentzer_Bio_ClinicalBERT. Creating a new one with MEAN pooling.
Some weights of the model checkpoint at /home/mintu/.cache/torch/sentence_transformers/emilyalsentzer_Bio_ClinicalBERT were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSeq

In [33]:
similar_topics, similarity = st_model.find_topics("cancer", top_n=5)
similar_topics

[0, -1, 3, 1, 2]

In [34]:
st_model.get_topic(0)

[('cancer', 0.09279619294212287),
 ('and', 0.06204131417004102),
 ('disease', 0.04807155689161357),
 ('tumours', 0.03771344463826473),
 ('adults', 0.03707932801591224),
 ('infection', 0.03438893138844579),
 ('disorder', 0.03438893138844579),
 ('syndrome', 0.03178090728526833),
 ('in', 0.030859726463352402),
 ('fever', 0.030859726463352402)]