<a href="https://colab.research.google.com/github/Priyabrata017/Topic-modelling-using-BERTopic/blob/main/Medical_transcription_BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tutorial** - Topic Modeling with BERTopic
(last updated 01-09-2022)

In this tutorial we will be exploring how to use BERTopic to create topics from the well-known 20Newsgroups dataset. The most frequent use-cases and methods are discussed together with important parameters to keep a look out for. 


## BERTopic
BERTopic is a topic modeling technique that leverages 🤗 transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. 

<br>

<img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="40%">

# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# **Installing BERTopic**

We start by installing BERTopic from PyPi:

In [4]:
%%capture
!pip install bertopic pandas

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import os
os.chdir("drive/My Drive/Datasets")

In [3]:
!ls

mtsamples.csv


In [7]:
import pandas as pd


In [8]:
df=pd.read_csv("mtsamples.csv")
print(df.columns)

Index(['Unnamed: 0', 'description', 'medical_specialty', 'sample_name',
       'transcription', 'keywords'],
      dtype='object')


In [9]:
print(df.head())

   Unnamed: 0                                        description  \
0           0   A 23-year-old white female presents with comp...   
1           1           Consult for laparoscopic gastric bypass.   
2           2           Consult for laparoscopic gastric bypass.   
3           3                             2-D M-Mode. Doppler.     
4           4                                 2-D Echocardiogram   

             medical_specialty                                sample_name  \
0         Allergy / Immunology                         Allergic Rhinitis    
1                   Bariatrics   Laparoscopic Gastric Bypass Consult - 2    
2                   Bariatrics   Laparoscopic Gastric Bypass Consult - 1    
3   Cardiovascular / Pulmonary                    2-D Echocardiogram - 1    
4   Cardiovascular / Pulmonary                    2-D Echocardiogram - 2    

                                       transcription  \
0  SUBJECTIVE:,  This 23-year-old white female pr...   
1  PAST MEDICAL 

In [16]:
import re
df['cleaned'] = df.apply(lambda row: re.sub(r"http\S+", "", str(row.transcription)).lower(), 1)
df['cleaned'] = df.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.cleaned.split())), 1)
df['cleaned'] = df.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.cleaned).split()), 1)

In [17]:
df.columns

Index(['Unnamed: 0', 'description', 'medical_specialty', 'sample_name',
       'transcription', 'keywords', 'cleaned'],
      dtype='object')

In [18]:
df['cleaned'].iloc[10]

'preoperative diagnosis morbid obesity postoperative diagnosis morbid obesity procedure laparoscopic roux en y gastric bypass antecolic antegastric with mm eea anastamosis esophagogastroduodenoscopy anesthesia general with endotracheal intubation indications for procedure this is a year old male who has been overweight for many years and has tried multiple different weight loss diets and programs the patient has now begun to have comorbidities related to the obesity the patient has attended our bariatric seminar and met with our dietician and psychologist the patient has read through our comprehensive handout and understands the risks and benefits of bypass surgery as evidenced by the signing of our consent form procedure in detail the risks and benefits were explained to the patient consent was obtained the patient was taken to the operating room and placed supine on the operating room table general anesthesia was administered with endotracheal intubation a foley catheter was placed f

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# Data
For this example, we use the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts

In [None]:
# from sklearn.datasets import fetch_20newsgroups
# docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

In [19]:
len(df.cleaned)

4999

# **Topic Modeling**

In this example, we will go through the main components of BERTopic and the steps necessary to create a strong topic model. 




## Training

We start by instantiating BERTopic. We set language to `english` since our documents are in the English language. If you would like to use a multi-lingual model, please use `language="multilingual"` instead. 

We will also calculate the topic probabilities. However, this can slow down BERTopic significantly at large amounts of data (>100_000 documents). It is advised to turn this off if you want to speed up the model. 


In [20]:
from bertopic import BERTopic

topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(df.cleaned.to_list())

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/157 [00:00<?, ?it/s]

2023-01-28 18:13:53,093 - BERTopic - Transformed documents to Embeddings
2023-01-28 18:14:21,021 - BERTopic - Reduced dimensionality
2023-01-28 18:14:26,937 - BERTopic - Clustered reduced embeddings


**NOTE**: Use `language="multilingual"` to select a model that support 50+ languages.

## Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. 

In [21]:
freq = topic_model.get_topic_info(); freq.head(5)

Unnamed: 0,Topic,Count,Name
0,-1,532,-1_is_of_the_and
1,0,116,0_artery_coronary_left_circumflex
2,1,70,1_she_that_her_he
3,2,69,2_gallbladder_duct_cystic_bile
4,3,61,3_stomach_scope_esophagus_hiatal


In [22]:
print(freq.Count.sum(),freq.Topic.count())

4999 192


In [41]:
print(topic_model.topic_sizes_)
print(type(topics))
print(len(topics))
print(topics[:10])
print(freq['Count'][0])

{-1: 532, 0: 116, 1: 70, 2: 69, 3: 61, 4: 60, 5: 55, 6: 51, 7: 48, 8: 48, 9: 46, 10: 45, 11: 44, 12: 42, 13: 41, 14: 41, 15: 39, 16: 39, 17: 39, 18: 38, 19: 37, 20: 37, 21: 37, 22: 36, 23: 36, 24: 35, 25: 34, 27: 34, 26: 34, 28: 33, 29: 33, 30: 33, 31: 32, 32: 32, 33: 32, 34: 32, 39: 31, 41: 31, 40: 31, 37: 31, 38: 31, 36: 31, 35: 31, 42: 30, 43: 28, 44: 28, 45: 28, 46: 28, 47: 27, 50: 26, 51: 26, 48: 26, 49: 26, 53: 25, 54: 25, 55: 25, 56: 25, 52: 25, 61: 24, 63: 24, 62: 24, 60: 24, 59: 24, 58: 24, 57: 24, 69: 23, 73: 23, 72: 23, 71: 23, 70: 23, 65: 23, 68: 23, 67: 23, 66: 23, 64: 23, 80: 22, 84: 22, 83: 22, 82: 22, 81: 22, 74: 22, 79: 22, 78: 22, 77: 22, 76: 22, 75: 22, 93: 21, 92: 21, 91: 21, 90: 21, 89: 21, 88: 21, 87: 21, 86: 21, 85: 21, 101: 20, 98: 20, 100: 20, 99: 20, 95: 20, 97: 20, 94: 20, 96: 20, 102: 19, 103: 19, 104: 19, 105: 19, 106: 19, 114: 18, 119: 18, 118: 18, 117: 18, 116: 18, 115: 18, 112: 18, 113: 18, 111: 18, 110: 18, 109: 18, 108: 18, 107: 18, 125: 17, 129: 17, 1

In [51]:
res=[]
for i in range(len(df.cleaned)):
  res.append(topic_model.topic_labels_.get(topics[i],'not_found'))

# df['topics']=topic_model.topic_labels_[topic]
# df['prob'] =
# df['topics_dict'] = pd.Series(topic_model.topic_labels_)

In [62]:
for i, row in df.iterrows():
  print(i)
  print('\n')
  row['new_c']=i

[1;30;43mStreaming output truncated to the last 5000 lines.[0m


3333


3334


3335


3336


3337


3338


3339


3340


3341


3342


3343


3344


3345


3346


3347


3348


3349


3350


3351


3352


3353


3354


3355


3356


3357


3358


3359


3360


3361


3362


3363


3364


3365


3366


3367


3368


3369


3370


3371


3372


3373


3374


3375


3376


3377


3378


3379


3380


3381


3382


3383


3384


3385


3386


3387


3388


3389


3390


3391


3392


3393


3394


3395


3396


3397


3398


3399


3400


3401


3402


3403


3404


3405


3406


3407


3408


3409


3410


3411


3412


3413


3414


3415


3416


3417


3418


3419


3420


3421


3422


3423


3424


3425


3426


3427


3428


3429


3430


3431


3432


3433


3434


3435


3436


3437


3438


3439


3440


3441


3442


3443


3444


3445


3446


3447


3448


3449


3450


3451


3452


3453


3454


3455


3456


3457


3458


3459


3460


3461


3462


3463


3464


3465


34

In [63]:
df.head()

Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords,cleaned,topics_dict,top,topics_c,topics_c1
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller...",subjective this year old white female presents...,0_artery_coronary_left_circumflex,a,81_mom_congestion_clear_has,81_mom_congestion_clear_has
1,1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh...",past medical history he has difficulty climbin...,1_she_that_her_he,a,42_he_swelling_rash_reaction,42_he_swelling_rash_reaction
2,2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart...",history of present illness i have seen abc tod...,2_gallbladder_duct_cystic_bile,a,135_he_his_him_heart,135_he_his_him_heart
3,3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple...",d m mode left atrial enlargement with left atr...,3_stomach_scope_esophagus_hiatal,a,76_valve_mitral_tricuspid_aortic,76_valve_mitral_tricuspid_aortic
4,4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo...",the left ventricular cavity size and wall thic...,4_revealed_speech_brain_on,a,76_valve_mitral_tricuspid_aortic,76_valve_mitral_tricuspid_aortic


In [52]:
df['topics_c']=res

In [54]:
df['topics_c1']= [topic_model.topic_labels_.get(topics[i],'not_found') for i in range(len(df.cleaned))]

In [55]:
df.head()

Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords,cleaned,topics_dict,top,topics_c,topics_c1
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller...",subjective this year old white female presents...,0_artery_coronary_left_circumflex,a,81_mom_congestion_clear_has,81_mom_congestion_clear_has
1,1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh...",past medical history he has difficulty climbin...,1_she_that_her_he,a,42_he_swelling_rash_reaction,42_he_swelling_rash_reaction
2,2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart...",history of present illness i have seen abc tod...,2_gallbladder_duct_cystic_bile,a,135_he_his_him_heart,135_he_his_him_heart
3,3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple...",d m mode left atrial enlargement with left atr...,3_stomach_scope_esophagus_hiatal,a,76_valve_mitral_tricuspid_aortic,76_valve_mitral_tricuspid_aortic
4,4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo...",the left ventricular cavity size and wall thic...,4_revealed_speech_brain_on,a,76_valve_mitral_tricuspid_aortic,76_valve_mitral_tricuspid_aortic


In [47]:
df['top'] ='a'
print(df.head())

   Unnamed: 0                                        description  \
0           0   A 23-year-old white female presents with comp...   
1           1           Consult for laparoscopic gastric bypass.   
2           2           Consult for laparoscopic gastric bypass.   
3           3                             2-D M-Mode. Doppler.     
4           4                                 2-D Echocardiogram   

             medical_specialty                                sample_name  \
0         Allergy / Immunology                         Allergic Rhinitis    
1                   Bariatrics   Laparoscopic Gastric Bypass Consult - 2    
2                   Bariatrics   Laparoscopic Gastric Bypass Consult - 1    
3   Cardiovascular / Pulmonary                    2-D Echocardiogram - 1    
4   Cardiovascular / Pulmonary                    2-D Echocardiogram - 2    

                                       transcription  \
0  SUBJECTIVE:,  This 23-year-old white female pr...   
1  PAST MEDICAL 

In [42]:
df['topics_dict'] = pd.Series(topic_model.topic_labels_)

In [43]:
df['topics_dict']

0       0_artery_coronary_left_circumflex
1                       1_she_that_her_he
2          2_gallbladder_duct_cystic_bile
3        3_stomach_scope_esophagus_hiatal
4              4_revealed_speech_brain_on
                      ...                
4994                                  NaN
4995                                  NaN
4996                                  NaN
4997                                  NaN
4998                                  NaN
Name: topics_dict, Length: 4999, dtype: object

In [39]:
topic_model.topic_labels_[topics[0]]

'81_mom_congestion_clear_has'

In [33]:
len(topic_model.probabilities_), type(topic_model.probabilities_),type(topic_model.topic_labels_)

(4999, numpy.ndarray, dict)

In [34]:
topic_model.topic_labels_

{-1: '-1_is_of_the_and',
 0: '0_artery_coronary_left_circumflex',
 1: '1_she_that_her_he',
 2: '2_gallbladder_duct_cystic_bile',
 3: '3_stomach_scope_esophagus_hiatal',
 4: '4_revealed_speech_brain_on',
 5: '5_nasal_septum_cartilage_septal',
 6: '6_rotator_cuff_tendon_glenoid',
 7: '7_orbital_tumor_matter_brain',
 8: '8_stress_perfusion_resting_myocardial',
 9: '9_vertigo_ble_nystagmus_on',
 10: '10_prostate_bladder_seeds_seminal',
 11: '11_tonsil_adenoid_tonsillar_tonsils',
 12: '12_pelvis_ct_pancreas_liver',
 13: '13_cough_no_mother_nose',
 14: '14_fat_lid_eyelid_upper',
 15: '15_her_she_mg_regarding',
 16: '16_eye_chamber_lens_viscoelastic',
 17: '17_no_arthritis_reveals_rheumatoid',
 18: '18_vein_saphenous_prolene_artery',
 19: '19_lobe_bronchoscope_secretions_bronchoscopy',
 20: '20_chest_pleural_tube_effusion',
 21: '21_we_disc_pulposus_l4',
 22: '22_mom_no_wellchild_mother',
 23: '23_knee_medial_meniscus_chondromalacia',
 24: '24_he_no_denies_or',
 25: '25_seizures_reported_seiz

-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [23]:
topic_model.get_topic(0)  # Select the most frequent topic

[('artery', 0.028634860191230474),
 ('coronary', 0.027339793875751656),
 ('left', 0.015313393421539822),
 ('circumflex', 0.014684411894694637),
 ('catheter', 0.013962602217124268),
 ('femoral', 0.013701179263637465),
 ('descending', 0.013609125489770422),
 ('vessel', 0.013437092507900499),
 ('6french', 0.013152288122603046),
 ('stenosis', 0.012376585994135979)]

**NOTE**: BERTopic is stocastich which mmeans that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

In [None]:
### Attributes

## Attributes

There are a number of attributes that you can access after having trained your BERTopic model:


| Attribute | Description |
|------------------------|---------------------------------------------------------------------------------------------|
| topics_               | The topics that are generated for each document after training or updating the topic model. |
| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
| topic_sizes_           | The size of each topic                                                                      |
| topic_mapper_          | A class for tracking topics and their mappings anytime they are merged/reduced.             |
| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values.                             |
| c_tf_idf_              | The topic-term matrix as calculated through c-TF-IDF.                                       |
| topic_labels_          | The default labels for each topic.                                                          |
| custom_labels_         | Custom labels for each topic as generated through `.set_topic_labels`.                                                               |
| topic_embeddings_      | The embeddings for each topic if `embedding_model` was used.                                                              |
| representative_docs_   | The representative documents for each topic if HDBSCAN is used.                                                |

For example, to access the predicted topics for the first 10 documents, we simply run the following:

In [None]:
topic_model.topics_[:10]

[0, 4, 18, 35, 77, -1, -1, 0, 0, -1]

# **Visualization**
There are several visualization options available in BERTopic, namely the visualization of topics, probabilities and topics over time. Topic modeling is, to a certain extent, quite subjective. Visualizations help understand the topics that were created. 

## Visualize Topics
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good 
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. 
Instead, we can visualize the topics that were generated in a way very similar to 
[LDAvis](https://github.com/cpsievert/LDAvis):

In [24]:
topic_model.visualize_topics()

## Visualize Topic Probabilities

The variable `probabilities` that is returned from `transform()` or `fit_transform()` can 
be used to understand how confident BERTopic is that certain topics can be found in a document. 

To visualize the distributions, we simply call:

In [None]:
topic_model.visualize_distribution(probs[200], min_probability=0.015)

## Visualize Topic Hierarchy

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another. This might help selecting an appropriate nr_topics when reducing the number of topics that you have created.

In [25]:
topic_model.visualize_hierarchy(top_n_topics=50)

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [26]:
topic_model.visualize_barchart(top_n_topics=5)

## Visualize Topic Similarity
Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [27]:
topic_model.visualize_heatmap(n_clusters=20, width=1000, height=1000)

## Visualize Term Score Decline
Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added. At some point adding words to the topic representation only marginally increases the total c-TF-IDF score and would not be beneficial for its representation.

To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word. In other words, the position of the words (term rank), where the words with the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method, the select the best number of words in a topic.


In [28]:
topic_model.visualize_term_rank()

# **Topic Representation**
After having created the topic model, you might not be satisfied with some of the parameters you have chosen. Fortunately, BERTopic allows you to update the topics after they have been created. 

This allows for fine-tuning the model to your specifications and wishes. 

## Update Topics
When you have trained a model and viewed the topics and the words that represent them,
you might not be satisfied with the representation. Perhaps you forgot to remove
stopwords or you want to try out a different `n_gram_range`. We can use the function `update_topics` to update 
the topic representation with new parameters for `c-TF-IDF`: 


In [None]:
topic_model.update_topics(docs, n_gram_range=(1, 2))

In [None]:
topic_model.get_topic(0)   # We select topic that we viewed before

[('game', 0.006654595795415211),
 ('team', 0.0056771893307381895),
 ('he', 0.005431101599582429),
 ('games', 0.004475287942644683),
 ('the', 0.004171588186919083),
 ('was', 0.0038796410199552055),
 ('players', 0.0038317533223677478),
 ('season', 0.003788213789693645),
 ('in', 0.0037472379725803605),
 ('hockey', 0.0037024333747987296)]

## Topic Reduction
We can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so, 
is that you can decide the number of topics after knowing how many are actually created. It is difficult to 
predict before training your model how many topics that are in your documents and how many will be extracted. 
Instead, we can decide afterwards how many topics seems realistic:





In [None]:
topic_model.reduce_topics(docs, nr_topics=60)

2023-01-28 17:26:18,588 - BERTopic - Reduced number of topics from 209 to 61


<bertopic._bertopic.BERTopic at 0x7f14b18ef700>

In [None]:
# Access the newly updated topics with:
print(topic_model.topics_)

[0, 3, 12, 36, -1, -1, -1, 0, 0, -1, -1, -1, -1, -1, -1, 12, -1, 5, -1, 6, 9, 15, -1, 6, 0, -1, 5, 3, 3, -1, 9, 5, 23, 0, -1, 11, -1, -1, 6, -1, -1, 41, 57, -1, 0, -1, -1, 7, 1, -1, -1, 9, 5, -1, -1, 6, -1, -1, 8, -1, 0, -1, -1, -1, -1, -1, 0, -1, -1, -1, -1, -1, 42, -1, -1, -1, 0, 24, -1, 0, 59, -1, -1, 57, 33, -1, -1, 20, -1, -1, 0, 2, 48, 3, 12, -1, -1, -1, 5, -1, -1, -1, -1, 4, 2, 0, -1, -1, -1, -1, 11, -1, 10, 15, 12, -1, -1, -1, 0, -1, -1, -1, -1, 21, -1, -1, -1, 2, -1, -1, -1, -1, -1, 0, 36, 2, 35, 1, -1, 16, -1, 11, -1, -1, -1, -1, 11, 14, 0, 7, 6, 6, 36, -1, -1, -1, 57, 4, 10, 3, -1, 2, 21, -1, -1, -1, -1, -1, -1, 26, 32, -1, -1, -1, 22, -1, 0, -1, 6, 0, -1, 0, -1, -1, -1, 2, -1, -1, 12, -1, -1, -1, 2, 18, -1, -1, 12, -1, -1, 14, -1, -1, -1, -1, -1, -1, 22, -1, 52, 18, 6, 46, -1, 3, 6, 5, -1, -1, 55, 2, 0, 30, -1, 9, 1, -1, 1, 6, 51, 0, 6, -1, -1, 39, 0, -1, -1, 0, 0, -1, -1, -1, 26, 13, -1, 7, 2, -1, -1, 28, -1, -1, 39, 3, -1, -1, 21, -1, 31, -1, -1, -1, -1, -1, 3, -1, 3, -1,

# **Search Topics**
After having trained our model, we can use `find_topics` to search for topics that are similar 
to an input search_term. Here, we are going to be searching for topics that closely relate the 
search term "vehicle". Then, we extract the most similar topic and check the results: 

In [None]:
similar_topics, similarity = topic_model.find_topics("vehicle", top_n=5); similar_topics

[9, 19, 17, 31, 47]

In [None]:
topic_model.get_topic(71)

False

# **Model serialization**
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved. 

In [None]:
# Save model
topic_model.save("my_model")	


Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.



In [None]:
# Load model
my_model = BERTopic.load("my_model")	

# **Embedding Models**
The parameter `embedding_model` takes in a string pointing to a sentence-transformers model, a SentenceTransformer, or a Flair DocumentEmbedding model.

## Sentence-Transformers
You can select any model from sentence-transformers here and pass it through BERTopic with embedding_model:



In [None]:
topic_model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")

Or select a SentenceTransformer model with your own parameters:


In [None]:
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
topic_model = BERTopic(embedding_model=sentence_model, verbose=True)

Downloading (…)925a9/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)1a515925a9/README.md:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

Downloading (…)515925a9/config.json:   0%|          | 0.00/550 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)925a9/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/450 [00:00<?, ?B/s]

Downloading (…)1a515925a9/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)15925a9/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

Click [here](https://www.sbert.net/docs/pretrained_models.html) for a list of supported sentence transformers models.  
