<a href="https://colab.research.google.com/github/Data4Wellness/BERTopic/blob/master/Hilmi_BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 
## BERTopic
BERTopic is a topic modeling technique that leverages transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. 


# Enabling the GPU
First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# **Installing BERTopic**

We start by installing BERTopic from PyPi:

In [1]:
!pip install bertopic
!pip install nltk



## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# Data
I will use the  Kaggle Glassdoor Employee Review Dataset which contains roughly 33000 positive and 30000 negative reviews

In [2]:
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [3]:
import pandas as pd
train_df = pd.read_csv('/content/gdrive/My Drive/BERTopic_glassdoor/train.csv')

In [4]:
pos_docs = [review for review in train_df['positives']]
pos_docs[:5]

['People are smart and friendly',
 '1) Food, food, food. 15+ cafes on main campus (MTV) alone. Mini-kitchens, snacks, drinks, free breakfast/lunch/dinner, all day, errr\'day.  2) Benefits/perks. Free 24:7 gym access (on MTV campus). Free (self service) laundry (washer/dryer) available. Bowling alley. Volley ball pit. Custom-built and exclusive employee use only outdoor sport park (MTV). Free health/fitness assessments. Dog-friendly. Etc. etc. etc.  3) Compensation. In ~2010 or 2011, Google updated its compensation packages so that they were more competitive.  4) For the size of the organization (30K+), it has remained relatively innovative, nimble, and fast-paced and open with communication but, that is definitely changing (for the worse).  5) With so many departments, focus areas, and products, *in theory*, you should have plenty of opportunity to grow your career (horizontally or vertically). In practice, not true.  6) You get to work with some of the brightest, most innovative and h

In [5]:
len(pos_docs)

30336

In [6]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
print(stopwords.words('english'))
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'ag

In [7]:
def clean_stopwords(sentence):
  word_tokens = word_tokenize(sentence)
  filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
  filtered_sentence =' '.join(filtered_sentence)
  return filtered_sentence 

In [8]:
pos_docs_clean = [clean_stopwords(sentence) for sentence in pos_docs]
pos_docs_clean[:5]

['People smart friendly',
 "1 ) Food , food , food . 15+ cafes main campus ( MTV ) alone . Mini-kitchens , snacks , drinks , free breakfast/lunch/dinner , day , errr'day . 2 ) Benefits/perks . Free 24:7 gym access ( MTV campus ) . Free ( self service ) laundry ( washer/dryer ) available . Bowling alley . Volley ball pit . Custom-built exclusive employee use outdoor sport park ( MTV ) . Free health/fitness assessments . Dog-friendly . Etc . etc . etc . 3 ) Compensation . ~2010 2011 , Google updated compensation packages competitive . 4 ) size organization ( 30K+ ) , remained relatively innovative , nimble , fast-paced open communication , definitely changing ( worse ) . 5 ) many departments , focus areas , products , *in theory* , plenty opportunity grow career ( horizontally vertically ) . practice , true . 6 ) get work brightest , innovative hard-working/diligent minds industry . 's `` con '' , ( see ) .",
 "* 're software engineer , 're among kings hill Google . 's engineer-driven co

In [9]:
len(pos_docs_clean)

30336

# **Topic Modeling**

We will go through the main components of BERTopic and the steps necessary to create a strong topic model. 




## Training

We start by instantiating BERTopic. We set language to `english` since our documents are in the English language. If you would like to use a multi-lingual model, please use `language="multilingual"` instead. 

We will also calculate the topic probabilities. However, this can slow down BERTopic significantly at large amounts of data (>100_000 documents). It is advised to turn this off if you want to speed up the model. 


In [10]:
from bertopic import BERTopic

topic_model = BERTopic(language="english",
                       top_n_words=15,
                       n_gram_range=(1, 2),
                       min_topic_size=30,
                       nr_topics='auto',
                       calculate_probabilities=True, 
                       verbose=True)
topics, probs = topic_model.fit_transform(pos_docs_clean)

  defaults = yaml.load(f)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=690.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3673.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=629.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=122.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=229.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=90895153.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=53.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466081.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=516.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=190.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Batches', max=948.0, style=ProgressStyle(description_widt…

2021-08-02 02:24:13,548 - BERTopic - Transformed documents to Embeddings





2021-08-02 02:24:55,641 - BERTopic - Reduced dimensionality with UMAP
2021-08-02 02:25:18,167 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2021-08-02 02:25:27,710 - BERTopic - Reduced number of topics from 120 to 69


## Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. 

In [11]:
freq = topic_model.get_topic_info()
freq

Unnamed: 0,Topic,Count,Name
0,-1,11983,-1_smart_environment_opportunities_pay
1,0,2393,0_amazon_apple_working_employees
2,1,1202,1_team_teams_great team_team great
3,2,1173,2_learn_technology_learning_cutting edge
4,3,1082,3_microsoft_software_microsoft great_working m...
...,...,...,...
64,63,39,63_diversity_diversity work_work diversity_div...
65,64,39,64_smart people_company smart_people company_g...
66,65,36,65_solve_problems_problems solve_interesting p...
67,66,35,66_shift_shifts_night shift_hour shifts


-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [12]:
topic_model.get_topic(0)  # Select the most frequent topic

[('amazon', 0.025349582300605685),
 ('apple', 0.02104518097199698),
 ('working', 0.006237041576259761),
 ('employees', 0.005483538023302047),
 ('nt', 0.005240492488007998),
 ('customer', 0.005055908323604561),
 ('make', 0.004720758062657242),
 ('job', 0.004558217732306532),
 ('working apple', 0.004544142427626432),
 ('always', 0.004473779642419827),
 ('customers', 0.004322257419905556),
 ('company', 0.004228831661091897),
 ('apple products', 0.004127444760637589),
 ('working amazon', 0.004074478483798206),
 ('part', 0.0040404287041674635)]

In [13]:
topic_model.get_topic(1)

[('team', 0.04571927591767117),
 ('teams', 0.03469095795302661),
 ('great team', 0.020126547257301244),
 ('team great', 0.010665687029080414),
 ('team work', 0.010637824430103643),
 ('good team', 0.009487999196297942),
 ('teammates', 0.008224659791208865),
 ('team members', 0.007670258627455578),
 ('teamwork', 0.006908714224615448),
 ('projects', 0.006718949447773446),
 ('team environment', 0.006606031362118422),
 ('environment', 0.006383055833453011),
 ('team good', 0.0062837390232654885),
 ('teams good', 0.005719284778164232),
 ('teams great', 0.004860395802974622)]

In [14]:
topic_model.get_topic(2)

[('learn', 0.0297567369326339),
 ('technology', 0.027079729982007042),
 ('learning', 0.022519446139429546),
 ('cutting edge', 0.015823729776352796),
 ('technologies', 0.015249119765251977),
 ('innovation', 0.013179695536923362),
 ('learn lot', 0.01279227657245076),
 ('smart', 0.011341732664270732),
 ('smart people', 0.011228965333637562),
 ('tech', 0.010896395116708585),
 ('edge technology', 0.010523078981258377),
 ('place learn', 0.009028332808253494),
 ('learn new', 0.0073267592595787825),
 ('great learning', 0.00699621492989919),
 ('good learning', 0.006806450496027125)]

**NOTE**: BERTopic is stocastich which means that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

# **Visualization**
There are several visualization options available in BERTopic, namely the visualization of topics, probabilities and topics over time. Topic modeling is, to a certain extent, quite subjective. Visualizations help understand the topics that were created. 

## Visualize Topics
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good 
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. 
Instead, we can visualize the topics that were generated in a way very similar to 
[LDAvis](https://github.com/cpsievert/LDAvis):

In [15]:
topic_model.visualize_topics()

## Visualize Topic Probabilities

The variable `probabilities` that is returned from `transform()` or `fit_transform()` can 
be used to understand how confident BERTopic is that certain topics can be found in a document. 

To visualize the distributions, we simply call:

In [16]:
topic_model.visualize_distribution(probs[0], min_probability=0.02)

## Visualize Topic Hierarchy

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another. This might help selecting an appropriate nr_topics when reducing the number of topics that you have created.

In [20]:
topic_model.visualize_hierarchy(top_n_topics=69)

In [21]:
topic_model.visualize_hierarchy(top_n_topics=30)

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [22]:
topic_model.visualize_barchart(top_n_topics=15)

## Visualize Topic Similarity
Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [23]:
topic_model.visualize_heatmap(n_clusters=20, width=1000, height=1000)

## Visualize Term Score Decline
Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added. At some point adding words to the topic representation only marginally increases the total c-TF-IDF score and would not be beneficial for its representation.

To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word. In other words, the position of the words (term rank), where the words with the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method, the select the best number of words in a topic.


In [24]:
topic_model.visualize_term_rank()

# **Topic Representation**
After having created the topic model, you might not be satisfied with some of the parameters you have chosen. Fortunately, BERTopic allows you to update the topics after they have been created. 

This allows for fine-tuning the model to your specifications and wishes. 

## Update Topics
When you have trained a model and viewed the topics and the words that represent them,
you might not be satisfied with the representation. Perhaps you forgot to remove
stopwords or you want to try out a different `n_gram_range`. We can use the function `update_topics` to update 
the topic representation with new parameters for `c-TF-IDF`: 


In [25]:
topic_model.update_topics(pos_docs_clean, topics, n_gram_range=(1, 3))

In [26]:
topic_model.get_topic(0) #The most frequent topic is about Amazon

[('amazon', 0.020718999347162665),
 ('apple', 0.016684284141934645),
 ('working', 0.005419892553384345),
 ('company', 0.004575293245689572),
 ('employees', 0.004438056600403737),
 ('customer', 0.003892189203172921),
 ('job', 0.00375597692786629),
 ('make', 0.0037087119856386093),
 ('retail', 0.0035621107871997636),
 ('always', 0.003499894287425272),
 ('lot', 0.003482448111654291),
 ('experience', 0.0033820754649375503),
 ('working apple', 0.0033070550049802386),
 ('customers', 0.003293725979774991),
 ('well', 0.003112837626149108)]

In [27]:
topic_model.get_topic(1) #The second most frequent topic is seems to be about team and teamwork

[('team', 0.037632281605950994),
 ('teams', 0.02672178253878484),
 ('great team', 0.014782503837503383),
 ('team work', 0.007743087508673815),
 ('team great', 0.007725993568612873),
 ('good team', 0.006869503202239288),
 ('teammates', 0.005929157689289428),
 ('members', 0.005837492113478688),
 ('team members', 0.005550627313082832),
 ('projects', 0.005246344241947346),
 ('good', 0.005208186921490718),
 ('teamwork', 0.004980492459003119),
 ('team environment', 0.004743027606672083),
 ('team good', 0.004528158957700133),
 ('teams good', 0.00408942187767159)]

## Topic Reduction
We can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so, 
is that you can decide the number of topics after knowing how many are actually created. It is difficult to 
predict before training your model how many topics that are in your documents and how many will be extracted. 
Instead, we can decide afterwards how many topics seems realistic:





In [28]:
new_topics, new_probs = topic_model.reduce_topics(pos_docs_clean, topics, probs, nr_topics=30)

2021-08-02 02:34:33,137 - BERTopic - Reduced number of topics from 69 to 31


In [29]:
topic_model.visualize_topics()

In [30]:
topic_model.visualize_distribution(probs[1], min_probability=0.005)

In [31]:
topic_model.visualize_hierarchy(top_n_topics=31)

In [32]:
topic_model.visualize_barchart(top_n_topics=10)

In [33]:
topic_model.visualize_heatmap(n_clusters=20, width=1000, height=1000)

In [34]:
topic_model.visualize_term_rank()

# **Search Topics**
After having trained our model, we can use `find_topics` to search for topics that are similar 
to an input search_term. Here, we are going to be searching for topics that closely relate the 
search term "vehicle". Then, we extract the most similar topic and check the results: 

In [35]:
similar_topics, similarity = topic_model.find_topics("management", top_n=5); similar_topics

[10, 3, 26, 15, 4]

In [36]:
topic_model.get_topic(10)

[('leadership', 0.04882687021618198),
 ('leadership principles', 0.015712875747383073),
 ('freedom', 0.014027363544502014),
 ('company', 0.012731540831753032),
 ('ownership', 0.010581091174048592),
 ('autonomy', 0.009399557035055133),
 ('great leadership', 0.008601862674013036),
 ('ceo', 0.008436248361676477),
 ('management', 0.008034856265700966),
 ('strong', 0.007349595582617356),
 ('benefits', 0.007209302494412388),
 ('team', 0.0061434787209351084),
 ('employees', 0.005728504727240912),
 ('managers', 0.005534511609870708),
 ('leaders', 0.004910149838611127)]

In [37]:
topic_model.get_topic(3)

[('microsoft', 0.04791339811348853),
 ('software', 0.013793895878953407),
 ('company', 0.010043808127899715),
 ('work', 0.008674319952738766),
 ('benefits', 0.007784837920566532),
 ('working', 0.0070722315108859846),
 ('microsoft great', 0.0068809560452056506),
 ('cloud', 0.006328292556227155),
 ('employees', 0.005516336365126923),
 ('working microsoft', 0.0054830521328258),
 ('technology', 0.005159173181450926),
 ('career', 0.0051129457473592455),
 ('smart', 0.004883151765706373),
 ('opportunities', 0.004759077723464276),
 ('industry', 0.0047316048754270324)]

In [38]:
topic_model.get_topic(26)

[('company', 0.05344951242870507),
 ('growing', 0.03047459910850824),
 ('growth', 0.03005956343170202),
 ('growing company', 0.015906870612776382),
 ('opportunity growth', 0.011212608987759904),
 ('within company', 0.010885157998005492),
 ('company growing', 0.010660222972494854),
 ('business', 0.010608721365366345),
 ('company lots', 0.01031761297069784),
 ('big company', 0.010230984682707653),
 ('large company', 0.009959737677343063),
 ('industry', 0.00918238490662474),
 ('company growth', 0.008435223668243565),
 ('grow company', 0.007884680270805237),
 ('company provides', 0.00743341432099657)]

# **Model serialization**
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved. 

In [23]:
# Save model
#topic_model.save("my_model")	

In [24]:
# Load model
#my_model = BERTopic.load("my_model")	

# **Embedding Models**
The parameter `embedding_model` takes in a string pointing to a sentence-transformers model, a SentenceTransformer, or a Flair DocumentEmbedding model.

## Sentence-Transformers
You can select any model from sentence-transformers here and pass it through BERTopic with embedding_model:



In [25]:
topic_model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")

Or select a SentenceTransformer model with your own parameters:


In [26]:
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
topic_model = BERTopic(embedding_model=sentence_model, verbose=True)

Click [here](https://www.sbert.net/docs/pretrained_models.html) for a list of supported sentence transformers models.  
