# **Tutorial** - Topic Modeling with BERTopic
(last updated 08-06-2021)

In this tutorial we will be exploring how to use BERTopic to create topics from the well-known 20Newsgroups dataset. The most frequent use-cases and methods are discussed together with important parameters to keep a look out for. 


## BERTopic
BERTopic is a topic modeling technique that leverages 🤗 transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. 

<br>

<img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="40%">

# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# **Installing BERTopic**

We start by installing BERTopic from PyPi:

In [None]:
%%capture
!pip install bertopic

!pip install kaleido



#now this works:


In [None]:
import pandas as pd
from bertopic import BERTopic
import kaleido #required
kaleido.__version__ #0.2.1
import plotly
plotly.__version__ #5.5.0
import plotly.graph_objects as go

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# Data


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:

meta = pd.read_csv('/content/drive/MyDrive/6000k/data.csv')
#count how many has abstract
count = 0
index = []
for i in range(len(meta)):
    #print(i)
    
    if type(meta.iloc[i, 5])== float:
        count += 1
    else:
        index.append(i)
documents = meta.iloc[index,[1,2,6]]
documents=documents.reset_index()
documents["index"] = documents.index.values


In [None]:
location = "Global"
for i in range(len(meta)):
  if location == "Global":
    break
  if documents.location[i] != location:
    documents = documents.drop(index = i,axis=0)
documents.shape

(8331, 4)

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
documents.head()

Unnamed: 0,index,date,content,location
0,0,2020-02-14 03:04:33+00:00,★💰 WORK FROM HOME💰 ★\n I'm Looking For Stay at...,New York
1,1,2020-02-04 23:21:24+00:00,LIMITED TIME! Sign Up for FREE\nEnds March 31....,New York
2,2,2020-03-31 21:21:25+00:00,Catch our own Tiffany Joy Murchison @Ms_Tiffan...,New York
3,3,2020-03-31 21:18:46+00:00,clocking out is near - make sure to end of you...,New York
4,4,2020-03-31 20:04:07+00:00,😎 she’s so cool... #lea #leainny #stayhome #go...,New York


In [None]:
import re
documents.content = documents.apply(lambda row: re.sub(r"http\S+", "", row.content).lower(), 1)
documents.content = documents.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.content.split())), 1)
#documents.content = documents.apply(lambda row: " ".join(re.sub("[^a-zA-Z0-9]+", " ", row.content).split()), 1)
stop = set(stopwords.words("english"))
documents.content = documents.apply(lambda row: " ".join(filter(lambda x:x[0] not in stop, row.content.split())), 1)
# documents.content = documents.apply(lambda row: " ".join([item for item in row.content.split() if item ]),1)
print(documents.head())
tweets = documents.content.to_list()
timestamps = documents.date.to_list()
topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(tweets)

   index                       date  \
0      0  2020-02-14 03:04:33+00:00   
1      1  2020-02-04 23:21:24+00:00   
2      2  2020-03-31 21:21:25+00:00   
3      3  2020-03-31 21:18:46+00:00   
4      4  2020-03-31 20:04:07+00:00   

                                             content  location  
0  ★💰 work from home💰 ★ looking for home &amp; li...  New York  
1  limited up for free ends 31. ★💰 work from home...  New York  
2  catch joy virtually (and literally) hanging (o...  New York  
3  clocking near - end right. how link. 📸 #wfh #s...  New York  
4  😎 cool... #lea #leainny #stayhome #goldenretri...  New York  


Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Batches:   0%|          | 0/261 [00:00<?, ?it/s]

2022-05-23 09:36:06,585 - BERTopic - Transformed documents to Embeddings
2022-05-23 09:37:05,386 - BERTopic - Reduced dimensionality
2022-05-23 09:37:05,718 - BERTopic - Clustered reduced embeddings


In [None]:
topics_over_time = topic_model.topics_over_time(tweets, topics, timestamps, nr_bins=20)

20it [00:02,  9.45it/s]


In [None]:
a = topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=10)

a.update_layout(
    autosize=False,
    title=("Topics Over Time"+" ("+location+")"),
)
a.show()
a.write_image("/content/drive/MyDrive/6000k/nostopword/topics_over_time_"+location+'.png',format='png')

In [None]:
from gensim.test.utils import common_corpus, common_dictionary
from gensim.models.coherencemodel import CoherenceModel

len(common_dictionary)

12

# Evaluate

In [None]:
a = len(topic_model.topic_names)
ll = []
for i in range(0,a-1):
    l = []
    topictmp = topic_model.get_topic(i)
    for j in topictmp:
      l.append(j[0])
    ll.append(l)
    
    # print(l)
   

# print(ll)

import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel

# Preprocess documents
cleaned_docs = topic_model._preprocess_text(tweets)

# Extract vectorizer and tokenizer from BERTopic
vectorizer = topic_model.vectorizer_model
tokenizer = vectorizer.build_tokenizer()

# Extract features for Topic Coherence evaluation
words = vectorizer.get_feature_names()
tokens = [tokenizer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = ll

# Evaluate
coherence_model = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='u_mass')
coherence = coherence_model.get_coherence()
print(coherence)

-11.759588473339246


## Human Judegement

In [None]:
topic_model.get_representative_docs()

{0: ['con being ‘arctic’ #workingfromhome',
  'what #workingfromhome 🤣',
  '#workingfromhome for ☀️'],
 1: ['when boss (cat) upset with #workingfromhome',
  'cat was riveted #wfh (he’s really enjoying wfh)',
  'else have wfh cat problem? #catproblems #wfh'],
 2: ['great #workingfromhome can what bake!⠀ baked banana bread 🍌 ⠀ better looks. 😊⠀ .⠀ .⠀ .⠀ .⠀ #baker #bakingfromscratch…',
  'perks #wfh fresh from veggie patch. kale, rocket, parsley, wonky carrots. #gardeningtwitter',
  '#workingfromhome &amp; got no prepare #lunch? 🏘 #spicevillage have got covered. enjoy 10% #order! \u2060#takeaway &amp; #delivery 🛎 𝐒𝐨𝐮𝐭𝐡𝐚𝐥𝐥 𝟎𝟐𝟎𝟖 𝟓𝟕𝟒𝟒 𝟒𝟕𝟓 𝐂𝐫𝐨𝐲𝐝𝐨𝐧 𝟎𝟐𝟎 𝟑𝟗𝟎𝟖 𝟕𝟎𝟕𝟏 𝐒𝐨𝐮𝐭𝐡𝐞𝐧𝐝 𝟎𝟏𝟕 𝟎𝟐𝟑𝟒 𝟎𝟗𝟕𝟎'],
 3: ['perspective from helpfully lays benefits culture. but comes financial cost, employers employees. #wfh',
  '46% respondents have place #wfh. juniors live conditions &amp; policies wfh becomes norm needs be',
  '#wfh business while #stayhome we part flatten curve.'],
 4: ['nostalgia comes us uncertainty &am

# **Topic Modeling**

In this example, we will go through the main components of BERTopic and the steps necessary to create a strong topic model. 




## Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. 

In [None]:
freq = topic_model.get_topic_info();
freq.to_csv("/content/drive/MyDrive/6000k/nostopword/freq_"+location+".csv")

-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [None]:
topic_model.get_topic(0)  # Select the most frequent topic

[('workingfromhome', 0.029300925531527175),
 ('love', 0.01511942184608348),
 ('great', 0.011829098594606346),
 ('back', 0.010313766264378096),
 ('but', 0.009609868614902088),
 ('work', 0.009599957510897353),
 ('been', 0.009562019243632395),
 ('not', 0.009053650692543947),
 ('for', 0.008801260952392349),
 ('know', 0.008669840271592144)]

**NOTE**: BERTopic is stocastich which mmeans that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

# **Visualization**
There are several visualization options available in BERTopic, namely the visualization of topics, probabilities and topics over time. Topic modeling is, to a certain extent, quite subjective. Visualizations help understand the topics that were created. 

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [None]:
b = topic_model.visualize_barchart(top_n_topics=12)

b.update_layout(
    autosize=False,
    title=("Topics Word Scores"+" ("+location+")"),
)
b.show()
b.write_image("/content/drive/MyDrive/6000k/nostopword/"+"topic_word_scores_"+location+".png")

# **Search Topics**
After having trained our model, we can use `find_topics` to search for topics that are similar 
to an input search_term. Here, we are going to be searching for topics that closely relate the 
search term "vehicle". Then, we extract the most similar topic and check the results: 

In [None]:
similar_topics, similarity = topic_model.find_topics("food", top_n=5); similar_topics

[2, 8, 6, 56, 13]

In [None]:
topic_model.get_topic(2)

[('lunch', 0.024133588349627234),
 ('breakfast', 0.012646975841215647),
 ('and', 0.012146086839072165),
 ('wfh', 0.011439243673503848),
 ('with', 0.010594776200100733),
 ('for', 0.010390998875604934),
 ('of', 0.009669050194786857),
 ('food', 0.009576746430649097),
 ('the', 0.00922208200266935),
 ('to', 0.008955122393540241)]

# **Model serialization**
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved. 

In [None]:
# Save model
topic_model.save("/content/drive/MyDrive/6000k/my_model")	

In [None]:
# Load model
my_model = BERTopic.load("/content/drive/MyDrive/6000k/my_model")	

KeyboardInterrupt: ignored

# **Embedding Models**
The parameter `embedding_model` takes in a string pointing to a sentence-transformers model, a SentenceTransformer, or a Flair DocumentEmbedding model.

## Sentence-Transformers
You can select any model from sentence-transformers here and pass it through BERTopic with embedding_model:



In [None]:
topic_model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")

Or select a SentenceTransformer model with your own parameters:


In [None]:
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
topic_model = BERTopic(embedding_model=sentence_model, verbose=True)

Click [here](https://www.sbert.net/docs/pretrained_models.html) for a list of supported sentence transformers models.  
