# Chapter 18: Topic Modelling (BERTopic)
### Roman Egger
##### <italic>Salzburg University of Applied Sciences - Department: Innovation and Management in Tourism</italic>

---

In this Jupyter Notebook, we will do a complete topic modelling walkthrough with a dataset from airbnb using BERTopic

The dataset we will use to extract topics from was crawled by the author and contains 2890 descriptions of airbnb-Experiences from the following European cities: Amsterdam, Athens, Berlin, Brussels, Copenhagen, Helsinki, London, Madrid, Oslo, Paris, Prague, Rome, Stockholm, Viwenna and Warsaw. 
Open the dataset (csv) [here](data/Airbnb_total.csv)

<img src="data/paris.jpg">

---
[See such an airbnb example](https://www.airbnb.com/experiences/356769?currentTab=experience_tab&federatedSearchId=9297b301-0091-433b-899d-0bcda11332a9&searchId=&sectionId=704c8a0a-1f93-4442-b6b5-44b52d817c5b&source=p2)


---
### We will go through the following steps:
* #### Data Preperation & Preprocessing

<hr>

Aknowledgement:<br>
This notebook is based on the [BERTopic Project by Maarten Grootendorst](https://maartengr.github.io/BERTopic/)
<br>
[Related Medium-Post](https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8)



In [1]:
#!pip install bertopic[visualization]

# **Prepare data**

In [2]:
# Let´s import the modules needed and load the Airbnb dataset.
from bertopic import BERTopic
import pandas as pd
import os 
import umap
from nltk.corpus import stopwords
import spacy
nlp = spacy.load('en_core_web_sm')
from  plotting_utils import *
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib as mpl

docs = pd.read_csv(r'./data/Airbnb_total.csv', sep=";", encoding="utf-8")

In [3]:
docs=docs.dropna(subset=["Todo"]).reset_index(drop=True)

In [4]:
docs = docs.drop(columns=['ID', 'ID.1'])

In [5]:
# Lower case
docs['prep'] = docs['Todo'].str.lower()
# Remove square brackets and text in square brackets
regex = r"\[.*?\]"
docs['prep'] = docs['prep'].str.replace(regex, '')
# Remove punctuation
regex = r'[^\w\s]'
docs['prep'] = docs['prep'].str.replace(regex,'')
# Remove words containing numbers
regex = r"([A-Za-z]+[\d@]+[\w@]*|[\d@]+[A-Za-z]+[\w@]*)"
docs['prep'] = docs['prep'].str.replace(regex, '')
# Remove stopwords
stop = stopwords.words('english')
docs['prep'] = docs['prep'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop) and len(word) > 2]))

# Tokenize sentences
def lemmatizer(text):        
    sent = []
    doc = nlp(text)
    for word in doc:
        sent.append(word.lemma_)
    return " ".join(sent)
docs["lemmatized"] =  docs.apply(lambda x: lemmatizer(x['prep']), axis=1)

In [6]:
docs.head()

Unnamed: 0,City,Todo,prep,lemmatized
0,Amsterdam,First of all we want to thank you all for choo...,first want thank choosing experience proud ann...,first want thank choose experience proud annou...
1,Amsterdam,We will have an exclusive Morning boat tour th...,exclusive morning boat tour amsterdam canals c...,exclusive morning boat tour amsterdam canals c...
2,Amsterdam,*PLEASE NOTE THIS IS A FREE TOUR CONCEPT*\n(1 ...,please note free tour concept euro ensure spot...,please note free tour concept euro ensure spot...
3,Amsterdam,"For our Winter Warmer Premium Cruise, we invit...",winter warmer premium cruise invite hour allin...,winter warm premium cruise invite hour allincl...
4,Amsterdam,"I am a social media photographer, and I would ...",social media photographer would love take tour...,social medium photographer would love take tou...


# **Create Topics**
English is the default setting. However BERTopic supports also multilingual corpra with more than 50 languages. In this case change "english" to "multilingual".

In [7]:
model = BERTopic(language="english")
topics, probs = model.fit_transform(docs["lemmatized"])

TypeError: __init__() got an unexpected keyword argument 'low_memory'

In [None]:
#run this to see, which topics are supported
#from bertopic import languages
#print(languages)

In [8]:
model.get_topics()

NameError: name 'model' is not defined

We can then extract most frequent topics:

In [None]:
model.get_topic_freq().head(10)

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated:

In [None]:
model.get_topic(0)[:10]

In [None]:
model.find_topics("experience")

Note that the model is stocastich which means that the topics might differ across runs. 

For a full list of support languages, see the values below:

# **Embedding model**
You can select any model from `sentence-transformers` and use it instead of the preselected models by simply passing the model through  
BERTopic with `embedding_model`:

In [None]:
st_model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")

Click [here](https://www.sbert.net/docs/pretrained_models.html) for a list of supported sentence transformers models.  


# **Visualize Topics**
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good 
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. 
Instead, we can visualize the topics that were generated in a way very similar to 
[LDAvis](https://github.com/cpsievert/LDAvis):

In [None]:
model.visualize_topics()

Hier bitte eine Liste der Verfügbaren Städte anzeigen so dass man im Folgenden die Topics und Visualisierung für eine bestimmte Stadt anzeigen lassen kann (zb. Warschau)


# **Visualize documents - use dropdown to switch between cities**

In [None]:
# Update data frame with topic id
docs['BERT_Topic'] = topics

In [None]:
# Update data frame with topic keywords
tdict =  model.get_topics()
docs['BERT_Topic_Keywords'] = docs['BERT_Topic'].apply(lambda x: [i[0] for i in tdict[x]])

In [None]:
# Get document vectors (tfidf)
docs["splited"] = docs["Todo"].map(lambda x: x.split())
text_string = [' '.join(d) for d in docs['splited'].tolist()]
n_features=10000
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=n_features, ngram_range=(1,2), stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(text_string)


In [None]:
# Add topic_string column to conform with the plotting function
docs['topic_string'] = topics
# UMAP embedding
umap_embr = umap.UMAP(n_neighbors=10, metric='cosine', min_dist=0.1, init='random', random_state=42)
embedding = umap_embr.fit_transform(tfidf.toarray())
embedding = pd.DataFrame(embedding, columns=['x','y'])
docs = pd.concat([docs, embedding],1 )

# Visalize with custom function
plot_main(docs, num_topics=np.unique(topics), save_name='results/BERT_topics.html', model= 'BERT')

In [None]:
# # Get topic most similar to search tearm-city name 
# countries = docs.City.unique()
# country_topics = {country: model.find_topics(country)[0][0] for country in countries}
# country_topics

In [None]:
# # Plot only these topics that refer to cities
# from sklearn.preprocessing import MinMaxScaler
# import umap
# import numpy as np

# topic_list = sorted(list(country_topics.values()))
# topic_list = np.unique(topic_list)
# frequencies = [model.topic_sizes[topic] for topic in topic_list]
# words = [" | ".join([word[0] for word in model.get_topic(topic)[:5]]) for topic in topic_list]

# # # Embed c-TF-IDF into 2D
# embeddings = MinMaxScaler().fit_transform(model.c_tf_idf.toarray())
# embeddings = umap.UMAP(n_neighbors=2, n_components=2, metric='hellinger').fit_transform(embeddings)

# #Filter embeddings
# #topic_list_unique = np.unique(topic_list)
# mask = [True if i in topic_list else False for i in range(len(embeddings)) ]
# embeddings = embeddings[mask, :]

# # Visualize 
# df = pd.DataFrame({"x": embeddings[:, 0], "y": embeddings[:, 1],
#                    "Topic": topic_list, "Words": words, "Size": frequencies})
# model._plotly_topic_visualization(df, topic_list)

# **Wordclaud AU**

In [None]:
docs

In [None]:
from wordcloud import WordCloud, STOPWORDS
from matplotlib import pyplot as plt 

au = docs[docs.City=='Vienna']

mpl.rcParams['figure.figsize']=(12.0,12.0)  
mpl.rcParams['font.size']=12            
mpl.rcParams['savefig.dpi']=100             
mpl.rcParams['figure.subplot.bottom']=.1 
stopwords = set(STOPWORDS)

wordcloud = WordCloud(
                          background_color='white',
                          stopwords=stopwords,
                          max_words=500,
                          max_font_size=40, 
                          random_state=42
                         ).generate(str(au['lemmatized']))

print(wordcloud)
fig = plt.figure(1)
plt.imshow(wordcloud)
plt.axis('off')
plt.show();

# **Visualize Topic Probabilities**

The variable `probabilities` that is returned from `transform()` or `fit_transform()` can 
be used to understand how confident BERTopic is that certain topics can be found in a document. 

To visualize the distributions, we simply call:

In [None]:
model.visualize_distribution(probs[1])

# **Topic Reduction**
Finally, we can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so, 
is that you can decide the number of topics after knowing how many are actually created. It is difficult to 
predict before training your model how many topics that are in your documents and how many will be extracted. 
Instead, we can decide afterwards how many topics seems realistic:





In [None]:
new_topics, new_probs = model.reduce_topics(docs['lemmatized'].tolist(), topics, probs, nr_topics=10)

# **Plot after reduction**

In [None]:
docs_new = docs.copy()
# Update data frame with new topic id
docs_new['BERT_Topic'] = new_topics
docs_new['topic_string'] = new_topics
# Update data frame with new topic keywords
tdict =  model.get_topics()
docs_new['BERT_Topic_Keywords'] = docs_new['BERT_Topic'].apply(lambda x: [i[0] for i in tdict[x]])


# Visualize with custom function
plot_main(docs_new, num_topics=np.unique(new_topics), save_name='results/BERT_topics_reduced.html', model= 'BERT')


The reasoning for putting `docs`, `topics`, and `probs` as parameters is that these values are not saved within 
BERTopic on purpose. If you were to have a million documents, it seems very inefficient to save those in BERTopic 
instead of a dedicated database.  

# **Topic Representation**
When you have trained a model and viewed the topics and the words that represent them,
you might not be satisfied with the representation. Perhaps you forgot to remove
stop_words or you want to try out a different n_gram_range. We can use the function `update_topics` to update 
the topic representation with new parameters for `c-TF-IDF`: 


In [None]:
model.update_topics(docs['lemmatized'].tolist(), topics, n_gram_range=(1, 3), stop_words="english")

In [None]:
model.get_topic_freq().head(35)

In [None]:
model.get_topic(-1)

In [None]:
model.visualize_distribution(probs[0])

# **Search Topics**
After having trained our model, we can use `find_topics` to search for topics that are similar 
to an input search_term. Here, we are going to be searching for topics that closely relate the 
search term "vehicle". Then, we extract the most similar topic and check the results: 

In [None]:
similar_topics, similarity = model.find_topics("holiday", top_n=5); similar_topics

In [None]:
model.get_topic(42)

# **Model serialization**
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved. 

In [None]:
# Save model
#model.save("my_model")	

In [None]:
# Load model
#my_model = BERTopic.load("my_model")	