# Topic Modeling with BERTopic
BERTopic is a topic modeling tool which creates topic clusters based on word embeddings and a class-based TF-IDF. It generates a set of topics, the top words in each topic, and the likelihood of each text in a corpus belonging to each topic. Visualizations can also be generated based on the relationships between topics.  


This notebook uses BERTopic for unsupervised topic modeling in order to explore the sci-fi corpus. BERTopic can be customized to support the following types of topic modeling:
* Guided: seeded topics manually set by the researcher
* (Semi)-supervised: modeling guided by document labels
* Hierarchicial: topic similarity and rankings calculated, subtopics generated
* Dynamic: differentiates topic clustering based on doc timestamps
* Online: modeling updated incrementally from small batches of texts 

Adapted from:

https://github.com/MaartenGr/BERTopic/blob/master/notebooks/BERTopic.ipynb

https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing#scrollTo=y_eHBI1jSb6i

## Install Packages

In [None]:
#!pip install bertopic
#!conda install pandas
#!conda install nltk
import os
import pandas as pd

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

## Get and Clean Data

In [None]:
#Get current working directory 
path = os.getcwd()
print(path)

#Change working directory
path = os.chdir("/home/dssadmin/Desktop/SF_Analysis/Data")

#Upload dataframe√
df = pd.read_csv('chapter_chunks_agg_output.csv')

#Drop first column (unnamed)
df = df.iloc[: , 1:]

df

In [None]:
#Lowercasing, punctuation and stopword removal
#Lowercase all words
df['Text'] = df['Text'].str.lower()

#Remove punctuation and replace with no space (except periods and hyphens)
df['Text'] = df['Text'].str.replace(r'[^\w\-\.\'\s]+', '', regex = True)

#Remove periods and replace with space (to prevent incorrect compounds)
df['Text'] = df['Text'].str.replace(r'[^\w\-\'\s]+', ' ', regex = True)

#Remove stopwords
#import nltk
#nltk.download('stopwords')
#from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
df['Text'] = df['Text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

#Check output
df.head()


In [None]:
#Append data to list
text = df.Text.to_list()
text

## Create and Run BERTopic Model

The model (`topic_model`) can be defined based on multiple parameters, including: 
* language: language of word embedding model used (default=English)
* embedding-model: sentence-transformers model which is used to create word embeddings; defaults to pre-set model, and [here's a list of all available models](https://www.sbert.net/docs/pretrained_models.html)
* nr_topics: set to reduce number of topics; can specify a  specific # of topics OR set as "auto" to merge topics with similarity > 0.9
* calculate_probabilities: calculates likelihood of each document falling into any of the possible documents (set to True or False)
* verbose: set to True so model initiation process does not shows messages

Once the model is defined, fit it to the corpus prepared above using `fit_transform` and get topics and probabilities. 

In [None]:
from bertopic import BERTopic

topic_model = BERTopic(language="english", nr_topics = 'auto', calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(text)

In [None]:
#Get the 10 most frequent topics (-1 topic refers to all outliers, ignore it)
freq = topic_model.get_topic_info(); freq.head(10)

In [None]:
#Get all topics and download as csv
topic_model_df = topic_model.get_topic_info()

topic_model_df.to_csv('topic_model_df.csv', index=False)

In [None]:
# Select a specific topic
topic_model.get_topic(3)  

In [None]:
#Get mos

In [None]:
#Get predicted topics for the first 10 documents in corpus
topic_model.topics_[:10]

In [None]:
#Get top topic for every text in corpus and append to a dataframe
topic_list = topic_model.topics_[:]
top_topics_df = df.copy()
top_topics_df['Top_Topic'] = topic_list

#Remove docs whose top topic is -1 (outlier)
top_topics_df = top_topics_df[top_topics_df.Top_Topic != -1]

#Sort by top_topic
top_topics_df.sort_values(by=['Top_Topic'], inplace=True)
top_topics_df.head(20)

In [None]:
#Add topic descriptions to dataframe
dictionary = topic_model_df[['Topic','Name']].copy()
dictionary = dict(zip(dictionary.Topic, dictionary.Name))

d = {k:v for k, v in dictionary.items()}

top_topics_df['Topic_Description'] = top_topics_df['Top_Topic'].map(d)
top_topics_df

In [None]:
#Download top topics df to csv
top_topics_df.to_csv('top_BERT_topics_df.csv', index=False)

## Visualizations

In [None]:
#Visualize distance between topics
topic_model.visualize_topics()

In [None]:
#Get probability that topics will appear in a specific document
topic_model.visualize_distribution(probs[0], min_probability=0.015)

In [None]:
#Vizualize hierarchical structure of topics
topic_model.visualize_hierarchy(top_n_topics=60)

In [None]:
#Visualize top terms in selected topics
topic_model.visualize_barchart(top_n_topics=10)

In [None]:
#Create matrix to indicate similarity between topics
topic_model.visualize_heatmap(n_clusters=20, width=1000, height=1000)

In [None]:
#Visualize the decline of c-TF-IDF score when adding words to the topic representation. 
#It allows you, using the elbow method, the select the best number of words in a topic.
topic_model.visualize_term_rank()

## Search Topics for Terms

In [None]:
#Search for topics that are similar to an input search_term
similar_topics, similarity = topic_model.find_topics("space", top_n=5); similar_topics

In [None]:
#Look at other terms in that one of the similar topics
topic_model.get_topic(18)

In [None]:
#Get all the texts which have most similar topic as top topic
top_topics_df.loc[top_topics_df['Top_Topic'] == 18]

## Update the BERTopic Model

Two common ways to update the topic model are based on ngram counts (default is single words, but you can also get bigrams, trigrams, etc) and by setting the number of topics)

In [None]:
#Update topics based on ngram counts
topic_model.update_topics(text, n_gram_range=(1, 2))

In [None]:
#Look at the topics again
topic_model.get_topic_info(); freq.head(10)

In [None]:
topic_model.get_topic(4)   # We select topic that we viewed before

In [None]:
#Reduce number of topics
topic_model.reduce_topics(text, nr_topics=20)

In [None]:
topic_model.get_topic_info(); freq.head(10)

## Additional Sources
Word Embeddings: https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/

BERTopic Intro: https://towardsdatascience.com/meet-bertopic-berts-cousin-for-advanced-topic-modeling-ea5bf0b7faa3

MOre about BERTopic: 
https://towardsdatascience.com/dynamic-topic-modeling-with-bertopic-e5857e29f872