<a href="https://colab.research.google.com/github/mkane968/extracted-features-1/blob/main/notebooks/4_Topic_Modeling_with_BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling with BERTopic
BERTopic is a topic modeling tool which creates topic clusters based on word embeddings and a class-based TF-IDF. It generates a set of topics, the top words in each topic, and the likelihood of each text in a corpus belonging to each topic. Visualizations can also be generated based on the relationships between topics.  


This notebook uses BERTopic for unsupervised topic modeling in order to explore the sci-fi corpus. BERTopic can be customized to support the following types of topic modeling:
* Guided: seeded topics manually set by the researcher
* (Semi)-supervised: modeling guided by document labels
* Hierarchicial: topic similarity and rankings calculated, subtopics generated
* Dynamic: differentiates topic clustering based on doc timestamps
* Online: modeling updated incrementally from small batches of texts 

Adapted from:

https://github.com/MaartenGr/BERTopic/blob/master/notebooks/BERTopic.ipynb

https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing#scrollTo=y_eHBI1jSb6i

## Install Packages

In [None]:
#!pip install bertopic
#!pip install --upgrade bertopic
#!conda install pandas
#!conda install nltk
import os
import pandas as pd

import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords

import re

#Get dictionary of English words to keep 
from nltk.corpus import words
#nltk.download('words')
#nltk.download('wordnet')
from nltk import WordNetLemmatizer

#Import BERTopic
from bertopic import BERTopic

## Get Data

In [None]:
#Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#Selet all files to upload
from google.colab import files

uploaded = files.upload()

In [None]:
#Upload dataframe√
df = pd.read_csv('adv_clean_bow_ch_chunks.csv')

df

In [None]:
#Change data type to string
df['English_Text'] = df['English_Text'].astype(str)

#Append data to list
text = df.English_Text.to_list()
text

## Create and Run BERTopic Model

The model (`topic_model`) can be defined based on multiple parameters, including: 
* language: language of word embedding model used (default=English)
* embedding-model: sentence-transformers model which is used to create word embeddings; defaults to pre-set model, and [here's a list of all available models](https://www.sbert.net/docs/pretrained_models.html)
* nr_topics: set to reduce number of topics; can specify a  specific # of topics OR set as "auto" to merge topics with similarity > 0.9
* calculate_probabilities: calculates likelihood of each document falling into any of the possible documents (set to True or False)
* vectorizer_model: Removes stopwords after embeddings are created
* verbose: set to True so model initiation process does not shows messages
a
Once the model is defined, fit it to the corpus prepared above using `fit_transform` and get topics and probabilities. 

In [None]:
#Set environment variable to false to avoid error
#os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

#Set vectorizer model to remove stopwords after embeddings have been created
vectorizer_model = CountVectorizer(stop_words="english")

#Create new topic model
topic_model = BERTopic(language="english", nr_topics = 'auto', vectorizer_model=vectorizer_model, calculate_probabilities=True, verbose=True, top_n_words=20)

In [None]:
#Run topic model on texts
topics, probs = topic_model.fit_transform(text)

In [None]:
#Get the 10 most frequent topics (-1 topic refers to all outliers, ignore it)
freq = topic_model.get_topic_info(); freq.head(10)

In [None]:
#Investigate top n words in a specific topic
topic_model.get_topic(3)  

In [None]:
#Get predicted topics for the first 10 documents in corpus
topic_model.topics_[:10]

In [None]:
#Create a dataframe which has info about top topic in each document
topics_df = topic_model.get_document_info(text)
topics_df

In [None]:
#Add document names to dataframe

#Create dataframe with texts and titles
texts_df = df[['Book + Chunk','English_Text']].copy()
texts_df.rename(columns={'Book + Chunk':'Title','English_Text':'Document'}, inplace=True)
texts_df

#Merge with above dataset on Document
top_BERTopic_per_doc = pd.DataFrame()
top_BERTopic_per_doc = texts_df.merge(topics_df, how='right',on='Document')
top_BERTopic_per_doc

#Sort by topic (or title)
top_BERTopic_per_doc.sort_values(by=['Topic'], inplace=True)
top_BERTopic_per_doc

In [None]:
#Download CSV with document and topic information
top_BERTopic_per_doc.to_csv('adv_clean_agg_ch_chunks_BERTopic_top_topic_per_doc_info.csv', index=False)

files.download('adv_clean_agg_ch_chunks_BERTopic_top_topic_per_doc_info.csv')

In [None]:
#Make CSV with just topic number and top words 
BERTopic_topic_metadata_df = top_BERTopic_per_doc[['Topic','Name','Top_n_words']].copy()
BERTopic_topic_metadata_df = BERTopic_topic_metadata_df.drop_duplicates()
BERTopic_topic_metadata_df = BERTopic_topic_metadata_df.reset_index(drop=True)
BERTopic_topic_metadata_df

In [None]:
#Download as CSV
BERTopic_topic_metadata_df.to_csv('adv_clean_agg_ch_chunks_BERTopic_topic_metadata.csv', index=False)

files.download('adv_clean_agg_ch_chunks_BERTopic_topic_metadata.csv')

In [None]:
#Add topic names (determined manually) to a new dataframe
topic_names = pd.read_csv('NAMED_adv_clean_agg_ch_chunks_BERTopic_topics.csv')
topic_names = topic_names.drop(columns='Top_n_words')
topic_names.head()

In [None]:
#Add topic name to dataframe 
named_topics_per_doc = top_BERTopic_per_doc.copy()

named_topics_per_doc['Topic'] = top_BERTopic_per_doc.Topic.map(topic_names.set_index('Topic')['Name'])
named_topics_per_doc

In [None]:
#Sort texts by topic
named_topics_per_doc.sort_values(by=['Topic'], inplace=True)
named_topics_per_doc

In [None]:
#Download CSV with named topics
named_topics_per_doc.to_csv('adv_clean_agg_ch_chunks_named_BERTopics_per_doc.csv', index=False)

files.download('adv_clean_agg_ch_chunks_named_BERTopics_per_doc.csv')

### Topic Span Over Time

In [None]:
#Remove chapter and chunk labels from titles 
#Remove document, probability and representative document first (will mess up duplicates) and name (redundant)
counted_topics = named_topics_per_doc[['Title','Topic']].copy()

#Sort texts by title
counted_topics.sort_values(by=['Title'], inplace=True)
counted_topics

counted_topics

In [None]:
#Count number of times each topic appears in each text
from collections import Counter

df1 = counted_topics['Topic'].apply(lambda x: pd.Series(Counter(x.split(','))), 1).fillna(0).astype(int)

counted_topics = counted_topics.join(df1.add_suffix(' Count'))
counted_topics

In [None]:
#Download couned topics df to csv
counted_topics.to_csv('adv_clean_agg_ch_chunks_counted_topics.csv', index=False)

In [None]:
#Make new dataframe to track topic prevalence over the years
yearly_topics = counted_topics.copy()

#Split title on year heading and remove following text
test = yearly_topics['Title'].str.split("_", expand = True)

yearly_topics['Title'] = test[0]
yearly_topics.rename(columns={"Title": "Year"}, inplace=True)
yearly_topics

In [None]:
yearly_topics = yearly_topics.groupby(['Year']).sum()
yearly_topics = yearly_topics.reset_index()
yearly_topics.head()

In [None]:
#Select topics of interest
interest_topics = yearly_topics[['Year','Air_Pollution Count', 'Disease_Outbreak Count', 'Car_Driving_Mechanics Count', 'Sea_Travel Count', 'Desert_Landscape_Exploration Count','Undersea_Reef_Species Count','Forest_Landscape Count']]

In [None]:
#Import seaborn for graphing and melt dataframe to prepare for plot (will shift topic counts into column)
import seaborn as sns

dfm = interest_topics.melt('Year', var_name='Topics', value_name='vals')
dfm

In [None]:
import matplotlib.pyplot as plt

#Plot usage of topics over tile
ax = sns.pointplot(x="Year", y="vals", hue='Topics', data=dfm)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.title("Topic Usage Over Time")
plt.rcParams["figure.figsize"] = (25,5)


## Word Clouds Per Topic of Interest

In [None]:
#Import word cloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt

#Remove custom words (stopwords and names not previously filtered out)
custom_stop_words = ['peter', 'rand', 'mick', 'wa', 'ha', 'mike']

def create_wordcloud(model, topic):
    text = {word: value for word, value in model.get_topic(topic) if word not in custom_stop_words}
    w = WordCloud(background_color="white", max_words=1000)
    w.generate_from_frequencies(text)
    plt.imshow(w, interpolation="bilinear")
    plt.axis("off")
    plt.show()

# Show wordcloud
print('Topic: Air_Pollution')
create_wordcloud(topic_model, topic=61)

In [None]:
print('Topic: Disease_Outbreak')
create_wordcloud(topic_model, topic=17)

In [None]:
print('Topic: Desert_Landscape_Exploration')
create_wordcloud(topic_model, topic=27)

In [None]:
print('Topic: Undersea_Reef_Species')
create_wordcloud(topic_model, topic=49)

In [None]:
print('Topic: Car_Driving_Mechanics')
create_wordcloud(topic_model, topic=6)

## Retrieve All Books Containing a Specific Topic

In [None]:
#create new dataframe
topic_contents = named_topics_per_doc[['Title','Topic']].copy()

#Split title on year heading and remove following text
test = topic_contents['Title'].str.split("_Chapter", expand = True)

topic_contents['Title'] = test[0]

topic_contents

In [None]:
#Keep only books which have specific topic
topic_contents = topic_contents.loc[topic_contents['Topic'] == 'Undersea_Reef_Species']

#Sort by title
topic_contents = topic_contents.sort_values(by='Title',ascending=True)

#Drop duplicates
topic_contents = topic_contents.drop_duplicates(subset=["Title"], keep='first')
topic_contents

## Compare Topic Usage Between Two Authors

In [None]:
#Create new df for topics per author analysis
topics_per_author = counted_topics.copy()
topics_per_author

In [None]:
#Split book on first hyphen, keep only text after first hyphen (author name, title, chapter and chunk)
start = topics_per_author["Title"].str.split("_", expand = True)
topics_per_author['Title'] = start[1]

#Split book on second hyphen keep text only before second hyphen (author name)
end = topics_per_author["Title"].str.split("_", expand = True)
topics_per_author['Title'] = end[0]

topics_per_author.rename(columns={"Title": "Author"}, inplace=True)
topics_per_author

In [None]:
#Count number of times each author uses each topic
from collections import Counter

df1 = topics_per_author['Topic'].apply(lambda x: pd.Series(Counter(x.split(','))), 1).fillna(0).astype(int)

topics_per_author = topics_per_author.join(df1.add_suffix(' Count'))
topics_per_author

In [None]:
topics_per_author = topics_per_author.groupby(['Author']).sum()
topics_per_author = topics_per_author.reset_index()
topics_per_author.head(20)

In [None]:
#Choose authors of interest
select_authors = topics_per_author[topics_per_author.Author.str.contains('ALDISS|LEGUIN')] 
select_authors

In [None]:
import seaborn as sns

dfm2 = select_authors.melt('Author', var_name='Topics', value_name='vals')

#Only keep rows where topic value is not 0
dfm2 = dfm2[dfm2.vals != 0]

#Remove outlier rows and other extraneous rows
dfm2 = dfm2[dfm2["Topics"].str.contains("Outliers")==False]
dfm2 = dfm2[dfm2["Topics"].str.contains("Book_Program")==False]
dfm2 = dfm2[dfm2["Topics"].str.contains("UNCLEAR")==False]


dfm2

In [None]:
#Create bar plots based on topic counts per author
fig, ax = plt.subplots(1, 2, figsize=(15, 5), sharey=True)
for i, Author in enumerate(dfm2.Author.unique()):
    _df = dfm2[dfm2.Author==Author].sort_values(by='Topics')
    g = sns.barplot(
        ax=ax[i],
        data=_df,
        x='Author', y='vals',
        hue='Topics'
    )
    ax[i].set(
        title=f'Topic Usage By ' + Author
    )
    


### Get Most Representative Documents Per Topic

In [None]:
#Make copy of dataframe
most_rep = named_topics_per_doc[['Title','Topic','Representative_document']].copy()

#change data type to string
most_rep['Representative_document'] = most_rep['Representative_document'].astype(str)

most_rep

In [None]:
#Keep only most representative documents
most_rep = most_rep.loc[most_rep['Representative_document'] == 'True']

In [None]:
#Split title on title heading and remove following text
test = most_rep['Title'].str.split("_Chapter", expand = True)
most_rep['Title'] = test[0]

#Sort by title
most_rep = most_rep.sort_values(by='Title',ascending=True)

#Drop duplicates
most_rep = most_rep.drop_duplicates(subset=["Title"], keep='first')

most_rep

In [None]:
#Get all documents representative of a certain topic
most_rep.loc[most_rep['Topic'] == 'Desert_Landscape_Exploration']


## Visualizations

In [None]:
#Visualize distance between topics
topic_model.visualize_topics()

In [None]:
#Get probability that topics will appear in a specific document
topic_model.visualize_distribution(probs[0], min_probability=0.015)

In [None]:
#Vizualize hierarchical structure of topics
topic_model.visualize_hierarchy(top_n_topics=60)

In [None]:
#Visualize top terms in selected topics
topic_model.visualize_barchart(top_n_topics=10)

In [None]:
#Create matrix to indicate similarity between topics
topic_model.visualize_heatmap(n_clusters=20, width=1000, height=1000)

In [None]:
#Visualize the decline of c-TF-IDF score when adding words to the topic representation. 
#It allows you, using the elbow method, the select the best number of words in a topic.
topic_model.visualize_term_rank()

## Search Topics for Terms

In [None]:
#Search for topics that are similar to an input search_term
similar_topics, similarity = topic_model.find_topics("space", top_n=5); similar_topics

In [None]:
#Look at other terms in that one of the similar topics
topic_model.get_topic(43)

## Update the BERTopic Model

Two common ways to update the topic model are based on ngram counts (default is single words, but you can also get bigrams, trigrams, etc) and by setting the number of topics)

In [None]:
#Update topics based on ngram counts
topic_model.update_topics(text, n_gram_range=(1, 2))

In [None]:
#Look at the topics again
topic_model.get_topic_info(); freq.head(10)

In [None]:
topic_model.get_topic(4)   # We select topic that we viewed before

In [None]:
#Reduce number of topics
topic_model.reduce_topics(text, nr_topics=20)

In [None]:
topic_model.get_topic_info(); freq.head(10)

## Compare LDA and BERT Topics for Each Text

In [None]:
uploaded = files.upload()

In [None]:
#Upload dataframe√
lda_df = pd.read_csv('lda_df.csv')

lda_df

In [None]:
#Rename columns to avoid confusion
lda_df = lda_df.rename(columns={"Dominant_Topic": "Dominant_LDA_Topic", "Topic_Perc_Contrib": "LDA_Topic_Perc_Contrib", "Keywords": "LDA_Topic_Keywords"})
lda_df

In [None]:
#Make copy of bertopic df
bertopic_df = named_topics_per_doc.copy()

#Rename columns to avoid confusion
bertopic_df = bertopic_df.rename(columns={"Topic": "BERTopic_Topic", "Top_n_words": "BERTopic_Topic_Keywords", "Probability": "BERTopic_Probability", "Representative_document": "BERTopic_Representative_document"})
bertopic_df

In [None]:
#Merge dfs
lda_and_bertopic_df = pd.merge(lda_df, bertopic_df, on="Title")

#Remove duplicates
lda_and_bertopic_df = lda_and_bertopic_df.drop_duplicates()

lda_and_bertopic_df

In [None]:
#Download combined dataframe
lda_and_bertopic_df.to_csv('lda_and_bertopic_df.csv', index=False)

files.download('lda_and_bertopic_df.csv')

## Additional Sources
Word Embeddings: https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/

BERTopic Intro: https://towardsdatascience.com/meet-bertopic-berts-cousin-for-advanced-topic-modeling-ea5bf0b7faa3

MOre about BERTopic: 
https://towardsdatascience.com/dynamic-topic-modeling-with-bertopic-e5857e29f872