# Topic Modeling with BERTopic

In this notebook, I will show you how to use BERTopic to perform topic modeling on a dataset. BERTopic is a topic modeling technique that leverages BERT embeddings to create dense clusters allowing for easily interpretable topics.

The steps we will take are as follows:
1. Load the data
2. Preprocess the data (Can be done, skipped for this notebook)
3. Create topics with BERTopic
4. Visualize the topics
5. Assign topics to new documents

*Note: This notebook is based on the [official BERTopic documentation](https://github.com/MaartenGr/BERTopic)*

1. [Loading the data](#1)

Data Source: [Emotions](https://www.kaggle.com/datasets/nelgiriyewithana/emotions)

Data description:
The "Emotions" dataset – a collection of English Twitter messages meticulously annotated with six fundamental emotions: anger, fear, joy, love, sadness, and surprise. This dataset serves as a valuable resource for understanding and analyzing the diverse spectrum of emotions expressed in short-form text on social media.

Each entry in this dataset consists of a text segment representing a Twitter message and a corresponding label indicating the predominant emotion conveyed. The emotions are classified into six categories:
0. sadness
1. joy
2. love
3. anger
4. fear
5. surprise
Whether you're interested in sentiment analysis, emotion classification, or text mining, this dataset provides a rich foundation for exploring the nuanced emotional landscape within the realm of social media.

In [1]:
import pandas as pd

In [2]:
emotions_data = pd.read_csv('data/text.csv', index_col=0)
emotions_data.head()

Unnamed: 0,text,label
0,i just feel really helpless and heavy hearted,4
1,ive enjoyed being able to slouch about relax a...,0
2,i gave up my internship with the dmrg and am f...,4
3,i dont know i feel so lost,0
4,i am a kindergarten teacher and i am thoroughl...,4


In [3]:
# number of tweets in the dataset
print('Number of tweets in the dataset:', emotions_data.shape[0])

Number of tweets in the dataset: 416809


In [4]:
# lenghts of tweets
emotions_data['length'] = emotions_data['text'].apply(lambda x : len(x.split()))

# max length of a tweet
print('Max length of a tweet:', emotions_data['length'].max())

# min length of a tweet
print('Min length of a tweet:', emotions_data['length'].min())

# average length of a tweet
print('Average length of a tweet:', emotions_data['length'].mean().round(0))

Max length of a tweet: 178
Min length of a tweet: 1
Average length of a tweet: 19.0


2. [Preprocessing the data](#2)

In [5]:
# possible preprocessing steps
# import re
# import nltk
# from nltk.corpus import stopwords
# from nltk.stem import WordNetLemmatizer

# # Download NLTK resources
# nltk.download('stopwords')
# nltk.download('wordnet')

# lemmatizer = WordNetLemmatizer()
# stop_words = set(stopwords.words('english'))

# def preprocess(text):
#     text = re.sub(r'\W', ' ', text.lower())
#     words = text.split()
#     words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
#     return ' '.join(words)

# emotions_data['text'] = emotions_data['text'].apply(preprocess)

# not necessary for bertopic

3. [Creating topics with BERTopic](#3)

In [6]:
from bertopic import BERTopic

topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True, )
topics, probs = topic_model.fit_transform(emotions_data['text'])

  from .autonotebook import tqdm as notebook_tqdm
2024-06-29 17:49:30,147 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 13026/13026 [2:59:05<00:00,  1.21it/s]     
2024-06-29 20:50:14,386 - BERTopic - Embedding - Completed ✓
2024-06-29 20:50:14,386 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-06-29 20:59:35,610 - BERTopic - Dimensionality - Completed ✓
2024-06-29 20:59:35,635 - BERTopic - Cluster - Start clustering the reduced embeddings


In [None]:
freq = topic_model.get_topic_info()
display(freq.head(5))

topic_model.get_topic(0)  # Select the most frequent topic

4 [Visualizing the topics](#4)

In [None]:
topic_model.visualize_topics()

In [None]:
topic_model.visualize_hierarchy(top_n_topics=50)

In [None]:
topic_model.visualize_barchart(top_n_topics=10)

5. [Assigning topics to new documents](#5)

In [None]:
test_text = "I feel so happy today"
topic_model.transform(test_text)

topic_model.get_topics()