# Topic Modeling with BERTopic
This notebook explores topic modeling using the *BERTopic* library on conversational data, such as interview transcripts. The workflow is tailored to handle challenges inherent to conversational datasets, like shifting contexts and overlapping themes, ensuring meaningful and coherent topic extraction.

The process includes:

- **Document Preparation:** Preprocessing and segmenting transcripts to optimize context preservation.
- **Topic Modeling Pipeline:** Combining embedding generation, dimensionality reduction, clustering, and topic representation for flexible and efficient topic extraction.
- **Topic Assignment:** Exploring multiple topic assignments per document to reduce noise and improve analytical accuracy.

In [24]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [25]:
from utils.analysis_helpers import *
from bertopic import BERTopic

# Defining and preprocessing Documents for Topic Modeling

In this section, we focus on defining and preparing the dataset for topic modeling. Conversational data, such as interviews, requires specific preprocessing and representation strategies to ensure meaningful and coherent results. The main steps include:

- Loading and filtering the dataset.
- Preprocessing the text to clean and standardize the content.
- Splitting the data into turns to preserve the local context.
- Preparing the text for embedding generation and topic modeling.

In [26]:
df_all = pd.read_csv("../Dataset/meditation_interviews/transcripts_merged.csv")

print(f"Unique conditions before filtering: {df_all['Condition'].unique()}")
print(f"Number of interviews before filtering: {df_all['File Name'].nunique()}")
# *0*: No "real" interview (e.g., setup phase, small talk). We filter these out.
df_all = df_all[df_all["Condition"] != '0']
print(f"Unique conditions after filtering: {df_all['Condition'].unique()}")
print(f"Number of interviews (File Name) after filtering: {df_all['File Name'].nunique()}")

Unique conditions before filtering: ['1' 'C' 'I' '0']
Number of interviews before filtering: 82
Unique conditions after filtering: ['1' 'C' 'I']
Number of interviews (File Name) after filtering: 75


## Preprocessing the Data

- **Speaker and Experiment Filtering**  
The dataset can be filtered to focus on specific speakers (Participants or Interviewers) and experiments.

- **Text Cleaning Steps**
    - **Lowercasing:** Text was converted to lowercase.
    - **Lemmatization:** Words were reduced to their base forms.
    - **Stop-word Removal:** Both generic and custom stop words were removed to produce meaningful topic.

Rows with empty or non-informative content after preprocessing are removed to ensure high-quality input for modeling.

In [27]:
df = df_all.copy()
# Focus only on the participant's or interviewer's speech or both
df = df[df["Speaker"] == "Participant"] #Interviewer

# Focus only on some set of experiment
#df = df[df["Experiment"] != "Compassion"]

# Preprocess the content
extra_stopwords = [
    # Filler Words: Common conversational placeholders without thematic value
    "yeah", "okay", "yes", "mean", "oh", "ah", "like", "kind","kinda", "course", "way",
    # Vague/Ambiguous Words: Frequent but thematically irrelevant in conversations
    "think", "know", "really", "bit", "feel", "thing", "sort", "maybe", "little", "actually",
    "sure", "exactly", "tell", "ask", "people", "think",
    # Broad terms or context-specific words overshadowing subtler themes
    "question", "sorry", "time", "first", "second", "later", "experience", "end", "meditation"
]
df['preprocessed_content'] = df['Content'].apply(lambda x: preprocess_text(x, extra_stopwords=extra_stopwords, retain_stopwords=["yourself", "myself"]))

# Remove rows with empty content or content that's only punctuation after preprocessing
df = df[df['preprocessed_content'].str.strip().str.len() > 0]

## Splitting Data into Turns
To address the challenges of conversational data, the text is split into *turns*:

- A *turn* represents one speaker's uninterrupted speech until another speaker begins.
- This approach preserves local context and avoids blending contributions from multiple speakers.

In [28]:
# Split the text into turns by interview (File Name)
df = df.groupby(['File Name','turn_index']).agg({ 
    'Content': ' '.join,  # Combine raw text
    'preprocessed_content': ' '.join,  # Combine preprocessed text
    'Experiment': 'first',
    'Condition': 'first', 
    'Id': 'first',  
    'Speaker': 'first',   
}).reset_index()
df.head(2)

Unnamed: 0,File Name,turn_index,Content,preprocessed_content,Experiment,Condition,Id,Speaker
0,ID 05,1,"So, that was very, let's say, unexpected and s...",let unexpected surprising moment realize disco...,OBE1,1,5,Participant
1,ID 05,3,"It was a little bit like, okay, well, so it's ...",watch outside special lot emotion explain moment,OBE1,1,5,Participant


The preprocessed text is converted into documents for embedding generation. The choice of document representation significantly impacts the quality and coherence of the topics generated:

- **Preprocessed Content:** Cleaned and standardized text, which is ideal for improving topic coherence and ensuring embeddings capture meaningful content.  
    - *Conversational datasets:* often involve shifting topics, filler words, and inconsistent sentence structures. For smaller datasets, preprocessing **before embedding generation** can yield better results by reducing noise and improving focus on meaningful terms.

- **Original Content:** Raw, unprocessed text, retaining the original language structure. This is useful for preserving context but can produce noisier results, especially for **smaller** datasets &/or **conversational** data.  
    - *Standard Approach:* Preprocessing (e.g., stop-word removal, lemmatization) is typically performed **after embeddings are generated** to preserve the raw context during vectorization. This ensures embeddings reflect the text's full semantic structure before clustering.

I recommend testing both strategies with your dataset to determine which approach works best for your specific case. For this project, preprocessing **before embeddings** provided better results, as supported by the literature and testing.

In [29]:
df["Index"] = df.index
docs = list(df.preprocessed_content)
#docs = list(df.Content)
print(len(docs))

668


# Defining the Topic Modeling Pipeline
This section outlines the pipeline for topic modeling, which includes embedding generation, dimensionality reduction, clustering, topic representation, and visualization. The pipeline is designed to efficiently process the dataset while allowing flexibility for fine-tuning the quality and number of topics.

- **Embedding Model:** We use ``all-mpnet-base-v2`` from SentenceTransformers for generating high-quality embeddings. Alternatives use ``all-MiniLM-L6-v2`` for faster processing but slightly lower accuracy.
- **Dimensionality Reduction:** UMAP reduces the high-dimensional embedding space to a lower-dimensional space to enhance clustering performance.
- **Clustering:** HDBSCAN is used for clustering embeddings. It dynamically determines the number of clusters (topics) and identifies outliers.
    - The parameter ``min_cluster_size`` controls the minimum number of documents per topic, allowing to fine-tuning the numbers of topics generated, especially critical for smaller datasets.
- **Topic Representation and Vectorization (Optional):** When working with raw, unprocessed text, additional steps can refine topic quality and reduce noise.

In [30]:
from sentence_transformers import SentenceTransformer

# Pre-calculate embeddings
embedding_model = SentenceTransformer("all-mpnet-base-v2") # Better but slower: all-mpnet-base-v2 || Trade-off: all-MiniLM-L6-v2
embeddings = embedding_model.encode(docs, show_progress_bar=True)

Batches:   0%|          | 0/21 [00:00<?, ?it/s]

In [31]:
# Fine-tune the numbers of topics generated
# By increasing this value you reduce the number of topics
# Adapt by respect the total number of documents (in our case number of turns)
min_cluster_size = 8

In [32]:
from umap import UMAP
from hdbscan import HDBSCAN

# Dimensionality reduction model
umap_model = UMAP(n_neighbors=15, n_components=8, min_dist=0.0, metric='cosine', random_state=42)

# Clustering model
hdbscan_model = HDBSCAN(min_cluster_size=min_cluster_size, metric='euclidean', cluster_selection_method='eom')

# Optional: Topic representation for raw content
#from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance
#representation_model = [KeyBERTInspired(), MaximalMarginalRelevance(diversity=.5)]

# Optional: Vectorizer model for raw content
#from sklearn.feature_extraction.text import CountVectorizer
#stops_words = preprocess_text("sample", return_stopwords=True, extra_stopwords=extra_stopwords, retain_stopwords=["yourself", "myself"])
#vectorizer_model = CountVectorizer(stop_words=stops_words)

topic_model = BERTopic(                     
embedding_model=embedding_model,  # Embedding generation
umap_model=umap_model,            # Dimensionality reduction
hdbscan_model=hdbscan_model,      # Clustering
#vectorizer_model=vectorizer_model,
#representation_model=representation_model,
verbose=True)

In [33]:
topics, ini_probs = topic_model.fit_transform(docs, embeddings=embeddings)
num_topics = len(topic_model.get_topics()) 
num_topics

2025-01-25 20:05:38,071 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-01-25 20:05:39,835 - BERTopic - Dimensionality - Completed ✓
2025-01-25 20:05:39,838 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-01-25 20:05:39,907 - BERTopic - Cluster - Completed ✓
2025-01-25 20:05:39,917 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-01-25 20:05:39,975 - BERTopic - Representation - Completed ✓


20

In [34]:
os.makedirs("outputs/topics", exist_ok=True)
topic_model.get_topic_info().to_csv("outputs/topics/topic_names_info.csv",index=False)
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,206,-1_myself_look_body_real,"[myself, look, body, real, point, different, f...",[calm relax big natural comfortable environmen...
1,0,47,0_interesting_easy_nice_fine,"[interesting, easy, nice, fine, fun, absolutel...","[interesting, easy nice, nice interesting]"
2,1,46,1_focus_leg_eye_distract,"[focus, leg, eye, distract, try, easy, concent...",[focus myself focus instruction eye closed hea...
3,2,46,2_relax_fall_sleep_asleep,"[relax, fall, sleep, asleep, calm, relaxed, sl...","[calm want sleep, relaxed afraid fall sleep go..."
4,3,44,3_reality_body_different_yourself,"[reality, body, different, yourself, room, out...",[understand differently today want life illusi...
5,4,37,4_body_heavy_light_come,"[body, heavy, light, come, half, phrase, outsi...","[body, body, body]"
6,5,29,5_color_yellow_normal_difference,"[color, yellow, normal, difference, change, di...","[yellow color blue color, notice change color,..."
7,6,23,6_touch_delay_scene_image,"[touch, delay, scene, image, catch, body, touc...",[touch image touch touch difference touch imag...
8,7,23,7_virtual_body_vr_myself,"[virtual, body, vr, myself, hologram, actual, ...",[virtual body mind virtual body look myself ob...
9,8,23,8_forest_rock_tree_adventure,"[forest, rock, tree, adventure, indonesia, riv...",[nice forest love forest good surprise calm re...


In [35]:
topic_model.visualize_barchart(top_n_topics=16)

In [36]:
topic_model.visualize_topics()

In [37]:
topic_model.visualize_documents(docs, embeddings=embeddings)

In [38]:
#topics_per_class = topic_model.topics_per_class(docs, classes=df.Id)
#topic_model.visualize_topics_per_class(topics_per_class)

In [39]:
# hierarchical_topics = topic_model.hierarchical_topics(docs)
# topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

In [40]:
df['one_topic'] = topics
topic_name_to_id = dict(zip(topic_model.get_topic_info().Topic, topic_model.get_topic_info().Name))
df['one_topic_name'] = df['one_topic'].map(topic_name_to_id)

# Topic Distribution: Assigning Multiple Topics Per Document

In some cases, documents may align with multiple topics rather than a single one. Assigning multiple topics to a document allows for:

1. **Reducing Outliers:** Avoid documents being classified as noise due to marginal probabilities.
2.**Improved Accuracy:** Mitigates false positives/negatives by acknowledging topic overlaps.
3. **Detailed Analysis:** Useful for cases where a specific topic is of interest, ensuring documents are not incorrectly excluded.

In [41]:
topic_distr, topic_token_distr = topic_model.approximate_distribution(
      docs, window = 5, calculate_tokens=True)

100%|██████████| 1/1 [00:00<00:00, 13.18it/s]


In [42]:
import tqdm
import numpy as np
import plotly.express as px

tmp_dfs = []

# iterating through different threshold levels
for thr in tqdm.tqdm(np.arange(0, 0.35, 0.001)):
    # calculating number of topics with probability > threshold for each document
    tmp_df = pd.DataFrame(list(map(lambda x: len(list(filter(lambda y: y >= thr, x))), topic_distr))).rename(
        columns = {0: 'num_topics'}
    )
    tmp_df['num_docs'] = 1
    
    tmp_df['num_topics_group'] = tmp_df['num_topics']\
        .map(lambda x: str(x) if x < 5 else '5+')
    
    # aggregating stats
    tmp_df_aggr = tmp_df.groupby('num_topics_group', as_index = False).num_docs.sum()
    tmp_df_aggr['threshold'] = thr
    
    tmp_dfs.append(tmp_df_aggr)

num_topics_stats_df = pd.concat(tmp_dfs).pivot(index = 'threshold', 
                              values = 'num_docs',
                              columns = 'num_topics_group').fillna(0)

num_topics_stats_df = num_topics_stats_df.apply(lambda x: 100.*x/num_topics_stats_df.sum(axis = 1))

100%|██████████| 350/350 [00:01<00:00, 183.07it/s]


### Iterative Threshold Testing
In this approach, multiple thresholds are tested to understand how the distribution of topics changes across documents. This allows for:

- **Proportion Analysis:** Examining the number of topics assigned to documents at various confidence levels.
- **Outlier Reduction:** Identifying a threshold that minimizes the number of documents classified as outliers (no assigned topics).
- **Avoiding Overlap:** Ensuring that documents are not assigned an excessive number of topics, which could dilute interpretability.

In [43]:
colormap = px.colors.sequential.YlGnBu
px.area(num_topics_stats_df, 
       title = 'Distribution of number of topics',
       labels = {'num_topics_group': 'number of topics',
                'value': 'share of reviews, %'},
       color_discrete_map = {
          '0': colormap[0],
          '1': colormap[3],
          '2': colormap[4],
          '3': colormap[5],
          '4': colormap[6],
          '5+': colormap[7]
      })

**Recomendation:** Select a threshold that minimizes outliers while avoiding excessive topic overlap for most documents.

In [44]:
threshold = 0.25

# Define topic with probability > threshold for each document
df['multiple_topics'] = list(map(
    lambda doc_topic_distr: list(map(
        lambda y: y[0], filter(lambda x: x[1] >= threshold, 
                               (enumerate(doc_topic_distr)))
    )), topic_distr
))
            
df["multiple_topics_name"] = df["multiple_topics"].map(lambda x: [topic_name_to_id.get(i, "No topic") for i in x])

**The assigned topics for each document are saved**  

Enabling further analysis, performed in the [Analysis Topics notebook](analysis_topics.ipynb).

In [45]:
df.to_csv("outputs/topics/df_topic.csv", index = False)
df.head()

Unnamed: 0,File Name,turn_index,Content,preprocessed_content,Experiment,Condition,Id,Speaker,Index,one_topic,one_topic_name,multiple_topics,multiple_topics_name
0,ID 05,1,"So, that was very, let's say, unexpected and s...",let unexpected surprising moment realize disco...,OBE1,1,5,Participant,0,3,3_reality_body_different_yourself,[3],[3_reality_body_different_yourself]
1,ID 05,3,"It was a little bit like, okay, well, so it's ...",watch outside special lot emotion explain moment,OBE1,1,5,Participant,1,-1,-1_myself_look_body_real,"[3, 17]","[3_reality_body_different_yourself, 17_room_ce..."
2,ID 05,5,So I'm not sure I've got all the perfect descr...,perfect description,OBE1,1,5,Participant,2,0,0_interesting_easy_nice_fine,[0],[0_interesting_easy_nice_fine]
3,ID 05,7,"The thing that I didn't really understand, but...",understand understand body touching basically ...,OBE1,1,5,Participant,3,6,6_touch_delay_scene_image,"[0, 6]","[0_interesting_easy_nice_fine, 6_touch_delay_s..."
4,ID 05,9,So I felt like a time lag in the last one of w...,lag body,OBE1,1,5,Participant,4,4,4_body_heavy_light_come,[4],[4_body_heavy_light_come]
