### BERTopic Analysis | Computer Assisted Content Analysis

This notebook documents the third phase of our mixed-methods analysis, where we used Computer-Assisted Content Analysis (CACA) with BERTopic to support and refine our thematic categories. Specifically, BERTopic was applied to surface latent discourse patterns, expand our keyword sets, and validate or revise the preliminary themes identified during qualitative immersion, open coding, and axial coding.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from gensim.corpora.dictionary import Dictionary
from bertopic import BERTopic
from hdbscan import HDBSCAN
import glob

In [4]:
csv_files = glob.glob("money_links_df*.csv") # finding all the csv's from the scraping

# Read and concatenate all CSVs
df = pd.concat((pd.read_csv(f) for f in csv_files), ignore_index=True)

df = df[~df['text'].str.contains(r'\[deleted\]|\[\-\-removed\-\-]', case=False, na=False)].copy() # drop removed or deleted text

In [5]:
docs = df['text'].dropna().astype(str).tolist() # drop empty text rows

In [6]:
# Create custom HDBSCAN model. Min cluster size defaults to 5 but this took hours to run so I set it manually to 100. 
hdbscan_model = HDBSCAN(min_cluster_size=100, min_samples=15, metric='euclidean', prediction_data=True)

# Pass the custom clustering model into BERTopic
topic_model = BERTopic(
    language="english",
    calculate_probabilities=True,
    verbose=True,
    hdbscan_model=hdbscan_model
)


topics, probs = topic_model.fit_transform(docs)

2025-05-14 11:36:26,346 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|█████████████████████████████████████████████████████████████████████| 3719/3719 [18:15<00:00,  3.40it/s]
2025-05-14 11:54:48,759 - BERTopic - Embedding - Completed ✓
2025-05-14 11:54:48,760 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-05-14 11:56:14,701 - BERTopic - Dimensionality - Completed ✓
2025-05-14 11:56:14,702 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-05-14 11:59:28,472 - BERTopic - Cluster - Completed ✓
2025-05-14 11:59:28,527 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-05-14 11:59:36,525 - BERTopic - Representation - Completed ✓


In [12]:
topic_model.get_topic_info().head(40)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,64237,-1_to_the_and_you,"[to, the, and, you, of, that, is, in, it, for]",[It’s been some time since my last post or hav...
1,0,2530,0_alpha_beta_betas_alphas,"[alpha, beta, betas, alphas, is, the, an, they...",[A beta can in certain situations choose to do...
2,1,2114,1_pill_red_blue_the,"[pill, red, blue, the, of, is, and, to, that, we]","[This, this and fucking this. It's time for th..."
3,2,1659,2_she_her_me_was,"[she, her, me, was, shes, and, girl, with, to,...",[Before I make this post I know I'm down bad/a...
4,3,1621,3_dad_my_mom_father,"[dad, my, mom, father, mother, kids, parents, ...",[The haters love to argue that it was simply t...
5,4,1584,4_job_degree_business_work,"[job, degree, business, work, college, you, co...",[After seeing the recent shit storm after GLOs...
6,5,1540,5_life_you_your_yourself,"[life, you, your, yourself, to, that, it, is, ...",[Usually when someone seeks to change somethin...
7,6,1335,6_age_old_older_year,"[age, old, older, year, 30, younger, 20s, youn...",[The same way you are meeting girls in your ag...
8,7,1040,7_lift_lifting_weights_weight,"[lift, lifting, weights, weight, reps, bench, ...","[Do you even lift? , Do you lift? , Lift. 1]"
9,8,844,8_marriage_married_divorce_marry,"[marriage, married, divorce, marry, get, contr...","[Why do you want to get married? , To get out ..."


In [13]:
topic_model.get_topic_info().to_csv('red_topics')