# Topic Modeling and Narrative Analysis

This notebook focuses on analyzing the evolution of narratives in YouTube comments using BERTopic. We will:

1. Load the processed comments data
2. Prepare texts for topic modeling
3. Train BERTopic model
4. Analyze topic evolution over time
5. Visualize narrative changes

## Setup and Dependencies

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
! pip install bertopic



In [3]:
# Import required libraries
import os
import pandas as pd
import numpy as np
from bertopic import BERTopic
from datetime import datetime
import plotly.express as px
import plotly.graph_objects as go
from tqdm.notebook import tqdm
from sklearn.feature_extraction.text import CountVectorizer

# For preprocessing
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## 1. Load and Prepare Data

First, we'll load our processed comments and prepare them for topic modeling.

In [4]:

df_comments = pd.read_csv("/content/drive/MyDrive/Youtube Data/processed_comments.csv")

print("Length of comments before prepreocessing:", len(df_comments["translated_text"]))

df_comments['published_at'] = pd.to_datetime(df_comments['published_at'])

# Prepare texts for topic modeling (use translated text if available)
texts = df_comments['translated_text'].fillna(df_comments['comment_text']).values
timestamps = df_comments['published_at'].values

# Basic text cleaning function
def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove special characters
    text = re.sub(r'[^\w\s]', '', text)
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    # Convert to lowercase
    text = text.lower().strip()
    return text

# Cleaned texts
print("Cleaning texts...")
cleaned_texts = [clean_text(str(text)) for text in tqdm(texts)]

# Removed empty texts
valid_indices = [i for i, text in enumerate(cleaned_texts) if len(text.strip()) > 0]
cleaned_texts = [cleaned_texts[i] for i in valid_indices]
timestamps = timestamps[valid_indices]

print(f"Total number of valid comments after preprocessing: {len(cleaned_texts)}")

Length of comments before prepreocessing: 399994
Cleaning texts...


  0%|          | 0/399994 [00:00<?, ?it/s]

Total number of valid comments after preprocessing: 387462


## 2. Train BERTopic Model

Now we'll train the BERTopic model and extract topics from our comments.

In [5]:
# Initialized and trained BERTopic model
from sentence_transformers import SentenceTransformer

# Defined data directory for saving the model
data_dir = '/content/drive/MyDrive/Youtube Data/'

# Used the  fine-tuned multilingual sentence transformer model
# Assuming the model is saved in the specified path
multilingual_model = SentenceTransformer('/content/drive/MyDrive/Youtube Data/best_model/', local_files_only=True)


Some weights of XLMRobertaModel were not initialized from the model checkpoint at /content/drive/MyDrive/Youtube Data/best_model/ and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
# Initialize and train BERTopic model
from sentence_transformers import SentenceTransformer

# Defined the data directory for saving the model
data_dir = '/content/drive/MyDrive/Youtube Data/'



vectorizer_model = CountVectorizer(ngram_range=(1,3), stop_words="english") # Keep english stop words for now, can be adjusted for multilingual if needed


topic_model = BERTopic(
    language="multilingual",
    verbose=True,
    calculate_probabilities=True,
    nr_topics=10,
    vectorizer_model=vectorizer_model
)

print(topic_model)

# Fitting the model and get topics
topics, probs = topic_model.fit_transform(cleaned_texts[:20000])

print("Fitting done")

# Saved the model
topic_model.save(os.path.join(data_dir, "topic_model"))

# Printed topic information
print("\nTop 10 topics:")
display(topic_model.get_topic_info())




2025-11-13 06:50:58,283 - BERTopic - Embedding - Transforming documents to embeddings.


BERTopic(calculate_probabilities=True, ctfidf_model=ClassTfidfTransformer(...), embedding_model=None, hdbscan_model=HDBSCAN(...), language=multilingual, low_memory=False, min_topic_size=10, n_gram_range=(1, 1), nr_topics=10, representation_model=None, seed_topic_list=None, top_n_words=10, umap_model=UMAP(...), vectorizer_model=CountVectorizer(...), verbose=True, zeroshot_min_similarity=0.7, zeroshot_topic_list=None)


Batches:   0%|          | 0/625 [00:00<?, ?it/s]

2025-11-13 06:51:11,904 - BERTopic - Embedding - Completed ✓
2025-11-13 06:51:11,904 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-11-13 06:51:37,480 - BERTopic - Dimensionality - Completed ✓
2025-11-13 06:51:37,482 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-11-13 06:52:50,258 - BERTopic - Cluster - Completed ✓
2025-11-13 06:52:50,259 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction.
2025-11-13 06:52:52,013 - BERTopic - Representation - Completed ✓
2025-11-13 06:52:52,019 - BERTopic - Topic reduction - Reducing number of topics
2025-11-13 06:52:52,060 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-11-13 06:52:53,657 - BERTopic - Representation - Completed ✓
2025-11-13 06:52:53,667 - BERTopic - Topic reduction - Reduced number of topics from 315 to 10


Fitting done

Top 10 topics:


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,7255,-1_song_like_love_just,"[song, like, love, just, people, dont, im, que...","[happy birthday, happy birthday, like]"
1,0,9601,0_happy_birthday_song_happy birthday,"[happy, birthday, song, happy birthday, like, ...","[happy birthday, happy birthday, happy birthday]"
2,1,1616,1_trump_people_president_country,"[trump, people, president, country, like, dont...",[for the life of me i cant understand why the ...
3,2,748,2_shes_like_teacher_woman,"[shes, like, teacher, woman, just, girl, women...",[truly admirable we are the character we proje...
4,3,274,3_earth_million_views_years,"[earth, million, views, years, million years, ...","[views, it was am i was carrying a telescope ..."
5,4,264,4_police_cops_brain_law,"[police, cops, brain, law, bitcoin, officers, ...",[tengo que corregir que la billetera de donaci...
6,5,89,5_house_eviction_jeffree_evicted,"[house, eviction, jeffree, evicted, rent, home...",[i went through the same thing my boys were yo...
7,6,63,6_bone_fabric_thugs_bone thugs,"[bone, fabric, thugs, bone thugs, self healing...","[bone bone be ne bone bone, bone bone bone bon..."
8,7,50,7_car_speed_trailer_slow,"[car, speed, trailer, slow, auto, ich, bei, ca...",[recht hast du sowas erzeugt bei jedem autofan...
9,8,40,8_button_google_thumbnail_doodle,"[button, google, thumbnail, doodle, replay but...","[wait for hit the like button, if you search t..."


In [7]:
# Got topic evolution over time
topics_over_time = topic_model.topics_over_time(
    cleaned_texts[:20000],
    timestamps[:20000],
    global_tuning=True,
    evolution_tuning=True,
    nr_bins=50
)

49it [00:47,  1.04it/s]


## 3. Visualize Topic Evolution

Let's create visualizations to track how narratives evolve over time.

In [8]:
# Plotted topic evolution over time
fig = topic_model.visualize_topics_over_time(topics_over_time)
fig.update_layout(xaxis_title="Time", yaxis_title="Topic Frequency")
fig.show()



In [9]:
# Created topic hierarchy visualization
fig_hierarchy = topic_model.visualize_hierarchy()
fig_hierarchy.update_layout(xaxis_title="Distance (UMAP Component 1)", yaxis_title="Distance (UMAP Component 2)", title="Topic Hierarchy (Dendrogram)")
fig_hierarchy.show()



In [10]:
# Created topic similarity network
fig_similarity = topic_model.visualize_topics()
fig_similarity.update_layout(xaxis_title="UMAP Dimension 1", yaxis_title="UMAP Dimension 2", title="Topic Similarity Network")
fig_similarity.show()


In [11]:

# Got the most representative documents per topic
for topic in topic_model.get_topic_info().head(5)['Topic']:
    print(f"\nTop 3 documents for Topic {topic}:")
    docs = topic_model.get_representative_docs(topic)
    for doc in docs[:3]:
        print(f"- {doc[:200]}...")


Top 3 documents for Topic -1:
- happy birthday...
- happy birthday...
- like...

Top 3 documents for Topic 0:
- happy birthday...
- happy birthday...
- happy birthday...

Top 3 documents for Topic 1:
- for the life of me i cant understand why the left is so hell bent on having the last word if you went to any biden voter even after all of this most likely they would take it to their grave he was a g...
- im looking back this is august  president trump is the best president weve had in so long look at biden and his administration trying to destroy america as fast as they can lindsey graham you are eati...
- this freaking guy i cant i really cant with this traitor this nevertrumper changed parties bc he didnt like who got the nod for the republican nominee and then the presidency i mean wow absolutely zer...

Top 3 documents for Topic 2:
- truly admirable we are the character we project outwards confidence shows in women just like it does for men and feeling capable and competent is wha