#BERTopic

Inspired by: [Tutorial - BERTopic Best Practices](https://colab.research.google.com/drive/1BoQ_vakEVtojsd2x_U6-_x52OOuqruj2?usp=sharing)


Created by: Oksana Kalytenko

In [2]:
%%capture
!pip install bertopic

restart the kernel/session

In [3]:
import pandas as pd

from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, PartOfSpeech
from bertopic import BERTopic


import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import plotly.express as px
import csv

  from tqdm.autonotebook import tqdm, trange


In [5]:
# Load the CSV file into a DataFrame
dataset = pd.read_csv("/data/cleaned_webis_tldr_subreddit_explainlikeimfive.csv",  sep = ';')

# DataFrame
dataset.head()

Unnamed: 0,author,body,normalizedBody,subreddit,subreddit_id,id,content,summary
0,Cypriotmenace,Think of it like mailing pages of a book to di...,Think of it like mailing pages of a book to di...,explainlikeimfive,t5_2sokd,c6dydfx,think of it like mailing pages of a book to di...,"always look for the highest seeded torrents, a..."
1,senatorskeletor,"""Redistribution"" is short for ""redistribution ...","""Redistribution"" is short for ""redistribution ...",explainlikeimfive,t5_2sokd,c6whsmv,"redistribution"" is short for ""redistribution o...",1) using the tax system to take money from the...
2,callumgg,"The Chinese system isn't exactly transparent, ...","The Chinese system isn't exactly transparent, ...",explainlikeimfive,t5_2sokd,c6y9grw,"the chinese system isn't exactly transparent, ...",2700 delegates and representatives from all ov...
3,mcanerin,Here is an analogy I've used before and might ...,Here is an analogy I've used before and might ...,explainlikeimfive,t5_2sokd,c6yj68l,here is an analogy i've used before and might ...,"the communist party isn't a political party, i..."
4,neo45,"This is a complicated question, but I think it...","This is a complicated question, but I think it...",explainlikeimfive,t5_2sokd,c7fuozw,"this is a complicated question, but i think it...","there's lots of good actors out there, but ver..."


In [6]:
# Extract contents to train on and corresponding titles
contents = dataset["content"]
contents[0]

"think of it like mailing pages of a book to different people who own photocopiers. if you have the full set, a, b and c, and then you give three people one of those pages after you've photocopied them, there are now 2 people out of 4 who have a, b or c. then, everyone photocopies what they have and redistributes it, which means that you don't have to send out 3 copies of the same file, only 2, and the others contribute one. \n same works with torrenting; the more people there are that have a piece of the file, the easier it is to get hold of, and the less each person has to contribute. as soon as someone has a piece, their connection uploads it to torrent programs that want it, and in return they can get pieces from other people. essentially, the more people involved, the faster the file spreads, because everyone can start copying and sending off the parts they already have."

In [7]:
# Pre-calculate embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(contents, show_progress_bar=True)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/795 [00:00<?, ?it/s]

In [8]:
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

## **Controlling Number of Topics**
There is a parameter to control the number of topics, namely `nr_topics`. This parameter, however, merges topics **after** they have been created. It is a parameter that supports creating a fixed number of topics.

However, it is advised to control the number of topics through the cluster model which is by default HDBSCAN. HDBSCAN has a parameter, namely `min_cluster_size` that indirectly controls the number of topics that will be created.

A higher `min_cluster_size` will generate fewer topics and a lower `min_cluster_size` will generate more topics.



In [9]:
hdbscan_model = HDBSCAN(min_cluster_size=150, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

In [10]:
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))

#### BERTopic:

* KeyBERTInspired
  * A method that derives inspiration from how KeyBERT works
* PartOfSpeech
  * Using SpaCy's POS tagging to extract words
* MaximalMarginalRelevance
  * Diversify the topic words



In [11]:
# KeyBERT
keybert_model = KeyBERTInspired()

# Part-of-Speech
pos_model = PartOfSpeech("en_core_web_sm")

# MMR
mmr_model = MaximalMarginalRelevance(diversity=0.3)

# All representation models
representation_model = {
    "KeyBERT": keybert_model,
    "MMR": mmr_model,
    "POS": pos_model
}

In [12]:
topic_model = BERTopic(

  # Pipeline models
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,
  representation_model=representation_model,

  # Hyperparameters
  top_n_words=10,
  verbose=True
)

topics, probs = topic_model.fit_transform(contents, embeddings)

2024-07-11 17:27:31,379 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-07-11 17:28:24,582 - BERTopic - Dimensionality - Completed ✓
2024-07-11 17:28:24,585 - BERTopic - Cluster - Start clustering the reduced embeddings
  pid = os.fork()
2024-07-11 17:28:29,369 - BERTopic - Cluster - Completed ✓
2024-07-11 17:28:29,384 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-07-11 17:28:59,226 - BERTopic - Representation - Completed ✓


In [13]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,KeyBERT,MMR,POS,Representative_Docs
0,-1,9467,-1_people_like_just_time,"[people, like, just, time, don, make, way, thi...","[life, energy, world, food, blood, oil, just, ...","[people, like, just, time, don, make, way, thi...","[people, time, way, things, lot, different, go...","[i tried to cover a little of everything here,..."
1,0,2848,0_money_people_pay_bank,"[money, people, pay, bank, debt, buy, value, c...","[spending, cash, economy, money, financial, ec...","[money, people, pay, bank, debt, buy, value, c...","[money, people, bank, debt, value, company, ec...","[i tried to make this a simple explanation, bu..."
2,1,1457,1_species_humans_women_men,"[species, humans, women, men, sex, like, sexua...","[species, reproduction, neanderthals, evolutio...","[species, humans, women, men, sex, like, sexua...","[species, humans, women, men, sex, sexual, ani...",[an overview of [sexual selection]( will help ...
3,2,1312,2_speed_light_universe_space,"[speed, light, universe, space, time, gravity,...","[relativity, speed light, black hole, physics,...","[speed, light, universe, space, time, gravity,...","[speed, light, universe, space, time, gravity,...","[the immediate conclusion you draw is ""somehow..."
4,3,1039,3_game_computer_windows_games,"[game, computer, windows, games, code, program...","[hardware, computers, computer, processor, os,...","[game, computer, windows, games, code, program...","[game, computer, windows, games, code, program...",[you can see a computer as a person paper work...
5,4,718,4_fat_food_body_eat,"[fat, food, body, eat, calories, sugar, foods,...","[calorie, diet, calories, carbs, eating, fat, ...","[fat, food, body, eat, calories, sugar, foods,...","[fat, food, body, calories, sugar, foods, diet...",[the sugary foods you mentioned are said to ha...
6,5,704,5_reddit_people_subreddit_post,"[reddit, people, subreddit, post, like, just, ...","[4chan, reddit, redditors, posts, community, s...","[reddit, people, subreddit, post, like, just, ...","[reddit, people, subreddit, post, anonymous, c...",[let. me. tell you. about srs. \n as reddit g...
7,6,603,6_internet_network_phone_service,"[internet, network, phone, service, data, isp,...","[net neutrality, internet, isp, broadband, net...","[internet, network, phone, service, data, isp,...","[internet, network, phone, service, data, isp,...","[this is my first eli5, so i'll try to keep it..."
8,7,580,7_music_sound_song_notes,"[music, sound, song, notes, hear, sounds, like...","[music, musical, melody, vibrations, musicians...","[music, sound, song, notes, hear, sounds, like...","[music, sound, song, notes, note, ear, frequen...",[to find the key for much of today's popular m...
9,8,517,8_party_vote_government_parties,"[party, vote, government, parties, president, ...","[political parties, parties, elections, candid...","[party, vote, government, parties, president, ...","[party, vote, government, parties, president, ...",[the video is great for explaining duverger's ...


## **Outlier Reduction**
By default, HDBSCAN generates outliers which is a helpful mechanic in creating accurate topic representations. However, you might want to assign every single document to a topic. We can use `.reduce_outliers` to map some or all outliers to a topic:

In [15]:
# Reduce outliers with pre-calculate embeddings instead
new_topics = topic_model.reduce_outliers(contents, topics, strategy="embeddings", embeddings=embeddings)

In [16]:
topic_model.update_topics(contents, topics = new_topics)



In [18]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,KeyBERT,MMR,POS,Representative_Docs
0,0,3448,0_the_to_and_of,"[the, to, and, of, money, you, that, in, is, for]","[spending, cash, economy, money, financial, ec...","[money, people, pay, bank, debt, buy, value, c...","[money, people, bank, debt, value, company, ec...","[i tried to make this a simple explanation, bu..."
1,1,1791,1_to_of_and_the,"[to, of, and, the, that, is, are, in, it, have]","[species, reproduction, neanderthals, evolutio...","[species, humans, women, men, sex, like, sexua...","[species, humans, women, men, sex, sexual, ani...",[an overview of [sexual selection]( will help ...
2,2,1648,2_the_is_of_it,"[the, is, of, it, to, that, you, and, in, we]","[relativity, speed light, black hole, physics,...","[speed, light, universe, space, time, gravity,...","[speed, light, universe, space, time, gravity,...","[the immediate conclusion you draw is ""somehow..."
3,3,1406,3_the_to_it_and,"[the, to, it, and, you, is, of, that, for, on]","[hardware, computers, computer, processor, os,...","[game, computer, windows, games, code, program...","[game, computer, windows, games, code, program...",[you can see a computer as a person paper work...
4,4,1204,4_and_the_to_you,"[and, the, to, you, food, your, of, it, is, in]","[calorie, diet, calories, carbs, eating, fat, ...","[fat, food, body, eat, calories, sugar, foods,...","[fat, food, body, calories, sugar, foods, diet...",[the sugary foods you mentioned are said to ha...
5,5,966,5_to_the_it_and,"[to, the, it, and, you, of, that, reddit, is, ...","[4chan, reddit, redditors, posts, community, s...","[reddit, people, subreddit, post, like, just, ...","[reddit, people, subreddit, post, anonymous, c...",[let. me. tell you. about srs. \n as reddit g...
6,6,1128,6_to_the_internet_you,"[to, the, internet, you, and, it, is, that, of...","[net neutrality, internet, isp, broadband, net...","[internet, network, phone, service, data, isp,...","[internet, network, phone, service, data, isp,...","[this is my first eli5, so i'll try to keep it..."
7,7,746,7_music_sound_the_to,"[music, sound, the, to, of, and, is, you, it, ...","[music, musical, melody, vibrations, musicians...","[music, sound, song, notes, hear, sounds, like...","[music, sound, song, notes, note, ear, frequen...",[to find the key for much of today's popular m...
8,8,774,8_the_to_of_and,"[the, to, of, and, in, party, that, is, they, ...","[political parties, parties, elections, candid...","[party, vote, government, parties, president, ...","[party, vote, government, parties, president, ...",[the video is great for explaining duverger's ...
9,9,993,9_the_to_and_that,"[the, to, and, that, you, of, in, is, it, they]","[arrest, enforcement, prosecution, law enforce...","[police, officer, evidence, court, law, cops, ...","[police, officer, evidence, court, law, cops, ...",[what people are thinking will happen next: \n...


##(Custom) Labels

In [19]:
# Label the topics yourself
topic_model.set_topic_labels({-1: "General",
                              0: "Economic and Financial Systems",
                              1: "Genetics and Human Evolution",
                              2: "Physics and Cosmology",
                              3: "Gaming and Software Development",
                              4: "Nutrition and Dietary Health",
                              5: "Community Posting",
                              6: "Telecommunications Infrastructure",
                              7: "Sound and Music",
                              8: "Political Parties and Elections",
                              9: "Law Enforcement",
                              10: "Cognitive Neuroscience",
                              11: "Religion",
                              12: "Healthcare Systems",
                              13: "Linguistic",
                              14: "Color Perception and Visual Physiology",
                              15: "Electrical Engineering and Power Systems",
                              16: "Football Teams and Soccer Leagues",
                              17: "Warfare and Diplomatic Relations",
                              18: "Thermodynamics and Fluid Dynamics",
                              19: "Human Anatomy and Physiological Health",
                              20: "Video Frames and Camera Technology",
                              21: "Sleep Physiology and Brain Activity",
                              22: "Cultural Diversity and Ethnic Backgrounds",
                              23: "Firearms and Public Safety",
                              24: "Mathematical Concepts and Number Theory",
                              25: "Middle Eastern Affairs and Religious Extremism",
                              26: "Cinema and Filmmaking",
                              27: "Tobacco and Smoking Behavior",
                              28: "Political Ideologies and Economic Systems",
                              29: "European Nations and Global Resources",
                              30: "Immunology and Infectious Diseases"
                              })

## **Visualize Topics**

In [20]:
topic_model.visualize_topics(custom_labels=True)
# Generate the visualization
fig = topic_model.visualize_topics(custom_labels=True)
fig.write_html("topics_visualization.html")
fig.show()

In [21]:
topic_model.visualize_hierarchy(custom_labels=True)
# Generate the visualization
fig = topic_model.visualize_hierarchy(custom_labels=True)
fig.write_html("topics_hierarchy.html")
fig.show()

## **Visualize Documents**


Visualizing documents in 2-dimensional space helps in understanding the underlying structure of the documents and topics.

In [23]:
# Reduce dimensionality of embeddings
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)

In [24]:
# hide the annotation to have a more clear overview of the topics
topic_model.visualize_documents(contents, hide_document_hover=True, reduced_embeddings=reduced_embeddings, custom_labels=True, hide_annotations=True)
# visualization
fig = topic_model.visualize_documents(contents, hide_document_hover=True, reduced_embeddings=reduced_embeddings, custom_labels=True, hide_annotations=True)
fig.write_html("topics_visualization_doc.html")

In [25]:
# Map topics

topic_labels = {0: "Economic and Financial Systems",
                1: "Genetics and Human Evolution",
                2: "Physics and Cosmology",
                3: "Gaming and Software Development",
                4: "Nutrition and Dietary Health",
                5: "Community Posting",
                6: "Telecommunications Infrastructure",
                7: "Sound and Music",
                8: "Political Parties and Elections",
                9: "Law Enforcement",
                10: "Cognitive Neuroscience",
                11: "Religion",
                12: "Healthcare Systems",
                13: "Linguistic",
                14: "Color Perception and Visual Physiology",
                15: "Electrical Engineering and Power Systems",
                16: "Football Teams and Soccer Leagues",
                17: "Warfare and Diplomatic Relations",
                18: "Thermodynamics and Fluid Dynamics",
                19: "Human Anatomy and Physiological Health",
                20: "Video Frames and Camera Technology",
                21: "Sleep Physiology and Brain Activity",
                22: "Cultural Diversity and Ethnic Backgrounds",
                23: "Firearms and Public Safety",
                24: "Mathematical Concepts and Number Theory",
                25: "Middle Eastern Affairs and Religious Extremism",
                26: "Cinema and Filmmaking",
                27: "Tobacco and Smoking Behavior",
                28: "Political Ideologies and Economic Systems",
                29: "European Nations and Global Resources",
                30: "Immunology and Infectious Diseases"
                }
dataset['Topic'] = [topic_labels[topic] if topic in topic_labels else "Unknown" for topic in new_topics]

In [26]:
# Save as CSV
dataset.to_csv('data_topics.csv', index=False, sep=";", encoding='utf-8', quoting=csv.QUOTE_NONNUMERIC)

# Save as JSON
dataset.to_json('data_topics.json', orient='records', lines=True)

In [27]:
# Count the occurrences of each topic
topic_counts = dataset['Topic'].value_counts()

# Calculate the ratio of each topic
topic_ratios = topic_counts / topic_counts.sum()

In [28]:
# Convert to DataFrame for Plotly
plot_data = topic_ratios.reset_index()
plot_data.columns = ['Topic', 'Ratio']

plot_data = plot_data.sort_values(by=['Ratio'], key=lambda x: x != 'Other').reset_index(drop=True)

# Define a pastel color palette without white
colors = px.colors.qualitative.Pastel + px.colors.qualitative.Pastel2 + px.colors.qualitative.Pastel1 + px.colors.qualitative.Set3

# Create the pie chart with Plotly
fig = px.pie(plot_data, names='Topic', values='Ratio', title='Topic Distribution',
             color_discrete_sequence=colors)

# Update the layout for better presentation
fig.update_traces(textposition='inside', textinfo='percent')
fig.update_layout(
    legend_title_text='Topics',
    legend=dict(
        orientation="v",
        yanchor="top",
        y=-5,
        xanchor="left",
        x=0,
        traceorder="normal"
    ),
    width=800,  # Make the pie chart bigger
    height=1200

)
# Save the plot as an HTML file
fig.write_html("topic_distribution.html")

# Display the plot
fig.show()

## Manual inspections

In [29]:
Eco_Fin = dataset[dataset["Topic"]=="Economic and Financial Systems"]["content"].reset_index(drop=True)
Eco_Fin.iloc[0]

'first of all, you have a misunderstanding of deflation. deflation in the original, rational sense of that word as a harmful economic phenomenon and a symptom of depressions, means a  decrease in the supply of money . this can only happen in a fractional reserve banking system, since when such a bank goes under, checks written upon its reserves loose all value and money is effectively destroyed. \n the increase in the value of money, in contrast, when it is caused by an  increase in the supply of goods , and not a decrease in the supply of money, is not harmful. in fact, it is a great benefit to everyone in the economy. as long as improves one\'s own productive ability in proportion with the rise in the productivity of the economy as a whole, one\'s position is better off because, while one now earns $5 an hour instead of $10 an hour, this $5 now buys  more  goods than the $10 did, because of the increase in wealth. \n neither is such increase in the value of money (caused by increase 

In [30]:
Eco_Fin.iloc[15]

"here's how i do math as a small business owner: \n most of my income comes from selling the products i manufacture. in a month i may make and sell 400 units, bringing in revenues of $10000, or $25/unit. i have one guy helping me with manufacturing, and i pay him $10/hour for 20 hours per week, so $200/week or about $800/month. that works out to about $2/unit going towards his pay. \n if i suddenly have to pay him $15/hour, then it'll cost me about $1200/month or about $3/unit. that gives him a 50% raise. if i want to recoup those costs, i would have to raise the cost per unit by 4% ($25 to $26). \n so, as i see it, if the minimum wage went up, my product would cost 4% more (or i make a little less per unit), but a huge number of people suddenly have the increased buying power to buy it, so my sales volume would increase more than enough to offset my added costs. \n it should be noted that i'm in no way an economist, so i may be missing something in all of this... in my reality though,

In [31]:
Eco_Fin.iloc[150]

"think in terms of cookies. if you are in a gym with a bunch of other kids and there are hundreds of cookies, you aren't going to lose sleep about losing a cookie and you would not exactly feel cheated if not everybody had the same amount but all people were still taken care of. now think there are 30 cookies and 25 kids. you're going to be pissed if someone gets their hands on multiple cookies because they now hold more value and you would certainly be willing to offer your services for a cookie in return. when everybody has a lot of cookies to spare, their value depreciates just as currency does, but when there is a limited supply of cookies they maintain a reasonable valuation and thus economical power."

In [32]:
dataset[dataset["Topic"]=="Genetics and Human Evolution"]["content"][32]

"two animals cannot interbreed if the genetic information of the animals is too different. \neither the sex cells don't join together properly and it ends there.\nor they join but the dna is an incoherent mess and so no growth process really get going.\nor they join, grow a little, but then after that the right systems aren't in place to facilitate more growth. \nit's a sliding scale of how far the animal gets along the path to life with the possibility of some pretty hideously deformed beasts if the genetics are not quite similar enough.\nvery similar species can interbreed (horses and donkeys)\nand some species are so varied that some breeds cannot interbreed (think domestic dogs)"

In [33]:
dataset[dataset["Topic"]=="Physics and Cosmology"]["content"][50]

'cafe standards, dot safety regulations, competition, popularity and general styling trends all have a huge effect. there are companies that make replica\'s and "kit cars" of classic designs but as far as a global automotive manufacturer it simply isn\'t possible anymore. \n \n cafe standards require companies to up fuel efficiency. unfortunately classic cars are not very aerodynamic and create much more drag than modern vehicles, greatly impacting fuel efficiency. \n \n dot safety standards have changed not only how well a car adsorbs the energy from an impact, but also the styling due to pedestrian safety. that is why you no longer see raised hood ornaments or sharp protruding edges on the fronts of cars. they actually have to be safe for hitting people at low speeds. also bumpers have a low speed impact rating (less than 5mph) that they can absorb without damage. \n \n \n so performance requirements, government regulations and popular styling all greatly dictate the design. in the f


"Economic and Financial Systems" is the largest topic in our subreddit. So many people struggle with understanding complex financial concepts and economic principles, seeking simplified explanations to make informed decisions in their personal and professional lives.
The next popular topics are "Genetics and Human Evolution", "Physics and Cosmology", "Gaming and Software Development", "Nutrition and Dietary Health".

Some topics may overlap like "Mathematical Concepts and Number Theory" and "Gaming and Software Development". "Genetics and Human Evolution" topic contains a lot of posts on sexuality and women rights? Maybe this topic can be split further into 2 topics. It looks like more manual inspection is needed.

Next steps:
*   Validate and adjust topics manually where necessary, based on domain knowledge and understanding of the dataset.
*   Adjust parameters (n_topics, min_topic_size)

