# How to do cluster analysis with Bertopic

This notebook contains code to do a clustering analysis using Bertopic. 

In [111]:
# First, we initialize packages that we will use. 

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
import pandas as pd
import plotly.express as px
import numpy as np
import re
import random
from tqdm import tqdm
import hdbscan
from umap import UMAP

# We also set a random state in order to make reproducible results
RANDOM_STATE = 1111
np.random.seed(RANDOM_STATE)

### Step 1: Load data
As a first step, we want to load our data, and possibly we also want to do some cleaning of the data. Bertopic uses large language models like BERT to produce the embeddings, and therefore we do not want to remove stopwords or similar here. However, sometimes we might want to remove HTML or other noisy stuff. In this example, we only load in the data and make a list of the documents. 

In [103]:
# For the example, we load a subset of the digital assistant dataset
df = pd.read_excel("data/Projects_all_data.xlsx", engine='openpyxl')
df = df[df['Slut (år)'].apply(lambda x: x in [2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022])]

df['Beskrivelse'] = df['Beskrivelse'].fillna('')
df['Intro'] = df['Intro'].fillna('')
df['text'] = df['Projekttitel'] + ' \n ' + df['Intro'] + ' \n ' + df['Beskrivelse']
df = df.reset_index(drop=True)

print(df.shape)
# Optional: Add text cleaning here 

docs = df['text'].tolist() # <- this is the column that contains the text

(717, 35)


### Step 2: Create embeddings

Secondly, we want to create embeddings from the text in the documents. Here, we use a pre-trained language model called `paraphrase-multilingual-MiniLM-L12-v2` which has a nice trade-off between size and speed. 

Note that the second cell that encodes the documents into embeddings takes some time to run. Therefore, we save the final embeddings in a numpy array, and later we can also load them in quickly without having to encode them again. 

In [16]:
embedding_model_name = "paraphrase-multilingual-MiniLM-L12-v2"
sentence_model = SentenceTransformer(embedding_model_name)

In [17]:
# OBS: Only run this cell once - it takes some time to run...
embeds = sentence_model.encode(docs, show_progress_bar=True)
print(embeds.shape)

np.save(f"data/embeddings/{embedding_model_name}", embeds)

Batches: 100%|██████████| 23/23 [00:13<00:00,  1.76it/s]

(717, 384)





In [17]:
### Later, we can load in the previous encoded embeddings
embeds = np.load(f"data/embeddings/{embedding_model_name}.npy")
print(embeds.shape)

(717, 384)


### Step 3: Run and tune topic models

Now is the time to run and tune the parameters of the clustering model. Bertopic uses UMAP to reduce dimensionality of the embeddings, and then HDBSCAN to cluster the documents. 

As a default, Bertopic uses a variation of TF-IDF (c-TF-IDF) to interpret the clusters, but here we add our own CountVectorizer to interpret the clusters. It uses also bi-grams and a list of Danish stop-words to create names for the clusters. 

The preferred way to tune the parameters is to start with the default values as written here. Then, you can visualize and change one parameters at a time if you feel like it. Note that there is several models in play, and they can be very sensitive, so begin with small changes. Usually, the models are also varies a lot between random states, so you might also want to try out a couple of different random states to validate your results. 

You can read about the different parameters here: 
 - Bertopic: https://maartengr.github.io/BERTopic/getting_started/parameter%20tuning/parametertuning.html 
 - UMAP: https://umap-learn.readthedocs.io/en/latest/parameters.html 
 - HDBSCAN: https://hdbscan.readthedocs.io/en/latest/parameter_selection.html 

In [118]:
# Here the CountVectorizer is initialized. This is used for naming the clusters afterwards.
# You probably don't need to tune any parameters here. 
with open('data/danish_stopwords.txt', 'r') as f:
    danish_stop_words = f.readlines()
danish_stop_words = [d.replace('\n', '') for d in danish_stop_words] + ['projekt','projektet']

from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words=danish_stop_words, ngram_range=(1, 2))


# *** Change parameters here ***
best_hp_dict = {
    "METRIC": "euclidean", # <- distance metric. We use the same in both UMAP and HDBSCAN

    # UMAP parameters
    "N_COMPONENTS": 2, # <- how many dimensions should we reduce to - for easy visualisation, we might want to choose 2
    # "N_NEIGHBORS": 8, 
    # "MIN_DIST": 0.1,

    # HDBSCAN parameters
    "HDBSCAN_MIN_CLUSTER_SIZE": 8, # <- the smallest size grouping that we want to consider a cluster
    # "MIN_SAMPLES": 1, # <- how conservative do we want the clustering to be
    # "CLUSTER_SELECTION_EPSILON": 0.4, # <- tradeoff between DBSCAN and HDBSCAN, higher values favors different size clusters

    # Bertopic parameters (irrelevant, since we add our own CountVectorizer)
    # "TOP_N_WORDS": xx,
    # "MIN_TOPIC_SIZE": xx,
}



# Here, we define the models with the parameters from best_hp_dict

umap_model = UMAP(
    n_components=best_hp_dict['N_COMPONENTS'], 
    metric=best_hp_dict['METRIC'],
    # n_neighbors=best_hp_dict['N_NEIGHBORS'],
    # min_dist=best_hp_dict['MIN_DIST'],
    random_state=RANDOM_STATE,
    low_memory=False
)

hdbscan_model = hdbscan.HDBSCAN(
    # min_cluster_size=best_hp_dict["HDBSCAN_MIN_CLUSTER_SIZE"],
    # min_samples=best_hp_dict["MIN_SAMPLES"],
    # cluster_selection_epsilon=best_hp_dict["CLUSTER_SELECTION_EPSILON"],
    metric=best_hp_dict['METRIC'],
    cluster_selection_method = 'eom',
    gen_min_span_tree=True,
    prediction_data=True
)

topic_model = BERTopic(
    language="multilingual",
    vectorizer_model=vectorizer_model,
    # top_n_words=best_hp_dict["TOP_N_WORDS"],
    # min_topic_size=best_hp_dict["MIN_TOPIC_SIZE"],
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
)

topics, probs = topic_model.fit_transform(docs, embeds) 

The difficult question is then: When we are confident with the model (and the parameters)? 
Our experience is that it requires a process of going back and forth between visualization of the UMAP projects colored with the clusters and re-consiring the parameters to make a clustering that seems to make sense. 

To guide the interpretation of the quality of the clusters, Bertopic have a number of nice and quick visualizations that is shown below. 

In the step 4, we make our own visualization that can also be used for further analysis - e.g. by comparing the linguistic clusters/embeddings with other background data. 

In [19]:
# ## Visualize the documents, their position and their predicted clusters
# fig = topic_model.visualize_documents(docs, embeddings=embeds)
# fig.show()
# fig.write_html("outputs/visualize_documents.html")

### Step 4: Visualization

When we are satisfied with our topic model, we want to visualize and analyze the clustering. Here, we first information about which document belong to which topic to the original dataframe. Then, we can plot or analyze how the clustering relates to other information in the data. 

In [34]:
len(topics)

717

In [201]:
# Make dict to map topic number to labels made by our CountVectorizer
label_dict = {}
for label in topic_model.generate_topic_labels():
    k,v = label.split('_', 1)
    label_dict[k] = v
    
# Add topics, probabilities, and topic names to original dataframe
df['topics'] = topics
df['probs'] = probs
df['topics_name'] = df['topics'].apply(lambda x: label_dict[str(x)])
df['topics_name'] = df['topics_name'].apply(lambda x: x.replace('_',' - '))
df['topics_name'] = df.apply(lambda row: 'Ikke Kategoriseret' if row['topics'] == -1 else row['topics_name'], axis=1)

def get_mio(x):
    if type(x) != str:
        return x
    l = x.split(',')
    l = [re.findall('\d.\d\d(?= mio)', i)[0] for i in l]
    return sum(float(i) for i in l)

df['Subsidy_total'] = df['Subsidy'].apply(get_mio)
df['Auto_financing_total'] = df['Auto financing'].apply(get_mio)
df['Total_financing'] = df['Subsidy_total'] + df['Auto_financing_total']
df['Fokusområder EUDP'] = df['Fokusområder EUDP'].fillna('Ingen data')

def line_breaker(s: str) -> str:
    chunk_size = len(s)//6
    s_chunks = [ s[i:i+chunk_size] for i in range(0, len(s), chunk_size) ]
    return '<br>'.join(s_chunks)

df['text_br'] = df['text'].apply(line_breaker)

# IF we chose n_components = 2 in the UMAP model, we can easily add them to the original df.
# If we chose more than 2 components, we run another UMAP reduce the embeddings to 2d for visualization. 
umap_model = topic_model.umap_model
temp = pd.DataFrame(umap_model.embedding_, columns=["x", "y"])
df_out = pd.concat([df,temp], axis=1)


# Optional: We might want to save the final dataframe - so we don't need to run all the code if we want to do further analysis
df_out.to_csv('data/data_with_topics.csv')

In [58]:
pd.options.display.max_columns = 99
df_out = pd.read_csv('data/data_with_topics.csv')
df_out.shape

(717, 42)

In [204]:
color_dict = {}
unique_topics = list(df_out['topics_name'].unique())
noise_label = df_out[df_out['topics'] == -1]['topics_name'].values[0]
for i in range(len(unique_topics)):

    if unique_topics[i] == noise_label:
        color_dict[unique_topics[i]] = "#ededed"
    else:
        color_dict[unique_topics[i]] = px.colors.qualitative.Dark24[i]

fig = px.scatter(df_out, 
    x='x', 
    y='y', 
    color='topics_name', 
    hover_data=['text_br','Bevillingsår','Fælles overordnet teknologiområde','Ansvarlig virksomhed','Fokusområder EUDP'], 
    width=1000, 
    height=700,
    color_discrete_map=color_dict,
    template='plotly_white', 
)
fig.update_traces(marker={'size': 5})
fig.update_layout(
    showlegend=False,
    xaxis_title=None,
    yaxis_title=None
)


for t in list(df_out['topics_name'].unique()):
    if t != 'Ikke Kategoriseret':
        temp = df_out[df_out['topics_name'] == t]

        fig.add_annotation( # add a text callout with arrow
            text=t, 
            x=temp['x'].mean(), 
            y=temp['y'].mean(), 
            showarrow=False,
        )

fig.show()
#fig.write_html("outputs/cluster_colored_by_topics.html")#  <- export as interactive html
# fig.write_image("outputs/cluster_colored_by_topics.svg") <- export as static svg

In [121]:
temp = df_out.groupby('topics_name').count()['ID'].reset_index().sort_values('ID', ascending=False)
temp = temp.rename(
    columns={
    'ID': 'Antal',
    'topics_name': 'Navn på tema'
    }
)

fig = px.bar(
    temp,
    x = 'Antal',
    y = 'Navn på tema',
    color='Navn på tema',
    width=1000, 
    height=700,
    color_discrete_map=color_dict,
    template='plotly_white', 
)
fig.update_layout(
    showlegend=False,
    xaxis_title=None,
    yaxis_title=None
)

fig.show()

In [124]:
temp = df_out.groupby(['topics_name','Slut (år)']).count()['ID'].reset_index().sort_values('ID', ascending=False)

temp = temp.rename(
    columns={
    'ID': 'Antal',
    'topics_name': 'Navn på tema',
    }
).sort_values(['Slut (år)','Navn på tema'])

fig = px.line(
    temp,
    x = 'Slut (år)',
    y = 'Antal',
    color='Navn på tema',
    width=1000, 
    height=700,
    markers=True,
    color_discrete_map=color_dict,
    template='plotly_white', 
)
fig.update_layout(
    showlegend=True,
    xaxis_title=None,
    yaxis_title=None,
    yaxis_range=[0,35],
    xaxis_range=[2015,2022]
)

fig.show()

In [181]:
# temp = df_out[['topics_name','Subsidy_total','Total_financing','text_br','Bevillingsår','Fælles overordnet teknologiområde','Ansvarlig virksomhed','Fokusområder EUDP']]
# temp = temp.rename(columns={'Subsidy_total': 'financing_subsidy', 'Total_financing': 'financing_total'})
# temp = pd.melt(
#     temp, 
#     id_vars=['topics_name','text_br','Bevillingsår','Fælles overordnet teknologiområde','Ansvarlig virksomhed','Fokusområder EUDP'],
#     value_vars=['financing_subsidy','financing_total']
# )
# temp = temp.rename(columns={
#     'variable': 'Tilskud / Finansiering i alt',
#     'value': 'Beløb',
#     'topics_name': 'Navn på emne'
# })

In [200]:
temp = df_out.rename(columns={
    'topics_name': 'Navn på emne',
    'Subsidy_total': 'Tilskud'
})

fig = px.strip(
    temp,
    y='Navn på emne',
    x='Tilskud',
    color='Navn på emne',
    stripmode='overlay',
    hover_data=['text_br','Bevillingsår','Fælles overordnet teknologiområde','Ansvarlig virksomhed','Fokusområder EUDP'], 
    width=1000, 
    height=700,
    color_discrete_map=color_dict,
    template='plotly_white', 
)
fig.update_layout(
    showlegend=False,
    xaxis_title='Tilskud (mio. kr)',
    yaxis_title=None
)

fig.show()

In [188]:
temp = df_out.rename(columns={
    'topics_name': 'Navn på emne',
    'Total_financing': 'Samlet beløb'
})

fig = px.strip(
    temp,
    y='Navn på emne',
    x='Samlet beløb',
    color='Navn på emne',
    stripmode='overlay',
    hover_data=['text_br','Bevillingsår','Fælles overordnet teknologiområde','Ansvarlig virksomhed','Fokusområder EUDP'], 
    width=1000, 
    height=700,
    color_discrete_map=color_dict,
    template='plotly_white', 
)
fig.update_layout(
    showlegend=False,
    xaxis_title='Samlet beløb (mio. kr)',
    yaxis_title=None
)

fig.show()

Unnamed: 0,Ansvarlig virksomhed,Antal
0,Danmarks Tekniske Universitet (DTU),118
1,Teknologisk Institut,64
2,Aalborg Universitet (Fredrik Bajers Vej),24
3,Ballard Europe,16
4,DANSK GASTEKNISK CENTER A/S,15
...,...,...
269,Grenaa Motorfabrik,1
270,BABCOCK & WILCOX VØLUND A/S,1
271,GlycoSpot,1
272,Banke Electrotrans ApS,1
