# Navigate to folder with data

The following code connects Google Drive to the Colab environment.

After running it, Colab asks for permission to access the Drive.

Once granted, the Drive is accessible at the path /content/drive.

This is necessary if the dataset is stored in Google Drive rather than directly uploaded to Colab. The data in this case is stored in a folder called "topic_modelling" in the Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import os

print(os.listdir())

Mounted at /content/drive
['.config', 'drive', 'sample_data']


In [None]:
%cd /content/drive/MyDrive/topic_modelling

/content/drive/MyDrive/topic_modelling


# Prepare data

For this example notebook the french normalized text of the Memoirs of Countess Luise Charlotte von Schwerin has been prepared. The folder "topic_modelling" contains the text in a .txt file, whereas each paragraph of the text is delimited with "///". The text consists of around 700 paragrapsh. The following code prepares and cleans the specific text for the topic modelling. The code:
- reads the text file (schwerin_text_paragraphs.txt).
- Splits the text wherever /// appears (that acts as a paragraph separator).
- Cleans each chunk: removes line breaks and special non-breaking spaces, collapses multiple spaces into one, strips leading/trailing whitespace.
- Stores all cleaned chunks in a list called paragraphs.
- Prints the list so you can see each cleaned paragraph.

Preparing the dataset will also be the topic today in the afternoon.

In [None]:
import re  # Importing the regular expressions library

file_path = './schwerin_text_paragraphs.txt'

# Open the file and read the content
with open(file_path, 'r', encoding='utf-8') as file:
    text = file.read()

# Split the text by '///'
chunks = text.split('///')

# Clean each chunk by stripping whitespace and replacing special characters
paragraphs = []
for chunk in chunks:
    # Replace newline and non-breaking space characters
    clean_text = chunk.replace('\n', ' ').replace('\xa0', ' ')
    # Use regular expression to replace multiple spaces with a single space
    clean_text = re.sub(r'\s+', ' ', clean_text).strip()
    # Append the cleaned text to the list
    paragraphs.append(clean_text)

print(paragraphs)


['Fürst Khevenhüller 1896', '« Histoire De la Vie de madame la comtesse de Schwerin', 'écrite par elle-même à ses enfants suivant les ordres de son directeur à Cologne »', 'Première partie', 'L’histoire de ma vie est si remarquable et si remplie d’événements, tant pour leur singularité que par le bruit que les plus considérables ont fait dans le monde, que ma curiosité m’a portée en premier lieu d’en rassembler le cours, et secondement pour qu’un jour mes chers enfants puissent être instruits par moi-même des suites malheureuses (selon le monde) que produit ordinairement une fortune éblouissante à une jeune personne élevée dans tout ce qui peut plaire à son ambition et à ses sens, suites dis-je d’autant plus dangereuses qu’elles sont imperceptibles à ceux dont l’amour-propre ferme les yeux sur leur conduite, et dont ils ne sont détrompés que par une longue expérience, d’autant plus rude à être exercée que peu de personnes ont le courage d’entreprendre le grand ouvrage de la connaissanc

# Installing BERTopic

This line installs BERTopic inside your Colab environment so you can use it for topic modeling.

In [None]:
!pip install bertopic

Collecting bertopic
  Downloading bertopic-0.17.3-py3-none-any.whl.metadata (24 kB)
Downloading bertopic-0.17.3-py3-none-any.whl (153 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.0/153.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bertopic
Successfully installed bertopic-0.17.3


This line imports the BERTopic class from the bertopic library.


In [None]:
from bertopic import BERTopic


  axis.set_ylabel('$\lambda$ value')
  $max \{ core_k(a), core_k(b), 1/\alpha d(a,b) \}$.


# Create topics (Quick Start)

When you run topic_model = BERTopic(verbose=True, language="french"), you create a new BERTopic model instance configured for French text processing. The verbose=True option ensures that the model prints detailed progress messages during training, which helps you follow steps like dimensionality reduction, clustering, and topic extraction. Setting language="french" customizes preprocessing for the French language. This initialization doesn’t yet analyze your data—it simply prepares a French-optimized BERTopic model that you can later fit to your text.

In [None]:
topic_model = BERTopic(verbose=True, language="french")

This line takes the prepared French paragraphs, learns topics from them, assigns each paragraph to a topic, and returns both the topic labels (topics) and the confidence scores (probs).

In [None]:
topics, probs = topic_model.fit_transform(paragraphs)

2025-09-01 11:07:40,344 - BERTopic - Embedding - Transforming documents to embeddings.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/22 [00:00<?, ?it/s]

2025-09-01 11:09:01,312 - BERTopic - Embedding - Completed ✓
2025-09-01 11:09:01,315 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-09-01 11:09:12,874 - BERTopic - Dimensionality - Completed ✓
2025-09-01 11:09:12,875 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-09-01 11:09:12,909 - BERTopic - Cluster - Completed ✓
2025-09-01 11:09:12,922 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-09-01 11:09:13,199 - BERTopic - Representation - Completed ✓


- Embeddings: BERTopic starts with transforming our input documents into numerical representations. Although there are many ways this can be achieved, we typically use sentence-transformers as it is quite capable of capturing the semantic similarity between documents.
- Dimensionality Reduction: As embeddings are often high in dimensionality, clustering becomes difficult. Therefore, dimensionality is reduced.
- Clustering: The input embeddings are then grouped to similar embeddings to extract our topics.
- Representation Representation is generated: human-readable topic labels by identifying representative words.

## Save and load the model

In [None]:
topic_model.save("./schwerin_paragraphs.mm")



In [None]:
topic_model = BERTopic.load("./schwerin_paragraphs.mm")

## Investigate the topics

topic_model.get_topic_info() returns a Pandas DataFrame that summarizes all discovered topics in your model.

Each row represents one topic and contains metadata about it.

 - Topic → the topic ID (an integer).

- -1 means outliers (documents that didn’t fit into any topic).

 - Count → how many documents were assigned to that topic.

 - Name → a short descriptive label for the topic, based on its most frequent keywords.

 - Representation (top words) → the key terms that define the topic.
 - representative documents are the most characteristic texts for each discovered topic.

In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,259,-1_de_je_et_le,"[de, je, et, le, que, la, il, me, qu, en]",[La Wergeau monta sur le soir dans ma chambre....
1,0,146,0_je_de_que_et,"[je, de, que, et, me, la, le, il, vous, mon]",[Je me levai pour la première fois le jour de ...
2,1,49,1_de_et_que_je,"[de, et, que, je, me, la, le, ma, les, mon]",[Nous arrivâmes qu’il était nuit et n’ayant pe...
3,2,48,2_de_la_je_et,"[de, la, je, et, que, me, le, comtesse, elle, qu]","[Le soir avant, la comtesse de Strattmann m’éc..."
4,3,40,3_de_je_que_le,"[de, je, que, le, et, la, comte, il, en, schwe...",[Enfin le comte de Virmond me dit qu’il nommer...
5,4,28,4_de_il_que_et,"[de, il, que, et, le, je, qu, me, la, pour]",[L’arrêt du prisonnier faisait toujours beauco...
6,5,27,5_de_et_mon_que,"[de, et, mon, que, il, père, le, je, ma, me]",[Le lendemain nous partîmes pour Berlin. Je fu...
7,6,26,6_de_je_que_il,"[de, je, que, il, le, me, la, et, comte, qu]","[J’étais plus morte que vive à ce compliment, ..."
8,7,22,7_je_de_que_et,"[je, de, que, et, le, elle, qu, me, la, il]",[La Wergeau assura que si le comte de Schwerin...
9,8,18,8_de_et_je_que,"[de, et, je, que, le, me, il, la, qu, ne]",[J’étais seule alors chez moi avec mes deux Ch...


With topic_model.get_topic(8) ask BERTopic to return the keywords that define topic number 8. The output is a list of tuples, where:

- The first element is a word strongly associated with the topic.

- The second element is the word’s importance weight (a float, often shown as np.float64).

In [None]:
topic_model.get_topic(8)

[('fièvre', np.float64(0.03130257945088305)),
 ('médecin', np.float64(0.02788126838891301)),
 ('lit', np.float64(0.024938447044508146)),
 ('peu', np.float64(0.021729053684913065)),
 ('larmes', np.float64(0.020483938456216585)),
 ('quand', np.float64(0.02046101105433914)),
 ('temps', np.float64(0.020292110257655243)),
 ('faire', np.float64(0.020066146439553224)),
 ('fort', np.float64(0.01906686272484892)),
 ('trouver', np.float64(0.018618880498813797))]

## Finetuning the model

The topics contain many words that dont carry signifiant meaing. Therefore, we are finetuning the model with involving a stopword list. This code loads a custom list of stopwords from a text file and uses it in CountVectorizer to filter out common or unimportant words. By removing these stopwords, the model focuses on more meaningful terms in the text. The updated vectorizer is then passed to the topic model so that the topics are built using cleaner, more relevant words.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Define the stopwords
#stop_words = ["et", "in", "est", "ad", "qui", "autem", "non", "de", "eius", "ut", "quae", "cum", "si", "eum"]

with open('./stopwords_edited.txt', 'r') as file:
  stop_words = [word.strip() for word in file.readlines()]

vectorizer_model = CountVectorizer(stop_words=stop_words)
topic_model.update_topics(paragraphs, vectorizer_model=vectorizer_model)

## Investigate the topics again

In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,259,-1_fort_faire_jour_tous,"[fort, faire, jour, tous, temps, monde, très, ...",[La Wergeau monta sur le soir dans ma chambre....
1,0,146,0_dieu_grand_monde_jour,"[dieu, grand, monde, jour, catholique, dire, f...",[Je me levai pour la première fois le jour de ...
2,1,49,1_mère_belle_tous_fort,"[mère, belle, tous, fort, maison, tante, père,...",[Nous arrivâmes qu’il était nuit et n’ayant pe...
3,2,48,2_comtesse_strattmann_vienne_madame,"[comtesse, strattmann, vienne, madame, prince,...","[Le soir avant, la comtesse de Strattmann m’éc..."
4,3,40,3_berlin_vienne_maison_tous,"[berlin, vienne, maison, tous, cour, temps, af...",[Enfin le comte de Virmond me dit qu’il nommer...
5,4,28,4_roi_pays_rien_point,"[roi, pays, rien, point, empereur, fait, ex, m...",[L’arrêt du prisonnier faisait toujours beauco...
6,5,27,5_père_belle_mère_tante,"[père, belle, mère, tante, oncle, monde, abord...",[Le lendemain nous partîmes pour Berlin. Je fu...
7,6,26,6_faire_dönhoff_dire_air,"[faire, dönhoff, dire, air, rien, comtesse, te...","[J’étais plus morte que vive à ce compliment, ..."
8,7,22,7_wergeau_enfants_dieu_faire,"[wergeau, enfants, dieu, faire, point, enfin, ...",[La Wergeau assura que si le comte de Schwerin...
9,8,18,8_fièvre_médecin_lit_peu,"[fièvre, médecin, lit, peu, larmes, quand, tem...",[J’étais seule alors chez moi avec mes deux Ch...


In [None]:
topic_model.get_topic(6)

[('berlin', np.float64(0.12049614810462303)),
 ('père', np.float64(0.014493937897551159)),
 ('retour', np.float64(0.01442894126547829)),
 ('retournâmes', np.float64(0.012651324604878613)),
 ('fille', np.float64(0.012623490203802984)),
 ('parmi', np.float64(0.011703138126051482)),
 ('campagne', np.float64(0.011570190533555888)),
 ('bruit', np.float64(0.011473530990542354)),
 ('trouverais', np.float64(0.010899826024381741)),
 ('berlinet', np.float64(0.010436534466118936))]

The method get_document_info() shows how each document (in our case each paragraph) is related to the topics discovered by your topic model.

 - Document → The original text snippet (or start of it) from your dataset.
 - Topic → The numerical label of the assigned topic. A value of -1 means the paragraph was not strongly assigned to any topic (often considered an “outlier” or background).
 - Name → The default name of the topic, usually formed from its most representative words (e.g., fort_faire_dieu_temps).
 - CustomName → A customizable label for the topic, often derived from the top keywords but editable by the user.
 - Representation → The list of keywords most strongly associated with the topic.
 - Representative_Docs → Example document(s) that best represent this topic.
 - Top_n_words → The top keywords again, but formatted differently for quick viewing.
 - Probability → How strongly this paragraph belongs to the given topic. A value close to 1 means a strong match; 0 means weak or no match
 - Representative_document → True if this paragraph is selected as one of the most representative documents for the topic, False otherwise.

In [None]:
topic_model.get_document_info(paragraphs)

Unnamed: 0,Document,Topic,Name,CustomName,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,Fürst Khevenhüller 1896,-1,-1_fort_faire_dieu_temps,fort - faire - dieu - temps - jour - tous - mo...,"[fort, faire, dieu, temps, jour, tous, monde, ...",[Quatre joursLe troisième ou quatrième jour ap...,fort - faire - dieu - temps - jour - tous - mo...,0.000000,False
1,« Histoire De la Vie de madame la comtesse de ...,-1,-1_fort_faire_dieu_temps,fort - faire - dieu - temps - jour - tous - mo...,"[fort, faire, dieu, temps, jour, tous, monde, ...",[Quatre joursLe troisième ou quatrième jour ap...,fort - faire - dieu - temps - jour - tous - mo...,0.000000,False
2,écrite par elle-même à ses enfants suivant les...,-1,-1_fort_faire_dieu_temps,fort - faire - dieu - temps - jour - tous - mo...,"[fort, faire, dieu, temps, jour, tous, monde, ...",[Quatre joursLe troisième ou quatrième jour ap...,fort - faire - dieu - temps - jour - tous - mo...,0.000000,False
3,Première partie,-1,-1_fort_faire_dieu_temps,fort - faire - dieu - temps - jour - tous - mo...,"[fort, faire, dieu, temps, jour, tous, monde, ...",[Quatre joursLe troisième ou quatrième jour ap...,fort - faire - dieu - temps - jour - tous - mo...,0.000000,False
4,L’histoire de ma vie est si remarquable et si ...,0,0_dieu_jamais_être_amour,dieu - jamais - être - amour - point - toutes ...,"[dieu, jamais, être, amour, point, toutes, mon...",[J’ai écrit cette période avec plus de circons...,dieu - jamais - être - amour - point - toutes ...,0.842525,False
...,...,...,...,...,...,...,...,...,...
675,"Jusque là, tous mes confesseurs avaient été de...",-1,-1_fort_faire_dieu_temps,fort - faire - dieu - temps - jour - tous - mo...,"[fort, faire, dieu, temps, jour, tous, monde, ...",[Quatre joursLe troisième ou quatrième jour ap...,fort - faire - dieu - temps - jour - tous - mo...,0.000000,False
676,J’ai écrit cette période avec plus de circonst...,0,0_dieu_jamais_être_amour,dieu - jamais - être - amour - point - toutes ...,"[dieu, jamais, être, amour, point, toutes, mon...",[J’ai écrit cette période avec plus de circons...,dieu - jamais - être - amour - point - toutes ...,0.842525,True
677,Je crois qu’avec cette période je pourrai fini...,0,0_dieu_jamais_être_amour,dieu - jamais - être - amour - point - toutes ...,"[dieu, jamais, être, amour, point, toutes, mon...",[J’ai écrit cette période avec plus de circons...,dieu - jamais - être - amour - point - toutes ...,0.768957,False
678,Fini à Cologne ce 7 janvier 1731 .,-1,-1_fort_faire_dieu_temps,fort - faire - dieu - temps - jour - tous - mo...,"[fort, faire, dieu, temps, jour, tous, monde, ...",[Quatre joursLe troisième ou quatrième jour ap...,fort - faire - dieu - temps - jour - tous - mo...,0.000000,False


## Visualize Topics

This visualization shows the top keywords for each discovered topic in the topic model.

In [None]:
topic_model.visualize_barchart(width=280, height=330, top_n_topics=20, n_words=10)

# Create topics with sentences as documents

In [None]:
path = "schwerin_sentences_school_cleaned.txt"
with open(path, "r", encoding="utf-8") as f:
    docs = [l.strip() for l in f if l.strip()]

print(f"Loaded {len(docs)} docs.")

Loaded 9855 docs.


In [None]:
topic_model_sentences = BERTopic(verbose=True, language="french")

topics, probs = topic_model_sentences.fit_transform(docs)

2025-09-08 05:07:21,939 - BERTopic - Embedding - Transforming documents to embeddings.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/308 [00:00<?, ?it/s]

2025-09-08 05:13:51,466 - BERTopic - Embedding - Completed ✓
2025-09-08 05:13:51,471 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-09-08 05:14:40,867 - BERTopic - Dimensionality - Completed ✓
2025-09-08 05:14:40,870 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-09-08 05:14:41,432 - BERTopic - Cluster - Completed ✓
2025-09-08 05:14:41,456 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-09-08 05:14:41,801 - BERTopic - Representation - Completed ✓


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Define the stopwords
#stop_words = ["et", "in", "est", "ad", "qui", "autem", "non", "de", "eius", "ut", "quae", "cum", "si", "eum"]

with open('./stopwords_edited.txt', 'r') as file:
  stop_words = [word.strip() for word in file.readlines()]

vectorizer_model = CountVectorizer(stop_words=stop_words)
topic_model_sentences.update_topics(docs, vectorizer_model=vectorizer_model)

In [None]:
topic_model_sentences.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,3618,-1_faire_tous_monde_dire,"[faire, tous, monde, dire, toute, rien, grand,...","[Ainsi que, ma confession, quoique bonne, ne f..."
1,0,1066,0_madame_lettre_roi_faire,"[madame, lettre, roi, faire, maison, jour, cou...",[Le comte de Schwerin avait marqué au comte de...
2,1,509,1_dieu_providence_divine_divin,"[dieu, providence, divine, divin, doux, loué, ...","[Dieu soit loué de ma maladie, voilà une bonne..."
3,2,379,2_prince_roi_reine_empereur,"[prince, roi, reine, empereur, impératrice, pr...",[De là je fus chez madame la Princesse Electri...
4,3,377,3_dame_femme_air_assura,"[dame, femme, air, assura, peut, voulut, fort,...",[Elle s’excusa extrêmement de ce qu’elle ne tr...
...,...,...,...,...,...
99,98,11,98_virmond_virmont_promener_couleur,"[virmond, virmont, promener, couleur, conseill...","[Dès que le comte de Virmond arriva, je me mis..."
100,99,11,99_daniel_auprès_eloignez_enlevée,"[daniel, auprès, eloignez, enlevée, cetave, ma...","[Je n’avais que Daniel à qui me confier., Je d..."
101,100,10,100_fils_eau_touché_exaucâtes,"[fils, eau, touché, exaucâtes, assoupie, brisa...","[Je voulus me reposer la nuit ensuite, mais à ..."
102,101,10,101_mort_mourut_séjour_arrosant,"[mort, mourut, séjour, arrosant, consolaient, ...","[« Oui, lui répétai-je en l’arrosant de mes la..."


## Visuzalize

In [None]:
topic_model_sentences.visualize_barchart(width=280, height=330, top_n_topics=115, n_words=10)

hierarchical_topics = topic_model.hierarchical_topics(docs)

## Hierarchical Topics

In [None]:
hierarchical_topics = topic_model_sentences.hierarchical_topics(docs)

100%|██████████| 110/110 [00:00<00:00, 219.25it/s]


In [None]:
topic_model_sentences.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

In [None]:
tree = topic_model_sentences.get_topic_tree(hierarchical_topics)
print(tree)

.
├─catholique_catholiques_religion_catholicité_oui
│    ├─■──catholique_catholiques_invoquer_professeurs_octobreou ── Topic: 89
│    └─catholique_catholiques_religion_catholicité_citant
│         ├─■──catholique_catholiques_religion_citant_catholicité ── Topic: 27
│         └─■──catholique_catholiques_adorons_oui_authentique ── Topic: 48
└─comtesse_vienne_berlin_dieu_strattmann
     ├─berlin_vienne_larmes_cour_père
     │    ├─tante_père_frère_lettre_oncle
     │    │    ├─père_frère_lettre_wesel_lecture
     │    │    │    ├─lettre_lecture_lu_écrivis_lettres
     │    │    │    │    ├─■──lecture_lettre_lu_livre_article ── Topic: 35
     │    │    │    │    └─■──lettre_lettres_écrivis_frère_écrivit ── Topic: 30
     │    │    │    └─père_frère_wesel_beau_famille
     │    │    │         ├─frère_père_beau_confessai_pénétration
     │    │    │         │    ├─■──frère_satisfactions_efforçait_retourné_bonnes ── Topic: 94
     │    │    │         │    └─■──père_frère_beau_confessai_pénétr