# BERTopic Exploration of the Stimmmungs- und Lageberichte Database Files
**Author:** Christopher Thomas Goodwin

**Creation Date:** 2024.04.10

**Summary:** Uses BERTopic modelling to explore the data of the NSHWE Stimmungs- und Lageberichte files

In [7]:
import platform
from bertopic import BERTopic
import json
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import torch

# Get stop words
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Check if GPU acceleration is available and call appropriate libraries
import GPUtil

if len(GPUtil.getAvailable()) > 0:
    from cuml.cluster import HDBSCAN
    #from cuml.manifold import UMAP # GPU-based version of UMAP
    from umap import UMAP # use CPU-based version of UMAP which is better for noisy or duplicate data
    print("GPU engaged.")

    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"Current GPU device index: {torch.cuda.current_device()}")
    print(f"Current GPU device name: {torch.cuda.get_device_name(torch.cuda.current_device())}")
    print(f"Pytorch Cuda version: {torch.version.cuda}")
else:
    from umap import UMAP
    from hdbscan import HDBSCAN
    print("No GPU engaged.")

GPU engaged.
CUDA available: True
Current GPU device index: 0
Current GPU device name: NVIDIA GeForce RTX 5070 Ti
Pytorch Cuda version: 12.9


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/cgoodwin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
# Check which platform user is on and set the data path accordingly
print(f"The Operating System is {platform.system()}")

if platform.system() == "Linux":
    path = "/home/cgoodwin/Documents/Programming/TextMiningNaziIdeology/data/json/stimmungs_data_sentences.json"
elif platform.system() == "Darwin":
    path = "/Users/cgoodwin/Programming Projects/TextMiningNaziIdeology/data/json/stimmungs_data_sentences.json"
else:
    path = "C:\\Users\\Christopher Goodwin\\Documents\\Programming Projects\\TextMiningNaziIdeology\\data\\json\\stimmungs_data_sentences.json"
    
with open(path, "r", encoding="utf-8") as f:
    files = json.load(f)
    # files loaded in as dictionary with strings of 0... length of files
    
    # we want just the textual data, the report from each entry
    reports = []
    for i in range(len(files)):
        reports.append(files[str(i)]["report"]) # iterate through dictionary and append report

print("File loaded.")

The Operating System is Linux
File loaded.


In [9]:
# set up vectorizer for German stopwords
german_stop_words = stopwords.words('german')
additional_stop_words = ["volk", "volksgemeinschaft", "1939", "1940", "1941", "1942", "1943", "1944", "1945", "deutsch", "bevölkerung", "ii", "iii", "iv", "v", "vi", "einzelmeldungen", "volksgenossen", "sei", "seien", "worden", "meldungen", "deutsche", "deutschen", "wegen", "wurde", "gif", "pro", "kg", "minusbox", "images", "rm"]

for i in range(0, 1946):
    additional_stop_words.append(str(i))

german_stop_words.extend(additional_stop_words)

vectorizer_model = CountVectorizer(stop_words=german_stop_words)

In [10]:
 # Adjust UMAP and HDBSCAN parameters
umap_model = UMAP(n_components=5, n_neighbors=10, min_dist=0.2)
hdbscan_model = HDBSCAN(min_samples=5, min_cluster_size=5, prediction_data=True)

# Initialize BERTopic with adjusted models
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, embedding_model="paraphrase-multilingual-MiniLM-L12-v2", language="multilingual", vectorizer_model=vectorizer_model, verbose=True, nr_topics=15, top_n_words=10)

In [11]:
topics, probs = topic_model.fit_transform(reports)

2025-07-19 11:36:26,769 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 3910/3910 [00:28<00:00, 139.48it/s]
2025-07-19 11:36:57,543 - BERTopic - Embedding - Completed ✓
2025-07-19 11:36:57,544 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-07-19 11:37:31,136 - BERTopic - Dimensionality - Completed ✓
2025-07-19 11:37:31,147 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-07-19 11:37:33,154 - BERTopic - Cluster - Completed ✓
2025-07-19 11:37:33,154 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction.
2025-07-19 11:37:34,361 - BERTopic - Representation - Completed ✓
2025-07-19 11:37:34,362 - BERTopic - Topic reduction - Reducing number of topics
2025-07-19 11:37:34,442 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-07-19 11:37:35,247 - BERTopic - Representation - Completed ✓
2025-07-19 11:37:35,255 - BERTopic - Topic reduction -

In [12]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,69725,-1_hätten_besonders_mehr_immer,"[hätten, besonders, mehr, immer, zeit, wurden,...",[V. Wirtschaft Gefährdung der Frühjahrsbestel...
1,0,22127,0_kirche_immer_mehr_besonders,"[kirche, immer, mehr, besonders, katholischen,...","[Kirche gehandelt hätten“., Besonders in den k..."
2,1,11021,1_frankfurt_tschechen_juden_sowjets,"[frankfurt, tschechen, juden, sowjets, tschech...","[B. Wien, Stuttgart, Karlsruhe, Braunschweig, ..."
3,2,9804,2_heißt_zt_rundfunk_stimmen,"[heißt, zt, rundfunk, stimmen, bez, vo, kath, ...","[So heißt es z., So heißt es z., So heißt es z.]"
4,3,5623,3_bauern_kartoffeln_mehr_hätten,"[bauern, kartoffeln, mehr, hätten, obst, milch...",[Auch die Versorgung mit Kartoffeln würde imme...
5,4,3384,4_frauen_lehrer_wohnungen_schulen,"[frauen, lehrer, wohnungen, schulen, kinder, h...",[Besonders die jüngeren Frauen erreichten – wi...
6,5,1472,5_sendung_musik_filme_künstler,"[sendung, musik, filme, künstler, kunst, sendu...",[In der Sendung „Noten und Anekdoten“ habe bes...
7,6,826,6_feiern_gut_veranstaltungen_feier,"[feiern, gut, veranstaltungen, feier, beteilig...",[Zurückgeführt wird die geringe Beteiligung in...
8,7,700,7_000_stundenlohn_umsatz_etwa,"[000, stundenlohn, umsatz, etwa, einkommen, zt...",[| 573.000 | 407.000 | 66.000 | 60.000 ...
9,8,199,8_weihnachten_weihnachtsfest_weihnachtszeit_we...,"[weihnachten, weihnachtsfest, weihnachtszeit, ...","[Der Wunsch, alle Familienangehörigen Weihnach..."


In [13]:
topic_model.get_topic(0)

[('kirche', np.float64(0.017444743999783414)),
 ('immer', np.float64(0.013086768301109959)),
 ('mehr', np.float64(0.011883179989136404)),
 ('besonders', np.float64(0.010746610603440222)),
 ('katholischen', np.float64(0.009844580393254268)),
 ('vielfach', np.float64(0.009703435079817791)),
 ('hätten', np.float64(0.009577731805176697)),
 ('italien', np.float64(0.008937926461518234)),
 ('zeit', np.float64(0.00879477540551662)),
 ('könne', np.float64(0.00850744917601348))]

In [14]:
topic_model.visualize_barchart()

In [None]:
topic_model.save("my_model", serialization=".safetensors")

# Generative Labeling

In [15]:
import requests

def query_ollama(prompt, model="gemma3:12b", temperature=0.1):
    url = "http://localhost:11434/api/generate"
    response = requests.post(url, json={
        "model": model,
        "prompt": prompt,
        "temperature": temperature,
        "stream": False
    })
    
    return response.json()['response'].strip()

top_topic_ids = topic_model.get_topic_info().head(10)['Topic'].tolist()

topic_keywords = {topic_id: topic_model.get_topic(topic_id) for topic_id in top_topic_ids}

custom_labels = {}


for topic_id, keywords in topic_keywords.items():
    words = ', '.join([word for word, _ in keywords])
    prompt = f"Give these keywords: {words}, generate a short, descriptive topic label that summarizes the theme. All of the topics come from the period 1939 to 1945 and are related to the Sicherheitsdienst in Nazi Germany. They are the ones who wrote the reports."
    label = query_ollama(prompt)
    custom_labels[topic_id] = label
    print(f"Topic {topic_id}: {label}")
    
# Copy existing labels
topic_model.custom_labels_ = topic_model.get_topic_info()['Name'].tolist()

# Replace with new ones

for topic_id, label in custom_labels.items():
    if topic_id < len(topic_model.custom_labels_):
        topic_model.custom_labels_[topic_id] = label
        
topic_model.visualize_topics()

Topic -1: Here's a short, descriptive topic label summarizing the theme based on the keywords and context:

**SD Reporting & Analysis (1939-1945)**

Here's why this works:

*   **SD Reporting:** Directly references the Sicherheitsdienst and their primary function.
*   **Analysis:** The keywords suggest a process of evaluation ("hätten," "könne," "mehr") related to information.
*   **(1939-1945):**  Specifies the timeframe.



The keywords like "bereits" (already), "immer" (always), "zeit" (time) and "teil" (part) point to ongoing observations and documentation - core to SD reporting. "Jedoch" (however) indicates critical assessment and potential contradictions within the reports. "Besonders" highlights areas of focus.
Topic 0: Here are a few topic label options based on the keywords and context (SD reports, 1939-1945), ranging in specificity:

**Option 1 (Most Concise):**

*   **Catholic Church Surveillance (1939-1945)**

**Option 2 (Slightly more descriptive):**

*   **SD Reports on C

# Apply TF-IDF to Model

In [None]:
tfidf_vectorizer = TfidfVectorizer(min_df=5, stop_words=german_stop_words)
embeddings = tfidf_vectorizer.fit_transform(reports)

tfidf_model = BERTopic(nr_topics=75)
tfidf_topics, tfidf_probs = tfidf_model.fit(reports, embeddings)

In [None]:
tfidf_model.get_topic_info()