## 3.5 Topic Modeling

To better understand the differences in communication of politicians on Twitter and in the Bundestag, we perform a topic modeling. For this we test three different approaches, before we choose the best performing as our final model. We apply hyperparameter tuning if applicable but omit classic train test split validation. We are gonna analyse the validity of the topic model in the Results section.

In [1]:
import pickle
from pprint import pprint
from imp import reload
from operator import itemgetter
from datetime import datetime
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import numpy as np
import pandas as pd

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel, LdaMulticore
from gensim.models.nmf import Nmf

import pyLDAvis
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()

from bertopic import BERTopic

from tqdm.notebook import tqdm
tqdm.pandas()

  if LooseVersion(ipywidgets.__version__) >= LooseVersion("7.0.0"):
  if LooseVersion(ipywidgets.__version__) >= LooseVersion("7.0.0"):
  if LooseVersion(ipywidgets.__version__) >= LooseVersion("7.0.0"):
  if LooseVersion(ipywidgets.__version__) >= LooseVersion("7.0.0"):


### 3.5.1 Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) constitutes a state of the art approach [Citaiton needed https://journalofbigdata.springeropen.com/track/pdf/10.1186/s40537-019-0255-7.pdf)] for topic modeling. LDA is a unsupervised machine learning technique that uses generative statistical models to extract topics from a collection of documents [citation needed]. Based on the underlying model it assigns a probability distribution over the vocabulary of the documents to topics that can be used for the topic detection. We will base our choice of the optimal hyperparameter combination on the coherence of the resulting topic model. This decision is based on the discussion [here](http://topicmodels.info/ckling/tmt/part4.pdf) and [here](https://dl.acm.org/doi/abs/10.1145/2684822.2685324). 

#### Define hyperparameters for optimization.

We optimize the hyperparameters of the LDA model based on a grid search with the variables topic number (k), the a-priory belief of document-topic distribution (alpha) and the the a-priory  belief of topic-word distribution (eta) [citation of https://radimrehurek.com/gensim/models/ldamodel.html]. This hyperparemter optimization is loosely based on this [article](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0).

In [2]:
# Topics range
min_topics = 10
max_topics = 150
step_size = 10
topics_range = range(min_topics, max_topics, step_size)

In [3]:
# Alpha parameter
alpha = list(np.arange(0.01, 1, 0.3)) #
alpha.append('symmetric')
alpha.append('asymmetric')

In [4]:
# Beta parameter
beta = list(np.arange(0.01, 1, 0.3))
beta.append('symmetric')

In [5]:
# Function for calculating coherence values of specific hyperparamter combinations
def compute_lda_coherence_values(corpus, text, id2word, k, a, b):
    lda_model = LdaMulticore(corpus=corpus,
                             id2word=id2word,
                             num_topics=k, 
                             random_state=42,
                             alpha=a,
                             eta=b)
    coherence_model_lda = CoherenceModel(model=lda_model, texts=text, dictionary=id2word, coherence='c_v')
    return coherence_model_lda.get_coherence()

In [6]:
# Function for executing a hyperparameter optimization
def hyperparameter_lda(data_preprocessed, title, topics_range, alpha, beta):
    id2word = corpora.Dictionary(data_preprocessed.text_preprocessed.to_list())
    # These hyperparameter could also be trialed in an extend scope
    id2word.filter_extremes(no_below=10, no_above=0.1)
    texts = data_preprocessed.text_preprocessed.to_list()
    corpus = [id2word.doc2bow(text) for text in texts]
    model_results = {'Topics': [],
                     'Alpha': [],
                     'Beta': [],
                     'Coherence': []
                    }
    grid = {}
    grid['Validation_Set'] = {}
    for k in tqdm(topics_range):
        print("Number of topics:" + str(k))
        for a in tqdm(alpha):
            print("Alpha value:" + str(a))
            for b in tqdm(beta):
                print("Beta value:" + str(b))
                cv = compute_lda_coherence_values(corpus=corpus,text = texts,
                                              id2word=id2word, k=k, a=a, b=b)
                model_results['Topics'].append(k)
                model_results['Alpha'].append(a)
                model_results['Beta'].append(b)
                model_results['Coherence'].append(cv)
    results_df = pd.DataFrame(model_results)
    results_df.to_csv('../data/processed/lda_tuning_results_' + title + '.csv', index=False)
    return results_df

#### 3.5.1.1 Hyperparameter optimization LDA for tweets

In [7]:
# Load data
tweets_processed_lda = pickle.load(open( "../data/processed/tweets_processed.p", "rb" ))

In [8]:
# Hyperparameter optimization
# Uncomment if you want to repeat the hyperparameter optimization
# hyperparameter_lda_tweets = hyperparameter_lda(tweets_processed_lda, "tweets", topics_range, alpha, beta)

In [9]:
# Save hyperparameter
# Uncomment if you want to repeat the hyperparameter optimization
# hyperparameter_lda_tweets.to_csv('../data/processed/lda_tuning_results_tweets.csv', index = False)

#### 3.5.1.2 Calculate best model LDA for tweets

In [10]:
# Load data
lda_tuning_results_tweets = pd.read_csv('../data/processed/lda_tuning_results_tweets.csv')

In [11]:
# Prepare corpus
id2word_tweets_lda = corpora.Dictionary(tweets_processed_lda.text_preprocessed.to_list())
id2word_tweets_lda.filter_extremes(no_below=5, no_above=0.1)
texts_tweets_lda = tweets_processed_lda.text_preprocessed.to_list()
corpus_tweets_lda = [id2word_tweets_lda.doc2bow(text) for text in texts_tweets_lda]

In [12]:
k_optimal_lda_tweets = int(lda_tuning_results_tweets.sort_values("Coherence", ascending = False).reset_index(drop = True).Topics[0])
try:
    a_optimal_lda_tweets = float(lda_tuning_results_tweets.sort_values("Coherence", ascending = False).reset_index(drop = True).Alpha[0])
except ValueError:
    a_optimal_lda_tweets = lda_tuning_results_tweets.sort_values("Coherence", ascending = False).reset_index(drop = True).Alpha[0]
try:
    b_optimal_lda_tweets = float(lda_tuning_results_tweets.sort_values("Coherence", ascending = False).reset_index(drop = True).Beta[0])
except ValueError:
    b_optimal_lda_tweets = lda_tuning_results_tweets.sort_values("Coherence", ascending = False).reset_index(drop = True).Beta[0]

In [13]:
# Train model
lda_model_tweets = LdaMulticore(corpus=corpus_tweets_lda,
                                 id2word=id2word_tweets_lda,
                                 num_topics=k_optimal_lda_tweets,
                                 random_state=42,
                                 alpha=a_optimal_lda_tweets,
                                 eta=b_optimal_lda_tweets)

In [14]:
# Calculate final coherence value
coherence_model_lda_tweets = CoherenceModel(model=lda_model_tweets, texts=texts_tweets_lda, dictionary=id2word_tweets_lda, coherence='c_v')
coherence_lda_tweets = coherence_model_lda_tweets.get_coherence()
print("The final model coherence of the LDA for Tweets is: " + str(round(coherence_lda_tweets,2)))

The final model coherence of the LDA for Tweets is: 0.54


In [15]:
# Visually inspect result
lda_vis_tweets = pyLDAvis.gensim_models.prepare(lda_model_tweets, corpus_tweets_lda, id2word_tweets_lda)
lda_vis_tweets

#### 3.5.1.3 Hyperparameter optimization LDA for speeches

In [16]:
# Load data
speeches_processed_lda = pickle.load(open( "../data/processed/speeches_processed.p", "rb" ))

In [17]:
# Hyperparameter optimization
# Uncomment if you want to repeat the hyperparameter optimization
# hyperparameter_lda_speeches = hyperparameter_lda(speeches_processed_lda, "tweets", topics_range, alpha, beta)

In [18]:
# Save hyperparameter
# Uncomment if you want to repeat the hyperparameter optimization
# hyperparameter_lda_speeches.to_csv('../data/processed/lda_tuning_results_speeches.csv', index = False)

#### 3.5.1.4 Calculate best model LDA for speeches

In [19]:
# Load data
lda_tuning_results_speeches = pd.read_csv('../data/processed/lda_tuning_results_speeches.csv')

In [20]:
# Prepare corpus
id2word_speeches_lda = corpora.Dictionary(tweets_processed_lda.text_preprocessed.to_list())
id2word_speeches_lda.filter_extremes(no_below=5, no_above=0.1)
texts_speeches_lda = tweets_processed_lda.text_preprocessed.to_list()
corpus_speeches_lda = [id2word_speeches_lda.doc2bow(text) for text in texts_speeches_lda]

In [21]:
k_optimal_lda_speeches = int(lda_tuning_results_speeches.sort_values("Coherence", ascending = False).reset_index(drop = True).Topics[0])
try:
    a_optimal_lda_speeches = float(lda_tuning_results_speeches.sort_values("Coherence", ascending = False).reset_index(drop = True).Alpha[0])
except ValueError:
    a_optimal_lda_speeches = lda_tuning_results_speeches.sort_values("Coherence", ascending = False).reset_index(drop = True).Alpha[0]
try:
    b_optimal_lda_speeches = float(lda_tuning_results_speeches.sort_values("Coherence", ascending = False).reset_index(drop = True).Beta[0])
except ValueError:
    b_optimal_lda_speeches = lda_tuning_results_speeches.sort_values("Coherence", ascending = False).reset_index(drop = True).Beta[0]

In [22]:
# Train model
lda_model_speeches = LdaMulticore(corpus=corpus_speeches_lda,
                                 id2word=id2word_speeches_lda,
                                 num_topics=k_optimal_lda_speeches,
                                 random_state=42,
                                 alpha=a_optimal_lda_speeches,
                                 eta=b_optimal_lda_speeches)

In [23]:
# Calculate final coherence value
coherence_model_lda_speeches = CoherenceModel(model=lda_model_speeches, texts=texts_speeches_lda, dictionary=id2word_speeches_lda,
                                                    coherence='c_v')
coherence_lda_speeches = coherence_model_lda_speeches.get_coherence()
print("The final model coherence of the LDA for Speeches is: " + str(round(coherence_lda_speeches,2)))

The final model coherence of the LDA for Speeches is: 0.37


In [24]:
# Visually inspect result
lda_vis_speeches = pyLDAvis.gensim_models.prepare(lda_model_speeches, corpus_speeches_lda, id2word_speeches_lda)
lda_vis_speeches

### 3.5.2 Non Negative Matrix Factorization

Another approach for topic modeling we are testing is Non Negative Matrix Factorization (NNMF). This technique is another unsupervised machine learning method, that factorizes a matrix into two matrices, that give a less complex representation of the the original matrix. In our case we use it for creating a document-term matrix, that help to identify topics of the considered documents.

#### Define hyperparameters for optimization.

We optimize the hyperparameters of the NNMF model based on a grid search with the variables topic number (k).

In [25]:
# Function for calculating coherence values of specific hyperparamter combinations
def compute_nnmf_coherence_values(corpus, text, id2word, k):
    nmf_model = Nmf(
        corpus=corpus,
        id2word=id2word,
        num_topics=k,
        random_state=42
    )
    coherence_model_lda = CoherenceModel(model=nmf_model, texts=text, dictionary=id2word, coherence='c_v')
    return coherence_model_lda.get_coherence()

In [26]:
# Function for executing a hyperparameter optimization
def hyperparameter_nnmf(data_preprocessed, title, topics_range):
    id2word = corpora.Dictionary(data_preprocessed.text_preprocessed.to_list())
    id2word.filter_extremes(no_below=10, no_above=0.1)
    texts = data_preprocessed.text_preprocessed.to_list()
    corpus = [id2word.doc2bow(text) for text in texts]
    model_results = {'Topics': [],
                     'Coherence': []
                    }
    grid = {}
    grid['Validation_Set'] = {}
    for k in tqdm(topics_range):
        print("Number of topics:" + str(k))
        cv = compute_nnmf_coherence_values(corpus=corpus,text = texts,
                                      id2word=id2word, k=k)
        model_results['Topics'].append(k)
        model_results['Coherence'].append(cv)
    results_df = pd.DataFrame(model_results)
    results_df.to_csv('../data/processed/nnmf_tuning_results_' + title + '.csv', index=False)
    return results_df

#### 3.5.2.1 Hyperparameter optimization NNMF for tweets

In [27]:
# Load data
tweets_processed_nnmf = pickle.load(open("../data/processed/tweets_processed.p", "rb" ))

In [28]:
# Hyperparameter optimization
# Uncomment if you want to repeat the hyperparameter optimization
# hyperparameter_nnmf_tweets = hyperparameter_nnmf(tweets_processed_nnmf, "tweets", topics_range)

In [29]:
# Save hyperparameter
# Uncomment if you want to repeat the hyperparameter optimization
# hyperparameter_nnmf_tweets.to_csv('../data/processed/nnmf_tuning_results_tweets.csv', index = False)

#### 3.5.2.2 Calculate best model NNMF for tweets

In [30]:
# Load data
tweets_processed_nnmf = pickle.load(open( "../data/processed/tweets_processed.p", "rb" ))
nnmf_tuning_results_tweets = pd.read_csv('../data/processed/nnmf_tuning_results_tweets.csv')

In [31]:
# Prepare corpus
id2word_tweets_nnmf = corpora.Dictionary(tweets_processed_nnmf.text_preprocessed.to_list())
id2word_tweets_nnmf.filter_extremes(no_below=5, no_above=0.1)
texts_tweets_nnmf = tweets_processed_nnmf.text_preprocessed.to_list()
corpus_tweets_nnmf = [id2word_tweets_nnmf.doc2bow(text) for text in texts_tweets_nnmf]

In [32]:
k_optimal_nnmf_tweets = int(lda_tuning_results_tweets.sort_values("Coherence", ascending = False).reset_index(drop = True).Topics[0])

In [33]:
# Train model
nnmf_model_tweets = Nmf(corpus=corpus_tweets_nnmf,
                                 id2word=id2word_tweets_nnmf,
                                 num_topics=k_optimal_nnmf_tweets,
                                 random_state=42)

In [34]:
# Calculate final coherence value
coherence_model_nnmf_tweets = CoherenceModel(model=nnmf_model_tweets, texts=texts_tweets_nnmf, dictionary=id2word_tweets_nnmf,
                                                    coherence='c_v')
coherence_nnmf_tweets = coherence_model_nnmf_tweets.get_coherence()
print("The final model coherence of the NNMF for tweets is: " + str(round(coherence_nnmf_tweets,2)))

The final model coherence of the NNMF for tweets is: 0.33


In [35]:
# Visually inspect result
nnmf_model_tweets.show_topics()

[(33,
  '0.121*"deutsch" + 0.055*"treffen" + 0.017*"erklären" + 0.014*"absolut" + 0.009*"schutz" + 0.008*"kennen" + 0.005*"unternehmen" + 0.005*"geschichte" + 0.004*"person" + 0.004*"fragen"'),
 (32,
  '0.302*"mal" + 0.009*"schauen" + 0.008*"lesen" + 0.006*"schau" + 0.004*"sagen" + 0.004*"rein" + 0.004*"reden" + 0.004*"ne" + 0.003*"fragen" + 0.003*"eigentlich"'),
 (84,
  '0.081*"handeln" + 0.075*"herr" + 0.048*"verantwortung" + 0.023*"tragen" + 0.020*"grund" + 0.013*"sache" + 0.011*"übernehmen" + 0.009*"sichern" + 0.007*"treffen" + 0.006*"alt"'),
 (99,
  '0.066*"schritt" + 0.064*"gesetz" + 0.037*"polizei" + 0.036*"politische" + 0.013*"stimmen" + 0.011*"richtung" + 0.011*"ändern" + 0.008*"mitarbeiter" + 0.008*"richtige" + 0.007*"konzept"'),
 (19,
  '0.311*"deutschland" + 0.005*"frankreich" + 0.005*"million" + 0.003*"verantwortung" + 0.003*"jude" + 0.003*"zahl" + 0.003*"groß" + 0.003*"weltweit" + 0.003*"jüdisch" + 0.003*"armut"'),
 (34,
  '0.193*"diskutieren" + 0.014*"welt" + 0.013*"wähl

#### 3.5.2.3 Hyperparameter optimization NNMF for speeches

In [36]:
# Load data
speeches_processed_nnmf = pickle.load(open( "../data/processed/speeches_processed.p", "rb" ))

In [37]:
# Hyperparameter optimization
# Uncomment if you want to repeat the hyperparameter optimization
# hyperparameter_nnmf_speeches = hyperparameter_lda(speeches_processed_nnmf, "tweets", topics_range, alpha, beta)

In [38]:
# Save hyperparameter
# Uncomment if you want to repeat the hyperparameter optimization
# hyperparameter_nnmf_speeches.to_csv('../data/processed/nnmf_tuning_results_speeches.csv', index = False)

#### 3.5.2.4 Calculate best model NNMF for speeches

In [39]:
# Load data
speeches_processed_nnmf = pickle.load(open( "../data/processed/speeches_processed.p", "rb" ))
nnmf_tuning_results_speeches = pd.read_csv('../data/processed/nnmf_tuning_results_speeches.csv')

In [40]:
# Prepare corpus
id2word_speeches_nnmf = corpora.Dictionary(tweets_processed_nnmf.text_preprocessed.to_list())
id2word_speeches_nnmf.filter_extremes(no_below=5, no_above=0.1)
texts_speeches_nnmf = tweets_processed_nnmf.text_preprocessed.to_list()
corpus_speeches_nnmf = [id2word_speeches_nnmf.doc2bow(text) for text in texts_speeches_nnmf]

In [41]:
k_optimal_nnmf_speeches = int(lda_tuning_results_speeches.sort_values("Coherence", ascending = False).reset_index(drop = True).Topics[0])

In [42]:
# Train model
nnmf_model_speeches = Nmf(corpus=corpus_speeches_nnmf,
                                 id2word=id2word_speeches_nnmf,
                                 num_topics=k_optimal_nnmf_speeches,
                                 random_state=42)

In [43]:
# Calculate final coherence value
coherence_model_nnmf_speeches = CoherenceModel(model=nnmf_model_speeches, texts=texts_speeches_nnmf, dictionary=id2word_speeches_nnmf,
                                                    coherence='c_v')
coherence_nnmf_speeches = coherence_model_nnmf_speeches.get_coherence()
print("The final model coherence of the NNMF for Speeches is: " + str(round(coherence_nnmf_speeches,2)))

The final model coherence of the NNMF for Speeches is: 0.33


In [44]:
# Visually inspect result
nnmf_model_speeches.show_topics()

[(1,
  '0.046*"schnellen" + 0.043*"fall" + 0.040*"unternehmen" + 0.034*"sichern" + 0.013*"diskussion" + 0.012*"tweet" + 0.012*"klar" + 0.010*"falsch" + 0.008*"dringen" + 0.008*"herr"'),
 (9,
  '0.283*"mal" + 0.008*"lesen" + 0.008*"schauen" + 0.006*"schau" + 0.005*"tweet" + 0.004*"rein" + 0.004*"reden" + 0.004*"ne" + 0.003*"eigentlich" + 0.003*"letzt"'),
 (75,
  '0.088*"schön" + 0.060*"neu" + 0.023*"gesellschaft" + 0.023*"schritt" + 0.018*"bekommen" + 0.013*"kämpfen" + 0.010*"wünsche" + 0.008*"bauen" + 0.007*"alt" + 0.007*"richtung"'),
 (5,
  '0.121*"stellen" + 0.064*"handeln" + 0.053*"denken" + 0.016*"antrag" + 0.006*"wettbewerb" + 0.005*"fair" + 0.005*"falsch" + 0.005*"verfügung" + 0.004*"dar" + 0.004*"entgegen"'),
 (36,
  '0.086*"vorschlag" + 0.058*"liegen" + 0.048*"hoch" + 0.030*"helfen" + 0.026*"bringen" + 0.015*"gesellschaft" + 0.010*"zahl" + 0.007*"einkommen" + 0.007*"wert" + 0.007*"steuer"'),
 (22,
  '0.246*"europa" + 0.009*"sozial" + 0.007*"friede" + 0.006*"stellen" + 0.006*"gr

### 3.5.3 Bertopic

The last model we apply is [BERTopic](https://doi.org/10.5281/zenodo.4381785), which employs BERT transformers model and c-TF-IDF for creating topic models. This model architecture is pretty new and there is not much existing research on the topic. However first results seem promising.

We do not perform hyperparameter optimization for the BERTopic models, as we are having only limited computational power and the runtime of BERTopic is high.

In [45]:
def calculate_coherence_bert(topic_model, docs, topics):
    cleaned_docs = topic_model._preprocess_text(docs)

    # Extract vectorizer and tokenizer from BERTopic
    vectorizer = topic_model.vectorizer_model
    tokenizer = vectorizer.build_tokenizer()

    # Extract features for Topic Coherence evaluation
    words = vectorizer.get_feature_names()
    tokens = [tokenizer(doc) for doc in cleaned_docs]
    dictionary = corpora.Dictionary(tokens)
    corpus = [dictionary.doc2bow(token) for token in tokens]
    topic_words = [[words for words, _ in topic_model.get_topic(topic)] 
                   for topic in range(len(set(topics))-1)]

    # Evaluate
    coherence_model = CoherenceModel(topics=topic_words, 
                                     texts=tokens, 
                                     corpus=corpus,
                                     dictionary=dictionary, 
                                     coherence='c_v')
    coherence = coherence_model.get_coherence()
    return coherence

In [46]:
def assign_topic(topic_id, topic_model):
    return topic_model.get_topic_info(topic_id).Name.values[0]

#### 3.5.3.1 Compute BERTopic model Tweets

In [47]:
# Load data
tweets_processed_bert = pickle.load(open( "../data/processed/tweets_processed.p", "rb" ))
docs_tweets_bert = tweets_processed_bert.text_preprocessed_sentence.tolist()

In [48]:
# Prepare topic model 
topic_model_tweets = BERTopic(language="german", nr_topics="auto", calculate_probabilities = True, verbose = True)

In [49]:
# Compute Bertopic model
# Uncomment if you want to retrain the network
# start_time_bert_tweets = datetime.now()
# topics_tweets_bert, probs_tweets_bert = topic_model_tweets.fit_transform(docs_tweets_bert)
# end_time_bert_tweets = datetime.now()
# print('Duration: {}'.format(end_time_bert_tweets - start_time_bert_tweets))

In [50]:
# Calculate coherence
# Uncomment if you want to retrain the network
# coherence_bert_tweets = calculate_coherence_bert(topic_model_tweets,docs_tweets_bert, topics_tweets_bert)
# coherence_bert_tweets

In [51]:
# Visualise results
# Uncomment if you want to retrain the network
# topic_model_tweets.visualize_topics()

Based on first analyses we saw that there are too many topics, so we reduce the number of topics with the inherent reduction logic.

In [52]:
# Reduce topics
# Uncomment if you want to retrain the network
# topics_tweets_bert_reduced, probs_tweets_bert_reduced = topic_model_tweets.reduce_topics(docs_tweets_bert,
#                                                                                         topics_tweets_bert,
#                                                                                         probs_tweets_bert,
#                                                                                         nr_topics=25)

In [53]:
# Load model
# Comment out if you retrain the model
with open('../data/processed/topics_tweets_bert.pickle', 'rb') as handle:
    topics_tweets_bert_reduced = pickle.load(handle)
topic_model_tweets = BERTopic.load("../models/bertopic_tweets")

In [54]:
# Visualise results
topic_model_tweets.visualize_topics()

OMP: Info #271: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.



distutils Version classes are deprecated. Use packaging.version instead.


distutils Version classes are deprecated. Use packaging.version instead.



In [55]:
# Calculate coherence reduced
coherence_bert_tweets_reduced = calculate_coherence_bert(topic_model_tweets, docs_tweets_bert, 
                                                         topics_tweets_bert_reduced)
print("The final model coherence of the BERTopic for Tweets is: " + str(round(coherence_bert_tweets_reduced,2)))

The final model coherence of the BERTopic for Tweets is: 0.51


In [56]:
# Assign results to dataframe
tweets_processed_bert["topic_id"] = topics_tweets_bert_reduced
tweets_processed_bert["topic"] = tweets_processed_bert.topic_id.progress_apply(assign_topic,                                                                                    topic_model = topic_model_tweets)

  0%|          | 0/148312 [00:00<?, ?it/s]

In [57]:
# Save model and results
# Uncomment if you want to retrain the network
# topic_model_tweets.save("../models/bertopic_tweets")
# with open( "../data/processed/tweets_processed_bert.pickle", "wb" ) as handle:
#    pickle.dump(tweets_processed_bert, handle, protocol=pickle.HIGHEST_PROTOCOL)
# with open('../data/processed/probabilities_tweets_bert.pickle', 'wb') as handle:
#     pickle.dump(probs_tweets_bert_reduced, handle, protocol=pickle.HIGHEST_PROTOCOL)
# with open('../data/processed/topics_tweets_bert.pickle', 'wb') as handle:
#    pickle.dump(topics_tweets_bert_reduced, handle, protocol=pickle.HIGHEST_PROTOCOL)

#### 3.5.3.2 Compute BERTopic model Speeches

In [58]:
# Load data
speeches_processed_bert = pickle.load(open( "../data/processed/speeches_processed.p", "rb" ))
docs_speeches_bert = speeches_processed_bert.text_preprocessed_infrequent_sentence.tolist()

In [59]:
# Prepare topic model 
topic_model_speeches = BERTopic(language="german", nr_topics="auto", calculate_probabilities = True, 
                                verbose = True)

In [60]:
# Compute Bertopic mode
# Uncomment if you want to retrain the network
# start_time_bert_speeches = datetime.now()
# topics_speeches_bert, probs_speeches_bert = topic_model_speeches.fit_transform(docs_speeches_bert)
# end_time_bert_speeches = datetime.now()
# print('Duration: {}'.format(end_time_bert_speeches - start_time_bert_speeches))

In [61]:
# Load model
# Comment out if you retrain the model
with open('../data/processed/topics_speeches_bert.pickle', 'rb') as handle:
    topics_speeches_bert = pickle.load(handle)
topic_model_speeches = BERTopic.load("../models/bertopic_speeches")

In [62]:
# Visualise results
# Uncomment if you want to retrain the network
topic_model_speeches.visualize_topics()


distutils Version classes are deprecated. Use packaging.version instead.


distutils Version classes are deprecated. Use packaging.version instead.



In [63]:
# Calculate coherence reduced
coherence_bert_speeches = calculate_coherence_bert(topic_model_speeches, docs_speeches_bert, 
                                                         topics_speeches_bert)
print("The final model coherence of the BERTopic for speeches is: " + str(round(coherence_bert_speeches,2)))

The final model coherence of the BERTopic for speeches is: 0.6


In [64]:
# Assign results to dataframe
# Uncomment if you want to retrain the network
# speeches_processed_bert["topic_id"] = topics_speeches_bert
# speeches_processed_bert["topic"] = speeches_processed_bert.topic_id.progress_apply(assign_topic, 
#                                                                                   topic_model = topic_model_speeches)

In [65]:
# Save model and results
# Uncomment if you want to retrain the network
# topic_model_speeches.save("../models/bertopic_speeches")
# with open( "../data/processed/speeches_processed_bert.pickle", "wb" ) as handle:
#    pickle.dump(speeches_processed_bert, handle, protocol=pickle.HIGHEST_PROTOCOL)
# with open('../data/processed/probabilities_speeches_bert.pickle', 'wb') as handle:
#    pickle.dump(probs_speeches_bert, handle, protocol=pickle.HIGHEST_PROTOCOL)
# with open('../data/processed/topics_speeches_bert.pickle', 'wb') as handle:
#    pickle.dump(topics_speeches_bert, handle, protocol=pickle.HIGHEST_PROTOCOL)

### 3.5.4 Model selection

For the final model selection we evaluate the model based on the coherence and the visual inspection of the model results.

In [67]:
print("The final model coherence of the LDA for tweets is: " + str(round(coherence_lda_tweets,2)))
print("The final model coherence of the NNMF for tweets is: " + str(round(coherence_nnmf_tweets,2)))
print("The final model coherence of the BERTopic for tweets is: " + str(round(coherence_bert_tweets_reduced,2)))
print("The final model coherence of the LDA for speeches is: " + str(round(coherence_lda_speeches,2)))
print("The final model coherence of the NNMF for speeches is: " + str(round(coherence_nnmf_speeches,2)))
print("The final model coherence of the BERTopic for speeches is: " + str(round(coherence_bert_speeches,2)))

The final model coherence of the LDA for tweets is: 0.54
The final model coherence of the NNMF for tweets is: 0.33
The final model coherence of the BERTopic for tweets is: 0.51
The final model coherence of the LDA for speeches is: 0.37
The final model coherence of the NNMF for speeches is: 0.33
The final model coherence of the BERTopic for speeches is: 0.6


 NNMF models did not perform very well in terms of coherence, while the LDA model only showed good coherence values for the tweets dataset. BERTopic could perform well for both datasets in term of cohersion. Based on the visual inspection we saw very good results for BERTopic and medium results for the other two model types. Based on these criteria we decide for the BERTopic model for both dataset to create the final topic model. In the next section we will analyse the results of BERTopic and validate the the selected models based on word and topic intrusion metrics.