# Comparing the communication of German politicians across Twitter and plenary speeches using topic modelling and sentiment analysis

# 1. Introduction (Jakob)

Twitter as a medium of research for understanding politicians' communication is a classical approach [citation](https://doi.org/10.1108/AJIM-09-2013-0083) that gained much publicity through the prominent tweets of the 45. president of the United States of America Donald J. Trump [citation](https://doi.org/10.1080/15295036.2016.1266686). To better understand the medium, we will compare the content and style of the communication of politicians with their speeches in the German Bundestag.

The importance of understanding how politicians communicate on Twitter and other social media is steadily increasing with the significant influence of their content on societies at large [citation](https://doi.org/10.1080/15205436.2019.1614196), [citation](https://doi.org/10.1145/3414752.3414787), [citation](https://doi.org/10.1007/978-3-642-23333-3_3). Besides decentralizing the reciprocal transfer of information between politicians to citizens [citation](https://doi.org/10.1007/978-3-642-23333-3_3), there are also increasing problems with manipulations [citation](https://doi.org/10.5210/fm.v25i11.11431) and fake news [citation](https://doi.org/10.1075/jlp.21027.wri). An improved understanding of the medium can help identify harmful practices and interpret the content and style context-dependent. 

The presented work aims to help increase the understanding of the communication patterns of politicians on Twitter by comparing the content and sentiment of their tweets to their plenary speeches. We execute this analysis for prominent German politicians of the 19th Bundestag in the time range from 2017 to 2021. For this, we defined the following six research questions:

* **RQ 1.1** What are the main topics of tweets of prominent politicians of the six parties in the German Bundestag differ in the period of the 19th Bundestag?

* **RQ 1.2** What are the main topics of speeches of prominent politicians of the six parties in the German Bundestag differ in the period of the 19th Bundestag?

* **RQ 1.3** How do the main topics of tweets and speeches of prominent politicians of the six parties in the German Bundestag differ in the period of the 19th Bundestag?

Our approach uses data scraped directly from Twitter and plenary speeches obtained from [open discourse](https://github.com/open-discourse/open-discourse) for creating topic models and sentiment analyses for the tweets and speeches of the politicians. For this, we set up a pipeline in Python that preprocesses the data for the modelling part. Before choosing the final best performing model, we try separate models for topic modelling, including Latent Dirichlet Allocation, Non-Negative Matrix Factorization and BERTopic. We use an unsupervised dictionary-based approach that we test with two different sentiment dictionaries for the sentiment analysis. We validate our results with the current state of the art evaluation methods.

The remaining work is structured into four sections Literature Review, Methodology, Results and Discussion. The literature review will analyze existing research approaches and showcase our contributions. The subsequent section comprises the preprocessing and modelling for our analyses, presented as commented code complemented by explanations. Based on the results from the methodology part, we will analyze and validate the results in the fourth section. Finally, we discuss the obtained results and outlook on further work in the last section.

# 2. Literature Review (Stjepan)

**Needs to be added** 

# 3. Methodology

# 3.1 Technical setup (Jakob)

We present the results of our work in a Jupyter Notebook, that contains the commented code and additional explanations, analaysis and evaluation. The project was programmed in the programming language Python using various preexisting packages. It is possible to reproduce all results with the provided complementary files. For this we recommend to setup an conda enviroment using Python 3.8. One can then install all the packages using the provided .txt file. Besides these packages you will need to setup an docker enviroment in section 3.3 if you want to reproducde the data collection. There are seperate introductions in the section.

In [None]:
# Uncomment when you setup the enviroment the first time
# ! pip install -r requirements.txt
# ! python -m spacy download de_core_news_sm

After installing the packages one can import them with the following lines of code.

In [None]:
# Import packages

# Import basic Python packages
import os
import re
import pickle
import random
from pprint import pprint
from imp import reload
import warnings
from operator import itemgetter
from datetime import datetime
from collections import Counter
from functools import partial

# Import util packages
from tqdm.notebook import tqdm

# Import data procesing packages
import numpy as np
import pandas as pd
import psycopg2

# Import visualisation packages
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# Import natural language processing packages
import spacy
from spacy.language import Language
from spacy_langdetect import LanguageDetector
from spacy.tokens.doc import Doc
from spacy.vocab import Vocab

# Import topic modeling packages
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel, LdaMulticore
from gensim.models.nmf import Nmf
import pyLDAvis
import pyLDAvis.gensim_models
from bertopic import BERTopic

# Import metrics packages
from sklearn.metrics import cohen_kappa_score

# Import interface widgets
import ipywidgets as widgets
from ipywidgets import IntProgress
from IPython.display import clear_output

# Set options
pd.options.mode.chained_assignment = None
warnings.filterwarnings("ignore", category=DeprecationWarning) 
os.environ["TOKENIZERS_PARALLELISM"] = "false"
tqdm.pandas()
pyLDAvis.enable_notebook()

After loading all the packages and installing all dependencies you can run the whole code. Parts that require longer execution time are commented out and the results of the code part are imported seperatedly. If you want to reproduce all analysis you need to uncomment the parts and commented out the importing of the results. The necessary steps for this are always described in the code. We do not recommend this step if not necessary as some models have runtimes over eight hours.

# 3.2 Scrape Twitter data (Stjepan)

**Needs to be added** 

# 3.3 Retrieve plenary proceedings data (Jakob)

In this section we retrieve the plenar protocoll data from the 19. German Bundestag. As the data is publically available, they can be downloaded from the official [website](https://www.bundestag.de/services/opendata). Currently the format of the files is not very convenient for automatic analysis, this is why the researcher from [open discourse](https://opendiscourse.de) published a preprocessed version of the plenar protocolls. We use their data, as setting up an own preprocessing pipeline would be very time intensive and out of scope for this work. We use their provided [Docker container](https://github.com/open-discourse/open-discourse) to quickly setup the database for our use. We then query the needed data and export them to a csv file.

## 3.3.1 Setup local database

To use the database we setup the Docker container from [open discourse](https://open-discourse.github.io/open-discourse-documentation/1.0.0/run-the-database-locally.html#use-the-database). Before this we have to download and setup Docker according to these [instructions](https://www.docker.com/products/docker-desktop). After this step we can launch Docker and proceeed.

In [None]:
# Define connection details
con_details = {
    "host": "localhost",
    "database": "next",
    "user": "postgres",
    "password": "postgres",
    "port": "5432"}

In [None]:
# Navigate to Docker container
# Uncomment if you want to set up the Docker container
# os.system("cd ..")

In [None]:
# Login to Github for Docker access
# Uncomment if you want to set up the Docker container
# os.system("docker login docker.pkg.github.com")

In [None]:
# (Only on the first run) download the Docker container
# Uncomment if you want to set up the Docker container
# os.system("docker pull docker.pkg.github.com/open-discourse/open-discourse/database:latest")

In [None]:
# Start and run the database in the Docker container
# Uncomment if you want to set up the Docker container
# os.system("docker run --env POSTGRES_USER=postgres --env POSTGRES_DB=postgres --env POSTGRES_PASSWORD=postgres -p 5432:5432 -d docker.pkg.github.com/open-discourse/open-discourse/database")

## 3.3.2 Retrieve plenary proceedings from database

After we have setup the PostgreSQL database, we now can query the required data.

In [None]:
# Define query
query = """SELECT * from open_discourse.speeches WHERE electoral_term = 19"""

In [None]:
# Create connection
# Uncomment if you want to query the database
# con = psycopg2.connect(**con_details) # If this fails, repeat execution of the cell.
# cur = con.cursor()

In [None]:
# Execute query
# Uncomment if you want to query the database
# cur.execute(query)
# rows = cur.fetchall()

In [None]:
# Transform results in dataframe
# Uncomment if you made a new query to the database
# speeches_retrieved = pd.DataFrame(rows)
# speeches_retrieved.columns = ["id", "session", "electoral_term", "first_name", "last_name", "politician_id", "text",
#                       "fraction_id", "document_url", "position_short", "position_long", "date", 
#                       "search_speech_content"]

In [None]:
# Export resullts to csv
# Uncomment if you made a new query to the database
# speeches_retrieved.to_csv("../data/raw/speeches_retrieved.csv", index=False)

We save the retrieved data as a CSV file and can use it now for further processing. 

# 3.3 Data Exploration (Jakob)

## 3.3.1 Tweets exploration (Stjepan)

#### Import data

In [None]:
# Load tweets data
tweets_scraped = pd.read_csv("../data/raw/tweets_scraped.csv", low_memory=False)

#### Check data

In [None]:
tweets_scraped.head()

In [None]:
tweets_scraped.tail()

In [None]:
tweets_scraped.info()

In [None]:
tweets_scraped.describe()

#### Drop missing data

We can drop all records with missing data, as we cannot use these records for our analysis.

In [None]:
# Drop missing data
tweets_scraped.dropna(inplace = True)

#### Clean names

For better comparability, we harmonize the names in the tweets and speeches data.

In [None]:
# Create twitter username to real name dictionary
usernames_to_fullname = {'rbrinkhaus': 'Ralph Brinkhaus', 'groehe': 'Hermann Gröhe', 
                         'NadineSchoen': 'Nadine Schön', 'n_roettgen': 'Norbert Röttgen',
                         'peteraltmaier': 'Peter Altmaier', 'jensspahn': 'Jens Spahn', 
                         'MatthiasHauer': 'Matthias Hauer', 'c_lindner': 'Christian Lindner',
                         'MarcoBuschmann': 'Marco Buschmann', 'starkwatzinger': 'Bettina Stark-Watzinger',
                         'Lambsdorff': 'Alexander Graf Lambsdorff', 'johannesvogel': 'Johannes Vogel',
                         'KonstantinKuhle': 'Konstantin Kuhle', 'MAStrackZi': 'Marie-Agnes Strack-Zimmermann',
                         'larsklingbeil': 'Lars Klingbeil', 'EskenSaskia': 'Saskia Esken',
                         'hubertus_heil': 'Hubertus Heil', 'HeikoMaas': 'Heiko Maas',
                         'MartinSchulz': 'Martin Schulz', 'KarambaDiaby': 'Karamba Diaby',
                         'Karl_Lauterbach': 'Karl Lauterbach', 'SteffiLemke': 'Steffi Lemke',
                         'cem_oezdemir': 'Cem Özdemir', 'GoeringEckardt': 'Katrin Göring-Eckardt',
                         'KonstantinNotz': 'Konstantin von Notz', '6': 'Konstantin von Notz',
                         'BriHasselmann': 'Britta Haßelmann', 'svenlehmann': 'Sven Lehmann',
                         'ABaerbock': 'Annalena Baerbock', 'ABaerbockArchiv': 'Annalena Baerbock',
                         'SWagenknecht': 'Sahra Wagenknecht', 'b_riexinger': 'Bernd Riexinger',
                         'NiemaMovassat': 'Niema Movassat', 'jankortemdb': 'Jan Korte',
                         'DietmarBartsch': 'Dietmar Bartsch', 'GregorGysi': 'Gregor Gysi',
                         'SevimDagdelen': 'Sevim Dağdelen', 'Alice_Weidel': 'Alice Weidel',
                         'Beatrix_vStorch': 'Beatrix von Storch', 'JoanaCotar': 'Joana Cotar',
                         'StBrandner': 'Stephan Brandner', 'Tino_Chrupalla': 'Tino Chrupalla',
                         'GtzFrmming': 'Götz Frömming', '3': 'Götz Frömming', 'Leif_Erik_Holm': 'Leif-Erik Holm'}

In [None]:
# Add full name
tweets_scraped["full_name"] = tweets_scraped.username.replace(usernames_to_fullname)

#### Check time data

In [None]:
# Add normalized date
tweets_scraped["date"] = pd.to_datetime(tweets_scraped["datetime"], format = "%Y-%m-%d").dt.date

In [None]:
tweets_scraped.date.min()

In [None]:
tweets_scraped.date.max()

In [None]:
# Tweet number per time
tweets_scraped.groupby('date')['tweet_id'].size().plot()

We now can drop all data that are not also represented in the speeches dataset.

In [None]:
# Drop unneded data
tweets_subset = tweets_scraped[np.logical_and(tweets_scraped.date >= pd.Timestamp("24.10.2017"), tweets_scraped.date <= pd.Timestamp("07.05.2021"))]

#### Checkt party distribution

When checking the distribution of tweets per party, we can see differences, but they do not significantly alter our results.

In [None]:
# Tweets per party
tweets_subset.groupby("party").size()

#### Check politician distribution

We see significant differences between the number of tweets per politician ranging from nearly 29665 to 658. We have to consider this in our work.

In [None]:
# Tweets per politican
tweets_scraped.groupby('full_name')['tweet_id'].size().sort_values().plot(kind='bar')

We see an strongly increasing trend of tweets per day. This is caused by two new parties entering the bundestag in 2017.

#### Check text

We check the texts of the tweets with a word cloud. We can infer the need for data preprocessing from a first analysis of the visualisation. 

In [None]:
# Create a word cloud
long_string_tweets = ' '.join(tweets_scraped["text"].tolist())
wordcloud_tweets = WordCloud(background_color="white", max_words=5000, contour_width=3, contour_color='steelblue')
wordcloud_tweets.generate(long_string_tweets)
wordcloud_tweets.to_image()

In [None]:
# Create a counter object
counter_tweets = Counter(long_string_tweets.split())

In [None]:
# Check the most common words
counter_tweets.most_common(10)

We can identify the need for a stopword removal.

#### Drop unneeded columns

In [None]:
# Drop unneeded columns
tweets_subset.drop(['datetime', 'tweet_id', 'username','name', 'reply_count'], axis = 1, inplace = True)

#### Export data

In [None]:
tweets_subset.to_csv("../data/interim/tweets_explored.csv", index = False)

## 3.3.2 Explore speeches of politicians (Jakob)

In a first analysis step we get an overview of the retrieved data and do first simple preprocessing tasks. 

### 3.3.2.1 Import data

In [None]:
# Load tweets data
# Comment out if you retrieve the data from scratch
speeches_retrieved = pd.read_csv("../data/raw/speeches_retrieved.csv", low_memory=False)

### 3.3.2.2 Check data

We use standard steps of data exploration to get an overview of the retrieved data including datatypes and missing values.

In [None]:
speeches_retrieved.head()

In [None]:
speeches_retrieved.tail()

In [None]:
speeches_retrieved.info()

Based on the first overview of the data we can identfiy different variables, that we have to deep dive into to better understand the data quality and prepare first processing steps.

### 3.3.2.3 Drop missing data

We can drop all records with missing speech content, as we cannot use these records for our analysis.

In [None]:
# Drop missing data
speeches_retrieved.dropna(subset = ["text"], inplace = True)

### 3.3.2.4 Clean names

For better comparability, we harmonize the politicians names in the tweets and speeches data.

In [None]:
# Add full name of politicians
speeches_retrieved["full_name"] = speeches_retrieved["first_name"] + " " + speeches_retrieved["last_name"]

In [None]:
# Subset to the selected politicians
speeches_subset = speeches_retrieved[speeches_retrieved.full_name.isin(tweets_subset.full_name.unique())]

In [None]:
speeches_subset.groupby('full_name')['id'].size().sort_values()

In [None]:
# Speeches per politican
speeches_subset.groupby('full_name')['id'].size().sort_values().plot(kind='bar')

There are significant differences between the number of speeches per politician ranging from 252 to 5. We have to consider this in the interpretation of our results.

### 3.3.2.5 Check time data

For an analyis of the topic per time, we need to have the data in a pandas dateformat. Additionally we controll the time span with retrieved speeches.

In [None]:
# Add normalized date
speeches_subset["date"] = pd.to_datetime(speeches_subset["date"], format = "%Y-%m-%d").dt.date

In [None]:
# Find first day with speeches
speeches_subset.date.min()

In [None]:
# Find last day with speeches
speeches_subset.date.max()

In [None]:
# Speech number per time
speeches_subset.groupby('date')['id'].size().plot()

We see some patterns in the time series, however there are no signficant gaps in the observed time frame.

### 3.3.2.6 Check party distribution

To controll the distributions of tweets per party, we assign the party of the author to each speech. 

In [None]:
fullname_to_party = {'Ralph Brinkhaus': 'CDU', 'Hermann Gröhe': 'CDU', 'Nadine Schön': 'CDU', 
                     'Norbert Röttgen': 'CDU', 'Peter Altmaier': 'CDU', 'Jens Spahn': 'CDU', 
                     'Matthias Hauer': 'CDU', 'Christian Lindner': 'FDP', 'Marco Buschmann': 'FDP',
                     'Bettina Stark-Watzinger': 'FDP', 'Alexander Graf Lambsdorff': 'FDP', 'Johannes Vogel': 'FDP',
                     'Konstantin Kuhle': 'FDP', 'Marie-Agnes Strack-Zimmermann': 'FDP', 'Lars Klingbeil': 'SPD',
                     'Saskia Esken': 'SPD', 'Hubertus Heil': 'SPD', 'Heiko Maas': 'SPD', 'Martin Schulz': 'SPD', 
                     'Karamba Diaby': 'SPD', 'Karl Lauterbach': 'SPD', 'Steffi Lemke': 'Grüne',
                     'Cem Özdemir': 'Grüne', 'Katrin Göring-Eckardt': 'Grüne', 'Konstantin von Notz': 'Grüne',
                     'Britta Haßelmann': 'Grüne', 'Sven Lehmann': 'Grüne', 'Annalena Baerbock': 'Grüne',
                     'Sahra Wagenknecht': 'Linke', 'Bernd Riexinger': 'Linke', 'Niema Movassat': 'Linke', 
                     'Jan Korte': 'Linke', 'Dietmar Bartsch': 'Linke', 'Gregor Gysi': 'Linke', 
                     'Sevim Dağdelen': 'Linke', 'Alice Weidel': 'AFD', 'Beatrix von Storch': 'AFD', 
                     'Joana Cotar': 'AFD', 'Stephan Brandner': 'AFD', 'Tino Chrupalla': 'AFD',
                     'Götz Frömming': 'AFD', 'Leif-Erik Holm': 'AFD'}

In [None]:
speeches_subset["party"] = speeches_subset.full_name.replace(fullname_to_party)

In [None]:
# Speeches per party
speeches_subset.groupby("party").size()

When checking the distribution of speeches per party, we can see differences, but we do not expect them to significantly alter our results.

### 3.3.2.7 Check text

We check the texts of the tweets with a word cloud. We can infer the need for data preprocessing from a first analysis of the visualisation. 

In [None]:
# Create a word cloud
long_string_speeches = ' '.join(speeches_subset["text"].tolist())
wordcloud_speeches = WordCloud(background_color="white", max_words=5000, contour_width=3, 
                               contour_color='steelblue')
wordcloud_speeches.generate(long_string_speeches)
wordcloud_speeches.to_image()

In [None]:
# Create a counter object
speeches_counter = Counter(long_string_speeches.split())

In [None]:
# Check the most common words
speeches_counter.most_common(10)

There is a clear need for extensive stopword removal, to reduce noise in the topic and sentiment analysis.

### 3.3.2.8 Drop unneeded columns

In [None]:
# Drop unneeded columns
speeches_subset.drop(['id', 'session', 'electoral_term', 'first_name', 'last_name', 'politician_id',
                      'fraction_id', 'document_url', 'position_short', 'position_long', 'search_speech_content'],
                     axis = 1, inplace = True)

### 3.3.2.9 Export data

In [None]:
speeches_subset.to_csv("../data/interim/speeches_explored.csv", index = False)

This section explored the speeches dataset and controlled the data quality of different important variables. The data quality is satisfactory, except for an highly skewed distribution of speeches per politicians.

# 3.4 Data preprocessing

## 3.4.1 Prepare spacy pipelines (Jakob)

In the last section we identified the need for an extensive preprocessing. We build an flexible spacy pipeline strucutre, that can easily add or remove different preprocessing steps. We base our model on the pretrained [spacy pipeline](https://spacy.io/models/de) for German documents.

In [None]:
@Language.component("Remove non alphabetic words")
def remove_non_alpha(doc):
    return [token for token in doc if token.is_alpha]

We identified the need remove non German text, as they reduce the quality of our topic and sentiment models. For this we use an an language detector and an additional component, that only removes sentences of other languages.

In [None]:
@Language.factory("Detect languages")
def create_language_detector(nlp, name):
    return LanguageDetector(language_detection_function=None)

In [None]:
@Language.component("Keep only German documents")
def remove_non_german(doc):
    res = [sent for sent in doc.sents if sent._.language["language"] == "de"]
    if res:
        return [token for sent in res for token in sent]
    else:
        return Doc(Vocab([]), words=[], spaces=[])

In [None]:
@Language.component("Remove stopwords")
def remove_stopwords(doc): 
    return [token for token in doc if not token.is_stop]

We lemmatize the resulting tokens to keep the semantic meaning of resulting words.

In [None]:
@Language.component("Lemmatize text")
def lemmatize_text(doc):
    return [token.lemma_ for token in doc]

In [None]:
@Language.component("Lowercase Text")
def lowercase(doc):
    return [token.lower() for token in doc]

In [None]:
emoji_codes = re.compile("["
                         u"\U0001F600-\U0001F64F"
                         u"\U0001F300-\U0001F5FF"
                         u"\U0001F680-\U0001F6FF"
                         u"\U0001F1E0-\U0001F1FF"
                         u"\U00002500-\U00002BEF"
                         u"\U00002702-\U000027B0"
                         u"\U00002702-\U000027B0"
                         u"\U000024C2-\U0001F251"
                         u"\U0001f926-\U0001f937"
                         u"\U00010000-\U0010ffff"
                         u"\u2640-\u2642"
                         u"\u2600-\u2B55"
                         u"\u200d"
                         u"\u23cf"
                         u"\u23e9"
                         u"\u231a"
                         u"\ufe0f"
                         u"\u3030"
                         "]+", re.UNICODE)

@Language.component("Remove emojis")
def remove_emojis(doc):
    doc = [token.text for token in doc if not re.match(emoji_codes, token.text)]
    doc = ' '.join(doc)
    return nlp_twitter.make_doc(doc)

In [None]:
@Language.component("Remove URLs")
def remove_urls(doc):
    doc = [token.text for token in doc if not token.like_url]
    doc = ' '.join(doc)
    return nlp_twitter.make_doc(doc)

In [None]:
@Language.component("Remove mentions")
def remove_mentions(doc):
    doc = [token.text for token in doc if not re.match("@.*", token.text)]
    doc = ' '.join(doc)
    return nlp_twitter.make_doc(doc)

In [None]:
@Language.component("Remove stopwords and punctuation")
def remove_stopwords(doc):
    doc = [token.text for token in doc if not token.is_stop and not token.is_punct]
    return doc

## 3.4.2 Topic modeling preprocessing (Jakob)

We do not all preptrained pipeline elements and therefore exclude them. In the next step we will add additional needed previous defined components.

In [None]:
# Exclude not needed pipeline elements
pipeline_exclude = ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'ner', 'morphologizer']

###  3.4.2.1 Tweets

In this subsection we define a pipeline for the preprocessing of the twitter data and execute the pipeline.

In [None]:
# Import data
tweets_explored = pd.read_csv("../data/interim/tweets_explored.csv")

In [None]:
# Create spacy pipeline
nlp_tweets = spacy.load('de_core_news_sm', exclude=pipeline_exclude)
nlp_tweets.Defaults.stop_words |= {"amp", "rt"}

# Add needed pipeline components
nlp_tweets.add_pipe("sentencizer", last=True)
nlp_tweets.add_pipe("Detect languages", name='Detect languages', last=True)
nlp_tweets.add_pipe("Keep only German documents", name='Keep only German documents', last=True)
nlp_tweets.add_pipe("Remove non alphabetic words", name="Remove non alphabetic words", last=True)
nlp_tweets.add_pipe("Remove stopwords", name="Remove stopwords", last=True)
nlp_tweets.add_pipe("Lemmatize text", name="Lemmatize text", last=True)
nlp_tweets.add_pipe("Lowercase Text", name="Lowercase Text", last=True)

In [None]:
# Apply pipeline to text
# Uncomment if you want to update the preprocessing of the data 
# tweets_explored["text_preprocessed"] = tweets_explored.text.progress_apply(nlp_tweets)
# This takes approximately one hour

In [None]:
# Add sentence structure
# Uncomment if you want to update the preprocessing of the data 
# tweets_explored["text_preprocessed_sentence"] = tweets_explored["text_preprocessed"].progress_apply(
#    lambda x: " ".join(x))

In [None]:
# Subset needed data
# Uncomment if you want to update the preprocessing of the data 
# tweets_preprocessed = tweets_explored[["full_name", "date", "party", "text", "text_preprocessed",
#                                       "text_preprocessed_sentence", 'retweet_count', 'like_count']]

In [None]:
# Drop empty texts
# Uncomment if you want to update the preprocessing of the data
# tweets_preprocessed.replace('', np.NaN, inplace=True)
# tweets_preprocessed.dropna(inplace=True)
# tweets_preprocessed.reset_index(drop = True, inplace = True)

In [None]:
# Save data as pickle file
# Uncomment if you want to update the preprocessing of the data
# pickle.dump(tweets_preprocessed, open("../data/processed/tweets_processed.p", "wb"))

We now can use the resulting file to train a topic model for the tweets dataset.

### 3.4.2.2 Speeches

In this subsection we define a pipeline for the preprocessing of the speeches data and execute the pipeline.

In [None]:
# Import data
speeches_explored = pd.read_csv("../data/interim/speeches_explored.csv")

In [None]:
# Create spacy pipeline
nlp_speeches = spacy.load('de_core_news_sm', exclude=pipeline_exclude)

# Add needed pipeline components
nlp_speeches.add_pipe('sentencizer', last=True)
nlp_speeches.add_pipe("Detect languages", name='Detect languages', last=True)
nlp_speeches.add_pipe("Keep only German documents", name='Keep only German documents', last=True)
nlp_speeches.add_pipe("Remove non alphabetic words", name="Remove non alphabetic words", last=True)
nlp_speeches.add_pipe("Remove stopwords", name="Remove stopwords", last=True)
nlp_speeches.add_pipe("Lemmatize text", name="Lemmatize text", last=True)
nlp_speeches.add_pipe("Lowercase Text", name="Lowercase Text", last=True)

In [None]:
# Apply pipeline to text
# Uncomment if you want to update the preprocessing of the data
# speeches_explored["text_preprocessed"] = speeches_explored.text.progress_apply(nlp_speeches)

In [None]:
# Add sentence structure
# Uncomment if you want to update the preprocessing of the data
# speeches_explored["text_preprocessed_sentence"] = speeches_explored["text_preprocessed"].progress_apply(
#    lambda x: " ".join(x))

In [None]:
# Subset needed data
# Uncomment if you want to update the preprocessing of the data
# speeches_preprocessed = speeches_explored[["full_name", "date", "party", "text",
#                                           "text_preprocessed", "text_preprocessed_sentence"]]

We identified the need for additional removing of frequent words, for topic modeling. There are many words coming from greeting phrases (Sehr, geehrte, Frauen, Herren) that do not have semantic relevance for our analyses, but interfere with the model quality based on their frequency. 

In [None]:
# Define function for removing frequent words
def remove_frequent_words(words_list, most_frequent_words):
    return [word for word in words_list if word not in most_frequent_words]

In [None]:
# Additional preprocessing for Bertopic model
# Uncomment if you want to update the preprocessing of the data
# long_string_speeches= ' '.join(speeches_preprocessed.text_preprocessed_sentence.tolist())
# counter_speeches = Counter(long_string_speeches.split())
# most_frequent_words = []
# for item in counter_speeches.most_common(200):
#    most_frequent_words.append(item[0])

In [None]:
# Add columns with preprocessed text and removed frequent words
# Uncomment if you want to update the preprocessing of the data
# speeches_preprocessed["text_preprocessed_infrequent"] = speeches_preprocessed.text_preprocessed.progress_apply(remove_frequent_words,most_frequent_words = most_frequent_words)
# speeches_preprocessed["text_preprocessed_infrequent_sentence"] = speeches_preprocessed["text_preprocessed_infrequent"].progress_apply(lambda x: " ".join(x))

In [None]:
# Drop empty texts
# Uncomment if you want to update the preprocessing of the data
# speeches_preprocessed.replace('', np.NaN, inplace=True)
# speeches_preprocessed.dropna(inplace=True)
# speeches_preprocessed.reset_index(drop = True, inplace = True)

In [None]:
# Save data as pickle file
# Uncomment if you want to update the preprocessing of the data
# pickle.dump(speeches_preprocessed, open("../data/processed/speeches_processed.p", "wb"))

We now can use the resulting file to train a topic model for the speeches dataset.

## 3.4.3 Sentiment analysis preprocessing (Stjepan)

### 3.4.3.1 Tweets

### 3.4.3.2 Speeches

# 3.5 Topic Modeling (Jakob)

To better understand the differences in communication of politicians on Twitter and in the Bundestag, we perform a topic modeling. For this we test three different approaches, before we choose the best performing as our final model. We apply hyperparameter tuning if applicable but omit classic train test split validation. We are gonna analyse the validity of the topic model in the Results section.

## 3.5.1 Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) constitutes a state of the art approach [citation](https://doi.org/10.1186/s40537-019-0255-7) for topic modelling. LDA is an unsupervised machine learning technique that uses generative statistical models to extract topics from a collection of documents [citation](https://dl.acm.org/doi/10.5555/944919.944937). The underlying model assigns a probability distribution over the vocabulary of the documents to topics that can be used for topic detection. We will base our choice of the optimal hyperparameter combination on the coherence of the resulting topic model. This decision is based on the discussion [here](http://topicmodels.info/ckling/tmt/part4.pdf) and [here](https://dl.acm.org/doi/abs/10.1145/2684822.2685324). 

### 3.5.1.1 Define hyperparameters for optimization.

We optimize the hyperparameters of the LDA model based on a grid search with the variables topic number (k), the a-priory belief of document-topic distribution (alpha) and the the a-priory  belief of topic-word distribution (eta) [citation](https://radimrehurek.com/gensim/models/ldamodel.html). This hyperparemter optimization is loosely based on this [article](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0).

In [None]:
# Topics range
min_topics = 10
max_topics = 150
step_size = 10
topics_range = range(min_topics, max_topics, step_size)

In [None]:
# Alpha parameter
alpha = list(np.arange(0.01, 1, 0.3)) #
alpha.append('symmetric')
alpha.append('asymmetric')

In [None]:
# Beta parameter
beta = list(np.arange(0.01, 1, 0.3))
beta.append('symmetric')

In [None]:
# Function for calculating coherence values of specific hyperparamter combinations
def compute_lda_coherence_values(corpus, text, id2word, k, a, b):
    lda_model = LdaMulticore(corpus=corpus,
                             id2word=id2word,
                             num_topics=k, 
                             random_state=42,
                             alpha=a,
                             eta=b)
    coherence_model_lda = CoherenceModel(model=lda_model, texts=text, dictionary=id2word, coherence='c_v')
    return coherence_model_lda.get_coherence()

In [None]:
# Function for executing a hyperparameter optimization
def hyperparameter_lda(data_preprocessed, title, topics_range, alpha, beta):
    id2word = corpora.Dictionary(data_preprocessed.text_preprocessed.to_list())
    # These hyperparameter could also be trialed in an extend scope
    id2word.filter_extremes(no_below=10, no_above=0.1)
    texts = data_preprocessed.text_preprocessed.to_list()
    corpus = [id2word.doc2bow(text) for text in texts]
    model_results = {'Topics': [],
                     'Alpha': [],
                     'Beta': [],
                     'Coherence': []
                    }
    grid = {}
    grid['Validation_Set'] = {}
    for k in tqdm(topics_range):
        print("Number of topics:" + str(k))
        for a in tqdm(alpha):
            print("Alpha value:" + str(a))
            for b in tqdm(beta):
                print("Beta value:" + str(b))
                cv = compute_lda_coherence_values(corpus=corpus,text = texts,
                                              id2word=id2word, k=k, a=a, b=b)
                model_results['Topics'].append(k)
                model_results['Alpha'].append(a)
                model_results['Beta'].append(b)
                model_results['Coherence'].append(cv)
    results_df = pd.DataFrame(model_results)
    results_df.to_csv('../data/processed/lda_tuning_results_' + title + '.csv', index=False)
    return results_df

### 3.5.1.1 Hyperparameter optimization LDA for tweets

In [None]:
# Load data
tweets_processed_lda = pickle.load(open("../data/processed/tweets_processed.p", "rb"))

In [None]:
# Hyperparameter optimization
# Uncomment if you want to repeat the hyperparameter optimization
# hyperparameter_lda_tweets = hyperparameter_lda(tweets_processed_lda, "tweets", topics_range, alpha, beta)
# Takes approximately one hour of runtime

In [None]:
# Save hyperparameter
# Uncomment if you want to repeat the hyperparameter optimization
# hyperparameter_lda_tweets.to_csv('../data/processed/lda_tuning_results_tweets.csv', index = False)

### 3.5.1.2 Calculate best model LDA for tweets

Based on the hyperparameter optimisation from the last subsection, we compute the best LDA model for the tweets dataset.

In [None]:
# Load data
lda_tuning_results_tweets = pd.read_csv('../data/processed/lda_tuning_results_tweets.csv')

In [None]:
# Prepare corpus
id2word_tweets_lda = corpora.Dictionary(tweets_processed_lda.text_preprocessed.to_list())
id2word_tweets_lda.filter_extremes(no_below=5, no_above=0.1)
texts_tweets_lda = tweets_processed_lda.text_preprocessed.to_list()
corpus_tweets_lda = [id2word_tweets_lda.doc2bow(text) for text in texts_tweets_lda]

In [None]:
# Retrieve optimal hyperparameter
k_optimal_lda_tweets = int(lda_tuning_results_tweets.sort_values("Coherence", ascending = False).reset_index(drop = True).Topics[0])
try:
    a_optimal_lda_tweets = float(lda_tuning_results_tweets.sort_values("Coherence", ascending = False).reset_index(drop = True).Alpha[0])
except ValueError:
    a_optimal_lda_tweets = lda_tuning_results_tweets.sort_values("Coherence", ascending = False).reset_index(drop = True).Alpha[0]
try:
    b_optimal_lda_tweets = float(lda_tuning_results_tweets.sort_values("Coherence", ascending = False).reset_index(drop = True).Beta[0])
except ValueError:
    b_optimal_lda_tweets = lda_tuning_results_tweets.sort_values("Coherence", ascending = False).reset_index(drop = True).Beta[0]

In [None]:
# Train model
lda_model_tweets = LdaMulticore(corpus=corpus_tweets_lda,
                                 id2word=id2word_tweets_lda,
                                 num_topics=k_optimal_lda_tweets,
                                 random_state=42,
                                 alpha=a_optimal_lda_tweets,
                                 eta=b_optimal_lda_tweets)

In [None]:
# Calculate final coherence value
coherence_model_lda_tweets = CoherenceModel(model=lda_model_tweets, texts=texts_tweets_lda, dictionary=id2word_tweets_lda, coherence='c_v')
coherence_lda_tweets = coherence_model_lda_tweets.get_coherence()
print("The final model coherence of the LDA for Tweets is: " + str(round(coherence_lda_tweets,2)))

In [None]:
# Visually inspect result
lda_vis_tweets = pyLDAvis.gensim_models.prepare(lda_model_tweets, corpus_tweets_lda, id2word_tweets_lda)
lda_vis_tweets

We use the coherence value and topic visualisation to evalute the model. The model has an good coherence score, but the visual inspection shows topics, that are not easily interpretable. Based on this we cannot infer a high model quality.

### 3.5.1.3 Hyperparameter optimization LDA for speeches

In [None]:
# Load data
speeches_processed_lda = pickle.load(open( "../data/processed/speeches_processed.p", "rb" ))

In [None]:
# Hyperparameter optimization
# Uncomment if you want to repeat the hyperparameter optimization
# hyperparameter_lda_speeches = hyperparameter_lda(speeches_processed_lda, "tweets", topics_range, alpha, beta)

In [None]:
# Save hyperparameter
# Uncomment if you want to repeat the hyperparameter optimization
# hyperparameter_lda_speeches.to_csv('../data/processed/lda_tuning_results_speeches.csv', index = False)

### 3.5.1.4 Calculate best model LDA for speeches

Based on the hyperparameter optimisation from the last subsection, we compute the best LDA model for the speeches dataset.

In [None]:
# Load data
lda_tuning_results_speeches = pd.read_csv('../data/processed/lda_tuning_results_speeches.csv')

In [None]:
# Prepare corpus
id2word_speeches_lda = corpora.Dictionary(tweets_processed_lda.text_preprocessed.to_list())
id2word_speeches_lda.filter_extremes(no_below=5, no_above=0.1)
texts_speeches_lda = tweets_processed_lda.text_preprocessed.to_list()
corpus_speeches_lda = [id2word_speeches_lda.doc2bow(text) for text in texts_speeches_lda]

In [None]:
# Retrieve optimal hyperparameter
k_optimal_lda_speeches = int(lda_tuning_results_speeches.sort_values("Coherence", ascending = False).reset_index(drop = True).Topics[0])
try:
    a_optimal_lda_speeches = float(lda_tuning_results_speeches.sort_values("Coherence", ascending = False).reset_index(drop = True).Alpha[0])
except ValueError:
    a_optimal_lda_speeches = lda_tuning_results_speeches.sort_values("Coherence", ascending = False).reset_index(drop = True).Alpha[0]
try:
    b_optimal_lda_speeches = float(lda_tuning_results_speeches.sort_values("Coherence", ascending = False).reset_index(drop = True).Beta[0])
except ValueError:
    b_optimal_lda_speeches = lda_tuning_results_speeches.sort_values("Coherence", ascending = False).reset_index(drop = True).Beta[0]

In [None]:
# Train model
lda_model_speeches = LdaMulticore(corpus=corpus_speeches_lda,
                                 id2word=id2word_speeches_lda,
                                 num_topics=k_optimal_lda_speeches,
                                 random_state=42,
                                 alpha=a_optimal_lda_speeches,
                                 eta=b_optimal_lda_speeches)

In [None]:
# Calculate final coherence value
coherence_model_lda_speeches = CoherenceModel(model=lda_model_speeches, texts=texts_speeches_lda, dictionary=id2word_speeches_lda,
                                                    coherence='c_v')
coherence_lda_speeches = coherence_model_lda_speeches.get_coherence()
print("The final model coherence of the LDA for Speeches is: " + str(round(coherence_lda_speeches,2)))

In [None]:
# Visually inspect result
lda_vis_speeches = pyLDAvis.gensim_models.prepare(lda_model_speeches, corpus_speeches_lda, id2word_speeches_lda)
lda_vis_speeches

Neither the coherence score or the visual inspection indicate a high model quality.

## 3.5.2 Non Negative Matrix Factorization

Another approach for topic modeling we are testing is Non Negative Matrix Factorization (NNMF). This technique is another unsupervised machine learning method, that factorizes a matrix into two matrices, that give a less complex representation of the the original matrix [citation](https://doi.org/10.1109/TKDE.2012.51). In our case we use it for creating a document-term matrix, that help to identify topics of the considered documents.

#### Define hyperparameters for optimization.

We optimize the hyperparameters of the NNMF model based on a grid search with the variables topic number (k).

In [None]:
# Function for calculating coherence values of specific hyperparamter combinations
def compute_nnmf_coherence_values(corpus, text, id2word, k):
    nmf_model = Nmf(
        corpus=corpus,
        id2word=id2word,
        num_topics=k,
        random_state=42
    )
    coherence_model_lda = CoherenceModel(model=nmf_model, texts=text, dictionary=id2word, coherence='c_v')
    return coherence_model_lda.get_coherence()

In [None]:
# Function for executing a hyperparameter optimization
def hyperparameter_nnmf(data_preprocessed, title, topics_range):
    id2word = corpora.Dictionary(data_preprocessed.text_preprocessed.to_list())
    id2word.filter_extremes(no_below=10, no_above=0.1)
    texts = data_preprocessed.text_preprocessed.to_list()
    corpus = [id2word.doc2bow(text) for text in texts]
    model_results = {'Topics': [],
                     'Coherence': []
                    }
    grid = {}
    grid['Validation_Set'] = {}
    for k in tqdm(topics_range):
        print("Number of topics:" + str(k))
        cv = compute_nnmf_coherence_values(corpus=corpus,text = texts,
                                      id2word=id2word, k=k)
        model_results['Topics'].append(k)
        model_results['Coherence'].append(cv)
    results_df = pd.DataFrame(model_results)
    results_df.to_csv('../data/processed/nnmf_tuning_results_' + title + '.csv', index=False)
    return results_df

### 3.5.2.1 Hyperparameter optimization NNMF for tweets

In [None]:
# Load data
tweets_processed_nnmf = pickle.load(open("../data/processed/tweets_processed.p", "rb" ))

In [None]:
# Hyperparameter optimization
# Uncomment if you want to repeat the hyperparameter optimization
# hyperparameter_nnmf_tweets = hyperparameter_nnmf(tweets_processed_nnmf, "tweets", topics_range)

In [None]:
# Save hyperparameter
# Uncomment if you want to repeat the hyperparameter optimization
# hyperparameter_nnmf_tweets.to_csv('../data/processed/nnmf_tuning_results_tweets.csv', index = False)

### 3.5.2.2 Calculate best model NNMF for tweets

Based on the hyperparameter optimisation from the last subsection, we compute the best NNMF model for the tweets dataset.

In [None]:
# Load data
tweets_processed_nnmf = pickle.load(open( "../data/processed/tweets_processed.p", "rb" ))
nnmf_tuning_results_tweets = pd.read_csv('../data/processed/nnmf_tuning_results_tweets.csv')

In [None]:
# Prepare corpus
id2word_tweets_nnmf = corpora.Dictionary(tweets_processed_nnmf.text_preprocessed.to_list())
id2word_tweets_nnmf.filter_extremes(no_below=5, no_above=0.1)
texts_tweets_nnmf = tweets_processed_nnmf.text_preprocessed.to_list()
corpus_tweets_nnmf = [id2word_tweets_nnmf.doc2bow(text) for text in texts_tweets_nnmf]

In [None]:
k_optimal_nnmf_tweets = int(lda_tuning_results_tweets.sort_values("Coherence", ascending = False).reset_index(drop = True).Topics[0])

In [None]:
# Train model
nnmf_model_tweets = Nmf(corpus=corpus_tweets_nnmf,
                                 id2word=id2word_tweets_nnmf,
                                 num_topics=k_optimal_nnmf_tweets,
                                 random_state=42)

In [None]:
# Calculate final coherence value
coherence_model_nnmf_tweets = CoherenceModel(model=nnmf_model_tweets, texts=texts_tweets_nnmf, dictionary=id2word_tweets_nnmf,
                                                    coherence='c_v')
coherence_nnmf_tweets = coherence_model_nnmf_tweets.get_coherence()
print("The final model coherence of the NNMF for tweets is: " + str(round(coherence_nnmf_tweets,2)))

In [None]:
# Visually inspect result
nnmf_model_tweets.show_topics()

When analysing the topic top words we cannot identify comprehensible subjects. Combined with the low coherence score we can conclude a low model quality.

### 3.5.2.3 Hyperparameter optimization NNMF for speeches

In [None]:
# Load data
speeches_processed_nnmf = pickle.load(open( "../data/processed/speeches_processed.p", "rb" ))

In [None]:
# Hyperparameter optimization
# Uncomment if you want to repeat the hyperparameter optimization
# hyperparameter_nnmf_speeches = hyperparameter_lda(speeches_processed_nnmf, "tweets", topics_range, alpha, beta)

In [None]:
# Save hyperparameter
# Uncomment if you want to repeat the hyperparameter optimization
# hyperparameter_nnmf_speeches.to_csv('../data/processed/nnmf_tuning_results_speeches.csv', index = False)

### 3.5.2.4 Calculate best model NNMF for speeches

Based on the hyperparameter optimisation from the last subsection, we compute the best NNMF model for the speeches dataset.

In [None]:
# Load data
speeches_processed_nnmf = pickle.load(open( "../data/processed/speeches_processed.p", "rb" ))
nnmf_tuning_results_speeches = pd.read_csv('../data/processed/nnmf_tuning_results_speeches.csv')

In [None]:
# Prepare corpus
id2word_speeches_nnmf = corpora.Dictionary(tweets_processed_nnmf.text_preprocessed.to_list())
id2word_speeches_nnmf.filter_extremes(no_below=5, no_above=0.1)
texts_speeches_nnmf = tweets_processed_nnmf.text_preprocessed.to_list()
corpus_speeches_nnmf = [id2word_speeches_nnmf.doc2bow(text) for text in texts_speeches_nnmf]

In [None]:
k_optimal_nnmf_speeches = int(lda_tuning_results_speeches.sort_values("Coherence", ascending = False).reset_index(drop = True).Topics[0])

In [None]:
# Train model
nnmf_model_speeches = Nmf(corpus=corpus_speeches_nnmf,
                                 id2word=id2word_speeches_nnmf,
                                 num_topics=k_optimal_nnmf_speeches,
                                 random_state=42)

In [None]:
# Calculate final coherence value
coherence_model_nnmf_speeches = CoherenceModel(model=nnmf_model_speeches, texts=texts_speeches_nnmf, dictionary=id2word_speeches_nnmf,
                                                    coherence='c_v')
coherence_nnmf_speeches = coherence_model_nnmf_speeches.get_coherence()
print("The final model coherence of the NNMF for Speeches is: " + str(round(coherence_nnmf_speeches,2)))

In [None]:
# Visually inspect result
nnmf_model_speeches.show_topics()

The coherence of the model is rather low and also the resulting topics show no consistent them resulting in a low model usability.

## 3.5.3 Bertopic

The last model we apply is [BERTopic](https://doi.org/10.5281/zenodo.4381785), which employs BERT transformers model for creating topic models. BERTopic uses pretrained BERT models and UMAP and HDBSCAN clustering with an c-TF-IDF embedding and Maximal Marginal Relevance selection. This model architecture is pretty new and there is not much existing research on the topic. However first results seem promising. Based on the architecutre the model is able to identify relevant topics in the text and cluster them according to semantic similarity. The model architecture is quite complex and therefore the runtime of training BERTopic is high. We do not perform hyperparameter optimization for the BERTopic models, as we are having only limited computational power.

In [None]:
def calculate_coherence_bert(topic_model, docs, topics):
    cleaned_docs = topic_model._preprocess_text(docs)

    # Extract vectorizer and tokenizer from BERTopic
    vectorizer = topic_model.vectorizer_model
    tokenizer = vectorizer.build_tokenizer()

    # Extract features for Topic Coherence evaluation
    words = vectorizer.get_feature_names()
    tokens = [tokenizer(doc) for doc in cleaned_docs]
    dictionary = corpora.Dictionary(tokens)
    corpus = [dictionary.doc2bow(token) for token in tokens]
    topic_words = [[words for words, _ in topic_model.get_topic(topic)] 
                   for topic in range(len(set(topics))-1)]

    # Evaluate
    coherence_model = CoherenceModel(topics=topic_words, 
                                     texts=tokens, 
                                     corpus=corpus,
                                     dictionary=dictionary, 
                                     coherence='c_v')
    coherence = coherence_model.get_coherence()
    return coherence

In [None]:
def assign_topic(topic_id, topic_model):
    return topic_model.get_topic_info(topic_id).Name.values[0]

### 3.5.3.1 Compute BERTopic model Tweets

In [None]:
# Load data
tweets_processed_bert = pickle.load(open( "../data/processed/tweets_processed.p", "rb" ))
docs_tweets_bert = tweets_processed_bert.text_preprocessed_sentence.tolist()

In [None]:
# Prepare topic model 
topic_model_tweets = BERTopic(language="german", nr_topics="auto", calculate_probabilities = True, verbose = True)

In [None]:
# Compute Bertopic model
# Uncomment if you want to retrain the network
# start_time_bert_tweets = datetime.now()
# topics_tweets_bert, probs_tweets_bert = topic_model_tweets.fit_transform(docs_tweets_bert)
# end_time_bert_tweets = datetime.now()
# print('Duration: {}'.format(end_time_bert_tweets - start_time_bert_tweets))
# Takes approximately eight hours of runtime

In [None]:
# Calculate coherence
# Uncomment if you want to retrain the network
# coherence_bert_tweets = calculate_coherence_bert(topic_model_tweets,docs_tweets_bert, topics_tweets_bert)
# coherence_bert_tweets

In [None]:
# Visualise results
# Uncomment if you want to retrain the network
# topic_model_tweets.visualize_topics()

Based on first analyses we saw that there are too many topics, so we reduce the number of topics with the inherent reduction logic.

In [None]:
# Reduce topics
# Uncomment if you want to retrain the network
# topics_tweets_bert_reduced, probs_tweets_bert_reduced = topic_model_tweets.reduce_topics(docs_tweets_bert,
#                                                                                         topics_tweets_bert,
#                                                                                         probs_tweets_bert,
#                                                                                         nr_topics=25)

In [None]:
# Load model
# Comment out if you retrain the model
with open('../data/processed/topics_tweets_bert.pickle', 'rb') as handle:
    topics_tweets_bert_reduced = pickle.load(handle)
topic_model_tweets = BERTopic.load("../models/bertopic_tweets")

In [None]:
# Calculate coherence reduced
coherence_bert_tweets_reduced = calculate_coherence_bert(topic_model_tweets, docs_tweets_bert, 
                                                         topics_tweets_bert_reduced)
print("The final model coherence of the BERTopic for Tweets is: " + str(round(coherence_bert_tweets_reduced,2)))

In [None]:
# Visualise results
topic_model_tweets.visualize_topics()

The coherence of the model is on a satisfactory level and the identified topics are interpretable for human observer. We can infer a high model quality.

In [None]:
# Assign results to dataframe
tweets_processed_bert["topic_id"] = topics_tweets_bert_reduced
tweets_processed_bert["topic"] = tweets_processed_bert.topic_id.progress_apply(assign_topic,                                                                                    topic_model = topic_model_tweets)

In [None]:
# Save model and results
# Uncomment if you want to retrain the network
# topic_model_tweets.save("../models/bertopic_tweets")
# with open( "../data/processed/tweets_processed_bert.pickle", "wb" ) as handle:
#    pickle.dump(tweets_processed_bert, handle, protocol=pickle.HIGHEST_PROTOCOL)
# with open('../data/processed/probabilities_tweets_bert.pickle', 'wb') as handle:
#     pickle.dump(probs_tweets_bert_reduced, handle, protocol=pickle.HIGHEST_PROTOCOL)
# with open('../data/processed/topics_tweets_bert.pickle', 'wb') as handle:
#    pickle.dump(topics_tweets_bert_reduced, handle, protocol=pickle.HIGHEST_PROTOCOL)

### 3.5.3.2 Compute BERTopic model Speeches

In [None]:
# Load data
speeches_processed_bert = pickle.load(open( "../data/processed/speeches_processed.p", "rb" ))
docs_speeches_bert = speeches_processed_bert.text_preprocessed_infrequent_sentence.tolist()

In [None]:
# Prepare topic model 
topic_model_speeches = BERTopic(language="german", nr_topics="auto", calculate_probabilities = True, 
                                verbose = True)

In [None]:
# Compute Bertopic mode
# Uncomment if you want to retrain the network
# start_time_bert_speeches = datetime.now()
# topics_speeches_bert, probs_speeches_bert = topic_model_speeches.fit_transform(docs_speeches_bert)
# end_time_bert_speeches = datetime.now()
# print('Duration: {}'.format(end_time_bert_speeches - start_time_bert_speeches))

In [None]:
# Load model
# Comment out if you retrain the model
with open('../data/processed/topics_speeches_bert.pickle', 'rb') as handle:
    topics_speeches_bert = pickle.load(handle)
topic_model_speeches = BERTopic.load("../models/bertopic_speeches")

In [None]:
# Calculate coherence reduced
coherence_bert_speeches = calculate_coherence_bert(topic_model_speeches, docs_speeches_bert, 
                                                         topics_speeches_bert)
print("The final model coherence of the BERTopic for speeches is: " + str(round(coherence_bert_speeches,2)))

In [None]:
# Visualise results
# Uncomment if you want to retrain the network
topic_model_speeches.visualize_topics()

The model has a comperatively high coherence and also interpretable topics. Therefore we conclude a high model quality.

In [None]:
# Assign results to dataframe
# Uncomment if you want to retrain the network
# speeches_processed_bert["topic_id"] = topics_speeches_bert
# speeches_processed_bert["topic"] = speeches_processed_bert.topic_id.progress_apply(assign_topic, 
#                                                                                   topic_model = topic_model_speeches)

In [None]:
# Save model and results
# Uncomment if you want to retrain the network
# topic_model_speeches.save("../models/bertopic_speeches")
# with open( "../data/processed/speeches_processed_bert.pickle", "wb" ) as handle:
#    pickle.dump(speeches_processed_bert, handle, protocol=pickle.HIGHEST_PROTOCOL)
# with open('../data/processed/probabilities_speeches_bert.pickle', 'wb') as handle:
#    pickle.dump(probs_speeches_bert, handle, protocol=pickle.HIGHEST_PROTOCOL)
# with open('../data/processed/topics_speeches_bert.pickle', 'wb') as handle:
#    pickle.dump(topics_speeches_bert, handle, protocol=pickle.HIGHEST_PROTOCOL)

## 3.5.4 Model selection

For the final model selection we evaluate the model based on the coherence and the visual inspection of the model results.

In [None]:
print("The final model coherence of the LDA for tweets is: " + str(round(coherence_lda_tweets,2)))
print("The final model coherence of the NNMF for tweets is: " + str(round(coherence_nnmf_tweets,2)))
print("The final model coherence of the BERTopic for tweets is: " + str(round(coherence_bert_tweets_reduced,2)))
print("The final model coherence of the LDA for speeches is: " + str(round(coherence_lda_speeches,2)))
print("The final model coherence of the NNMF for speeches is: " + str(round(coherence_nnmf_speeches,2)))
print("The final model coherence of the BERTopic for speeches is: " + str(round(coherence_bert_speeches,2)))

 NNMF models did not perform very well in terms of coherence, while the LDA model only showed good coherence values for the tweets dataset. BERTopic could perform well for both datasets in term of cohersion. Based on the visual inspection we saw very good results for BERTopic and medium results for the other two model types. Based on these criteria we decide for the BERTopic model for both dataset to create the final topic model. In the next section we will analyse the results of BERTopic and validate the the selected models based on word and topic intrusion metrics.

# 4 Results

# 4.1 Topic modelling results

Based on the model selection and creation in section 3.5 we will now analyse the results to anwser our first three research questions:

* **RQ 1.1** What are the main topics of tweets of prominent politicians of the six parties in the German Parlament in the period of the 19th Bundestag?

* **RQ 1.2** What are the main topics of speeches of prominent politicians of the six parties in the German Parlament in the period of the 19th Bundestag?

+ **RQ 1.3** How do the main topics of tweets and speeches of prominent politicians of the six parties in the German Parlament differ in the period of the 19th Bundestag?

For this we visualise the results and deep dive into several topics. We cannot do an exhaustive interpretation of all topics of our models, as this would be out of scope for this work. We still provide the code for an exhaustive analysis, so that the interested reader can execute the analysis on his own.

## 4.1.1 Analyse tweets model

To answer the first research question, we use the trained BERTopic model for tweets from the last section. We load the pretrained model and the resulting data. If the model is retrained, one can skip this step.

In [None]:
# Load data
tweets_processed_bert = pickle.load(open( "../data/processed/tweets_processed_bert.pickle", "rb" ))
docs_tweets = tweets_processed_bert.text_preprocessed_sentence.tolist()
with open('../data/processed/probabilities_tweets_bert.pickle', 'rb') as handle:
    probs_tweets = pickle.load(handle)
with open('../data/processed/topics_tweets_bert.pickle', 'rb') as handle:
    topics_tweets = pickle.load(handle)

In [None]:
# Load model
topic_model_tweets = BERTopic.load("../models/bertopic_tweets")

### 4.1.1.1 Overview of topics

We reduced the model to 100 topics which increased our coherence, but also create one large not very expressive topic, that gives no insights. To avoid this problem we would need to do hyperparamter optimization with the amount of topics and the preprocessing which is out of scope based on computational restrictions. We will now focus on topics that can be interpreted.

In [None]:
topic_model_tweets.get_topic_info().head(10)

In [None]:
topic_model_tweets.get_topic_info().tail(10)

In [None]:
topic_model_tweets.visualize_barchart(topics=None, top_n_topics=25, n_words=5, width=250, height=250) 

Even though we already identified some weakness in the model, we can see that there are many interesting topics, that we can use for further analysis. 

### 4.1.1.2 Visualise topic correlation

Another important quality indicator is the similarity of the topics. If we have very similar topics, they will not be very selective and can lead to skewed topic distributions.

In [None]:
topic_model_tweets.visualize_heatmap(top_n_topics=100, width=800, height=800)

We can see strong similarity between some topics. Interestingly these are topics that seem to be not well defined and do not show high inner topic coherence form a human perspective. With a more sophisticated preprocessing and hyperparameter optimization we should be able to handle this problem. We will focus on topics, that are not highly correlated with other topics and seems to have a coherence meaning in our further analysis to avoid distorted topics.

### 4.1.1.3 Visualise topic hierachy

To analyse the topic cluster of the resulting BERTopic model we will use the inherent clustering of the model. We use the inherent clustering to identify significant cluster, that we analyse in more detail.

In [None]:
# Visualise topic hierarchy
topic_model_tweets.visualize_hierarchy(top_n_topics=100) 

In [None]:
# Visualise topic distance map
topic_model_tweets.visualize_topics(top_n_topics=100)

Based on the clustering and our evaluation we identified twelve larger topic clusters. We will analyse three of the cluster in detail, while the other clusters are shortly described and code for deeper analysis is provided. We only selected clusters, that contains topics with political and societal relevance. This limitation exclude topics that only comprise interhuman relationship building and smalltalk. There are a lot more topics and cluster, that we could cover, but this is out of scope for this work.

### 4.1.1.4 Analyse topics

In [None]:
# Prepare time based visualisation
tweets_topics_over_time = topic_model_tweets.topics_over_time(docs_tweets, topics_tweets, 
                                                       pd.to_datetime(tweets_processed_bert.date).dt.strftime('%Y-%m'), 
                                                       nr_bins=None, datetime_format=None, evolution_tuning=True,
                                                       global_tuning=True) 

#### 4.1.1.4.1 Cluster migration

The first cluster covering migration contains the topics 15, 19, 53 and 59. It includes the subjects migration, asyl, refugees and family reunion. We deep dive into the analysis of the topic, to get a better understanding of subject area.

In [None]:
# Define cluster
cluster_1_migration = [15, 19, 53, 59]

In [None]:
# Visualise topic hierarchy
topic_model_tweets.visualize_hierarchy(top_n_topics=100, topics = cluster_1_migration) 

In [None]:
# Analyse the cluster over time
topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_1_migration)

The frequency of tweets concerning the topics migration and asylum peak around the second half of the year 2018 and from then one they are decreasing. One can correlate this peak with the discussion about the [global compact for migration](https://refugeesmigrants.un.org/migration-compact) from the United Nations and other debates about mmigration and asylum in this time span.

In [None]:
# See the party distribution of the cluster
tweets_cluster_1_migration =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_1_migration)]
print(tweets_cluster_1_migration.groupby("party").size().sort_values(ascending = False))
print("\n")
print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

ADF tweets significantly more about the topic migration asylum compared to the other parties controlling their general tweet frequency. The remaining distribution of tweets seems to be proportional to the amount of tweets of the parties in general. 

In [None]:
# See the most prominent politicians of the cluster
tweets_cluster_1_migration.groupby("full_name").size().sort_values(ascending = False).head(10)

The distribution of the politicians seems to correlate with the identified distribution of the parties. An interesting next step could be to investigate the sentiment of the different politicians and parties for the topic.

#### 4.1.1.4.2 Cluster media

The topics 0, 44, 78, 79, 88 and 98 from the cluster media. The cluster comprises subjects as social media, press and other communication media.

In [None]:
cluster_2_media = [0, 44, 78, 79, 88, 98]

In [None]:
# Analyse the cluster over time
# Uncomment if you want to analyse the cluster
# topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_2_media)

In [None]:
# See the party distribution of the cluster
# Uncomment if you want to analyse the cluster
# tweets_cluster_2_media =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_2_media)]
# print(tweets_cluster_2_media.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if you want to analyse the cluster
# tweets_cluster_2_media.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.1.4.3 Cluster extremism and religion

The next cluster comprises the topics 36, 47, 73, 76 which deal with the subjects extremism and religion.

In [None]:
cluster_3_extremism_religion  = [36, 47, 73, 76]

In [None]:
# Analyse the cluster over time
# Uncomment if you want to analyse the clruster
# topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_3_extremism_religion)

In [None]:
# See the party distribution of the cluster
# Uncomment if you want to analyse the cluster
# tweets_cluster_3_extremism_religion =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_3_extremism_religion)]
# print(tweets_cluster_3_extremism_religion.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if you want to analyse the cluster
# tweets_cluster_3_extremism_religion.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.1.4.4 Cluster foreign politics and armed conflicts

The fourth cluster combines the topics 22, 32, 41, 56, 89 and 93. Main issues of this cluster are armed conflicts and defense topics.

In [None]:
cluster_4_foreign_politics_armed_conflicts = [22, 32, 41, 56, 89, 93]

In [None]:
# Analyse the cluster over time
# Uncomment if you want to analyse the cluster
# topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_4_foreign_politics_armed_conflicts)

In [None]:
# See the party distribution of the cluster
# Uncomment if you want to analyse the cluster
# tweets_cluster_4_foreign_politics_armed_conflicts =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_4_foreign_politics_armed_conflicts)]
# print(tweets_cluster_4_foreign_politics_armed_conflicts.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if you want to analyse the cluster
# tweets_cluster_4_foreign_politics_armed_conflicts.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.1.4.5 Cluster discrimination

Another prominent topic area is discrimination and racism, that we combined in the fifth cluster with the topics 13, 23, 37, 40 and 72.

In [None]:
cluster_5_discrimination = [13, 23, 37, 40, 72]

In [None]:
# Analyse the cluster over time
# Uncomment if you want to analyse the cluster
# topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_5_discrimination)

In [None]:
# See the party distribution of the cluster
# Uncomment if you want to analyse the cluster
# tweets_cluster_5_discrimination =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_5_discrimination)]
# print(tweets_cluster_5_discrimination.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if you want to analyse the cluster
# tweets_cluster_5_discrimination.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.1.4.6 Cluster Covid-19

The Covid-19 cluster comprises the topics 8, 9, 29, 54, 65, 71 and 90 and comprises all topics around the global pandemy. We analyse this cluster in more detail.

In [None]:
cluster_6_covid = [8, 9, 29, 54, 65, 71, 90]

In [None]:
# Analyse the cluster over time
topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_6_covid)

The time series of the cluster can be easily related to the developmeent of the worldwide pandemic situation. We have an higher frequency of tweets in time of high numbers of infections and restrictions and less tweets in summer when the situation is more relaxed.

In [None]:
# See the party distribution of the cluster
tweets_cluster_6_covid =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_6_covid)]
print(tweets_cluster_6_covid.groupby("party").size().sort_values(ascending = False))
print("\n")
print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

SPD has an mucher higher number of tweets compared to the other parties concerning this cluster. This can be explained by the amount of tweets of the prominent SPD politican Karl Lauterbach as we can see in the next code cell. It could be interesting to go into a deeper analysis of his tweets, television and other media appearances to better understand his political career.

In [None]:
# See the most prominent politicians of the cluster
tweets_cluster_6_covid.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.1.4.7 Cluster democratic structures

The topics in cluster 16, 24, 27, 34, 38 and 74 seem to focus on general parlamentary and democratic structures.

In [None]:
cluster_7_democratic_structure = [16, 24, 27, 34, 38, 74]

In [None]:
# Analyse the cluster over time
# Uncomment if you want to analyse the cluster
# topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_7_democratic_structure)

In [None]:
# See the party distribution of the cluster
# Uncomment if you want to analyse the cluster
# tweets_cluster_7_democratic_structure =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_7_democratic_structure)]
# print(tweets_cluster_7_democratic_structure.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if you want to analyse the cluster
# tweets_cluster_7_democratic_structure.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.1.4.8 Cluster Germany and EU

Cluster 8 comprises topics 5, 6, 35, 51 and 70 which focus on europe, the EU and Germany.

In [None]:
cluster_8_germany_in_europe = [5, 6, 35, 51, 70]

In [None]:
# Analyse the cluster over time
# Uncomment if you want to analyse the cluster
# topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_8_germany_in_europe)

In [None]:
# See the party distribution of the cluster
# Uncomment if you want to analyse the cluster
# tweets_cluster_8_germany_in_europe =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_8_germany_in_europe)]
# print(tweets_cluster_8_germany_in_europe.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if you want to analyse the cluster
# tweets_cluster_8_germany_in_europe.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.1.4.9 Cluster finance

Another cluster consisting of topics 14, 39, 67, 91 that cover topic around finance.

In [None]:
cluster_9_finance = [14, 39, 67, 91]

In [None]:
# Analyse the cluster over time
# Uncomment if you want to analyse the cluster
# topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_9_finance)

In [None]:
# See the party distribution of the cluster
# Uncomment if you want to analyse the cluster
# tweets_cluster_9_finance =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_9_finance)]
# print(tweets_cluster_9_finance.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if you want to analyse the cluster
# tweets_cluster_9_finance.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.1.4.10 Cluster police and safety

The cluster police and safety comprises three the topics  7, 83 and 93 and covers the issues police and safety.

In [None]:
cluster_10_police_safety = [7, 83, 93]

In [None]:
# Analyse the cluster over time
# Uncomment if you want to analyse the cluster
# topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_10_police_safety)

In [None]:
# See the party distribution of the cluster
# Uncomment if you want to analyse the cluster
# tweets_cluster_10_police_safety =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_10_police_safety)]
# print(tweets_cluster_10_police_safety.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if you want to analyse the cluster
# tweets_cluster_10_police_safety.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.1.4.11 Cluster climate

Another cluster of interest consists of the topics 12 and 99 and covers the area climate and nature. We will analyse the area in more detail.

In [None]:
cluster_11_climate = [12, 99]

In [None]:
# Analyse the cluster over time
# Uncomment if you want to analyse the cluster
topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_11_climate)

An interesting trend we can observe in the data is that there was an sharp increasing frequency of tweets until the beginning of the Covid-19 pandemy. After the beginning of the pandemy, the topic lost importance in the tweet behaviour of the politicians.

In [None]:
# See the party distribution of the cluster
# Uncomment if you want to analyse the cluster
tweets_cluster_11_climate =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_11_climate)]
print(tweets_cluster_11_climate.groupby("party").size().sort_values(ascending = False))
print("\n")
print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

The party Die Grünen has the highest  frequence of tweets concerning enviormental topics. This is in line with the political agenda of the party.

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if you want to analyse the cluster
tweets_cluster_11_climate.groupby("full_name").size().sort_values(ascending = False).head(10)

When analysing the list of the politicians that tweets about this with a high frequency we can see a lot of politicians of the party Die Grünen and other politicians that generally tweets with a high frequency.

#### 4.1.1.4.12 Cluster infrastructure

The last cluster containing topics 1, 18 and 87 covers digital and anlog infrastructure.

In [None]:
# cluster_12_infrastructure = [18, 87, 1]

In [None]:
# Analyse the cluster over time
# Uncomment if you want to analyse the cluster
# topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_12_infrastructure)

In [None]:
# See the party distribution of the cluster
# Uncomment if you want to analyse the cluster
# tweets_cluster_12_infrastructure =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_12_infrastructure)]
# print(tweets_cluster_12_infrastructure.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if you want to analyse the cluster
# tweets_cluster_12_infrastructure.groupby("full_name").size().sort_values(ascending = False).head(10)

### 4.1.1.5 Summary

In this section we summarize the results concerning the intial research question:

**What are the main topics of tweets of prominent politicians of the six parties in the German Parlament in the period of the 19th Bundestag?**

We trained a BERTopic model to give us an overview the topics, that are presented in the sections 4.1.1.1 - 4.1.1.3. Based on the identified topics and the inherent modelling clustering we defined 12 overarching cluster of subjects that are presented in section 4.1.1.4. The cluster of the topics, could now we be used to for further analysis. To answers the research question, we identified the following main topics of tweets of the selected politicians:

* Migration
* Media
* Extremism and feligion
* Foreign politics and armed conflicts
* Discrimination
* Covid-19
* Democratic structures
* Europe, EU and Germany
* Finance
* Police and safety
* Climate
* Infrastructure

We did not include topics, that were results of interhuman relationship building or smalltalk. We did a deep dive in the clusters migration, Covid-19 and enviroment. The code for deeper analysis of the other clusters is provided and can be used by the interested reader. Based on this analysis we will compare the results with the topics of the speeches in the parlaments in section 4.1.3.

## 4.1.2 Analyse speeches model

To answer the second research question, we proceed the same as in the process of answering the first research question.

In [None]:
# Load data
speeches_processed_bert = pickle.load(open( "../data/processed/speeches_processed_bert.pickle", "rb" ))
docs_speeches = speeches_processed_bert.text_preprocessed_sentence.tolist()
with open('../data/processed/probabilities_speeches_bert.pickle', 'rb') as handle:
    probs_speeches = pickle.load(handle)
with open('../data/processed/topics_speeches_bert.pickle', 'rb') as handle:
    topics_speeches = pickle.load(handle)

### 4.1.2.1 Overview of topics

We already saw in the modelling section, that we identified less topics for the speeches dataset. This effect correspond to the significantly fewer number of documents in the dataset. We identified 25 topics in the modelling stage, that we now analyse in more detail.

In [None]:
# Load model
topic_model_speeches = BERTopic.load("../models/bertopic_speeches")

In [None]:
# Show topic infos
topic_model_speeches.get_topic_info()

In [None]:
topic_model_speeches.visualize_barchart(topics=None, top_n_topics=25, n_words=5, width=250, height=250) 

### 4.1.2.2 Visualise topic correlation

To get an better understanding of the quality of our topic model, we analyse the similarity of the identified topics.

In [None]:
# Visualise correlation
topic_model_speeches.visualize_heatmap(top_n_topics=25)

The first two topics have similaritys score with various other topics. This could skew our results and has to be minded when interpreting the results.

### 4.1.2.3 Visualise topic hierachy

To analyse the topic cluster of the resulting BERTopic model we will use the inherent clustering of the model.

In [None]:
# Visualise clustering
topic_model_speeches.visualize_hierarchy(orientation='left', top_n_topics=25, width=1000, height=600) 

In [None]:
# Visualise topic distance
topic_model_speeches.visualize_topics(topics=None, top_n_topics=None, width=650, height=650)

### 4.1.2.4 Analyse topics

We identified 12 cluster and topics that we now analyse in more detail

In [None]:
# Prepare time based visualisation
speeches_topics_over_time = topic_model_speeches.topics_over_time(docs_speeches, topics_speeches, pd.to_datetime(speeches_processed_bert.date).dt.strftime('%Y-%m'), 
                                                       nr_bins=None, datetime_format=None, evolution_tuning=True,
                                                       global_tuning=True) 

#### 4.1.2.4.1 Cluster europe

The first cluster based on topic 21 and 23 deal with the topic europe and EU.

In [None]:
cluster_1_europe = [21, 23]

In [None]:
# Analyse the cluster over time
# Uncomment if you want to analyse the cluster
# topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_1_europe)

In [None]:
# See the party distribution of the cluster
# Uncomment if you want to analyse the cluster
# speeches_cluster_1_europe = speeches_processed_bert[speeches_processed_bert.topic_id.isin([5, 18, 21])]
# print(speeches_cluster_1_europe.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if you want to analyse the cluster
# speeches_cluster_1_europe.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.2.4.2 Cluster democratic structures

The second cluster comprises only the topics democratic structures.

In [None]:
cluster_2_democratic = [5]

In [None]:
# Analyse the cluster over time
# Uncomment if you want to analyse the cluster
# topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_2_democratic)

In [None]:
# See the party distribution of the cluster
# Uncomment if you want to analyse the cluster
# speeches_cluster_2_democratic = speeches_processed_bert[speeches_processed_bert.topic_id.isin(cluster_2_democratic)]
# print(speeches_cluster_2_democratic.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if you want to analyse the cluster
# speeches_cluster_2_democratic.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.2.4.3 Cluster Covid-19

The third cluster contains the topics 2 and 18 concerning the health and the covid pandemy. We will analyse the prevalence of the topics per time and party.

In [None]:
cluster_3_covid = [18, 2]

In [None]:
# Analyse the cluster over time
topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_3_covid)

We can identify two peaks of the subject that mirror the development of the pandemic situation. We already saw this trend in the analysis of the tweets.

In [None]:
# See the party distribution of the cluster
speeches_cluster_3_various = speeches_processed_bert[speeches_processed_bert.topic_id.isin([0,7,8,12])]
print(speeches_cluster_3_various.groupby("party").size().sort_values(ascending = False))
print("\n")
print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

There are no obvious patterns in the distribution of the speeches per party.

In [None]:
# See the most prominent politicians of the cluster
speeches_cluster_3_various.groupby("full_name").size().sort_values(ascending = False).head(10)

When analysing the top speaker, we see an suprising pattern as not Jens Spahn nor Karl Lauterbach are in the lsit of most top speaker.

#### 4.1.2.4.4 Cluster foreign politics

The largest cluster combines seven topics (6, 7, 11, 12, 15, 19, 20) concenring foreign politics.

In [None]:
cluster_4_foreign_politics = [6, 7, 11, 12, 15, 19, 20]

In [None]:
# Analyse the cluster over time
# Uncomment if you want to analyse the cluster
# topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_4_foreign_politics)

In [None]:
# See the party distribution of the cluster
# Uncomment if you want to analyse the cluster
# speeches_cluster_4_foreign_politics = speeches_processed_bert[speeches_processed_bert.topic_id.isin(cluster_4_foreign_politics)]
# print(speeches_cluster_4_foreign_politics.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if you want to analyse the cluster
# speeches_cluster_4_foreign_politics.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.2.4.5 Cluster occupations

The next cluster contains topic 8 and 9 and deals with the subject occupations.

In [None]:
cluster_5_occupation = [8, 9]

In [None]:
# Analyse the cluster over time
# Uncomment if you want to analyse the cluster
# topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_5_occupation)

In [None]:
# See the party distribution of the cluster
# Uncomment if you want to analyse the cluster
# speeches_cluster_5_occupation = speeches_processed_bert[speeches_processed_bert.topic_id.isin(cluster_5_occupation)]
# print(speeches_cluster_5_occupation.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if you want to analyse the cluster
# speeches_cluster_5_occupation.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.2.4.6 Cluster discrimination

The sixth cluster is only the topic 17 that treats the issue migration.

In [None]:
cluster_6_discrimination = [17]

In [None]:
# Analyse the cluster over time
# Uncomment if you want to analyse the cluster
# topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_6_discrimination)

In [None]:
# See the party distribution of the cluster
# Uncomment if you want to analyse the cluster
# speeches_cluster_6_discrimination = speeches_processed_bert[speeches_processed_bert.topic_id.isin(cluster_6_discrimination)]
# print(speeches_cluster_6_discrimination.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if you want to analyse the cluster
# speeches_cluster_6_discrimination.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.2.4.7 Cluster police and safety

Another cluster comprises only one topic (14) and deals with police and safety.

In [None]:
cluster_7_police_safety = [14]

In [None]:
# Analyse the cluster over time
# Uncomment if you want to analyse the cluster
# topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_7_police_safety)

In [None]:
# See the party distribution of the cluster
# Uncomment if you want to analyse the cluster
# speeches_cluster_7_police_safety = speeches_processed_bert[speeches_processed_bert.topic_id.isin(cluster_7_police_safety)]
# print(speeches_cluster_7_police_safety.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if you want to analyse the cluster
# speeches_cluster_7_police_safety.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.2.4.8 Cluster climate

Cluster eight (topic 1) includes speeches about climate change and protection. To get an overview of the topic, we analyse it in more detail.

In [None]:
cluster_8_climate = [1]

In [None]:
# Analyse the cluster over time
topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_8_climate)

There are two peaks for the topics around the end of 2019, 2020 and 2021. The main topics of the peaks renewable energy topics.

In [None]:
# See the party distribution of the cluster
speeches_cluster_8_climate = speeches_processed_bert[speeches_processed_bert.topic_id.isin(cluster_8_climate)]
print(speeches_cluster_8_climate.groupby("party").size().sort_values(ascending = False))
print("\n")
print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

There is a strong difference in the amount of speeches covering the subject. The parties Die Grüne and CDU cover this topics in their speeches more than other parties controlled for their general frequency of speeches. Most of the speeches of the CDU are held by Peter Altmaier, as we see in the next code snippet.

In [None]:
# See the most prominent politicians of the cluster
speeches_cluster_8_climate.groupby("full_name").size().sort_values(ascending = False).head(10)

We observe many politicians of the party Die Grünen and the CDU politican Peter Altmaier. He is was the federal minister for energy and economy, which explains his top position in the overview.

#### 4.1.2.4.9 Cluster digitalisation

In the ninth cluster is topic 4 covering digitalisation. 

In [None]:
cluster_9_digitalisation = [4]

In [None]:
# Analyse the cluster over time
# Uncomment if you want to analyse the cluster
# topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_9_digitalisation)

In [None]:
# See the party distribution of the cluster
# Uncomment if you want to analyse the cluster
# speeches_cluster_9_digitalisation = speeches_processed_bert[speeches_processed_bert.topic_id.isin(cluster_9_digitalisation)]
# print(speeches_cluster_9_digitalisation.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if you want to analyse the cluster
# speeches_cluster_9_digitalisation.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.2.4.10 Cluster health

The subject health is present in topic 3 and 22.

In [None]:
cluster_10_health = [3,22]

In [None]:
# Analyse the cluster over time
# Uncomment if you want to analyse the cluster
# topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_10_health)

In [None]:
# See the party distribution of the cluster
# Uncomment if you want to analyse the cluster
# speeches_cluster_10_health = speeches_processed_bert[speeches_processed_bert.topic_id.isin(cluster_10_health)]
# print(speeches_cluster_10_health.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if you want to analyse the cluster
# speeches_cluster_10_health.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.2.4.11 Cluster extremism and religion

Similar to the cluster of the tweets we have a cluster (topic 13 and 16) dealing with extremism and religion.

In [None]:
cluster_11_extremism_religion = [13, 16]

In [None]:
# Analyse the cluster over time
# Uncomment if you want to analyse the cluster
# topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_11_extremism_religion)

In [None]:
# See the party distribution of the cluster
# Uncomment if you want to analyse the cluster
# speeches_cluster_11_extremism_religion = speeches_processed_bert[speeches_processed_bert.topic_id.isin(cluster_11_extremism_religion)]
# print(speeches_cluster_11_extremism_religion.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if you want to analyse the cluster
# speeches_cluster_11_extremism_religion.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.2.4.12 Cluster migration

The last cluster with topic 10 is about migration.

In [None]:
cluster_12_migration = [10]

In [None]:
# Analyse the cluster over time
topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_12_migration)

In [None]:
# See the party distribution of the cluster
speeches_cluster_12_migration = speeches_processed_bert[speeches_processed_bert.topic_id.isin(cluster_12_migration)]
print(speeches_cluster_12_migration.groupby("party").size().sort_values(ascending = False))
print("\n")
print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

There are not that many speeches about migration but most of them are held by FDP and AFD. We saw a similar trend in the tweets but the comparable high amounts of tweets of the AFD does not tranfer to the number of speeches hold.

In [None]:
# See the most prominent politicians of the cluster
speeches_cluster_12_migration.groupby("full_name").size().sort_values(ascending = False).head(10)

### 4.1.2.5 Summary

We use the results of the last subsections to answer the second questions:

**What are the main topics of speeches of prominent politicians of the six parties in the German Bundestag differ in the period of the 19th Bundestag?**

We trained a BERTopic model to give us an overview the topics, that are presented in the sections 4.1.2.1 - 4.1.2.3. Based on the identified topics and the inherent modelling clustering we defined 12 overarching cluster of subjects that are presented in section 4.1.2.4. The cluster of the topics, could now we be used to for further analysis. To answers the second research question, we identified the following main topics of speeches of the selected politicians:

* Europe
* Democratic structures
* Covid-19
* Foreign politics
* Occupation
* Discrimination
* Police and safety
* Climate
* Digitalisation
* Health
* Extremism and religion
* Migration

We did a deep dive in the clusters migration, Covid-19 and climate. The code for deeper analysis of the other clusters is provided and can be used by the interested reader. Based on this analysis we will compare the results with the topics of the speeches in the parlaments in section 4.1.3.

## 4.1.3 Compare topics of tweets and speeches

Based on the results of the last two subsections, we now compare the content of tweets and speeches of the German politicians to answer the third research question:

**How do the main topics of tweets and speeches of prominent politicians of the six parties in the German Bundestag differ in the period of the 19th Bundestag?**

For this we will compare the differences of the overall topics of the two medias and the difference of the topic distribution broken down to the parties. When the topics of tweets and speeches, we take into account the inherent differences of the two medias.

### 4.1.3.1 Topics in tweets and speeches

There was a significant difference in the amount of topics we identified for the tweets and for speeches. One part of this diference can be explained by the many times larger amount of tweets compared to speeches, while another part can be explained by the nature of tweets and speeches. The amount of topics is quite high for tweets, which can be explained as Twitter is an lower barrier for communication and politicians will express opinions for more different subjects, than they are willing to talk about in the Bundestag. This could also be interpreted as a sign, that politician are willing to express their opinions about topics that they are not experts in on Twitter, while their are more selective in their speeches in the Bundestag. Additionally politicans use Twitter to announce various events and to build connections to voters and other people of public interest.

When analysing the overall clusters of the topics of speeches and tweets, we found a high amount of matches. There are no obvious signifcant differences in the topics for both medias. However the relative focus of the topics between the medias differs.

In [None]:
# Visualise top topics for tweets
tweets_processed_bert.groupby("topic").size().sort_values(ascending = False)[[1, 3, 5, 6, 7, 8, 9, 10, 12, 13]]

The first obvious difference in the top topics, is the presence of many non relevant topics in the models for tweets. Therefore we select the subset of the most prominent relevant topics for the tweets model.

In [None]:
# Visualise top topics for speeches
speeches_processed_bert.groupby("topic").size().sort_values(ascending = False)[1:11]

One can see striking differences in the top topics between the two medias. The topics digitalisation, climate, occupation and covid pandemy are present in both medias. While the topics concerning the foreign politics and armed conflict is presented with a high frequency in speeches, we do not see it in the top topics of tweets. The topic EU, Europe and Euro has an high presence in the tweets of the politicians, but not that often in the speeches dataset. This is most likely caused by the european parliament election in 2019 and the previous election campaigns. This analysis uses only an except of the topics and therefore has only limited validity. However it still provides an first overview of the differences in the most prominent topics per medium. One could go into deeper analysis but this is out of scope for this work.

### 4.1.3.2 Topics of AFD

When comparing the most prominent topics of politicans of the party AFD, we can again identify differences in the topic distribution.

In [None]:
# Visualise top topics for tweets
tweets_processed_bert[tweets_processed_bert.party == "AFD"].groupby("topic").size().sort_values(ascending = False)[1:11]

In [None]:
# Visualise top topics for speeches
speeches_processed_bert[speeches_processed_bert.party == "AFD"].groupby("topic").size().sort_values(ascending = False)[1:11]

The largest topics of the Twitter presence of politicians of the AFD are police, migration, refugees and muslims. This is in strong contrast to the topics studying, climate, constituition and digitalisation, that are the focus of most speeches in the Bundestag.

### 4.1.3.3 Topics of CDU

The most striking difference for the tweets and speeches for the CDU is the focus on climate and energy that is not present in the top topics of the tweets. Another interesting observation is the missing representation of the topic foreign politics in the top tweets topics. 

In [None]:
# Visualise top topics for tweets
tweets_processed_bert[tweets_processed_bert.party == "CDU"].groupby("topic").size().sort_values(ascending = False)[1:11]

In [None]:
# Visualise top topics for speeches
speeches_processed_bert[speeches_processed_bert.party == "CDU"].groupby("topic").size().sort_values(ascending = False)[1:11]

### 4.1.3.4 Topics of FDP

The pattern of a missing representation of the topics foreign politcis and armed conflicts is repeating in when analysing the tweets of the FDP.

In [None]:
# Visualise top topics for speeches
tweets_processed_bert[tweets_processed_bert.party == "FDP"].groupby("topic").size().sort_values(ascending = False)[1:11]

In [None]:
# Visualise top topics for speeches
speeches_processed_bert[speeches_processed_bert.party == "FDP"].groupby("topic").size().sort_values(ascending = False)[1:11]

### 4.1.3.5 Topics of Grüne

The coherence between the topics of the tweets and the topics of the speeches is comparably high for the party Die Grünen. One subject from the speeches that is not highly represented in the tweeets is digitalisation.

In [None]:
# Visualise top topics for speeches
tweets_processed_bert[tweets_processed_bert.party == "Grüne"].groupby("topic").size().sort_values(ascending = False)[1:11]

In [None]:
# Visualise top topics for speeches
speeches_processed_bert[speeches_processed_bert.party == "Grüne"].groupby("topic").size().sort_values(ascending = False)[1:11]

The largest topics of the Twitter presence of politicians of the AFD are police, migration, refugees and muslims. This is in strong contrast to the topics studying, climate, constituition and digitalisation, that are the focus of most speeches in the Bundestag.

### 4.1.3.6 Topics of Linke

The party Die Linken also has many overlapping topics in both medias. But the topic occupation and police, that is quite prevailing in the tweets is nearly not at all represented in the speeches.

In [None]:
# Visualise top topics for speeches
tweets_processed_bert[tweets_processed_bert.party == "Linke"].groupby("topic").size().sort_values(ascending = False)[1:11]

In [None]:
# Visualise top topics for speeches
speeches_processed_bert[speeches_processed_bert.party == "Linke"].groupby("topic").size().sort_values(ascending = False)[1:11]

### 4.1.3.7 Topics of SPD

The SPD politicans hold speeches about the similar topics as they tweeet, with the common difference, that they not dicsuss forein affair on Twitter often.

In [None]:
# Visualise top topics for speeches
tweets_processed_bert[tweets_processed_bert.party == "SPD"].groupby("topic").size().sort_values(ascending = False)[1:11]

In [None]:
# Visualise top topics for speeches
speeches_processed_bert[speeches_processed_bert.party == "SPD"].groupby("topic").size().sort_values(ascending = False)[1:11]

The largest topics of the Twitter presence of politicians of the AFD are police, migration, refugees and muslims. This is in strong contrast to the topics studying, climate, constituition and digitalisation, that are the focus of most speeches in the Bundestag.

### 4.1.3.8 Summary

Based on the last subseciton we answer the third research questions.

**How do the main topics of tweets and speeches of prominent politicians of the six parties in the German Bundestag differ in the period of the 19th Bundestag?**

There are many similarities between the communicaiton.

# 4.2 Topic model validation

For validating the results in the previous section, we use word and topic intrusion tests based on [Reading Tea Leaves: How Humans Interpret Topic Models](https://proceedings.neurips.cc/paper/2009/file/f92586a25bb3145facd64ab20fd554ff-Paper.pdf). We implement an interface and evaluate the results of humans label by the two authors.

## 4.2.1 Word intrusion

Word intrusion measures the coherence of topics. For this we show annotators 5 high probability keywords of a particular topic and an intruder keyword form another topic and give them the task to identify the intruder keyword. The model precision as measured by the word intrusion score is then defined as the the number of time the intruder keyword was chosen divided by the number of topics shown. 

### 4.2.1.1 Define functions

Before we can execute the word intrusion task we need to define a set of help functions. We are creating an simple interface for this task to be executed in the Notebooks cells.

In [None]:
# Define a random document searcher
def choose_random_document(index, number_documents):
    rand_document = random.randrange(-1, number_documents-2)
    if rand_document != index:
        return rand_document 
    else:
        return choose_random_document(index, number_documents)

In [None]:
# Function for creating a word intrusion dataset
def create_word_intrusion_dataset(topic_model):
    number_documents = len(topic_model.get_topics())
    records_list = []
    for i in range(number_documents): 
        word_list = []
        for j in range(5):
            word_list.append(topic_model.get_topic(i-1)[j][0])
        intruder_word = topic_model.get_topic(choose_random_document(i-1, number_documents))[0][0]
        intruder_position = random.randrange(4)
        word_list.insert(intruder_position, intruder_word)
        word_list.append(intruder_word)
        word_list.append(intruder_position)
        records_list.append(word_list)
    word_intrusion_df = pd.DataFrame.from_records(records_list)
    word_intrusion_df.columns = ["word_0", "word_1", "word_2", "word_3", "word_4", "word_5", 
                                 "intruder_word", "intruder_index"]
    return word_intrusion_df

In [None]:
# A function that divides the word intrusion dataset into seperate sets for the the annotators
def generate_annotator_set(df, number_label, number_iaa, name_1, name_2):
    length = df.shape[0]
    if 2*number_label + number_iaa > length:
        print("Too many labels for the size of the dataframe")
    df_shuffeled = df.sample(frac=1).reset_index(drop=True)
    df_shuffeled[name_1] = [1] * (number_label+number_iaa) + [0] * (length-number_label-number_iaa)
    df_shuffeled[name_2] = [0] * (number_label) + [1] * (number_label+number_iaa) + [0] * (length-2*number_label-number_iaa)
    df_shuffeled["iaa_flag"] = [0] * number_label + [1] * number_iaa + [0] * (length-number_label-number_iaa)
    df_shuffeled["wis_label"] = [1] * number_label + [0] * number_iaa + [1] * (length-number_label-number_iaa)
    return df_shuffeled

In [None]:
# A function that offers an interface in Jupyter notebook for the word intrusion task
def word_intrusion_test(word_df, name, medium):
    intrusion_df = word_df[word_df[name] == 1].reset_index(drop = True)
    
    max_count = intrusion_df.shape[0]
    global i
    i = 0
    
    button_0 = widgets.Button(description = intrusion_df.word_0[i])
    button_1 = widgets.Button(description = intrusion_df.word_1[i])
    button_2 = widgets.Button(description = intrusion_df.word_2[i])
    button_3 = widgets.Button(description = intrusion_df.word_3[i])
    button_4 = widgets.Button(description = intrusion_df.word_4[i])
    button_5 = widgets.Button(description = intrusion_df.word_5[i])


    chosen_words = []
    chosen_positions= []

    display("Word Intrusion Test")

    f = IntProgress(min=0, max=max_count)
    display(f)

    display(button_0)
    display(button_1)
    display(button_2)
    display(button_3)
    display(button_4)
    display(button_5)


    def btn_eventhandler(position, obj):
        global i 
        i += 1
        
        
        clear_output(wait=True)
        
        display("Word Intrusion Text")
        display(f)
        f.value += 1
        
        choosen_text = obj.description
        chosen_words.append(choosen_text)
        
        chosen_positions.append(position)
        
        if i < max_count:

            button_0 = widgets.Button(description = intrusion_df.word_0[i])
            button_1 = widgets.Button(description = intrusion_df.word_1[i])
            button_2 = widgets.Button(description = intrusion_df.word_2[i])
            button_3 = widgets.Button(description = intrusion_df.word_3[i])
            button_4 = widgets.Button(description = intrusion_df.word_4[i])
            button_5 = widgets.Button(description = intrusion_df.word_5[i])
            
            display(button_0)
            display(button_1)
            display(button_2)
            display(button_3)
            display(button_4)
            display(button_5)
            
            button_0.on_click(partial(btn_eventhandler,0))
            button_1.on_click(partial(btn_eventhandler,1))
            button_2.on_click(partial(btn_eventhandler,2))
            button_3.on_click(partial(btn_eventhandler,3))
            button_4.on_click(partial(btn_eventhandler,4))
            button_5.on_click(partial(btn_eventhandler,5))
        else:
            print ("Thanks " + name + " you finished all the work!")
            intrusion_df["chosen_word"] = chosen_words
            intrusion_df["chosen_position"] = chosen_positions
            intrusion_df.to_csv("../data/processed/word_intrusion_test_" + name + "_" + medium + ".csv", index = False)



    button_0.on_click(partial(btn_eventhandler,0))
    button_1.on_click(partial(btn_eventhandler,1))
    button_2.on_click(partial(btn_eventhandler,2))
    button_3.on_click(partial(btn_eventhandler,3))
    button_4.on_click(partial(btn_eventhandler,4))
    button_5.on_click(partial(btn_eventhandler,5))
    
    return intrusion_df

In [None]:
# Calculate the word intrusion score for the two annotator sets
def calculate_word_intrusion(name_1, name_2, medium):
    df_word_intrusion_1 = pd.read_csv("../data/processed/word_intrusion_test_" + name_1 + "_" + medium + ".csv")
    df_word_intrusion_2 = pd.read_csv("../data/processed/word_intrusion_test_" + name_2 + "_" + medium + ".csv")
    iaa_values_1 = df_word_intrusion_1[df_word_intrusion_1.iaa_flag == 1].chosen_position.values
    iaa_values_2 = df_word_intrusion_2[df_word_intrusion_2.iaa_flag == 1].chosen_position.values
    kappa = cohen_kappa_score(iaa_values_1, iaa_values_2)
    df_word_intrusion = df_word_intrusion_1.append(df_word_intrusion_2)
    df_word = df_word_intrusion[df_word_intrusion["wis_label"] == 1]
    df_word["intruder_chosen"] = df_word["intruder_word"] == df_word["chosen_word"]
    return  df_word["intruder_chosen"].mean(), kappa

### 4.2.1.2 Validation of tweets topic model

Based on the above defined functions, we are going to execute the word intrusion task for the tweets BERTopic model. The annotation is done by the two authors.

In [None]:
# Load model
topic_model_tweets = BERTopic.load("../models/bertopic_tweets")

In [None]:
# Create candidate dataset
word_intrusion_dataset_tweets = create_word_intrusion_dataset(topic_model_tweets)

In [None]:
# Create label dataset for two annotators
word_intrusion_dataset_tweets_label = generate_annotator_set(word_intrusion_dataset_tweets, 45, 11, "Jakob",
                                                             "Stjepan")

In [None]:
# Execute annotation for first candidate
# Uncomment if annotation is repeated
# df_word_intrusion_jakob_tweets = word_intrusion_test(word_intrusion_dataset_tweets_label, "Jakob", "Tweets")

In [None]:
# Execute annotation for second candidate
# Uncomment if annotation is repeated
# df_word_intrusion_stjepan_tweets = word_intrusion_test(word_intrusion_dataset_tweets_label, "Stjepan", "Tweets")

In [None]:
# Calculate intrusion score and cohens kappa
word_intrusion_score_tweets, word_kappa_tweets = calculate_word_intrusion("Jakob", "Stjepan", "Tweets")

In [None]:
# Cohens kappa
print("Cohens kappa is: " + str(round(word_kappa_tweets,2)))

Our inter annotator agreement is on a satisfactory level and shows a good consensus of our annotations. 

In [None]:
# Intrusion score
print("The word intrusion score is: " + str(round(word_intrusion_score_tweets,2)))

We see an a good intrusion score as many of the intruder words were detected. These results could be improved by fixing the identified limitations of our model.

### 4.2.1.3 Validation of speeches topic model

In [None]:
# Load model
topic_model_speeches = BERTopic.load("../models/bertopic_speeches")

In [None]:
# Create candidate dataset
word_intrusion_dataset_speeches = create_word_intrusion_dataset(topic_model_speeches)

In [None]:
# Create label dataset for two annotators
word_intrusion_dataset_speeches_label = generate_annotator_set(word_intrusion_dataset_speeches, 10, 5, "Jakob",
                                                             "Stjepan")

In [None]:
# Execute annotation for first candidate
# df_word_intrusion_jakob_speeches = word_intrusion_test(word_intrusion_dataset_speeches_label, "Jakob", "Speeches")

In [None]:
# Execute annotation for second candidate
# df_word_intrusion_stjepan_speeches = word_intrusion_test(word_intrusion_dataset_speeches_label, "Stjepan", "Speeches")

In [None]:
# Calculate intrusion score and cohens kappa
word_intrusion_score_speeches, word_kappa_speeches = calculate_word_intrusion("Jakob", "Stjepan", "Speeches")

In [None]:
# Cohens kappa
print("Cohens kappa is: " + str(round(word_kappa_speeches,2)))

Our inter annotator agreement is on a satisfactory level and shows a good consensus of our annotations. 

In [None]:
# Intrusion score
print("The word intrusion score is: " + str(round(word_intrusion_score_speeches,2)))

We see an a good intrusion score as many of the intruder words were detected. These results could be improved by fixing the identified limitations of our model.

## 4.2.2 Topic Intrusion

By measurng the topic intrusion score we want to test if the algorithms probability distribution of topics for the documents seems to match the human assesment. For this we show an excerpt of the document, the three topics with the highest probability for this topic and a random low probability topic. To calculate the topic intrusion score we take the mean of the differences of the log probabilities of the selected topic and the true topic.

In [None]:
# Create a function that combines key words into a single string
def create_topic_string(topic_info):
    word_list = []
    for i in range(8):
        word_list.append(topic_info[i][0])
    return ", ".join(word_list)

In [None]:
# Create a function that prepares the topic intrusion dataset
def create_topic_intrusion_dataset(data, topic_model, topic_probabilities, test_number = 100):
    number_documents = data.shape[0]
    if number_documents < test_number:
        print("You can only choose as many test as number of documents!")
    number_topics = len(topic_model.get_topics())
    records_list = []
    for i in range(test_number): 
        topic_list = []
        high_probability_documents = sorted(zip(topic_probabilities[i].tolist(), list(range(number_topics))), reverse=True)[:3]
        low_probability_documents = sorted(zip(topic_probabilities[i].tolist(), list(range(number_topics))), reverse=True)[3:]
        for j in range(3):
            topic_index = high_probability_documents[j][1]
            topic_list.append(create_topic_string(topic_model.get_topic(topic_index)))
        intruder_document = low_probability_documents[random.randrange(number_topics-4)]
        intruder_topic = create_topic_string(topic_model.get_topic(intruder_document[1]))
        intruder_position = random.randrange(4)
        topic_list.insert(intruder_position, intruder_topic)
        for k in range(3):
            topic_index = high_probability_documents[k][1]
            topic_list.append(high_probability_documents[k][0])
        topic_list.insert(intruder_position + 4, intruder_document[0])
        topic_list.append(intruder_topic)
        topic_list.append(intruder_document[0])
        topic_list.append(intruder_position)
        topic_list.append(data["text"][i])
        records_list.append(topic_list)
    df = pd.DataFrame.from_records(records_list)
    df.columns = ["topic_0", "topic_1", "topic_2", "topic_3","probability_topic_0","probability_topic_1",
                  "probability_topic_2","probability_topic_3", "intruder_topic", "intruder_topic_probability",
                  "intruder_index", "text"]
    return df

In [None]:
# Create a function that generate the interface for the topic intrusion test
def topic_intrusion_test(intrusion_df, name, medium):
    intrusion_df = intrusion_df[intrusion_df[name] == 1].reset_index(drop = True)
    
    max_count = intrusion_df.shape[0]
    global i
    i = 0
    
    layout = widgets.Layout(width='auto')

    button_0 = widgets.Button(description = intrusion_df.topic_0[i], layout = layout)
    button_1 = widgets.Button(description = intrusion_df.topic_1[i], layout = layout)
    button_2 = widgets.Button(description = intrusion_df.topic_2[i], layout = layout)
    button_3 = widgets.Button(description = intrusion_df.topic_3[i], layout = layout)
    
    chosen_elements = []
    chosen_positions = []
    chosen_probabilities = []

    display("Topic Intrusion Test")

    f = IntProgress(min=0, max=max_count)
    display(f)
    
    if len(intrusion_df.text[i]) < 1100:
        display(intrusion_df.text[i][0:1100])
    else :
        display(intrusion_df.text[i][100:1100])

    display(button_0)
    display(button_1)
    display(button_2)
    display(button_3)


    def btn_eventhandler(position, column, obj):
        
        global i
        
        clear_output(wait=True)
        
        display("Topic Intrusion Text")
        display(f)
        f.value += 1
                
        choosen_text = obj.description
        chosen_elements.append(choosen_text)
        chosen_positions.append(position)
        chosen_probabilities.append(intrusion_df[column][i])
        
        i += 1
        
        if i < max_count:

            button_0 = widgets.Button(description = intrusion_df.topic_0[i], layout = layout)
            button_1 = widgets.Button(description = intrusion_df.topic_1[i], layout = layout)
            button_2 = widgets.Button(description = intrusion_df.topic_2[i], layout = layout)
            button_3 = widgets.Button(description = intrusion_df.topic_3[i], layout = layout)
            
            if len(intrusion_df.text[i]) < 1100:
                display(intrusion_df.text[i][0:1000])
            else :
                display(intrusion_df.text[i][100:1100])
            
            display(button_0)
            display(button_1)
            display(button_2)
            display(button_3)
            
            button_0.on_click(partial(btn_eventhandler,0,"probability_topic_0"))
            button_1.on_click(partial(btn_eventhandler,1,"probability_topic_1"))
            button_2.on_click(partial(btn_eventhandler,2,"probability_topic_2"))
            button_3.on_click(partial(btn_eventhandler,3,"probability_topic_3"))
        else:
            print ("Thanks " + name + " you finished all the work!")
            intrusion_df["chosen_topic"] = chosen_elements
            intrusion_df["chosen_position"] = chosen_positions
            intrusion_df["chosen_topic_probability"] = chosen_probabilities
            intrusion_df.to_csv("../data/processed/topic_intrusion_test_" + name + "_" + medium + ".csv", index = False)



    button_0.on_click(partial(btn_eventhandler,0,"probability_topic_0"))
    button_1.on_click(partial(btn_eventhandler,1,"probability_topic_1"))
    button_2.on_click(partial(btn_eventhandler,2,"probability_topic_2"))
    button_3.on_click(partial(btn_eventhandler,3,"probability_topic_3"))
    
    return intrusion_df

In [None]:
# Create a function to calulate the topic intrusion score
def calculate_topic_intrusion(name_1, name_2, medium):
    df_topic_intrusion_1 = pd.read_csv("../data/processed/topic_intrusion_test_" + name_1 + "_" + medium + ".csv")
    df_topic_intrusion_2 = pd.read_csv("../data/processed/topic_intrusion_test_" + name_2 + "_" + medium + ".csv")
    iaa_values_1 = df_topic_intrusion_1[df_topic_intrusion_1.iaa_flag == 1].chosen_position.values
    iaa_values_2 = df_topic_intrusion_2[df_topic_intrusion_2.iaa_flag == 1].chosen_position.values
    kappa = cohen_kappa_score(iaa_values_1, iaa_values_2)
    df_topic_intrusion = df_topic_intrusion_1.append(df_topic_intrusion_2)
    df_topic = df_topic_intrusion[df_topic_intrusion["wis_label"] == 1]
    df_topic["intruder_score"] = np.log(df_topic["intruder_topic_probability"]) - np.log(df_topic["chosen_topic_probability"])
    return  df_topic["intruder_score"].mean(), kappa

### 4.2.2.1 Validation of tweets topic model

In the first step we calculate the validation score for the tweets BERTopic model

In [None]:
# Load data
with open( "../data/processed/tweets_processed_bert.pickle", "rb" ) as handle:
    tweets_processed_bert = pickle.load(handle)
with open('../data/processed/probabilities_tweets_bert.pickle', 'rb') as handle:
    topic_probabilities_tweets = pickle.load(handle)

In [None]:
# Load model
# topic_model_tweets = BERTopic.load("../models/bertopic_tweets")

In [None]:
# Create candidate dataset
topic_intrusion_dataset_tweets = create_topic_intrusion_dataset(tweets_processed_bert, topic_model_tweets,
                                                               topic_probabilities_tweets, test_number = 100)

In [None]:
# Create label dataset for two annotators
topic_intrusion_dataset_tweets_label = generate_annotator_set(topic_intrusion_dataset_tweets, 40, 10, "Jakob",
                                                             "Stjepan")

In [None]:
# Execute annotation for first candidate
# df_topic_intrusion_jakob_tweets = topic_intrusion_test(topic_intrusion_dataset_tweets_label, "Jakob", "Tweets")

In [None]:
# Execute annotation for second candidate
# df_topic_intrusion_stjepan_tweets = topic_intrusion_test(topic_intrusion_dataset_tweets_label, "Stjepan", "Tweets")

In [None]:
# Calculate intrusion score and cohens kappa
topic_intrusion_score_tweets, topic_kappa_tweets = calculate_topic_intrusion("Jakob", "Stjepan", "Tweets")

In [None]:
# Cohens kappa
print("Cohens kappa is: " + str(round(topic_kappa_tweets,2)))

Our inter annotator agreement is on a satisfactory level and shows a good consensus of our annotations. 

In [None]:
# Intrusion score
print("The topic intrusion score is: " + str(round(topic_intrusion_score_tweets,2)))

It is difficult to objectively evaluate the resulting topic intrusion score. But comparing with the results from the article, we can infer that this score is at least satisfactory and validates our model.

### 4.2.2.2 Validation of speeches topic model

In [None]:
# Load data
with open( "../data/processed/speeches_processed_bert.pickle", "rb" ) as handle:
    speeches_processed_bert = pickle.load(handle).reset_index(drop = True)
with open('../data/processed/probabilities_speeches_bert.pickle', 'rb') as handle:
    topic_probabilities_speeches = pickle.load(handle)

In [None]:
# Load model
# topic_model_speeches = BERTopic.load("../models/bertopic_speeches")

In [None]:
# Create candidate dataset
topic_intrusion_dataset_speeches = create_topic_intrusion_dataset(speeches_processed_bert, topic_model_speeches,
                                                               topic_probabilities_speeches, test_number = 100)

In [None]:
# Create label dataset for two annotators
topic_intrusion_dataset_speeches_label = generate_annotator_set(topic_intrusion_dataset_speeches, 40, 10, "Jakob",
                                                             "Stjepan")

In [None]:
# Execute annotation for first candidate
# df_topic_intrusion_jakob_speeches = topic_intrusion_test(topic_intrusion_dataset_speeches_label, "Jakob", "Speeches")

In [None]:
# Execute annotation for second candidate
# df_topic_intrusion_stjepan_speeches = topic_intrusion_test(topic_intrusion_dataset_speeches_label, "Stjepan", "Speeches")

In [None]:
# Calculate intrusion score and cohens kappa
topic_intrusion_score_speeches, topic_kappa_speeches = calculate_topic_intrusion("Jakob", "Stjepan", "Speeches")

In [None]:
# Cohens kappa
print("Cohens kappa is: " + str(round(topic_kappa_speeches,2)))

The inter annotator agreement on this task is rather small. We did expect this as it was quite difficult to infer the topics from an excerpt from the speeches, as they are generally quite long and therefore it is not easy to infer the right topics.

In [None]:
# Intrusion score
print("The word intrusion score is: " + str(round(topic_intrusion_score_speeches,2)))

It is difficult to objectively evaluate the resulting topic intrusion score. But comparing with the results from the article, we can infer that this score is at least satisfactory and validates our model.

### 4.2.3 Conclusion

Based on the topic and word intrusion measures we evaluated in these section, we can infer an satisfactory validity of our models. There are different possibility of improvement and we detected several limtation in the results section, however the model still offers noticeable interesting insights.

# 4.3 Result analysis sentiment analysis (Stjepan)

**Needs to be added** 

# 4.4 Validation sentiment analysis (Stjepan)

**Needs to be added** 

# 5 Discussion

**Needs to be added** 

# 5.1 Discussion topic modelling (Jakob)

We discuss

* Model results
* Data quality 
* Model quality
* Model validity

* Limitations of our approach
* Validity of the results
* Next possible steps

# 5.2 Discussion sentiment analysis (Stjepan)

**Needs to be added** 

# 6. Bibliography (Stjepan)

**Needs to be added** 