In [None]:
# 4 Results

# 4.1 Topic modelling results (Jakob)

Based on the model selection and creation in section 3.5, we will now analyse the results to answer our first three research questions:

* **RQ 1.1** What are the main topics of tweets of prominent politicians of the six parties in the German Parlament in the period of the 19th Bundestag?

* **RQ 1.2** What are the main topics of speeches of prominent politicians of the six parties in the German Parlament in the period of the 19th Bundestag?

+ **RQ 1.3** How do the main topics of tweets and speeches of prominent politicians of the six parties in the German Parlament differ in the period of the 19th Bundestag?

For this, we visualise the results and deep dive into several topics. We cannot make an exhaustive interpretation of all topics of our models, as this would be out of scope for this work. We still provide the code for an exhaustive analysis so that the interested reader can execute the analysis on his own.

## 4.1.1 Analyse tweets model

We use the trained BERTopic model for tweets from the last section to answer the first research question. We load the pre-trained model and the resulting data. If the model is retrained, one can skip this step.

In [None]:
# Load data
tweets_processed_bert = pickle.load(open( "../data/processed/tweets_processed_bert.pickle", "rb" ))
docs_tweets = tweets_processed_bert.text_preprocessed_sentence.tolist()
with open('../data/processed/probabilities_tweets_bert.pickle', 'rb') as handle:
    probs_tweets = pickle.load(handle)
with open('../data/processed/topics_tweets_bert.pickle', 'rb') as handle:
    topics_tweets = pickle.load(handle)

In [None]:
# Load model
topic_model_tweets = BERTopic.load("../models/bertopic_tweets")

### 4.1.1.1 Overview of topics

We reduced the model to 100 topics, increasing our coherence and creating one large, not very expressive topic that gives no insights. To avoid this problem, we would need to do hyperparameter optimisation with the number of topics and the preprocessing, which is out of scope based on computational restrictions. We will now focus on topics that can be interpreted.

In [None]:
topic_model_tweets.get_topic_info().head(10)

In [None]:
topic_model_tweets.get_topic_info().tail(10)

In [None]:
topic_model_tweets.visualize_barchart(topics=None, top_n_topics=25, n_words=5, width=250, height=250)

Even though we have already identified some weaknesses in the model, we can see that there are many exciting topics that we can use for further analysis.

### 4.1.1.2 Visualise topic correlation

Another critical quality indicator is the similarity of the topics. If we have very similar topics, they will not be very selective and lead to skewed topic distributions.

In [None]:
topic_model_tweets.visualize_heatmap(top_n_topics=100, width=800, height=800)

We can see a substantial similarity between some topics. Interestingly these are topics that seem to be not well defined and do not show high inner topic coherence from a human perspective. With a more sophisticated preprocessing and hyperparameter optimisation, we could handle this problem. We will focus on topics that are not highly correlated with other topics and seems to have a coherent meaning in our further analysis to avoid distorted topics.

### 4.1.1.3 Visualise topic hierarchy

To analyse the topic cluster of the resulting BERTopic model, we will use the inherent clustering of the model. We use the inherent clustering to identify significant clusters that we analyse in more detail.

In [None]:
# Visualise topic hierarchy
topic_model_tweets.visualize_hierarchy(top_n_topics=100)

In [None]:
# Visualise topic distance map
topic_model_tweets.visualize_topics(top_n_topics=100)

We identified twelve larger topic clusters based on the clustering and our evaluation. We will analyse three of the cluster in detail, while the other clusters are shortly described, and code for more profound analysis is provided. We only selected clusters that contained topics with political and societal relevance. This limitation excludes topics that only comprise interhuman relationship building and Smalltalk. There are many more topics and clusters that we could cover, but this is out of scope for this work.

### 4.1.1.4 Analyse topics

In [None]:
# Prepare time based visualisation
tweets_topics_over_time = topic_model_tweets.topics_over_time(docs_tweets, topics_tweets,
                                                       pd.to_datetime(tweets_processed_bert.date).dt.strftime('%Y-%m'),
                                                       nr_bins=None, datetime_format=None, evolution_tuning=True,
                                                       global_tuning=True)

#### 4.1.1.4.1 Cluster migration

The first cluster covering migration contains topics 15, 19, 53 and 59. It includes the subjects migration, asylum, refugees and family reunions. We deep dive into the analysis of the topic to better understand the subject area.

In [None]:
# Define cluster
cluster_1_migration = [15, 19, 53, 59]

In [None]:
# Visualise topic hierarchy
topic_model_tweets.visualize_hierarchy(top_n_topics=100, topics = cluster_1_migration)

In [None]:
# Analyse the cluster over time
topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_1_migration)

The frequency of tweets concerning migration and asylum topics peaked around the second half of the year 2018. After this time point, they are decreasing. One can correlate this peak with the discussion about the [global compact for migration](https://refugeesmigrants.un.org/migration-compact) from the United Nations and other debates about immigration and asylum in this period.

In [None]:
# See the party distribution of the cluster
tweets_cluster_1_migration =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_1_migration)]
print(tweets_cluster_1_migration.groupby("party").size().sort_values(ascending = False))
print("\n")
print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

ADF tweets significantly more about the topic of migration asylum compared to the other parties controlling their general tweet frequency. The remaining distribution of tweets seems to be proportional to the number of tweets of the parties in general.

In [None]:
# See the most prominent politicians of the cluster
tweets_cluster_1_migration.groupby("full_name").size().sort_values(ascending = False).head(10)

The distribution of the politicians seems to correlate with the identified distribution of the parties. An interesting next step could be to investigate the sentiment of the different politicians and parties for the topic.

#### 4.1.1.4.2 Cluster media

The topics 0, 44, 78, 79, 88 and 98 form the cluster media. The cluster comprises social media, press, and other communication media subjects.

In [None]:
cluster_2_media = [0, 44, 78, 79, 88, 98]

In [None]:
# Analyse the cluster over time
# Uncomment if one wants to analyse the cluster
# topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_2_media)

In [None]:
# See the party distribution of the cluster
# Uncomment if one wants to analyse the cluster
# tweets_cluster_2_media =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_2_media)]
# print(tweets_cluster_2_media.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if one wants to analyse the cluster
# tweets_cluster_2_media.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.1.4.3 Cluster extremism and religion

The next cluster comprises the topics 36, 47, 73, 76, which deal with the subjects extremism and religion.

In [None]:
cluster_3_extremism_religion  = [36, 47, 73, 76]

In [None]:
# Analyse the cluster over time
# Uncomment if one wants to analyse the clruster
# topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_3_extremism_religion)

In [None]:
# See the party distribution of the cluster
# Uncomment if one wants to analyse the cluster
# tweets_cluster_3_extremism_religion =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_3_extremism_religion)]
# print(tweets_cluster_3_extremism_religion.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if one wants to analyse the cluster
# tweets_cluster_3_extremism_religion.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.1.4.4 Cluster foreign politics and armed conflicts

The fourth cluster combines the topics 22, 32, 41, 56, 89 and 93. The main issues of this cluster are armed conflicts and defence topics.

In [None]:
cluster_4_foreign_politics_armed_conflicts = [22, 32, 41, 56, 89, 93]

In [None]:
# Analyse the cluster over time
# Uncomment if one wants to analyse the cluster
# topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_4_foreign_politics_armed_conflicts)

In [None]:
# See the party distribution of the cluster
# Uncomment if one wants to analyse the cluster
# tweets_cluster_4_foreign_politics_armed_conflicts =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_4_foreign_politics_armed_conflicts)]
# print(tweets_cluster_4_foreign_politics_armed_conflicts.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if one wants to analyse the cluster
# tweets_cluster_4_foreign_politics_armed_conflicts.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.1.4.5 Cluster discrimination

Another prominent topic area is discrimination and racism that we combined in the fifth cluster with the topics 13, 23, 37, 40 and 72.

In [None]:
cluster_5_discrimination = [13, 23, 37, 40, 72]

In [None]:
# Analyse the cluster over time
# Uncomment if one wants to analyse the cluster
# topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_5_discrimination)

In [None]:
# See the party distribution of the cluster
# Uncomment if one wants to analyse the cluster
# tweets_cluster_5_discrimination =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_5_discrimination)]
# print(tweets_cluster_5_discrimination.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if one wants to analyse the cluster
# tweets_cluster_5_discrimination.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.1.4.6 Cluster Covid-19

The Covid-19 cluster comprises topics 8, 9, 29, 54, 65, 71 and 90. We analyse this cluster in more detail.

In [None]:
cluster_6_covid = [8, 9, 29, 54, 65, 71, 90]

In [None]:
# Analyse the cluster over time
topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_6_covid)

The time series of the cluster can be easily related to the development of the worldwide pandemic situation. We have a higher frequency of tweets in times of high infections and restrictions and fewer tweets in summer when the situation is more relaxed.

In [None]:
# See the party distribution of the cluster
tweets_cluster_6_covid =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_6_covid)]
print(tweets_cluster_6_covid.groupby("party").size().sort_values(ascending = False))
print("\n")
print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

SPD has a much higher number of tweets than the other parties concerning this cluster. This difference can be explained by the number of tweets of the prominent SPD politician Karl Lauterbach as we can see in the following code cell. It could be interesting to go into a deeper analysis of his tweets, television and other media appearances to understand his political career better.

In [None]:
# See the most prominent politicians of the cluster
tweets_cluster_6_covid.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.1.4.7 Cluster democratic structures

The topics in clusters 16, 24, 27, 34, 38 and 74 focus on general parliamentary and democratic structures.

In [None]:
cluster_7_democratic_structure = [16, 24, 27, 34, 38, 74]

In [None]:
# Analyse the cluster over time
# Uncomment if one wants to analyse the cluster
# topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_7_democratic_structure)

In [None]:
# See the party distribution of the cluster
# Uncomment if one wants to analyse the cluster
# tweets_cluster_7_democratic_structure =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_7_democratic_structure)]
# print(tweets_cluster_7_democratic_structure.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if one wants to analyse the cluster
# tweets_cluster_7_democratic_structure.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.1.4.8 Cluster Germany and EU

Cluster 8 comprises topics 5, 6, 35, 51 and 70, focusing on Europe, the EU and Germany.

In [None]:
cluster_8_germany_in_europe = [5, 6, 35, 51, 70]

In [None]:
# Analyse the cluster over time
# Uncomment if one wants to analyse the cluster
# topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_8_germany_in_europe)

In [None]:
# See the party distribution of the cluster
# Uncomment if one wants to analyse the cluster
# tweets_cluster_8_germany_in_europe =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_8_germany_in_europe)]
# print(tweets_cluster_8_germany_in_europe.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if one wants to analyse the cluster
# tweets_cluster_8_germany_in_europe.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.1.4.9 Cluster finance

Another cluster consists of topics 14, 39, 67, 91 that cover topics around finance.

In [None]:
cluster_9_finance = [14, 39, 67, 91]

In [None]:
# Analyse the cluster over time
# Uncomment if one wants to analyse the cluster
# topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_9_finance)

In [None]:
# See the party distribution of the cluster
# Uncomment if one wants to analyse the cluster
# tweets_cluster_9_finance =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_9_finance)]
# print(tweets_cluster_9_finance.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if one wants to analyse the cluster
# tweets_cluster_9_finance.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.1.4.10 Cluster police and safety

The cluster policy and safety comprise three topics  7, 83 and 93 and cover the issues of police and safety.

In [None]:
cluster_10_police_safety = [7, 83, 93]

In [None]:
# Analyse the cluster over time
# Uncomment if one wants to analyse the cluster
# topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_10_police_safety)

In [None]:
# See the party distribution of the cluster
# Uncomment if one wants to analyse the cluster
# tweets_cluster_10_police_safety =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_10_police_safety)]
# print(tweets_cluster_10_police_safety.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if one wants to analyse the cluster
# tweets_cluster_10_police_safety.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.1.4.11 Cluster climate

Another cluster of interest consists of topics 12 and 99 and covers climate and nature. We will analyse the area in more detail.

In [None]:
cluster_11_climate = [12, 99]

In [None]:
# Analyse the cluster over time
# Uncomment if one wants to analyse the cluster
topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_11_climate)

An interesting trend in the data is a sharp increasing frequency of tweets until the beginning of the Covid-19 pandemic. After the pandemic's beginning, the topic lost importance in the tweeting behaviour of the politicians.

In [None]:
# See the party distribution of the cluster
# Uncomment if one wants to analyse the cluster
tweets_cluster_11_climate =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_11_climate)]
print(tweets_cluster_11_climate.groupby("party").size().sort_values(ascending = False))
print("\n")
print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

The party Die Grünen has the highest frequency of tweets concerning environmental topics. This observation is in line with the political agenda of the party.

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if one wants to analyse the cluster
tweets_cluster_11_climate.groupby("full_name").size().sort_values(ascending = False).head(10)

When analysing the list of the politicians that tweet about this with a high frequency, we can see many politicians of the party Die Grünen and other politicians that generally tweets with a high frequency.

#### 4.1.1.4.12 Cluster infrastructure

The last cluster containing topics 1, 18 and 87 covers digital and analogue infrastructure.

In [None]:
# cluster_12_infrastructure = [18, 87, 1]

In [None]:
# Analyse the cluster over time
# Uncomment if one wants to analyse the cluster
# topic_model_tweets.visualize_topics_over_time(tweets_topics_over_time, topics=cluster_12_infrastructure)

In [None]:
# See the party distribution of the cluster
# Uncomment if one wants to analyse the cluster
# tweets_cluster_12_infrastructure =  tweets_processed_bert[tweets_processed_bert.topic_id.isin(cluster_12_infrastructure)]
# print(tweets_cluster_12_infrastructure.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(tweets_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if one wants to analyse the cluster
# tweets_cluster_12_infrastructure.groupby("full_name").size().sort_values(ascending = False).head(10)

### 4.1.1.5 Summary

In this section, we summarise the results concerning the initial research question:

**What are the main topics of tweets of prominent politicians of the six parties in the German Parlament in the period of the 19th Bundestag?**

We trained a BERTopic model to give us an overview of the topics presented in sections 4.1.1.1 - 4.1.1.3. Based on the identified topics and the inherent modelling clustering, we defined 12 overarching clusters of subjects that are presented in section 4.1.1.4. We did not include topics resulting from interhuman relationship building or Smalltalk. The cluster of the topics could now be used for further analysis. To answers the research question, we identified the following main topics of tweets of the selected politicians:

* Migration
* Media
* Extremism and feligion
* Foreign politics and armed conflicts
* Discrimination
* Covid-19
* Democratic structures
* Europe, EU and Germany
* Finance
* Police and safety
* Climate
* Infrastructure

We did a deep dive into the clusters migration, Covid-19 and environment. The code for deeper analysis of the other clusters is provided and can be used by the interested reader. Based on this analysis, we will compare the results with the topics of the speeches in the parliaments in section 4.1.3.

## 4.1.2 Analyse speeches model

To answer the second research question, we proceed the same as answering the first research question.

In [None]:
# Load data
speeches_processed_bert = pickle.load(open( "../data/processed/speeches_processed_bert.pickle", "rb" ))
docs_speeches = speeches_processed_bert.text_preprocessed_sentence.tolist()
with open('../data/processed/probabilities_speeches_bert.pickle', 'rb') as handle:
    probs_speeches = pickle.load(handle)
with open('../data/processed/topics_speeches_bert.pickle', 'rb') as handle:
    topics_speeches = pickle.load(handle)

### 4.1.2.1 Overview of topics

We already saw in the modelling section that we identified fewer topics for the speeches dataset. This effect corresponds to the significantly fewer number of documents in the dataset. We identified 25 topics in the modelling stage that we now analyse in more detail.

In [None]:
# Load model
topic_model_speeches = BERTopic.load("../models/bertopic_speeches")

In [None]:
# Show topic infos
topic_model_speeches.get_topic_info()

In [None]:
topic_model_speeches.visualize_barchart(topics=None, top_n_topics=25, n_words=5, width=250, height=250)

### 4.1.2.2 Visualise topic correlation

To better understand the quality of our topic model, we analyse the similarity of the identified topics.

In [None]:
# Visualise correlation
topic_model_speeches.visualize_heatmap(top_n_topics=25)

The first two topics have a similar score with various other topics. This fact could skew our results and has to be minded when interpreting the results.

### 4.1.2.3 Visualise topic hierarchy

To analyse the topic cluster of the resulting BERTopic model, we will use the inherent clustering of the model.

In [None]:
# Visualise clustering
topic_model_speeches.visualize_hierarchy(orientation='left', top_n_topics=25, width=1000, height=600)

In [None]:
# Visualise topic distance
topic_model_speeches.visualize_topics(topics=None, top_n_topics=None, width=650, height=650)

### 4.1.2.4 Analyse topics

We identified 12 clusters and topics that we now analyse in more detail.

In [None]:
# Prepare time based visualisation
speeches_topics_over_time = topic_model_speeches.topics_over_time(docs_speeches, topics_speeches, pd.to_datetime(speeches_processed_bert.date).dt.strftime('%Y-%m'),
                                                       nr_bins=None, datetime_format=None, evolution_tuning=True,
                                                       global_tuning=True)

#### 4.1.2.4.1 Cluster europe

The first cluster is based on topics 21 and 23 and deals with Europe and the EU.

In [None]:
cluster_1_europe = [21, 23]

In [None]:
# Analyse the cluster over time
# Uncomment if one wants to analyse the cluster
# topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_1_europe)

In [None]:
# See the party distribution of the cluster
# Uncomment if one wants to analyse the cluster
# speeches_cluster_1_europe = speeches_processed_bert[speeches_processed_bert.topic_id.isin([5, 18, 21])]
# print(speeches_cluster_1_europe.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if one wants to analyse the cluster
# speeches_cluster_1_europe.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.2.4.2 Cluster democratic structures

The second cluster comprises only the topic democratic structures.

In [None]:
cluster_2_democratic = [5]

In [None]:
# Analyse the cluster over time
# Uncomment if one wants to analyse the cluster
# topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_2_democratic)

In [None]:
# See the party distribution of the cluster
# Uncomment if one wants to analyse the cluster
# speeches_cluster_2_democratic = speeches_processed_bert[speeches_processed_bert.topic_id.isin(cluster_2_democratic)]
# print(speeches_cluster_2_democratic.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if one wants to analyse the cluster
# speeches_cluster_2_democratic.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.2.4.3 Cluster Covid-19

The third cluster contains topics 2 and 18 concerning health and the covid pandemic. We will analyse the prevalence of the topics per time and party.

In [None]:
cluster_3_covid = [18, 2]

In [None]:
# Analyse the cluster over time
topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_3_covid)

We can identify two peaks of the subject that mirror the development of the pandemic situation. We already saw this trend in the analysis of the tweets.

In [None]:
# See the party distribution of the cluster
speeches_cluster_3_various = speeches_processed_bert[speeches_processed_bert.topic_id.isin([0,7,8,12])]
print(speeches_cluster_3_various.groupby("party").size().sort_values(ascending = False))
print("\n")
print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

There are no apparent patterns in the distribution of the speeches per party.

In [None]:
# See the most prominent politicians of the cluster
speeches_cluster_3_various.groupby("full_name").size().sort_values(ascending = False).head(10)

When analysing the data, we see a surprising pattern as neither Jens Spahn nor Karl Lauterbach are in the list of the persons with the most speeches for this topic.

#### 4.1.2.4.4 Cluster foreign politics

The largest cluster combines seven topics (6, 7, 11, 12, 15, 19, 20) concerning foreign politics.

In [None]:
cluster_4_foreign_politics = [6, 7, 11, 12, 15, 19, 20]

In [None]:
# Analyse the cluster over time
# Uncomment if one wants to analyse the cluster
# topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_4_foreign_politics)

In [None]:
# See the party distribution of the cluster
# Uncomment if one wants to analyse the cluster
# speeches_cluster_4_foreign_politics = speeches_processed_bert[speeches_processed_bert.topic_id.isin(cluster_4_foreign_politics)]
# print(speeches_cluster_4_foreign_politics.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if one wants to analyse the cluster
# speeches_cluster_4_foreign_politics.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.2.4.5 Cluster occupations

The next cluster contains topics 8 and 9 and deals with the subject occupations.

In [None]:
cluster_5_occupation = [8, 9]

In [None]:
# Analyse the cluster over time
# Uncomment if one wants to analyse the cluster
# topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_5_occupation)

In [None]:
# See the party distribution of the cluster
# Uncomment if one wants to analyse the cluster
# speeches_cluster_5_occupation = speeches_processed_bert[speeches_processed_bert.topic_id.isin(cluster_5_occupation)]
# print(speeches_cluster_5_occupation.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if one wants to analyse the cluster
# speeches_cluster_5_occupation.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.2.4.6 Cluster discrimination

The sixth cluster is only topic 17 that treats the issue of migration.

In [None]:
cluster_6_discrimination = [17]

In [None]:
# Analyse the cluster over time
# Uncomment if one wants to analyse the cluster
# topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_6_discrimination)

In [None]:
# See the party distribution of the cluster
# Uncomment if one wants to analyse the cluster
# speeches_cluster_6_discrimination = speeches_processed_bert[speeches_processed_bert.topic_id.isin(cluster_6_discrimination)]
# print(speeches_cluster_6_discrimination.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if one wants to analyse the cluster
# speeches_cluster_6_discrimination.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.2.4.7 Cluster police and safety

Another cluster comprises only one topic (14) and deals with police and safety.

In [None]:
cluster_7_police_safety = [14]

In [None]:
# Analyse the cluster over time
# Uncomment if one wants to analyse the cluster
# topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_7_police_safety)

In [None]:
# See the party distribution of the cluster
# Uncomment if one wants to analyse the cluster
# speeches_cluster_7_police_safety = speeches_processed_bert[speeches_processed_bert.topic_id.isin(cluster_7_police_safety)]
# print(speeches_cluster_7_police_safety.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if one wants to analyse the cluster
# speeches_cluster_7_police_safety.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.2.4.8 Cluster climate

Cluster eight (topic 1) includes speeches about climate change and protection. To get an overview of the topic, we analyse it in more detail.

In [None]:
cluster_8_climate = [1]

In [None]:
# Analyse the cluster over time
topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_8_climate)

There are two peaks for the topics around the end of 2019, 2020 and 2021. The main topics of the peaks were renewable energy topics.

In [None]:
# See the party distribution of the cluster
speeches_cluster_8_climate = speeches_processed_bert[speeches_processed_bert.topic_id.isin(cluster_8_climate)]
print(speeches_cluster_8_climate.groupby("party").size().sort_values(ascending = False))
print("\n")
print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

There is a substantial difference in the number of speeches covering the subject. The parties Die Grüne and CDU cover this topic in their speeches more than other parties controlled for their general frequency of speeches. Most of the speeches of the CDU are held by Peter Altmaier, as we see in the following code snippet.

In [None]:
# See the most prominent politicians of the cluster
speeches_cluster_8_climate.groupby("full_name").size().sort_values(ascending = False).head(10)

We observe many politicians of the party Die Grünen and the CDU politican Peter Altmaier. He was the federal minister for energy and economy, which explains his top position in the overview.

#### 4.1.2.4.9 Cluster digitalisation

In the ninth cluster is topic 4, covering digitalisation.

In [None]:
cluster_9_digitalisation = [4]

In [None]:
# Analyse the cluster over time
# Uncomment if one wants to analyse the cluster
# topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_9_digitalisation)

In [None]:
# See the party distribution of the cluster
# Uncomment if one wants to analyse the cluster
# speeches_cluster_9_digitalisation = speeches_processed_bert[speeches_processed_bert.topic_id.isin(cluster_9_digitalisation)]
# print(speeches_cluster_9_digitalisation.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if one wants to analyse the cluster
# speeches_cluster_9_digitalisation.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.2.4.10 Cluster health

The subject of health is present in topics 3 and 22.

In [None]:
cluster_10_health = [3,22]

In [None]:
# Analyse the cluster over time
# Uncomment if one wants to analyse the cluster
# topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_10_health)

In [None]:
# See the party distribution of the cluster
# Uncomment if one wants to analyse the cluster
# speeches_cluster_10_health = speeches_processed_bert[speeches_processed_bert.topic_id.isin(cluster_10_health)]
# print(speeches_cluster_10_health.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if one wants to analyse the cluster
# speeches_cluster_10_health.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.2.4.11 Cluster extremism and religion

Similar to the cluster of the tweets, we have a cluster (topics 13 and 16) dealing with extremism and religion.

In [None]:
cluster_11_extremism_religion = [13, 16]

In [None]:
# Analyse the cluster over time
# Uncomment if one wants to analyse the cluster
# topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_11_extremism_religion)

In [None]:
# See the party distribution of the cluster
# Uncomment if one wants to analyse the cluster
# speeches_cluster_11_extremism_religion = speeches_processed_bert[speeches_processed_bert.topic_id.isin(cluster_11_extremism_religion)]
# print(speeches_cluster_11_extremism_religion.groupby("party").size().sort_values(ascending = False))
# print("\n")
# print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

In [None]:
# See the most prominent politicians of the cluster
# Uncomment if one wants to analyse the cluster
# speeches_cluster_11_extremism_religion.groupby("full_name").size().sort_values(ascending = False).head(10)

#### 4.1.2.4.12 Cluster migration

The last cluster with topic 10 is about migration.

In [None]:
cluster_12_migration = [10]

In [None]:
# Analyse the cluster over time
topic_model_speeches.visualize_topics_over_time(speeches_topics_over_time, topics=cluster_12_migration)

In [None]:
# See the party distribution of the cluster
speeches_cluster_12_migration = speeches_processed_bert[speeches_processed_bert.topic_id.isin(cluster_12_migration)]
print(speeches_cluster_12_migration.groupby("party").size().sort_values(ascending = False))
print("\n")
print(speeches_processed_bert.groupby("party").size().sort_values(ascending = False))

There are not many speeches about migration, but most of them are held by FDP and AFD. We saw a similar trend in the tweets, but the comparable high amounts of tweets of the AFD does not transfer to the number of speeches held.

In [None]:
# See the most prominent politicians of the cluster
speeches_cluster_12_migration.groupby("full_name").size().sort_values(ascending = False).head(10)

### 4.1.2.5 Summary

We use the results of the last subsections to answer the second research question:

**What are the main topics of speeches of prominent politicians of the six parties in the German Bundestag in the period of the 19th Bundestag?**

We trained a BERTopic model to overview the topics presented in sections 4.1.2.1 - 4.1.2.3. Based on the identified topics and the inherent modelling clustering, we defined 12 overarching clusters of subjects that are presented in section 4.1.2.4. The cluster of the topics could now be used for further analysis. To answers the second research question, we identified the following main topics of speeches of the selected politicians:

* Europe
* Democratic structures
* Covid-19
* Foreign politics
* Occupation
* Discrimination
* Police and safety
* Climate
* Digitalisation
* Health
* Extremism and religion
* Migration

We did a deep dive into the clusters migration, Covid-19 and climate. The code for deeper analysis of the other clusters is provided and can be used by the interested reader.

## 4.1.3 Compare topics of tweets and speeches

Based on the results of the last two subsections, we now compare the content of tweets and speeches of the German politicians to answer the third research question:

**How do the main topics of tweets and speeches of prominent politicians of the six parties in the German Bundestag differ in the period of the 19th Bundestag?**

For this, we compare the differences between the two media's general topics and the topic distribution broken down to the parties. When discussing tweets and speeches, we consider the inherent differences between the two media.

### 4.1.3.1 Topics in tweets and speeches

There was a significant difference in the number of topics we identified for the tweets and speeches. One part of this difference can be explained by the many times higher number of tweets than speeches, while the nature of tweets and speeches can explain another part. The number of topics is relatively high for tweets, which can be explained as Twitter being a lower barrier for communication. Politicians will express opinions for more subjects than they are willing to talk about in the Bundestag. This observation could also be interpreted as a sign that politicians are willing to express their opinions about topics they are not experts in on Twitter. At the same time, they are more selective in their speeches in the Bundestag. Additionally, politicians use Twitter to announce various events and build connections to voters and other people of public interest.

When analysing the overall clusters of speeches and tweets, we found a high amount of matches. There are no apparent significant differences in the topics for both media. However, the relative focus of the topics between the media differs.

In [None]:
# Visualise top topics for tweets
tweets_processed_bert.groupby("topic").size().sort_values(ascending = False)[[1, 3, 5, 6, 7, 8, 9, 10, 12, 13]]

The first noticeable difference in the top topics is the presence of many non-relevant topics in the models for tweets. Therefore we select the subset of the most prominent relevant topics for the tweets model.

In [None]:
# Visualise top topics for speeches
speeches_processed_bert.groupby("topic").size().sort_values(ascending = False)[1:11]

One can see striking differences in the top topics between the two media. Digitalisation, climate, occupation and covid pandemic are present in both media. While the topics concerning foreign politics and armed conflict are presented with a high frequency in speeches, we do not see it in the top topics of tweets. The topic EU, Europe and Euro have a high presence in the politicians' tweets, but not that often in the speeches dataset. This pattern is most likely caused by the European parliament election in 2019 and the previous election campaigns. This analysis only uses an exception of the topics and therefore has limited validity. However, it still provides a first overview of the differences in the most prominent topics per medium. One could go into deeper analysis, but this is out of scope for this work.

### 4.1.3.2 Topics of AFD

When comparing the most prominent topics of politicians of the party AFD, we can again identify differences in the topic distribution.

In [None]:
# Visualise top topics for tweets
tweets_processed_bert[tweets_processed_bert.party == "AFD"].groupby("topic").size().sort_values(ascending = False)[1:11]

In [None]:
# Visualise top topics for speeches
speeches_processed_bert[speeches_processed_bert.party == "AFD"].groupby("topic").size().sort_values(ascending = False)[1:11]

The most prominent topics of the Twitter presence of politicians of the AFD are police, migration, refugees and Muslims. This observation is in strong contrast to the topics studying climate, constitution, and digitalisation, which focus on most speeches in the Bundestag.

### 4.1.3.3 Topics of CDU

The most striking difference between the tweets and speeches for the CDU is the focus on climate and energy that is not present in the top topics of the tweets. Another interesting observation is the missing representation of foreign politics in the top tweets topics.

In [None]:
# Visualise top topics for tweets
tweets_processed_bert[tweets_processed_bert.party == "CDU"].groupby("topic").size().sort_values(ascending = False)[1:11]

In [None]:
# Visualise top topics for speeches
speeches_processed_bert[speeches_processed_bert.party == "CDU"].groupby("topic").size().sort_values(ascending = False)[1:11]

### 4.1.3.4 Topics of FDP

The pattern of a missing representation of foreign politics and armed conflicts is repeating when analysing the tweets of the FDP.

In [None]:
# Visualise top topics for speeches
tweets_processed_bert[tweets_processed_bert.party == "FDP"].groupby("topic").size().sort_values(ascending = False)[1:11]

In [None]:
# Visualise top topics for speeches
speeches_processed_bert[speeches_processed_bert.party == "FDP"].groupby("topic").size().sort_values(ascending = False)[1:11]

### 4.1.3.5 Topics of Grüne

The coherence between the tweets and the speeches' topics is comparably high for the party Die Grünen. Digitalisation is one subject from the speeches that are not highly represented in the tweets.

In [None]:
# Visualise top topics for speeches
tweets_processed_bert[tweets_processed_bert.party == "Grüne"].groupby("topic").size().sort_values(ascending = False)[1:11]

In [None]:
# Visualise top topics for speeches
speeches_processed_bert[speeches_processed_bert.party == "Grüne"].groupby("topic").size().sort_values(ascending = False)[1:11]

### 4.1.3.6 Topics of Linke

The party Die Linken also has many overlapping topics in both media. Nevertheless, the topic of occupation and police, which is quite prevailing in the tweets, is nearly not represented in the speeches.

In [None]:
# Visualise top topics for speeches
tweets_processed_bert[tweets_processed_bert.party == "Linke"].groupby("topic").size().sort_values(ascending = False)[1:11]

In [None]:
# Visualise top topics for speeches
speeches_processed_bert[speeches_processed_bert.party == "Linke"].groupby("topic").size().sort_values(ascending = False)[1:11]

### 4.1.3.7 Topics of SPD

The SPD politicians hold speeches about similar topics as they tweet, with the common difference, that they do not discuss foreign affairs on Twitter often.

In [None]:
# Visualise top topics for speeches
tweets_processed_bert[tweets_processed_bert.party == "SPD"].groupby("topic").size().sort_values(ascending = False)[1:11]

In [None]:
# Visualise top topics for speeches
speeches_processed_bert[speeches_processed_bert.party == "SPD"].groupby("topic").size().sort_values(ascending = False)[1:11]

### 4.1.3.8 Summary

Based on the last subsection, we answer the third research question.

**How do the main topics of tweets and speeches of prominent politicians of the six parties in the German Bundestag differ in the period of the 19th Bundestag?**

There are many similarities between the communication topics on Twitter and in the German Bundestag. However, we could see some differences between the media and analysed the topic differences per party. To answer the research question, we found a lot more topics for tweets than for politicians' speeches, and there was a significant difference in the communication about foreign politics and armed conflicts. While these topics were discussed in the plenum, they did not appear in the popular topics of tweets. We explained the significant difference in the number of topics by the different numbers of documents and by the characteristics of social media compared to plenary speeches. Another difference is the focus of the topics. While we have similar general topics, prominent subjects of speeches and tweets differ significantly depending on the politician and the party.

# 4.2 Topic model validation

To validate the previous section's results, we use word and topic intrusion tests ([Chang, Boyd-Graber, Gerrish, Wang, & Blei, 2009](https://proceedings.neurips.cc/paper/2009/file/f92586a25bb3145facd64ab20fd554ff-Paper.pdf)). We implement an interface and evaluate the results of humans labelled by the two authors.

## 4.2.1 Word intrusion

Word intrusion measures the coherence of topics. For this, we show annotators five high probability keywords of a particular topic and an intruder keyword from another topic and give them the task to identify the intruder keyword. The model precision as measured by the word intrusion score is then defined as the number of times the intruder keyword was chosen divided by the number of topics shown.

### 4.2.1.1 Define functions

Before executing the word intrusion task, we need to define a set of help functions. We are creating a simple interface for this task to be executed in the Notebooks cells.

In [None]:
# Define a random document searcher
def choose_random_document(index, number_documents):
    rand_document = random.randrange(-1, number_documents-2)
    if rand_document != index:
        return rand_document
    else:
        return choose_random_document(index, number_documents)

In [None]:
# Function for creating a word intrusion dataset
def create_word_intrusion_dataset(topic_model):
    number_documents = len(topic_model.get_topics())
    records_list = []
    for i in range(number_documents):
        word_list = []
        for j in range(5):
            word_list.append(topic_model.get_topic(i-1)[j][0])
        intruder_word = topic_model.get_topic(choose_random_document(i-1, number_documents))[0][0]
        intruder_position = random.randrange(4)
        word_list.insert(intruder_position, intruder_word)
        word_list.append(intruder_word)
        word_list.append(intruder_position)
        records_list.append(word_list)
    word_intrusion_df = pd.DataFrame.from_records(records_list)
    word_intrusion_df.columns = ["word_0", "word_1", "word_2", "word_3", "word_4", "word_5",
                                 "intruder_word", "intruder_index"]
    return word_intrusion_df

In [None]:
# A function that divides the word intrusion dataset into seperate sets for the the annotators
def generate_annotator_set(df, number_label, number_iaa, name_1, name_2):
    length = df.shape[0]
    if 2*number_label + number_iaa > length:
        print("Too many labels for the size of the dataframe")
    df_shuffeled = df.sample(frac=1).reset_index(drop=True)
    df_shuffeled[name_1] = [1] * (number_label+number_iaa) + [0] * (length-number_label-number_iaa)
    df_shuffeled[name_2] = [0] * (number_label) + [1] * (number_label+number_iaa) + [0] * (length-2*number_label-number_iaa)
    df_shuffeled["iaa_flag"] = [0] * number_label + [1] * number_iaa + [0] * (length-number_label-number_iaa)
    df_shuffeled["wis_label"] = [1] * number_label + [0] * number_iaa + [1] * (length-number_label-number_iaa)
    return df_shuffeled

In [None]:
# A function that offers an interface in Jupyter notebook for the word intrusion task
def word_intrusion_test(word_df, name, medium):
    intrusion_df = word_df[word_df[name] == 1].reset_index(drop = True)

    max_count = intrusion_df.shape[0]
    global i
    i = 0

    button_0 = widgets.Button(description = intrusion_df.word_0[i])
    button_1 = widgets.Button(description = intrusion_df.word_1[i])
    button_2 = widgets.Button(description = intrusion_df.word_2[i])
    button_3 = widgets.Button(description = intrusion_df.word_3[i])
    button_4 = widgets.Button(description = intrusion_df.word_4[i])
    button_5 = widgets.Button(description = intrusion_df.word_5[i])


    chosen_words = []
    chosen_positions= []

    display("Word Intrusion Test")

    f = IntProgress(min=0, max=max_count)
    display(f)

    display(button_0)
    display(button_1)
    display(button_2)
    display(button_3)
    display(button_4)
    display(button_5)


    def btn_eventhandler(position, obj):
        global i
        i += 1


        clear_output(wait=True)

        display("Word Intrusion Text")
        display(f)
        f.value += 1

        choosen_text = obj.description
        chosen_words.append(choosen_text)

        chosen_positions.append(position)

        if i < max_count:

            button_0 = widgets.Button(description = intrusion_df.word_0[i])
            button_1 = widgets.Button(description = intrusion_df.word_1[i])
            button_2 = widgets.Button(description = intrusion_df.word_2[i])
            button_3 = widgets.Button(description = intrusion_df.word_3[i])
            button_4 = widgets.Button(description = intrusion_df.word_4[i])
            button_5 = widgets.Button(description = intrusion_df.word_5[i])

            display(button_0)
            display(button_1)
            display(button_2)
            display(button_3)
            display(button_4)
            display(button_5)

            button_0.on_click(partial(btn_eventhandler,0))
            button_1.on_click(partial(btn_eventhandler,1))
            button_2.on_click(partial(btn_eventhandler,2))
            button_3.on_click(partial(btn_eventhandler,3))
            button_4.on_click(partial(btn_eventhandler,4))
            button_5.on_click(partial(btn_eventhandler,5))
        else:
            print ("Thanks " + name + " you finished all the work!")
            intrusion_df["chosen_word"] = chosen_words
            intrusion_df["chosen_position"] = chosen_positions
            intrusion_df.to_csv("../data/processed/word_intrusion_test_" + name + "_" + medium + ".csv", index = False)



    button_0.on_click(partial(btn_eventhandler,0))
    button_1.on_click(partial(btn_eventhandler,1))
    button_2.on_click(partial(btn_eventhandler,2))
    button_3.on_click(partial(btn_eventhandler,3))
    button_4.on_click(partial(btn_eventhandler,4))
    button_5.on_click(partial(btn_eventhandler,5))

    return intrusion_df

In [None]:
# Calculate the word intrusion score for the two annotator sets
def calculate_word_intrusion(name_1, name_2, medium):
    df_word_intrusion_1 = pd.read_csv("../data/processed/word_intrusion_test_" + name_1 + "_" + medium + ".csv")
    df_word_intrusion_2 = pd.read_csv("../data/processed/word_intrusion_test_" + name_2 + "_" + medium + ".csv")
    iaa_values_1 = df_word_intrusion_1[df_word_intrusion_1.iaa_flag == 1].chosen_position.values
    iaa_values_2 = df_word_intrusion_2[df_word_intrusion_2.iaa_flag == 1].chosen_position.values
    kappa = cohen_kappa_score(iaa_values_1, iaa_values_2)
    df_word_intrusion = df_word_intrusion_1.append(df_word_intrusion_2)
    df_word = df_word_intrusion[df_word_intrusion["wis_label"] == 1]
    df_word["intruder_chosen"] = df_word["intruder_word"] == df_word["chosen_word"]
    return  df_word["intruder_chosen"].mean(), kappa

### 4.2.1.2 Validation of tweets topic model

We execute the word intrusion task for the tweets BERTopic model based on the above-defined functions. The two authors do the annotation.

In [None]:
# Load model
topic_model_tweets = BERTopic.load("../models/bertopic_tweets")

In [None]:
# Create candidate dataset
word_intrusion_dataset_tweets = create_word_intrusion_dataset(topic_model_tweets)

In [None]:
# Create label dataset for two annotators
word_intrusion_dataset_tweets_label = generate_annotator_set(word_intrusion_dataset_tweets, 45, 11, "Jakob",
                                                             "Stjepan")

In [None]:
# Execute annotation for first candidate
# Uncomment if annotation is repeated
# df_word_intrusion_jakob_tweets = word_intrusion_test(word_intrusion_dataset_tweets_label, "Jakob", "Tweets")

In [None]:
# Execute annotation for second candidate
# Uncomment if annotation is repeated
# df_word_intrusion_stjepan_tweets = word_intrusion_test(word_intrusion_dataset_tweets_label, "Stjepan", "Tweets")

In [None]:
# Calculate intrusion score and cohens kappa
word_intrusion_score_tweets, word_kappa_tweets = calculate_word_intrusion("Jakob", "Stjepan", "Tweets")

In [None]:
# Cohens kappa
print("Cohens kappa is: " + str(round(word_kappa_tweets,2)))

Our inter-annotator agreement is satisfactory and shows a good consensus of our annotations.

In [None]:
# Intrusion score
print("The word intrusion score is: " + str(round(word_intrusion_score_tweets,2)))

We see a good intrusion score as many of the intruder words were detected. These results could be improved by fixing the identified limitations of our model.

### 4.2.1.3 Validation of speeches topic model

In [None]:
# Load model
topic_model_speeches = BERTopic.load("../models/bertopic_speeches")

In [None]:
# Create candidate dataset
word_intrusion_dataset_speeches = create_word_intrusion_dataset(topic_model_speeches)

In [None]:
# Create label dataset for two annotators
word_intrusion_dataset_speeches_label = generate_annotator_set(word_intrusion_dataset_speeches, 10, 5, "Jakob",
                                                             "Stjepan")

In [None]:
# Execute annotation for first candidate
# df_word_intrusion_jakob_speeches = word_intrusion_test(word_intrusion_dataset_speeches_label, "Jakob", "Speeches")

In [None]:
# Execute annotation for second candidate
# df_word_intrusion_stjepan_speeches = word_intrusion_test(word_intrusion_dataset_speeches_label, "Stjepan", "Speeches")

In [None]:
# Calculate intrusion score and cohens kappa
word_intrusion_score_speeches, word_kappa_speeches = calculate_word_intrusion("Jakob", "Stjepan", "Speeches")

In [None]:
# Cohens kappa
print("Cohens kappa is: " + str(round(word_kappa_speeches,2)))

Our inter-annotator agreement is satisfactory and shows a good consensus of our annotations.

In [None]:
# Intrusion score
print("The word intrusion score is: " + str(round(word_intrusion_score_speeches,2)))

We see a good intrusion score as many of the intruder words were detected. These results could be improved by fixing the identified limitations of our model.

## 4.2.2 Topic Intrusion

By measuring the topic intrusion score, we want to test if the algorithms probability distribution of topics for the documents seems to match the human assessment. We show an excerpt of the document, the three topics with the highest probability for this topic and a random low probability topic. To calculate the topic intrusion score, we take the mean of the differences of the log probabilities of the selected topic and the actual topic.

In [None]:
# Create a function that combines key words into a single string
def create_topic_string(topic_info):
    word_list = []
    for i in range(8):
        word_list.append(topic_info[i][0])
    return ", ".join(word_list)

In [None]:
# Create a function that prepares the topic intrusion dataset
def create_topic_intrusion_dataset(data, topic_model, topic_probabilities, test_number = 100):
    number_documents = data.shape[0]
    if number_documents < test_number:
        print("You can only choose as many test as number of documents!")
    number_topics = len(topic_model.get_topics())
    records_list = []
    for i in range(test_number):
        topic_list = []
        high_probability_documents = sorted(zip(topic_probabilities[i].tolist(), list(range(number_topics))), reverse=True)[:3]
        low_probability_documents = sorted(zip(topic_probabilities[i].tolist(), list(range(number_topics))), reverse=True)[3:]
        for j in range(3):
            topic_index = high_probability_documents[j][1]
            topic_list.append(create_topic_string(topic_model.get_topic(topic_index)))
        intruder_document = low_probability_documents[random.randrange(number_topics-4)]
        intruder_topic = create_topic_string(topic_model.get_topic(intruder_document[1]))
        intruder_position = random.randrange(4)
        topic_list.insert(intruder_position, intruder_topic)
        for k in range(3):
            topic_list.append(high_probability_documents[k][0])
        topic_list.insert(intruder_position + 4, intruder_document[0])
        topic_list.append(intruder_topic)
        topic_list.append(intruder_document[0])
        topic_list.append(intruder_position)
        topic_list.append(data["text"][i])
        records_list.append(topic_list)
    df = pd.DataFrame.from_records(records_list)
    df.columns = ["topic_0", "topic_1", "topic_2", "topic_3","probability_topic_0","probability_topic_1",
                  "probability_topic_2","probability_topic_3", "intruder_topic", "intruder_topic_probability",
                  "intruder_index", "text"]
    return df

In [None]:
# Create a function that generate the interface for the topic intrusion test
def topic_intrusion_test(intrusion_df, name, medium):
    intrusion_df = intrusion_df[intrusion_df[name] == 1].reset_index(drop = True)

    max_count = intrusion_df.shape[0]
    global i
    i = 0

    layout = widgets.Layout(width='auto')

    button_0 = widgets.Button(description = intrusion_df.topic_0[i], layout = layout)
    button_1 = widgets.Button(description = intrusion_df.topic_1[i], layout = layout)
    button_2 = widgets.Button(description = intrusion_df.topic_2[i], layout = layout)
    button_3 = widgets.Button(description = intrusion_df.topic_3[i], layout = layout)

    chosen_elements = []
    chosen_positions = []
    chosen_probabilities = []

    display("Topic Intrusion Test")

    f = IntProgress(min=0, max=max_count)
    display(f)

    if len(intrusion_df.text[i]) < 1100:
        display(intrusion_df.text[i][0:1100])
    else :
        display(intrusion_df.text[i][100:1100])

    display(button_0)
    display(button_1)
    display(button_2)
    display(button_3)


    def btn_eventhandler(position, column, obj):

        global i

        clear_output(wait=True)

        display("Topic Intrusion Text")
        display(f)
        f.value += 1

        choosen_text = obj.description
        chosen_elements.append(choosen_text)
        chosen_positions.append(position)
        chosen_probabilities.append(intrusion_df[column][i])

        i += 1

        if i < max_count:

            button_0 = widgets.Button(description = intrusion_df.topic_0[i], layout = layout)
            button_1 = widgets.Button(description = intrusion_df.topic_1[i], layout = layout)
            button_2 = widgets.Button(description = intrusion_df.topic_2[i], layout = layout)
            button_3 = widgets.Button(description = intrusion_df.topic_3[i], layout = layout)

            if len(intrusion_df.text[i]) < 1100:
                display(intrusion_df.text[i][0:1000])
            else :
                display(intrusion_df.text[i][100:1100])

            display(button_0)
            display(button_1)
            display(button_2)
            display(button_3)

            button_0.on_click(partial(btn_eventhandler,0,"probability_topic_0"))
            button_1.on_click(partial(btn_eventhandler,1,"probability_topic_1"))
            button_2.on_click(partial(btn_eventhandler,2,"probability_topic_2"))
            button_3.on_click(partial(btn_eventhandler,3,"probability_topic_3"))
        else:
            print ("Thanks " + name + " you finished all the work!")
            intrusion_df["chosen_topic"] = chosen_elements
            intrusion_df["chosen_position"] = chosen_positions
            intrusion_df["chosen_topic_probability"] = chosen_probabilities
            intrusion_df.to_csv("../data/processed/topic_intrusion_test_" + name + "_" + medium + ".csv", index = False)



    button_0.on_click(partial(btn_eventhandler,0,"probability_topic_0"))
    button_1.on_click(partial(btn_eventhandler,1,"probability_topic_1"))
    button_2.on_click(partial(btn_eventhandler,2,"probability_topic_2"))
    button_3.on_click(partial(btn_eventhandler,3,"probability_topic_3"))

    return intrusion_df

In [None]:
# Create a function to calulate the topic intrusion score
def calculate_topic_intrusion(name_1, name_2, medium):
    df_topic_intrusion_1 = pd.read_csv("../data/processed/topic_intrusion_test_" + name_1 + "_" + medium + ".csv")
    df_topic_intrusion_2 = pd.read_csv("../data/processed/topic_intrusion_test_" + name_2 + "_" + medium + ".csv")
    iaa_values_1 = df_topic_intrusion_1[df_topic_intrusion_1.iaa_flag == 1].chosen_position.values
    iaa_values_2 = df_topic_intrusion_2[df_topic_intrusion_2.iaa_flag == 1].chosen_position.values
    kappa = cohen_kappa_score(iaa_values_1, iaa_values_2)
    df_topic_intrusion = df_topic_intrusion_1.append(df_topic_intrusion_2)
    df_topic = df_topic_intrusion[df_topic_intrusion["wis_label"] == 1]
    df_topic["intruder_score"] = np.log(df_topic["intruder_topic_probability"]) - np.log(df_topic["chosen_topic_probability"])
    return  df_topic["intruder_score"].mean(), kappa

### 4.2.2.1 Validation of tweets topic model

In the first step, we calculate the validation score for the tweets BERTopic model.

In [None]:
# Load data
with open( "../data/processed/tweets_processed_bert.pickle", "rb" ) as handle:
    tweets_processed_bert = pickle.load(handle)
with open('../data/processed/probabilities_tweets_bert.pickle', 'rb') as handle:
    topic_probabilities_tweets = pickle.load(handle)

In [None]:
# Load model
# topic_model_tweets = BERTopic.load("../models/bertopic_tweets")

In [None]:
# Create candidate dataset
topic_intrusion_dataset_tweets = create_topic_intrusion_dataset(tweets_processed_bert, topic_model_tweets,
                                                               topic_probabilities_tweets, test_number = 100)

In [None]:
# Create label dataset for two annotators
topic_intrusion_dataset_tweets_label = generate_annotator_set(topic_intrusion_dataset_tweets, 40, 10, "Jakob",
                                                             "Stjepan")

In [None]:
# Execute annotation for first candidate
# df_topic_intrusion_jakob_tweets = topic_intrusion_test(topic_intrusion_dataset_tweets_label, "Jakob", "Tweets")

In [None]:
# Execute annotation for second candidate
# df_topic_intrusion_stjepan_tweets = topic_intrusion_test(topic_intrusion_dataset_tweets_label, "Stjepan", "Tweets")

In [None]:
# Calculate intrusion score and cohens kappa
topic_intrusion_score_tweets, topic_kappa_tweets = calculate_topic_intrusion("Jakob", "Stjepan", "Tweets")

In [None]:
# Cohens kappa
print("Cohens kappa is: " + str(round(topic_kappa_tweets,2)))

Our inter-annotator agreement is satisfactory and shows a good consensus of our annotations.

In [None]:
# Intrusion score
print("The topic intrusion score is: " + str(round(topic_intrusion_score_tweets,2)))

It is difficult to evaluate the resulting topic intrusion score objectively. However, compared with the results from the article, we can infer that this score is at least satisfactory and validates our model.

### 4.2.2.2 Validation of speeches topic model

In [None]:
# Load data
with open( "../data/processed/speeches_processed_bert.pickle", "rb" ) as handle:
    speeches_processed_bert = pickle.load(handle).reset_index(drop = True)
with open('../data/processed/probabilities_speeches_bert.pickle', 'rb') as handle:
    topic_probabilities_speeches = pickle.load(handle)

In [None]:
# Load model
# topic_model_speeches = BERTopic.load("../models/bertopic_speeches")

In [None]:
# Create candidate dataset
topic_intrusion_dataset_speeches = create_topic_intrusion_dataset(speeches_processed_bert, topic_model_speeches,
                                                               topic_probabilities_speeches, test_number = 100)

In [None]:
# Create label dataset for two annotators
topic_intrusion_dataset_speeches_label = generate_annotator_set(topic_intrusion_dataset_speeches, 40, 10, "Jakob",
                                                             "Stjepan")

In [None]:
# Execute annotation for first candidate
# df_topic_intrusion_jakob_speeches = topic_intrusion_test(topic_intrusion_dataset_speeches_label, "Jakob", "Speeches")

In [None]:
# Execute annotation for second candidate
# df_topic_intrusion_stjepan_speeches = topic_intrusion_test(topic_intrusion_dataset_speeches_label, "Stjepan", "Speeches")

In [None]:
# Calculate intrusion score and cohens kappa
topic_intrusion_score_speeches, topic_kappa_speeches = calculate_topic_intrusion("Jakob", "Stjepan", "Speeches")

In [None]:
# Cohens kappa
print("Cohens kappa is: " + str(round(topic_kappa_speeches,2)))

The inter-annotator agreement on this task is relatively small. We did expect this as it was pretty challenging to infer the topics from an excerpt from the speeches, as they are generally quite long, and therefore it is not easy to infer the suitable topics.

In [None]:
# Intrusion score
print("The word intrusion score is: " + str(round(topic_intrusion_score_speeches,2)))

It is difficult to evaluate the resulting topic intrusion score objectively. Nevertheless, compared with the results from the article, we can infer that this score is at least satisfactory and validates our model.

### 4.2.3 Conclusion

Based on the topic and word intrusion measures we evaluated in this section, we can infer a satisfactory validity of our models. There are various possibilities for improvement, and we detected several limitations in the Results section. Hoewever, the model still offers noticeable insights and serves as basis for further work.

# 4.3 Result analysis sentiment analysis (Stjepan)

**Needs to be added**

# 4.4 Validation sentiment analysis (Stjepan)

**Needs to be added**

# 5 Discussion

# 5.1 Discussion topic modelling (Jakob)

This section discusses the procedure and the topic modelling results weaknesses and strengths. For this, we investigate the research design, the data, the preprocessing pipeline, the model and the results based on our previous work.

Our research design was based on comparing the content of politicians' tweets with their speeches in the plenary. As we have large amounts of data, there is a need for automated content analysis solutions to tackle this problem. We decided to use well known and experimental topic model algorithms, which we will discuss in more detail later. Our setup in the programming language Python allowed us to process large amounts of data that we could not have analysed manually. Therefore we conclude that our general research design was sound setup, besides limitations in the data, preprocessing and modelling, that we will discuss in the following paragraphs.

We restricted the data collection to the tweets and speeches of each party's seven most prominent politicians in the 19. Bundestag. Another limitation was the time range of the preprocessed data by the open-discourse team. As they plan to publish up to date data regularly, one could retrieve the new data to update the results. As our pipeline can efficiently process and model the data, one could repeat the research setup with a larger pool of politicians and for more periods of the Bundestag. Based on the short timespan we chose, we only saw a snapshot of the topics and the differences in the two media that could be extended with a more profound setup. A source of bias is in selecting the politicians of the six parties. While we tried to choose a representative sample, the choice of politicians will have led to skewed results, as different politicians have different fields of expertise, and not all of them were represented in our subset. Another issue relating to the politicians is the significantly different amount of speeches and tweets per politician. This problem cannot trivially be solved and must be minded when analysing the data on politicians.

The preprocessing is flexible and extendable to enable different pipelines for different tasks. While the pipeline is based on best practices, there were still problems with not relevant words with high frequency. With a more sophisticated preprocessing, one could identify context-dependent stop words and improve the topic model.

For the topic modelling, we test three approaches. LDA and NNMF are well known and widely used models ([Asmussen & Møller, 2019](https://doi.org/10.1186/s40537-019-0255-7)) that are known to produce good results. The third approach BERTopic is relatively new but showed promising results in previous work and was also the best performing model in our case. The model approaches selection and the testing of different models are already professional. However, we are missing the computational capacities for an extensive hyperparameter optimisation for the BERTopic approach, which would undoubtedly improve the model quality. One should also introduce a train test validation split to better transfer the results to new data when updating the structure.

We implemented a simple version of the topic and word intrusion test to validate the model. The results of this test showed satisfactory validity of our results. One could improve the validation with more annotators and a larger dataset for validation. This extension was, however, out of scope for the two authors.

The analysis of the modelling results was relatively short as the main contribution of this work lies in creating the end-to-end pipeline, including data collection, preprocessing, and modelling. The results could be interpreted in more detail and connected with existing research results in a subsequent project. Especially the comparison of the two media and the deep dive into the parties would have needed more attention.

We identified several next steps in the discussion that could be done in the following research project. Besides extending the data collection to a more considerable period and more politicians, it would be essential to improve the preprocessing and the detection of context-dependent stopwords. One could create an improved model based on this and expanded hyperparameter optimisation. This setup would serve as an optimal fundament for a sophisticated analysis of the topics and comparing both media.

# 5.2 Discussion sentiment analysis (Stjepan)

**Needs to be added**

# 6. Bibliography (Stjepan)

Asmussen, C.B. & Møller, C. (2019). *Smart literature review: a practical topic modelling approach to exploratory literature review*. Journal of Big Data, 6(93). https://doi.org/10.1186/s40537-019-0255-7

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). *Latent dirichlet allocation*. Journal of machine Learning research, 3, 993-1022.

Chang, J., Boyd-Graber, J., Gerrish, S., Wang, C., & Blei. D. M. (2009). *Reading tea leaves: how humans interpret topic models*. In Proceedings of the 22nd International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, 288–296.

Dietz, L. (2016). *Topic Model Evaluation: How much does it help?* [PowerPoint slides]. Topic Model Tutorial at WebSci2016, University of Mannheim. http://topicmodels.info/ckling/tmt/part4.pdf


Effing, R., van Hillegersberg, J., & Huibers, T. (2011) *Social Media and Political Participation: Are Facebook, Twitter and YouTube Democratizing Our Political Systems?*. In: Tambouris E., Macintosh A., de Bruijn H. (eds) Electronic Participation. ePart 2011. Lecture Notes in Computer Science, vol 6847. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23333-3_3

Fatema, S., Yanbin, L., & Fugui, D. (2020). *Impact of Social Media on Politician/Citizens Relationship*. In The 11th International Conference on E-business, Management and Economics. Association for Computing Machinery, New York, NY, USA, 109–114. https://doi.org/10.1145/3414752.3414787

Ferrara, E., Chang, H., Chen, E., Muric, G., & Patel, J. (2020). *Characterizing social media manipulation in the 2020 U.S. presidential election*. First Monday, 25(11). https://doi.org/10.5210/fm.v25i11.11431

Grootendorst, M., & Reimers, N. (2021). MaartenGr/BERTopic: v0.9.4 (v0.9.4). Zenodo. https://doi.org/10.5281/zenodo.5779238

Kapadia, S. (2019, August 19). *Evaluate Topic Models: Latent Dirichlet Allocation (LDA)*. towardsdatascience.com. https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0

McLaughlin B. & Macafee, T (2019). *Becoming a Presidential Candidate: Social Media Following and Politician Identification*. Mass Communication and Society, 22(5), 584-603. https://doi.org/10.1080/15205436.2019.1614196

Ott, B. L. (2017). *The age of Twitter: Donald J. Trump and the politics of debasement*. Critical Studies in Media Communication, 34(1), 59-68. https://doi.org/10.1080/15295036.2016.1266686

Rehurek, R., & Sojka, P. (2010). *Software framework for topic modelling with large corpora*. In In Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks.

Richter, F., Koch, P., Franke, O., Kraus, J., Kuruc, F., Thiem, A., Högerl, J., Heine, S., & Schöps, K. (2020). Open Discourse. https://github.com/open-discourse/open-discourse

Röder, M., Both, A., & Hinneburg, A. (2015). *Exploring the Space of Topic Coherence Measures*. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. Association for Computing Machinery, New York, NY, USA, 399–408. https://doi.org/10.1145/2684822.2685324

Wang, Y. X., & Zhang, Y. J. (2013). *Nonnegative Matrix Factorization: A Comprehensive Review*. IEEE Transactions on Knowledge and Data Engineering, 25(6), 1336-1353. https:doi.org/10.1109/TKDE.2012.51.

Wright, S. (2021). *Beyond ‘fake news’? A longitudinal analysis of how Australian politicians attack and criticise the media on Twitter*. Journal of Language and Politics, 20(5), 719-740. https://doi.org/10.1075/jlp.21027.wri

Zimmer, M., & Proferes, N.J. (2014). *A topology of Twitter research: disciplines, methods, and ethics*. Journal of Information Management, 66(3), 250-261. https://doi.org/10.1108/AJIM-09-2013-0083