# Comparing the communication of German politicians across Twitter and plenary speeches using topic modelling and sentiment analysis

# 1. Introduction (Jakob)

Twitter as a medium of research for understanding politicians' communication is a classical approach ([Zimmer & Proferes, 2014](https://doi.org/10.1108/AJIM-09-2013-0083)) that gained much publicity through the prominent tweets of the 45. president of the United States of America Donald J. Trump ([Ott, 2016](https://doi.org/10.1080/15295036.2016.1266686)). To better understand the medium, we will compare the content and style of politicians' communication with their speeches in the German Bundestag.

The importance of understanding how politicians communicate on Twitter and other social media is steadily increasing with the significant influence of their content on societies at large ([McLaughlin & Macafee, 2019](https://doi.org/10.1080/15205436.2019.1614196), [Fatema, Yanbin, & Fugui, 2020](https://doi.org/10.1145/3414752.3414787), [van Hillegersberg & Huibers, 2011](https://doi.org/10.1007/978-3-642-23333-3_3)). Besides, decentralising the reciprocal transfer of information between politicians to citizens ([van Hillegersberg & Huibers, 2011](https://doi.org/10.1007/978-3-642-23333-3_3)), there are also increasing problems with manipulations ([Ferrara, Chang, Chen, Muric, & Patel, 2020](https://doi.org/10.5210/fm.v25i11.11431)) and fake news ([Wright, 2021](https://doi.org/10.1075/jlp.21027.wri)). An improved understanding of the medium can help identify harmful practices and interpret the content and style context-dependent.

The presented work aims to help increase the understanding of the communication patterns of politicians on Twitter by comparing the content and sentiment of their tweets to their plenary speeches. We execute this analysis for prominent German politicians of the 19th Bundestag in the time range from 2017 to 2021. For this, we defined the following six research questions:

* **RQ 1.1** What are the main topics of tweets of prominent politicians of the six parties in the German Bundestag differ in the period of the 19th Bundestag?

* **RQ 1.2** What are the main topics of speeches of prominent politicians of the six parties in the German Bundestag differ in the period of the 19th Bundestag?

* **RQ 1.3** How do the main topics of tweets and speeches of prominent politicians of the six parties in the German Bundestag differ in the period of the 19th Bundestag?

* **RQ 2.1** What are the differences in sentiment of politician’s speeches in the Bundestag and on their Twitter posts with regard to their parties in the period of the 19th Bundestag?

* **RQ 2.2** Are there significant differences in sentiment of male politicians from the Bundestag in speeches and posts compared to female politicians in the period of the 19th Bundestag?

* **RQ 2.3** How are the sentiments of prominent politicians of the six parties in the German Bundestag developing over time in the period of the 19th Bundestag?

Our approach uses data scraped directly from Twitter and plenary speeches obtained from open discourse ([Richter, Koch, Franke, Kraus, Kuruc, Thiem, Högerl, Heine, & Schöps, 2020](https://github.com/open-discourse/open-discourse)) for creating topic models and sentiment analyses for the tweets and speeches of the politicians. For this, we set up a pipeline in Python that preprocesses the data for the modelling part. Before choosing the final best performing model, we try separate models for topic modelling, including Latent Dirichlet Allocation, Non-Negative Matrix Factorisation and BERTopic. We use an unsupervised dictionary-based approach that we test with two different sentiment dictionaries for the sentiment analysis. We validate our results with the current state of the art evaluation methods.

The remaining work is structured into four sections Literature Review, Methodology, Results and Discussion. The literature review will analyse existing research approaches and showcase our contributions. The subsequent section comprises the preprocessing and modelling for our analyses, presented as commented code complemented by explanations. Based on the results from the methodology part, we will analyse and validate the results in the fourth section. Finally, we discuss the obtained results and outlook on further work in the last section.

# 2. Literature Review (Stjepan)

In preparation for our research we first look at some similar projects and papers already published. As the field of automated media content analysis is starting to grow exponentially in recent years it is no wonder that there are many researches conducted in the field and new findings shared. For our specific task at hand we looked at some topic modeling and sentiment analysis work done on German texts especially in the context of politics and social media.

Like Fatema et a. (Fatema, S., Yanbin, L., & Fugui, D., 2020) have already discussed in their paper that nowadays social media plays an important role for politicians when they want to communicate and get popular with citizens. Like with every marketing approach it is vital for the politicians to get range and a lot of exposure on their topics and opinions. Therefore, platforms like Twitter serve the politicians as channels to talk to a very specific part of their voters in order to keep them informed and market for themselves and their parties. What is also desirable for them is to be able to communicate with the citizens in discussions and to have a platform to share views outside the boundaries of the Bundestag. For our research that means that we have to take a look at how and what the politicians present on their social medias and how it compares to their appearance in the Bundestag.

For the first methodology used by us to analyze the politicians we want to apply topic modeling to the corpora. There exists a very promising initiative in the form of Open Discourse (Richter, F., Koch, P., Franke, O., Kraus, J., Kuruc, F., Thiem, A., Högerl, J., Heine, S., & Schöps, K., 2020) that wants to analyze the texts from the speeches of the Bundestag and make the insights more available for the public. In their work they use data science methods in combination with political research to gain information for the vast amount of data available in form of protocols. As the Bundestag is in the center of German politics and this information is available for everybody it is an initiative that is concerned with the political education and information of the German population. Especially topic modelling in form of LDA topic models can be found in their research. <br>
In the paper by Heiberger and Koss (Heiberger, R. H., & Koss, C., 2017) we can also see the LDA model being applied to see the topics and trends in the German Bundestag. The goal of their analysis was get an automated method to identify topics and their use. They showed how this method can be applied to successfully gain insights over a vast corpus that cannot be coded by hand. Nevertheless, they stress the need for manual coding or a validation with human coding as the automated approach can't guarantee valid results on its own. Where this approach end we want to go even further and apply more models in form of NNFK models and BERT topic models to try different and new approaches for automated topic modeling. We also want to consider social media data as before mentioned it is getting more vital for politicians to address citizens over social media.

In a second part, we want to look at the sentiments of politicians in the German Bundestag. As Haselmayer and Jenny (Haselmayer, M., & Jenny, M., 2016) stress in their paper that sentiment for public opinion and political campaigns is eminent. Therefore, the analysis of sentiment and the exploration of important factors in determining sentiment are of great interest in us. As warned in their paper we will have to account for language biases and the limitations of the methods applied. An advantage of automated methods definitely lies in their scalability and that they can be further adhenced with human coding. In their paper they created a dictionary for negative sentiment with the help of crowdcoding an approach that we sadly could not apply in our project scope but which seems to be promising. One takeaway is that we have to be cautious with general purpose dictionaries as they may not cover important topics of sentiment or are biased. <br>
For the paper of Lommatzsch et a. (Lommatzsch, A., Bütow, F., Ploch, D., & Albayrak, S., 2017) we also see an approach to create better coverage in German sentiment analysis also using two corpora. They also stress the difficulty of the task at hand as sentiment is very context dependent. In their work to apply machine learning models to annotated corpora to create sentiment models. Furthermore, they are using a SentiWS based classifier in their work which we will apply as only the dictionary due to our unlabeled corpus. As the creation of new corpora for the German language was not our prime target we will focus on the application of dictionary approaches to answer our research questions.

Judging from the available research we have entered a field of many options which we can't possibly all try out. Nevertheless, we can profit from the experience and best practices suggested by those papers and hope to achieve some interesting results with our own approaches.


# 3. Methodology

# 3.1 Technical setup (Jakob)

We present the results of our work in a Jupyter Notebook that contains the commented code and additional explanations, analysis and evaluation. The project was programmed in the programming language Python using various preexisting packages. It is possible to reproduce all results with the provided complimentary files. For this, we recommend setting up a conda environment using Python 3.8. One can then install all the packages using the provided .txt file. Besides these packages, one needs to set up a docker environment in section 3.3 to reproduce the data collection. There are separate instructions in the section.

In [None]:
# Uncomment when one setups the environment the first time
# ! pip install -r requirements.txt
# ! python -m spacy download de_core_news_sm

After installing the packages, one can import them with the following lines of code.

In [None]:
# Import packages

# Import basic Python packages
import os
import re
import pickle
import random
import warnings
from collections import Counter
from functools import partial

# Import util packages
from tqdm.notebook import tqdm

# Import data processing packages
import numpy as np
import pandas as pd
import psycopg2

# Import visualisation packages
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from scipy.ndimage.filters import gaussian_filter1d
import matplotlib.dates as mdates

# Import natural language processing packages
import spacy
from spacy.language import Language
from spacy_langdetect import LanguageDetector
from spacy.tokens.doc import Doc
from spacy.vocab import Vocab

# Import topic modelling packages
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel, LdaMulticore
from gensim.models.nmf import Nmf
import pyLDAvis
import pyLDAvis.gensim_models
from bertopic import BERTopic

# Import metrics packages
from sklearn.metrics import cohen_kappa_score, f1_score, accuracy_score, precision_score, recall_score

# Import interface widgets
import ipywidgets as widgets
from ipywidgets import IntProgress
from IPython.display import clear_output, display

# Set package options
pd.options.mode.chained_assignment = None
tqdm.pandas()
pyLDAvis.enable_notebook()
warnings.filterwarnings("ignore", category=DeprecationWarning)
os.environ["TOKENIZERS_PARALLELISM"] = "false"

After loading the packages and installing all dependencies, one can run the entire code. Parts that require longer execution time are commented out, and the results of the code part are imported separately. If one wants to reproduce the analysis, it is required to uncomment the parts and comment out the importing of the results. The necessary steps for this are always described in the code. We do not recommend this step if not necessary as some models have runtimes over eight hours.

# 3.2 Scrape Twitter data (Stjepan)


### 3.2.1 Import packages

As we want to analyze Twitter data and speeches from German politicians one of our first tasks was to get the tweets. Because we decided to analyze the nineteenth Bundestag as our time span, we had to get data from politicians of the six represented parties in the Bundestag. <br>
For our purpose we decided to take seven of the more active and most followed politicians from each respected political party. After deciding on which politicians account need to be scraped, we implemented a scraper. We chose snscrape which is a scraper for social network services. It allowed us to scrape every ever-posted tweet by one of our politicians without problems and restrictions. Apart from Twitter this scraper can be used for many more social networks like Facebook or Instagram but for our project Twitter was sufficient.

### 3.2.2 Define variables and functions

After installation this package allowed us to scrape tweets from desired accounts and gave us many options on how the retrieved data should look. Besides, the accounts username and tweet texts it allowed us to also get information on the time of the post, the displayed Twitter name of the account, the number of replies, retweets and likes. <br> We created a dictionary to map the usernames of the Twitter accounts to the political party. Afterwards, we defined a function that creates a dataframe with the scraped tweets from snscrape given the Twitter username and party. With the tweets scraped we only get the real tweets and reply from the politician and no retweets what helped in the analysis as we only got media content created by the desired target.

In [None]:
twitter_name_party_dic = {'rbrinkhaus': 'CDU', 'groehe': 'CDU', 'NadineSchoen': 'CDU', 'n_roettgen': 'CDU',
                          'peteraltmaier': 'CDU', 'jensspahn': 'CDU', 'MatthiasHauer': 'CDU', 'c_lindner': 'FDP',
                          'MarcoBuschmann': 'FDP', 'starkwatzinger': 'FDP', 'Lambsdorff': 'FDP',
                          'johannesvogel': 'FDP', 'KonstantinKuhle': 'FDP', 'MAStrackZi': 'FDP',
                          'larsklingbeil': 'SPD', 'EskenSaskia': 'SPD', 'hubertus_heil': 'SPD', 'HeikoMaas': 'SPD',
                          'MartinSchulz': 'SPD', 'KarambaDiaby': 'SPD', 'Karl_Lauterbach': 'SPD',
                          'SteffiLemke': 'Grüne', 'cem_oezdemir': 'Grüne', 'GoeringEckardt': 'Grüne',
                          'KonstantinNotz': 'Grüne', 'BriHasselmann': 'Grüne', 'svenlehmann': 'Grüne',
                          'ABaerbock': 'Grüne', 'SWagenknecht': 'Linke', 'b_riexinger': 'Linke',
                          'NiemaMovassat': 'Linke', 'jankortemdb': 'Linke', 'DietmarBartsch': 'Linke',
                          'GregorGysi': 'Linke', 'SevimDagdelen': 'Linke', 'Alice_Weidel': 'AFD',
                          'Beatrix_vStorch': 'AFD', 'JoanaCotar': 'AFD', 'StBrandner': 'AFD',
                          'Tino_Chrupalla': 'AFD', 'GtzFrmming': 'AFD', 'Leif_Erik_Holm': 'AFD',
                          'ABaerbockArchiv':"Grüne"
                          }

In [None]:
def retrieve_user_tweets(twitter_name, party):
    tweets_list = []
    search_string = "from:" + twitter_name
    for tweet in tqdm((sntwitter.TwitterSearchScraper(search_string).get_items())):
        tweets_list.append([tweet.date, tweet.id, tweet.content, tweet.user.username, tweet.user.displayname,
                                tweet.replyCount, tweet.retweetCount, tweet.likeCount])
    tweets_df = pd.DataFrame(tweets_list, columns=['datetime', 'tweet_id', 'text', 'username', 'name',
                                                   'reply_count', 'retweet_count', 'like_count'])
    tweets_df["party"] = party
    return tweets_df

### 3.2.3 Scrape twitter data

We define a for loop applying our defined function and appending the generated dataframe to one big corpus for the Twitter data. After scraping the tweets for every politician, we ended up with data frames that contained date, id, text, username, Twitter name, reply count, retweet count and like count of each individual tweet. We combined those data frames to one big corpus containing all the tweets by concatenating the data frames. With this corpus of around 300000 tweets, we could start preprocessing and analyzing the tweets.

In [None]:
# twitter_name_tweets_list = []
# for key in twitter_name_party_dic:
#     print(key)
#     twitter_name_tweets_list.append(retrieve_user_tweets(key, twitter_name_party_dic[key]))
# tweets = pd.concat(twitter_name_tweets_list, ignore_index=True)
# tweets.to_csv("../data/raw/tweets_scraped.csv", index = False)


# 3.3 Retrieve plenary proceedings data (Jakob)

In this section, we retrieve the plenary protocol data from the 19. German Bundestag. As the data is publicly available, they can be downloaded from the official [website](https://www.bundestag.de/services/opendata). Currently, the format of the files is not very convenient for automatic analysis, which is why the researcher from [open discourse](https://opendiscourse.de) published a preprocessed version of the plenary protocols ([Richter et al., 2020](https://github.com/open-discourse/open-discourse)). We use their data, as setting up our preprocessing pipeline would be very time-intensive and out of scope for this work. We use their provided [Docker container](https://github.com/open-discourse/open-discourse) to set up the database. We then query the needed data and export them to a csv file.

## 3.3.1 Setup local database

To use the database, we set up the Docker container from [open discourse](https://open-discourse.github.io/open-discourse-documentation/1.0.0/run-the-database-locally.html#use-the-database). Before this, we have to download and set up Docker according to these [instructions](https://www.docker.com/products/docker-desktop). After this step, one can launch Docker and proceed.

In [None]:
# Define connection details
con_details = {
    "host": "localhost",
    "database": "next",
    "user": "postgres",
    "password": "postgres",
    "port": "5432"}

In [None]:
# Navigate to Docker container
# Uncomment if one wants to set up the Docker container
# os.system("cd ..")

In [None]:
# Login to Github for Docker access
# Uncomment if one wants to set up the Docker container
# os.system("docker login docker.pkg.github.com")

In [None]:
# (Only on the first run) download the Docker container
# Uncomment if one wants to set up the Docker container
# os.system("docker pull docker.pkg.github.com/open-discourse/open-discourse/database:latest")

In [None]:
# Start and run the database in the Docker container
# Uncomment if one wants to set up the Docker container
# os.system("docker run --env POSTGRES_USER=postgres --env POSTGRES_DB=postgres --env POSTGRES_PASSWORD=postgres -p 5432:5432 -d docker.pkg.github.com/open-discourse/open-discourse/database")

## 3.3.2 Retrieve plenary proceedings from database

After setting up the PostgreSQL database, we can now query the required data.

In [None]:
# Define query
query = """SELECT * from open_discourse.speeches WHERE electoral_term = 19"""

In [None]:
# Create connection
# Uncomment if one wants to query the database
# con = psycopg2.connect(**con_details) # If this fails, repeat execution of the cell.
# cur = con.cursor()

In [None]:
# Execute query
# Uncomment if one wants to query the database
# cur.execute(query)
# rows = cur.fetchall()

In [None]:
# Transform results in dataframe
# Uncomment if one made a new query to the database
# speeches_retrieved = pd.DataFrame(rows)
# speeches_retrieved.columns = ["id", "session", "electoral_term", "first_name", "last_name", "politician_id", "text",
#                       "fraction_id", "document_url", "position_short", "position_long", "date",
#                       "search_speech_content"]

In [None]:
# Export resullts to csv
# Uncomment if one made a new query to the database
# speeches_retrieved.to_csv("../data/raw/speeches_retrieved.csv", index=False)

We save the retrieved data as a CSV file and can use it now for further processing.

# 3.3 Data Exploration

## 3.3.1 Tweets exploration (Stjepan)

#### Import data

In [None]:
# Load tweets data
tweets_scraped = pd.read_csv("../data/raw/tweets_scraped.csv", low_memory=False)

#### Check data

In [None]:
tweets_scraped.head()

In [None]:
tweets_scraped.tail()

In [None]:
tweets_scraped.info()

In [None]:
tweets_scraped.describe()

#### Drop missing data

We can drop all records with missing data, as we cannot use these records for our analysis.

In [None]:
# Drop missing data
tweets_scraped.dropna(inplace = True)

#### Clean names

We harmonise the names in the tweets and speeches data for better comparability.

In [None]:
# Create twitter username to real name dictionary
usernames_to_fullname = {'rbrinkhaus': 'Ralph Brinkhaus', 'groehe': 'Hermann Gröhe',
                         'NadineSchoen': 'Nadine Schön', 'n_roettgen': 'Norbert Röttgen',
                         'peteraltmaier': 'Peter Altmaier', 'jensspahn': 'Jens Spahn',
                         'MatthiasHauer': 'Matthias Hauer', 'c_lindner': 'Christian Lindner',
                         'MarcoBuschmann': 'Marco Buschmann', 'starkwatzinger': 'Bettina Stark-Watzinger',
                         'Lambsdorff': 'Alexander Graf Lambsdorff', 'johannesvogel': 'Johannes Vogel',
                         'KonstantinKuhle': 'Konstantin Kuhle', 'MAStrackZi': 'Marie-Agnes Strack-Zimmermann',
                         'larsklingbeil': 'Lars Klingbeil', 'EskenSaskia': 'Saskia Esken',
                         'hubertus_heil': 'Hubertus Heil', 'HeikoMaas': 'Heiko Maas',
                         'MartinSchulz': 'Martin Schulz', 'KarambaDiaby': 'Karamba Diaby',
                         'Karl_Lauterbach': 'Karl Lauterbach', 'SteffiLemke': 'Steffi Lemke',
                         'cem_oezdemir': 'Cem Özdemir', 'GoeringEckardt': 'Katrin Göring-Eckardt',
                         'KonstantinNotz': 'Konstantin von Notz', '6': 'Konstantin von Notz',
                         'BriHasselmann': 'Britta Haßelmann', 'svenlehmann': 'Sven Lehmann',
                         'ABaerbock': 'Annalena Baerbock', 'ABaerbockArchiv': 'Annalena Baerbock',
                         'SWagenknecht': 'Sahra Wagenknecht', 'b_riexinger': 'Bernd Riexinger',
                         'NiemaMovassat': 'Niema Movassat', 'jankortemdb': 'Jan Korte',
                         'DietmarBartsch': 'Dietmar Bartsch', 'GregorGysi': 'Gregor Gysi',
                         'SevimDagdelen': 'Sevim Dağdelen', 'Alice_Weidel': 'Alice Weidel',
                         'Beatrix_vStorch': 'Beatrix von Storch', 'JoanaCotar': 'Joana Cotar',
                         'StBrandner': 'Stephan Brandner', 'Tino_Chrupalla': 'Tino Chrupalla',
                         'GtzFrmming': 'Götz Frömming', '3': 'Götz Frömming', 'Leif_Erik_Holm': 'Leif-Erik Holm'}

In [None]:
# Add full name
tweets_scraped["full_name"] = tweets_scraped.username.replace(usernames_to_fullname)

#### Check time data

In [None]:
# Add normalized date
tweets_scraped["date"] = pd.to_datetime(tweets_scraped["datetime"], format = "%Y-%m-%d").dt.date

In [None]:
tweets_scraped.date.min()

In [None]:
tweets_scraped.date.max()

In [None]:
# Tweet number per time
tweets_scraped.groupby('date')['tweet_id'].size().plot()

We can now drop all data not represented in the speeches dataset.

In [None]:
# Drop unneded data
tweets_subset = tweets_scraped[np.logical_and(tweets_scraped.date >= pd.Timestamp("24.10.2017"), tweets_scraped.date <= pd.Timestamp("07.05.2021"))]

#### Checkt party distribution

When checking the distribution of tweets per party, we can see differences, but they do not significantly alter our results.

In [None]:
# Tweets per party
tweets_subset.groupby("party").size()

#### Check politician distribution

We see significant differences between the number of tweets per politician ranging from nearly 29665 to 658. We have to consider this in our work.

In [None]:
# Tweets per politican
tweets_scraped.groupby('full_name')['tweet_id'].size().sort_values().plot(kind='bar')

We see an increasing trend of tweets per day. This trend is caused by two new parties entering the Bundestag in 2017.

#### Check text

We check the texts of the tweets with a word cloud. We can infer the need for data preprocessing from a first visualisation analysis.

In [None]:
# Create a word cloud
long_string_tweets = ' '.join(tweets_scraped["text"].tolist())
wordcloud_tweets = WordCloud(background_color="white", max_words=5000, contour_width=3, contour_color='steelblue')
wordcloud_tweets.generate(long_string_tweets)
wordcloud_tweets.to_image()

In [None]:
# Create a counter object
counter_tweets = Counter(long_string_tweets.split())

In [None]:
# Check the most common words
counter_tweets.most_common(10)

We can identify the need for a stopword removal.

#### Drop unneeded columns

In [None]:
# Drop unneeded columns
tweets_subset.drop(['datetime', 'tweet_id', 'username','name', 'reply_count'], axis = 1, inplace = True)

#### Export data

In [None]:
tweets_subset.to_csv("../data/interim/tweets_explored.csv", index = False)

## 3.3.2 Explore speeches of politicians (Jakob)

We overview the retrieved data in the first analysis step and do the first simple preprocessing tasks.

### 3.3.2.1 Import data

In [None]:
# Load tweets data
# Comment out if one retrieves the data from scratch
speeches_retrieved = pd.read_csv("../data/raw/speeches_retrieved.csv", low_memory=False)

### 3.3.2.2 Check data

We use standard steps of data exploration to get an overview of the retrieved data, including data types and missing values.

In [None]:
speeches_retrieved.head()

In [None]:
speeches_retrieved.tail()

In [None]:
speeches_retrieved.info()

Based on the first overview of the data, we can identify different variables that we have to deep dive into to understand the data quality better and prepare the first processing steps.

### 3.3.2.3 Drop missing data

We can drop all records with missing speech content, as we cannot use these records for our analysis.

In [None]:
# Drop missing data
speeches_retrieved.dropna(subset = ["text"], inplace = True)

### 3.3.2.4 Clean names

We harmonise the politicians' names in the tweets and speeches data for better comparability.

In [None]:
# Add full name of politicians
speeches_retrieved["full_name"] = speeches_retrieved["first_name"] + " " + speeches_retrieved["last_name"]

In [None]:
# Subset to the selected politicians
speeches_subset = speeches_retrieved[speeches_retrieved.full_name.isin(tweets_subset.full_name.unique())]

In [None]:
# Speeches per politican
speeches_subset.groupby('full_name')['id'].size().sort_values().plot(kind='bar')

There are significant differences between the number of speeches per politician ranging from 252 to 5. We have to consider this in the interpretation of our results.

### 3.3.2.5 Check time data

To analyse the topic per time, we need to have the data in a pandas dateformat. Additionally, we control the time with retrieved speeches.

In [None]:
# Add normalized date
speeches_subset["date"] = pd.to_datetime(speeches_subset["date"], format = "%Y-%m-%d").dt.date

In [None]:
# Find first day with speeches
speeches_subset.date.min()

In [None]:
# Find last day with speeches
speeches_subset.date.max()

In [None]:
# Speech number per time
speeches_subset.groupby('date')['id'].size().plot()

We see some patterns in the time series. However, there are no significant gaps in the observed time frame.

### 3.3.2.6 Check party distribution

To control the distributions of tweets per party, we assign the author's party to each speech.

In [None]:
# Dictionary assigning parties to the politicians
fullname_to_party = {'Ralph Brinkhaus': 'CDU', 'Hermann Gröhe': 'CDU', 'Nadine Schön': 'CDU',
                     'Norbert Röttgen': 'CDU', 'Peter Altmaier': 'CDU', 'Jens Spahn': 'CDU',
                     'Matthias Hauer': 'CDU', 'Christian Lindner': 'FDP', 'Marco Buschmann': 'FDP',
                     'Bettina Stark-Watzinger': 'FDP', 'Alexander Graf Lambsdorff': 'FDP', 'Johannes Vogel': 'FDP',
                     'Konstantin Kuhle': 'FDP', 'Marie-Agnes Strack-Zimmermann': 'FDP', 'Lars Klingbeil': 'SPD',
                     'Saskia Esken': 'SPD', 'Hubertus Heil': 'SPD', 'Heiko Maas': 'SPD', 'Martin Schulz': 'SPD',
                     'Karamba Diaby': 'SPD', 'Karl Lauterbach': 'SPD', 'Steffi Lemke': 'Grüne',
                     'Cem Özdemir': 'Grüne', 'Katrin Göring-Eckardt': 'Grüne', 'Konstantin von Notz': 'Grüne',
                     'Britta Haßelmann': 'Grüne', 'Sven Lehmann': 'Grüne', 'Annalena Baerbock': 'Grüne',
                     'Sahra Wagenknecht': 'Linke', 'Bernd Riexinger': 'Linke', 'Niema Movassat': 'Linke',
                     'Jan Korte': 'Linke', 'Dietmar Bartsch': 'Linke', 'Gregor Gysi': 'Linke',
                     'Sevim Dağdelen': 'Linke', 'Alice Weidel': 'AFD', 'Beatrix von Storch': 'AFD',
                     'Joana Cotar': 'AFD', 'Stephan Brandner': 'AFD', 'Tino Chrupalla': 'AFD',
                     'Götz Frömming': 'AFD', 'Leif-Erik Holm': 'AFD'}

In [None]:
# Assign party
speeches_subset["party"] = speeches_subset.full_name.replace(fullname_to_party)

In [None]:
# Speeches per party
speeches_subset.groupby("party").size()

When checking the distribution of speeches per party, we see differences, but we do not expect them to alter our results significantly.

### 3.3.2.7 Check text

We check the texts of the tweets with a word cloud. We can infer the need for data preprocessing from a first visualisation analysis.

In [None]:
# Create a word cloud
long_string_speeches = ' '.join(speeches_subset["text"].tolist())
wordcloud_speeches = WordCloud(background_color="white", max_words=5000, contour_width=3,
                               contour_color='steelblue')
wordcloud_speeches.generate(long_string_speeches)
wordcloud_speeches.to_image()

In [None]:
# Create a counter object
speeches_counter = Counter(long_string_speeches.split())

In [None]:
# Check the most common words
speeches_counter.most_common(10)

There is a clear need for extensive stopword removal to reduce noise in the topic and sentiment analysis.

### 3.3.2.8 Drop unneeded columns

In [None]:
# Drop unneeded columns
speeches_subset.drop(['id', 'session', 'electoral_term', 'first_name', 'last_name', 'politician_id',
                      'fraction_id', 'document_url', 'position_short', 'position_long', 'search_speech_content'],
                     axis = 1, inplace = True)

### 3.3.2.9 Export data

In [None]:
speeches_subset.to_csv("../data/interim/speeches_explored.csv", index = False)

This section explored the speeches dataset and controlled the data quality of different essential variables. The data quality is satisfactory, except for a highly skewed distribution of politicians' speeches.

# 3.4 Data preprocessing

## 3.4.1 Prepare spacy pipelines (Jakob)

In the last section, we identified the need for extensive preprocessing. We build a flexible spacy pipeline structure that quickly adds or removes different preprocessing steps. We base our model on the pretrained [spacy pipeline](https://spacy.io/models/de).

In [None]:
@Language.component("Remove non alphabetic words")
def remove_non_alpha(doc):
    return [token for token in doc if token.is_alpha]

We identified the need to remove non-German text, as they reduce the quality of our topic and sentiment models. We use a language detector and an additional component that only removes sentences of other languages.

In [None]:
@Language.factory("Detect languages")
def create_language_detector(nlp, name):
    return LanguageDetector(language_detection_function=None)

In [None]:
@Language.component("Keep only German documents")
def remove_non_german(doc):
    res = [sent for sent in doc.sents if sent._.language["language"] == "de"]
    if res:
        return [token for sent in res for token in sent]
    else:
        return Doc(Vocab([]), words=[], spaces=[])

In [None]:
@Language.component("Remove stopwords")
def remove_stopwords(doc):
    return [token for token in doc if not token.is_stop]

We lemmatise the resulting tokens to keep the semantic meaning of the resulting words.

In [None]:
@Language.component("Lemmatize text")
def lemmatize_text(doc):
    return [token.lemma_ for token in doc]

In [None]:
@Language.component("Lowercase Text")
def lowercase(doc):
    return [token.lower() for token in doc]

## 3.4.2 Topic modeling preprocessing (Jakob)

We do not use all pre-trained pipeline elements and therefore exclude them. In the next step, we will add additional needed previous defined components.

In [None]:
# Exclude not needed pipeline elements
pipeline_exclude = ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'ner', 'morphologizer']

###  3.4.2.1 Tweets

This subsection defines a pipeline to preprocess the Twitter data and execute the pipeline.

In [None]:
# Import data
tweets_explored = pd.read_csv("../data/interim/tweets_explored.csv")

In [None]:
# Create spacy pipeline
nlp_tweets = spacy.load('de_core_news_sm', exclude=pipeline_exclude)
nlp_tweets.Defaults.stop_words |= {"amp", "rt"}

# Add needed pipeline components
nlp_tweets.add_pipe("sentencizer", last=True)
nlp_tweets.add_pipe("Detect languages", name='Detect languages', last=True)
nlp_tweets.add_pipe("Keep only German documents", name='Keep only German documents', last=True)
nlp_tweets.add_pipe("Remove non alphabetic words", name="Remove non alphabetic words", last=True)
nlp_tweets.add_pipe("Remove stopwords", name="Remove stopwords", last=True)
nlp_tweets.add_pipe("Lemmatize text", name="Lemmatize text", last=True)
nlp_tweets.add_pipe("Lowercase Text", name="Lowercase Text", last=True)

In [None]:
# Apply pipeline to text
# Uncomment if one wants to update the preprocessing of the data
# tweets_explored["text_preprocessed"] = tweets_explored.text.progress_apply(nlp_tweets)
# This takes approximately one hour

In [None]:
# Add sentence structure
# Uncomment if one wants to update the preprocessing of the data
# tweets_explored["text_preprocessed_sentence"] = tweets_explored["text_preprocessed"].progress_apply(
#    lambda x: " ".join(x))

In [None]:
# Subset needed data
# Uncomment if one wants to update the preprocessing of the data
# tweets_preprocessed = tweets_explored[["full_name", "date", "party", "text", "text_preprocessed",
#                                       "text_preprocessed_sentence", 'retweet_count', 'like_count']]

In [None]:
# Drop empty texts
# Uncomment if one wants to update the preprocessing of the data
# tweets_preprocessed.replace('', np.NaN, inplace=True)
# tweets_preprocessed.dropna(inplace=True)
# tweets_preprocessed.reset_index(drop = True, inplace = True)

In [None]:
# Save data as pickle file
# Uncomment if one wants to update the preprocessing of the data
# pickle.dump(tweets_preprocessed, open("../data/processed/tweets_processed.p", "wb"))

We now can use the resulting file to train a topic model for the tweets dataset.

### 3.4.2.2 Speeches

This subsection defines a pipeline for preprocessing the speeches data and executing the pipeline.

In [None]:
# Import data
speeches_explored = pd.read_csv("../data/interim/speeches_explored.csv")

In [None]:
# Create spacy pipeline
nlp_speeches = spacy.load('de_core_news_sm', exclude=pipeline_exclude)

# Add needed pipeline components
nlp_speeches.add_pipe('sentencizer', last=True)
nlp_speeches.add_pipe("Detect languages", name='Detect languages', last=True)
nlp_speeches.add_pipe("Keep only German documents", name='Keep only German documents', last=True)
nlp_speeches.add_pipe("Remove non alphabetic words", name="Remove non alphabetic words", last=True)
nlp_speeches.add_pipe("Remove stopwords", name="Remove stopwords", last=True)
nlp_speeches.add_pipe("Lemmatize text", name="Lemmatize text", last=True)
nlp_speeches.add_pipe("Lowercase Text", name="Lowercase Text", last=True)

In [None]:
# Apply pipeline to text
# Uncomment if one wants to update the preprocessing of the data
# speeches_explored["text_preprocessed"] = speeches_explored.text.progress_apply(nlp_speeches)

In [None]:
# Add sentence structure
# Uncomment if one wants to update the preprocessing of the data
# speeches_explored["text_preprocessed_sentence"] = speeches_explored["text_preprocessed"].progress_apply(
#    lambda x: " ".join(x))

In [None]:
# Subset needed data
# Uncomment if one wants to update the preprocessing of the data
# speeches_preprocessed = speeches_explored[["full_name", "date", "party", "text",
#                                           "text_preprocessed", "text_preprocessed_sentence"]]

We identified the need to remove frequent words for topic modelling. Many words come from greeting phrases (Sehr, geehrte, Frauen, Herren) that do not have semantic relevance for our analyses but interfere with the model quality based on their frequency.

In [None]:
# Define function for removing frequent words
def remove_frequent_words(words_list, most_frequent_words):
    return [word for word in words_list if word not in most_frequent_words]

In [None]:
# Additional preprocessing for Bertopic model
# Uncomment if one wants to update the preprocessing of the data
# long_string_speeches= ' '.join(speeches_preprocessed.text_preprocessed_sentence.tolist())
# counter_speeches = Counter(long_string_speeches.split())
# most_frequent_words = []
# for item in counter_speeches.most_common(200):
#    most_frequent_words.append(item[0])

In [None]:
# Add columns with preprocessed text and removed frequent words
# Uncomment if one wants to update the preprocessing of the data
# speeches_preprocessed["text_preprocessed_infrequent"] = speeches_preprocessed.text_preprocessed.progress_apply(remove_frequent_words,most_frequent_words = most_frequent_words)
# speeches_preprocessed["text_preprocessed_infrequent_sentence"] = speeches_preprocessed["text_preprocessed_infrequent"].progress_apply(lambda x: " ".join(x))

In [None]:
# Drop empty texts
# Uncomment if one wants to update the preprocessing of the data
# speeches_preprocessed.replace('', np.NaN, inplace=True)
# speeches_preprocessed.dropna(inplace=True)
# speeches_preprocessed.reset_index(drop = True, inplace = True)

In [None]:
# Save data as pickle file
# Uncomment if one wants to update the preprocessing of the data
# pickle.dump(speeches_preprocessed, open("../data/processed/speeches_processed.p", "wb"))

We now can use the resulting file to train a topic model for the speeches dataset.

## 3.4.3 Sentiment analysis preprocessing (Stjepan)

**Needs to be added**

# 3.5 Topic Modeling (Jakob)

We perform a topic modelling to understand better the differences in politicians' communication on Twitter and the Bundestag. For this, we test three different approaches before we choose the best performing as our final model. We apply hyperparameter tuning if applicable but omit classic train test split validation. We are analysing the validity of the topic model in the Results section.

## 3.5.1 Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) constitutes a state of the art approach ([Asmussen & Møller, 2019](https://doi.org/10.1186/s40537-019-0255-7)) for topic modelling. LDA is an unsupervised machine learning technique that uses generative statistical models to extract topics from a collection of documents ([Blei, Ng, & Jordan, 2003](https://dl.acm.org/doi/10.5555/944919.944937)). The underlying model assigns a probability distribution over the vocabulary of the documents to topics that can be used for topic detection. We will base our choice of the optimal hyperparameter combination on the coherence of the resulting topic model. This decision is based on the discussion in [Dietz (2016)](http://topicmodels.info/ckling/tmt/part4.pdf) and [Röder, Both, & Hinneburg (2015)](https://dl.acm.org/doi/abs/10.1145/2684822.2685324).

### 3.5.1.1 Define hyperparameters for optimization.

We optimise the hyperparameters of the LDA model based on a grid search with the variables topic number (k), the a-priory belief of document-topic distribution (alpha) and the a-priory belief of topic-word distribution (eta) ([Rehurek & Sojka, 2010](https://radimrehurek.com/gensim/models/ldamodel.html)). This hyperparameter optimisation is loosely based on [Kapadia (2019)](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0).

In [None]:
# Topics range
min_topics = 10
max_topics = 150
step_size = 10
topics_range = range(min_topics, max_topics, step_size)

In [None]:
# Alpha parameter
alpha = list(np.arange(0.01, 1, 0.3)) #
alpha.append('symmetric')
alpha.append('asymmetric')

In [None]:
# Beta parameter
beta = list(np.arange(0.01, 1, 0.3))
beta.append('symmetric')

In [None]:
# Function for calculating coherence values of specific hyperparamter combinations
def compute_lda_coherence_values(corpus, text, id2word, k, a, b):
    lda_model = LdaMulticore(corpus=corpus,
                             id2word=id2word,
                             num_topics=k,
                             random_state=42,
                             alpha=a,
                             eta=b)
    coherence_model_lda = CoherenceModel(model=lda_model, texts=text, dictionary=id2word, coherence='c_v')
    return coherence_model_lda.get_coherence()

In [None]:
# Function for executing a hyperparameter optimization
def hyperparameter_lda(data_preprocessed, title, topics_range, alpha, beta):
    id2word = corpora.Dictionary(data_preprocessed.text_preprocessed.to_list())
    # These hyperparameter could also be trialed in an extend scope
    id2word.filter_extremes(no_below=10, no_above=0.1)
    texts = data_preprocessed.text_preprocessed.to_list()
    corpus = [id2word.doc2bow(text) for text in texts]
    model_results = {'Topics': [],
                     'Alpha': [],
                     'Beta': [],
                     'Coherence': []
                    }
    for k in tqdm(topics_range):
        print("Number of topics:" + str(k))
        for a in tqdm(alpha):
            print("Alpha value:" + str(a))
            for b in tqdm(beta):
                print("Beta value:" + str(b))
                cv = compute_lda_coherence_values(corpus=corpus,text = texts,
                                              id2word=id2word, k=k, a=a, b=b)
                model_results['Topics'].append(k)
                model_results['Alpha'].append(a)
                model_results['Beta'].append(b)
                model_results['Coherence'].append(cv)
    results_df = pd.DataFrame(model_results)
    results_df.to_csv('../data/processed/lda_tuning_results_' + title + '.csv', index=False)
    return results_df

### 3.5.1.1 Hyperparameter optimization LDA for tweets

In [None]:
# Load data
tweets_processed_lda = pickle.load(open("../data/processed/tweets_processed.p", "rb"))

In [None]:
# Hyperparameter optimization
# Uncomment if one wants to repeat the hyperparameter optimization
# hyperparameter_lda_tweets = hyperparameter_lda(tweets_processed_lda, "tweets", topics_range, alpha, beta)
# Takes approximately one hour of runtime

In [None]:
# Save hyperparameter
# Uncomment if one wants to repeat the hyperparameter optimization
# hyperparameter_lda_tweets.to_csv('../data/processed/lda_tuning_results_tweets.csv', index = False)

### 3.5.1.2 Calculate best model LDA for tweets

We compute the best LDA model for the tweets dataset based on the hyperparameter optimisation from the last subsection.

In [None]:
# Load data
lda_tuning_results_tweets = pd.read_csv('../data/processed/lda_tuning_results_tweets.csv')

In [None]:
# Prepare corpus
id2word_tweets_lda = corpora.Dictionary(tweets_processed_lda.text_preprocessed.to_list())
id2word_tweets_lda.filter_extremes(no_below=5, no_above=0.1)
texts_tweets_lda = tweets_processed_lda.text_preprocessed.to_list()
corpus_tweets_lda = [id2word_tweets_lda.doc2bow(text) for text in texts_tweets_lda]

In [None]:
# Retrieve optimal hyperparameter
k_optimal_lda_tweets = int(lda_tuning_results_tweets.sort_values("Coherence", ascending = False).reset_index(drop = True).Topics[0])
try:
    a_optimal_lda_tweets = float(lda_tuning_results_tweets.sort_values("Coherence", ascending = False).reset_index(drop = True).Alpha[0])
except ValueError:
    a_optimal_lda_tweets = lda_tuning_results_tweets.sort_values("Coherence", ascending = False).reset_index(drop = True).Alpha[0]
try:
    b_optimal_lda_tweets = float(lda_tuning_results_tweets.sort_values("Coherence", ascending = False).reset_index(drop = True).Beta[0])
except ValueError:
    b_optimal_lda_tweets = lda_tuning_results_tweets.sort_values("Coherence", ascending = False).reset_index(drop = True).Beta[0]

In [None]:
# Train model
lda_model_tweets = LdaMulticore(corpus=corpus_tweets_lda,
                                 id2word=id2word_tweets_lda,
                                 num_topics=k_optimal_lda_tweets,
                                 random_state=42,
                                 alpha=a_optimal_lda_tweets,
                                 eta=b_optimal_lda_tweets)

In [None]:
# Calculate final coherence value
coherence_model_lda_tweets = CoherenceModel(model=lda_model_tweets, texts=texts_tweets_lda, dictionary=id2word_tweets_lda, coherence='c_v')
coherence_lda_tweets = coherence_model_lda_tweets.get_coherence()
print("The final model coherence of the LDA for Tweets is: " + str(round(coherence_lda_tweets,2)))

In [None]:
# Visually inspect result
lda_vis_tweets = pyLDAvis.gensim_models.prepare(lda_model_tweets, corpus_tweets_lda, id2word_tweets_lda)
lda_vis_tweets

We use the coherence value and topic visualisation to evaluate the model. The model has a good coherence score, but the visual inspection shows not easily interpretable topics. Based on this, we cannot infer a high model quality.

### 3.5.1.3 Hyperparameter optimization LDA for speeches

In [None]:
# Load data
speeches_processed_lda = pickle.load(open( "../data/processed/speeches_processed.p", "rb" ))

In [None]:
# Hyperparameter optimization
# Uncomment if one wants to repeat the hyperparameter optimization
# hyperparameter_lda_speeches = hyperparameter_lda(speeches_processed_lda, "tweets", topics_range, alpha, beta)

In [None]:
# Save hyperparameter
# Uncomment if one wants to repeat the hyperparameter optimization
# hyperparameter_lda_speeches.to_csv('../data/processed/lda_tuning_results_speeches.csv', index = False)

### 3.5.1.4 Calculate best model LDA for speeches

We compute the best LDA model for the speeches dataset based on the hyperparameter optimisation from the last subsection.

In [None]:
# Load data
lda_tuning_results_speeches = pd.read_csv('../data/processed/lda_tuning_results_speeches.csv')

In [None]:
# Prepare corpus
id2word_speeches_lda = corpora.Dictionary(tweets_processed_lda.text_preprocessed.to_list())
id2word_speeches_lda.filter_extremes(no_below=5, no_above=0.1)
texts_speeches_lda = tweets_processed_lda.text_preprocessed.to_list()
corpus_speeches_lda = [id2word_speeches_lda.doc2bow(text) for text in texts_speeches_lda]

In [None]:
# Retrieve optimal hyperparameter
k_optimal_lda_speeches = int(lda_tuning_results_speeches.sort_values("Coherence", ascending = False).reset_index(drop = True).Topics[0])
try:
    a_optimal_lda_speeches = float(lda_tuning_results_speeches.sort_values("Coherence", ascending = False).reset_index(drop = True).Alpha[0])
except ValueError:
    a_optimal_lda_speeches = lda_tuning_results_speeches.sort_values("Coherence", ascending = False).reset_index(drop = True).Alpha[0]
try:
    b_optimal_lda_speeches = float(lda_tuning_results_speeches.sort_values("Coherence", ascending = False).reset_index(drop = True).Beta[0])
except ValueError:
    b_optimal_lda_speeches = lda_tuning_results_speeches.sort_values("Coherence", ascending = False).reset_index(drop = True).Beta[0]

In [None]:
# Train model
lda_model_speeches = LdaMulticore(corpus=corpus_speeches_lda,
                                 id2word=id2word_speeches_lda,
                                 num_topics=k_optimal_lda_speeches,
                                 random_state=42,
                                 alpha=a_optimal_lda_speeches,
                                 eta=b_optimal_lda_speeches)

In [None]:
# Calculate final coherence value
coherence_model_lda_speeches = CoherenceModel(model=lda_model_speeches, texts=texts_speeches_lda, dictionary=id2word_speeches_lda,
                                                    coherence='c_v')
coherence_lda_speeches = coherence_model_lda_speeches.get_coherence()
print("The final model coherence of the LDA for Speeches is: " + str(round(coherence_lda_speeches,2)))

In [None]:
# Visually inspect result
lda_vis_speeches = pyLDAvis.gensim_models.prepare(lda_model_speeches, corpus_speeches_lda, id2word_speeches_lda)
lda_vis_speeches

Neither the coherence score nor the visual inspection indicates a high model quality.

## 3.5.2 Non Negative Matrix Factorisation

Another approach for topic modelling we are testing is Non-Negative Matrix Factorisation (NNMF). This technique is another unsupervised machine learning method that factorises a matrix into two matrices that give a less complex representation of the original data ([Wang & Zhang, 2012](https://doi.org/10.1109/TKDE.2012.51)). In our case, we use it to create a document term matrix that helps identify topics of the considered documents.

#### Define hyperparameters for optimization.

We optimise the hyperparameters of the NNMF model based on a grid search with the variables topic number (k).

In [None]:
# Function for calculating coherence values of specific hyperparamter combinations
def compute_nnmf_coherence_values(corpus, text, id2word, k):
    nmf_model = Nmf(
        corpus=corpus,
        id2word=id2word,
        num_topics=k,
        random_state=42
    )
    coherence_model_lda = CoherenceModel(model=nmf_model, texts=text, dictionary=id2word, coherence='c_v')
    return coherence_model_lda.get_coherence()

In [None]:
# Function for executing a hyperparameter optimization
def hyperparameter_nnmf(data_preprocessed, title, topics_range):
    id2word = corpora.Dictionary(data_preprocessed.text_preprocessed.to_list())
    id2word.filter_extremes(no_below=10, no_above=0.1)
    texts = data_preprocessed.text_preprocessed.to_list()
    corpus = [id2word.doc2bow(text) for text in texts]
    model_results = {'Topics': [],
                     'Coherence': []
                    }
    for k in tqdm(topics_range):
        print("Number of topics:" + str(k))
        cv = compute_nnmf_coherence_values(corpus=corpus,text = texts,
                                      id2word=id2word, k=k)
        model_results['Topics'].append(k)
        model_results['Coherence'].append(cv)
    results_df = pd.DataFrame(model_results)
    results_df.to_csv('../data/processed/nnmf_tuning_results_' + title + '.csv', index=False)
    return results_df

### 3.5.2.1 Hyperparameter optimization NNMF for tweets

In [None]:
# Load data
tweets_processed_nnmf = pickle.load(open("../data/processed/tweets_processed.p", "rb" ))

In [None]:
# Hyperparameter optimization
# Uncomment if one wants to repeat the hyperparameter optimization
# hyperparameter_nnmf_tweets = hyperparameter_nnmf(tweets_processed_nnmf, "tweets", topics_range)

In [None]:
# Save hyperparameter
# Uncomment if one wants to repeat the hyperparameter optimization
# hyperparameter_nnmf_tweets.to_csv('../data/processed/nnmf_tuning_results_tweets.csv', index = False)

### 3.5.2.2 Calculate best model NNMF for tweets

We compute the best NNMF model for the tweets dataset based on the hyperparameter optimisation from the last subsection.

In [None]:
# Load data
tweets_processed_nnmf = pickle.load(open( "../data/processed/tweets_processed.p", "rb" ))
nnmf_tuning_results_tweets = pd.read_csv('../data/processed/nnmf_tuning_results_tweets.csv')

In [None]:
# Prepare corpus
id2word_tweets_nnmf = corpora.Dictionary(tweets_processed_nnmf.text_preprocessed.to_list())
id2word_tweets_nnmf.filter_extremes(no_below=5, no_above=0.1)
texts_tweets_nnmf = tweets_processed_nnmf.text_preprocessed.to_list()
corpus_tweets_nnmf = [id2word_tweets_nnmf.doc2bow(text) for text in texts_tweets_nnmf]

In [None]:
k_optimal_nnmf_tweets = int(lda_tuning_results_tweets.sort_values("Coherence", ascending = False).reset_index(drop = True).Topics[0])

In [None]:
# Train model
nnmf_model_tweets = Nmf(corpus=corpus_tweets_nnmf,
                                 id2word=id2word_tweets_nnmf,
                                 num_topics=k_optimal_nnmf_tweets,
                                 random_state=42)

In [None]:
# Calculate final coherence value
coherence_model_nnmf_tweets = CoherenceModel(model=nnmf_model_tweets, texts=texts_tweets_nnmf, dictionary=id2word_tweets_nnmf,
                                                    coherence='c_v')
coherence_nnmf_tweets = coherence_model_nnmf_tweets.get_coherence()
print("The final model coherence of the NNMF for tweets is: " + str(round(coherence_nnmf_tweets,2)))

In [None]:
# Visually inspect result
nnmf_model_tweets.show_topics()

When analysing the top topic words, we cannot identify comprehensible subjects. Combined with the low coherence score, we can conclude a low model quality.

### 3.5.2.3 Hyperparameter optimization NNMF for speeches

In [None]:
# Load data
speeches_processed_nnmf = pickle.load(open( "../data/processed/speeches_processed.p", "rb" ))

In [None]:
# Hyperparameter optimization
# Uncomment if one wants to repeat the hyperparameter optimization
# hyperparameter_nnmf_speeches = hyperparameter_lda(speeches_processed_nnmf, "tweets", topics_range, alpha, beta)

In [None]:
# Save hyperparameter
# Uncomment if one wants to repeat the hyperparameter optimization
# hyperparameter_nnmf_speeches.to_csv('../data/processed/nnmf_tuning_results_speeches.csv', index = False)

### 3.5.2.4 Calculate best model NNMF for speeches

We compute the best NNMF model for the speeches dataset based on the hyperparameter optimisation from the last subsection.

In [None]:
# Load data
speeches_processed_nnmf = pickle.load(open( "../data/processed/speeches_processed.p", "rb" ))
nnmf_tuning_results_speeches = pd.read_csv('../data/processed/nnmf_tuning_results_speeches.csv')

In [None]:
# Prepare corpus
id2word_speeches_nnmf = corpora.Dictionary(tweets_processed_nnmf.text_preprocessed.to_list())
id2word_speeches_nnmf.filter_extremes(no_below=5, no_above=0.1)
texts_speeches_nnmf = tweets_processed_nnmf.text_preprocessed.to_list()
corpus_speeches_nnmf = [id2word_speeches_nnmf.doc2bow(text) for text in texts_speeches_nnmf]

In [None]:
k_optimal_nnmf_speeches = int(lda_tuning_results_speeches.sort_values("Coherence", ascending = False).reset_index(drop = True).Topics[0])

In [None]:
# Train model
nnmf_model_speeches = Nmf(corpus=corpus_speeches_nnmf,
                                 id2word=id2word_speeches_nnmf,
                                 num_topics=k_optimal_nnmf_speeches,
                                 random_state=42)

In [None]:
# Calculate final coherence value
coherence_model_nnmf_speeches = CoherenceModel(model=nnmf_model_speeches, texts=texts_speeches_nnmf, dictionary=id2word_speeches_nnmf,
                                                    coherence='c_v')
coherence_nnmf_speeches = coherence_model_nnmf_speeches.get_coherence()
print("The final model coherence of the NNMF for Speeches is: " + str(round(coherence_nnmf_speeches,2)))

In [None]:
# Visually inspect result
nnmf_model_speeches.show_topics()

The coherence of the model is relatively low, and the resulting topics show no consistent them resulting in low model usability.

## 3.5.3 Bertopic

The last model we apply is BERTopic ([Grootendorst & Reimers, 2021](https://doi.org/10.5281/zenodo.4381785)), which employs the BERT transformers model for creating topic models. BERTopic uses pre-trained BERT models and UMAP and HDBSCAN clustering with a c-TF-IDF embedding and Maximal Marginal Relevance selection. This model architecture is pretty new, and there is not much existing research on the topic. However first results seem promising. Based on the architecture, the model can identify relevant topics in the text and cluster them according to semantic similarity. The model architecture is quite complex, and therefore the runtime of training BERTopic is high. We do not perform hyperparameter optimisation for the BERTopic models, as we have only limited computational power.

In [None]:
def calculate_coherence_bert(topic_model, docs, topics):
    cleaned_docs = topic_model._preprocess_text(docs)

    # Extract vectorizer and tokenizer from BERTopic
    vectorizer = topic_model.vectorizer_model
    tokenizer = vectorizer.build_tokenizer()

    # Extract features for Topic Coherence evaluation
    tokens = [tokenizer(doc) for doc in cleaned_docs]
    dictionary = corpora.Dictionary(tokens)
    corpus = [dictionary.doc2bow(token) for token in tokens]
    topic_words = [[words for words, _ in topic_model.get_topic(topic)]
                   for topic in range(len(set(topics))-1)]

    # Evaluate
    coherence_model = CoherenceModel(topics=topic_words,
                                     texts=tokens,
                                     corpus=corpus,
                                     dictionary=dictionary,
                                     coherence='c_v')
    coherence = coherence_model.get_coherence()
    return coherence

In [None]:
def assign_topic(topic_id, topic_model):
    return topic_model.get_topic_info(topic_id).Name.values[0]

### 3.5.3.1 Compute BERTopic model Tweets

In [None]:
# Load data
tweets_processed_bert = pickle.load(open( "../data/processed/tweets_processed.p", "rb" ))
docs_tweets_bert = tweets_processed_bert.text_preprocessed_sentence.tolist()

In [None]:
# Prepare topic model
topic_model_tweets = BERTopic(language="german", nr_topics="auto", calculate_probabilities = True, verbose = True)

In [None]:
# Compute Bertopic model
# Uncomment if one wants to retrain the network
# start_time_bert_tweets = datetime.now()
# topics_tweets_bert, probs_tweets_bert = topic_model_tweets.fit_transform(docs_tweets_bert)
# end_time_bert_tweets = datetime.now()
# print('Duration: {}'.format(end_time_bert_tweets - start_time_bert_tweets))
# Takes approximately eight hours of runtime

In [None]:
# Calculate coherence
# Uncomment if one wants to retrain the network
# coherence_bert_tweets = calculate_coherence_bert(topic_model_tweets,docs_tweets_bert, topics_tweets_bert)
# coherence_bert_tweets

In [None]:
# Visualise results
# Uncomment if one wants to retrain the network
# topic_model_tweets.visualize_topics()

We saw too many topics based on the first analysis, so we reduced the number of topics with the inherent reduction logic.

In [None]:
# Reduce topics
# Uncomment if one wants to retrain the network
# topics_tweets_bert_reduced, probs_tweets_bert_reduced = topic_model_tweets.reduce_topics(docs_tweets_bert,
#                                                                                         topics_tweets_bert,
#                                                                                         probs_tweets_bert,
#                                                                                         nr_topics=25)

In [None]:
# Load model
# Comment out if one retrains the model
with open('../data/processed/topics_tweets_bert.pickle', 'rb') as handle:
    topics_tweets_bert_reduced = pickle.load(handle)
topic_model_tweets = BERTopic.load("../models/bertopic_tweets")

In [None]:
# Calculate coherence reduced
coherence_bert_tweets_reduced = calculate_coherence_bert(topic_model_tweets, docs_tweets_bert,
                                                         topics_tweets_bert_reduced)
print("The final model coherence of the BERTopic for Tweets is: " + str(round(coherence_bert_tweets_reduced,2)))

In [None]:
# Visualise results
topic_model_tweets.visualize_topics()

The coherence of the model is on a satisfactory level, and the identified topics are interpretable for a human observer. We can infer a high model quality.

In [None]:
# Assign results to dataframe
tweets_processed_bert["topic_id"] = topics_tweets_bert_reduced
tweets_processed_bert["topic"] = tweets_processed_bert.topic_id.progress_apply(assign_topic,                                                                                    topic_model = topic_model_tweets)

In [None]:
# Save model and results
# Uncomment if one wants to retrain the network
# topic_model_tweets.save("../models/bertopic_tweets")
# with open( "../data/processed/tweets_processed_bert.pickle", "wb" ) as handle:
#    pickle.dump(tweets_processed_bert, handle, protocol=pickle.HIGHEST_PROTOCOL)
# with open('../data/processed/probabilities_tweets_bert.pickle', 'wb') as handle:
#     pickle.dump(probs_tweets_bert_reduced, handle, protocol=pickle.HIGHEST_PROTOCOL)
# with open('../data/processed/topics_tweets_bert.pickle', 'wb') as handle:
#    pickle.dump(topics_tweets_bert_reduced, handle, protocol=pickle.HIGHEST_PROTOCOL)

### 3.5.3.2 Compute BERTopic model Speeches

In [None]:
# Load data
speeches_processed_bert = pickle.load(open( "../data/processed/speeches_processed.p", "rb" ))
docs_speeches_bert = speeches_processed_bert.text_preprocessed_infrequent_sentence.tolist()

In [None]:
# Prepare topic model
topic_model_speeches = BERTopic(language="german", nr_topics="auto", calculate_probabilities = True,
                                verbose = True)

In [None]:
# Compute Bertopic mode
# Uncomment if one wants to retrain the network
# start_time_bert_speeches = datetime.now()
# topics_speeches_bert, probs_speeches_bert = topic_model_speeches.fit_transform(docs_speeches_bert)
# end_time_bert_speeches = datetime.now()
# print('Duration: {}'.format(end_time_bert_speeches - start_time_bert_speeches))

In [None]:
# Load model
# Comment out if one retrains the model
with open('../data/processed/topics_speeches_bert.pickle', 'rb') as handle:
    topics_speeches_bert = pickle.load(handle)
topic_model_speeches = BERTopic.load("../models/bertopic_speeches")

In [None]:
# Calculate coherence reduced
coherence_bert_speeches = calculate_coherence_bert(topic_model_speeches, docs_speeches_bert,
                                                         topics_speeches_bert)
print("The final model coherence of the BERTopic for speeches is: " + str(round(coherence_bert_speeches,2)))

In [None]:
# Visualise results
# Uncomment if one wants to retrain the network
topic_model_speeches.visualize_topics()

The model has a comparatively high coherence and also interpretable topics. Therefore we conclude a high model quality.

In [None]:
# Assign results to dataframe
# Uncomment if one wants to retrain the network
# speeches_processed_bert["topic_id"] = topics_speeches_bert
# speeches_processed_bert["topic"] = speeches_processed_bert.topic_id.progress_apply(assign_topic,
#                                                                                   topic_model = topic_model_speeches)

In [None]:
# Save model and results
# Uncomment if one wants to retrain the network
# topic_model_speeches.save("../models/bertopic_speeches")
# with open( "../data/processed/speeches_processed_bert.pickle", "wb" ) as handle:
#    pickle.dump(speeches_processed_bert, handle, protocol=pickle.HIGHEST_PROTOCOL)
# with open('../data/processed/probabilities_speeches_bert.pickle', 'wb') as handle:
#    pickle.dump(probs_speeches_bert, handle, protocol=pickle.HIGHEST_PROTOCOL)
# with open('../data/processed/topics_speeches_bert.pickle', 'wb') as handle:
#    pickle.dump(topics_speeches_bert, handle, protocol=pickle.HIGHEST_PROTOCOL)

## 3.5.4 Model selection

For the final model selection, we evaluate the model based on the coherence and the visual inspection of the results.

In [None]:
print("The final model coherence of the LDA for tweets is: " + str(round(coherence_lda_tweets,2)))
print("The final model coherence of the NNMF for tweets is: " + str(round(coherence_nnmf_tweets,2)))
print("The final model coherence of the BERTopic for tweets is: " + str(round(coherence_bert_tweets_reduced,2)))
print("The final model coherence of the LDA for speeches is: " + str(round(coherence_lda_speeches,2)))
print("The final model coherence of the NNMF for speeches is: " + str(round(coherence_nnmf_speeches,2)))
print("The final model coherence of the BERTopic for speeches is: " + str(round(coherence_bert_speeches,2)))

 NNMF models did not perform very well in terms of coherence, while the LDA model only showed good coherence values for the tweets dataset. BERTopic could perform well for both datasets measured by cohesion. Based on the visual inspection, we saw excellent results for BERTopic and medium results for the other two model types. We decided on the BERTopic model for both datasets to create the final topic model based on these criteria. In the next section, we will analyse the results of BERTopic and validate the selected models based on word and topic intrusion metrics.

# 3.6 Sentiment Analysis

With the sentiment analysis we want to look at the question of how the sentiment of politicians from different parties varies from social media to the Bundestag as an audience and make a comparison between female and male politicians in the way of used sentiment. As we used Python for our programming language, we start by importing some useful and commonly used packages. After loading in our preprocessed corpus we were ready to analyze the data.

In [None]:
#import packages

import pandas as pd
from textblob_de import TextBlobDE as TextBlob
import numpy as np
from tqdm.notebook import tqdm

import re
import pickle
pd.options.mode.chained_assignment = None  # default='warn' based on false positives
import spacy
from spacy.language import Language
from spacy_langdetect import LanguageDetector
from spacy.tokens.doc import Doc
from spacy.vocab import Vocab
from spacy_sentiws import spaCySentiWS
from spacy_sentiws import spaCySentiWS


tqdm.pandas()

#load in the preprocessed data
unpre_data_twitter= pd.read_csv("../data/raw/tweets_explored.csv")
unpre_data_speeches= pd.read_csv("../data/raw/speeches_explored.csv")
pre_data_twitter= pickle.load(open('../data/processed/tweets_processed.p','rb'))
pre_data_speeches= pickle.load(open('../data/processed/speeches_processed.p','rb'))
pre_data_twitter.head()


## 3.6.1 Sentiment Analysis with TextBlob

For our first approach at sentiment analysis, we use the package TextBlob which can be used for preprocessing textual data and provides an API for natural language processing tasks like sentiment analysis. As our corpus was in German language, we needed to use the German version TextBlobDE which has fewer functionalities than its English counterpart but was sufficient for our first sentiment approach. For sentiment analysis it returns the polarity of a given sentence where polarity -1 means very negative and 1 very positive. The scores are generated based on a dictionary approach using a polarity lexicon for German from Clematide and Klenner (Clematide, S; Klenner, M, 2010).

### 3.6.1.1 Sentiment Analysis for Twitter Data

First we start of with the analysis of the Twitter data. As we want to look at the different politicians from our corpus individually, we define a for loop going through each politician. To apply TextBlob we first need to take the preprocessed tweets in sentence format. After applying TextBlob we use the function sentiment to generate the polarity scores for the individual tweets. We ignore the second output subjectivity as it has no meaning in this German version of this package. Then we calculate the mean of the polarity for each politician. Furthermore, we counted the number of positive, negative, and neutral tweets for every politician without accounting for how positive or negative they were.

In [None]:
#loop through all the politicians we want to analyze
# data=[]
# for name in tqdm(['Ralph Brinkhaus','Hermann Gröhe', 'Nadine Schön' ,'Norbert Röttgen' , 'Peter Altmaier' , 'Jens Spahn' , 'Matthias Hauer',
#             'Christian Lindner' , 'Marco Buschmann' , 'Bettina Stark-Watzinger', 'Alexander Graf Lambsdorff' , 'Johannes Vogel' , 'Konstantin Kuhle' , 'Marie-Agnes Strack-Zimmermann',
#             'Lars Klingbeil' , 'Saskia Esken' , 'Hubertus Heil' , 'Heiko Maas' , 'Martin Schulz' , 'Karamba Diaby' , 'Karl Lauterbach',
#             'Steffi Lemke' , 'Cem Özdemir' , 'Katrin Göring-Eckardt' , 'Konstantin von Notz' , 'Britta Haßelmann' , 'Sven Lehmann' , 'Annalena Baerbock',
#             'Sahra Wagenknecht' , 'Bernd Riexinger' , 'Niema Movassat' , 'Jan Korte' , 'Dietmar Bartsch' , 'Gregor Gysi' , 'Sevim Dağdelen',
#             'Alice Weidel' , 'Beatrix von Storch' , 'Joana Cotar' , 'Stephan Brandner' , 'Tino Chrupalla' , 'Götz Frömming' , 'Leif-Erik Holm']):
#     #get tweets from the specific politician
#     tweets_analyzing =pre_data_twitter.loc[pre_data_twitter['full_name']==name]
#     #create sentiment scores
#     blobs=tweets_analyzing['text_preprocessed_sentence'].apply(TextBlob)
#     sentiment=[]
#     for blob in blobs:
#         sentiment.append(blob.sentiment)
#     #get the polarity scores
#     polarity=[]
#     for egg in sentiment:
#         polarity.append(egg.polarity)
#     #get the mean of the scores
#     p_mean = np.mean(polarity)
#     #get the number of positive, neutral and negative tweets
#     positive_p=0
#     neutral_p=0
#     negative_p=0
#     for item_p in polarity:
#         if item_p>0:
#             positive_p += 1
#         elif item_p<0:
#             negative_p += 1
#         else:
#             neutral_p += 1
#     #set up list to secure the values generated
#     data.append([name,p_mean,positive_p,neutral_p,negative_p])

Ending up with a data frame containing the polarity means and tweet counts for every politician, we had a first overview of the sentiments of their social media presence.

In [None]:
#set up dataframe with all values and save it into a csv file
# dataf = pd.DataFrame(data, columns=['Name','Polarity_mean','Num_pos_tweets','Num_neutral_tweets','Num_neg_tweets'])
# dataf.to_csv('../data/processed/sentiment_scores_twitter_01.csv')
# dataf.head()

We can now expand our dataframe with a column containing the polarity score generated by TextBlob. By simply applying the code from our for loop to the whole corpus and appending the generated scores.

In [None]:
#create a polarity column for our dataset
# blobs=pre_data_twitter['text_preprocessed_sentence'].progress_apply(TextBlob)
# sentiment=[]
# for blob in blobs:
#     sentiment.append(blob.sentiment)
# #get the scores
# polarity=[]
# for egg in sentiment:
#     polarity.append(egg.polarity)
# pre_data_twitter['polarity_textblob'] = polarity


In [None]:
# pickle.dump(pre_data_twitter, open("../data/processed/tweets_processed.p", "wb"))
#
# pre_data_twitter.head()


### 3.6.1.2 Sentiment Analysis Bundestag Speeches

Next up are the Bundestag speeches from the same politicians we analyzed in the step before. Here we take our preprocessed speeches and apply TextBlob in a similar fashion as on the tweets also looping through the politicians individually.

In [None]:
#loop through all the politicians we want to analyze
# data=[]
# for name in tqdm(['Ralph Brinkhaus','Hermann Gröhe', 'Nadine Schön' ,'Norbert Röttgen' , 'Peter Altmaier' , 'Jens Spahn' , 'Matthias Hauer',
#             'Christian Lindner' , 'Marco Buschmann' , 'Bettina Stark-Watzinger', 'Alexander Graf Lambsdorff' , 'Johannes Vogel' , 'Konstantin Kuhle' , 'Marie-Agnes Strack-Zimmermann',
#             'Lars Klingbeil' , 'Saskia Esken' , 'Hubertus Heil' , 'Heiko Maas' , 'Martin Schulz' , 'Karamba Diaby' , 'Karl Lauterbach',
#             'Steffi Lemke' , 'Cem Özdemir' , 'Katrin Göring-Eckardt' , 'Konstantin von Notz' , 'Britta Haßelmann' , 'Sven Lehmann' , 'Annalena Baerbock',
#             'Sahra Wagenknecht' , 'Bernd Riexinger' , 'Niema Movassat' , 'Jan Korte' , 'Dietmar Bartsch' , 'Gregor Gysi' , 'Sevim Dağdelen',
#             'Alice Weidel' , 'Beatrix von Storch' , 'Joana Cotar' , 'Stephan Brandner' , 'Tino Chrupalla' , 'Götz Frömming' , 'Leif-Erik Holm']):
#     #get speeches from the specific politician
#     speeches_analyzing =pre_data_speeches.loc[pre_data_speeches['full_name']==name]
#     #create sentiment scores
#     blobs=speeches_analyzing['text_preprocessed_sentence'].apply(TextBlob)
#     sentiment=[]
#     for blob in blobs:
#         sentiment.append(blob.sentiment)
#     #get the polarity scores
#     polarity=[]
#     for egg in sentiment:
#         polarity.append(egg.polarity)
#     #get the mean and of the polarity values
#     p_mean = np.mean(polarity)
#     #get the number of positive, neutral and negative tweets
#     positive_p=0
#     neutral_p=0
#     negative_p=0
#     for item_p in polarity:
#         if item_p>0:
#             positive_p += 1
#         elif item_p<0:
#             negative_p += 1
#         else:
#             neutral_p += 1
#     #set up list to secure the values generated
#     data.append([name,p_mean,positive_p,neutral_p,negative_p])

Again, we end up with a list containing the sentiment score means and counts of positive, negative, and neutral speeches which we transform into a dataset we can analyze further.

In [None]:
#set up dataframe with all values
# dataf = pd.DataFrame(data, columns=['Name','Polarity_mean','Num_pos_speeches','Num_neutral_speeches','Num_neg_speeches'])
# dataf.to_csv('../data/processed/sentiment_scores_speeches_01.csv')
# dataf.head()

Here we also add a column for the sentiment scores to have an overview.

In [None]:
# blobs=pre_data_speeches['text_preprocessed_sentence'].apply(TextBlob)
# sentiment=[]
# for blob in blobs:
#     sentiment.append(blob.sentiment)
# #get the scores
# polarity=[]
# for egg in sentiment:
#     polarity.append(egg.polarity)
# pre_data_speeches['polarity_textblob'] = polarity


In [None]:
# pickle.dump(pre_data_speeches, open("../data/processed/speeches_processed.p", "wb"))
# pre_data_speeches.head()

## 3.6.2 Sentiment Analysis with SentiWS

As a second approach for sentiment analysis we tried using SentiWS an often used German sentiment dictionary. It also calculates the sentiment of a given sentence with a polarity score from -1 to 1 and has over 3000 base words and over 30000 word forms in its dictionary. Not only does it use adjectives and adverbs but also nouns and verbs to calculate the sentiment score. For the code implementation we could use an extension from the spacy pipeline used in preprocessing. With this spaCySentiWS we can add the application of the dictionary directly into the preprocessing pipeline. Therefore, we write a new preprocessing pipeline which is changed a little from original pipeline to get the sentiment scores of a sentence.

In [None]:
#insert pipeline to add sentiws preprocessing

In [None]:
pre_data_twitter= pd.read_csv("../data/raw/tweets_explored.csv")
pre_data_speeches= pd.read_csv("../data/raw/speeches_explored.csv")

In [None]:
@Language.component("Remove non alphabetic words")
def remove_non_alpha(doc):
    return [token for token in doc if token.is_alpha]

In [None]:
@Language.factory("Detect languages")
def create_language_detector(nlp, name):
    return LanguageDetector(language_detection_function=None)

In [None]:
@Language.factory("Sentiment Appplication")
def create_sentiment_dictionary(nlp, name):
    return spaCySentiWS(sentiws_path = "../data/raw/Sentiment/")

In [None]:
@Language.component("Keep only German documents")
def remove_non_german(doc):
    res = [sent for sent in doc.sents if sent._.language["language"] == "de"]
    if res:
        return [token for sent in res for token in sent]
    else:
        return Doc(Vocab([]), words=[], spaces=[])

In [None]:
@Language.component("Remove stopwords")
def remove_stopwords(doc):
    return [token for token in doc if not token.is_stop]

In [None]:
@Language.component("Lemmatize text")
def lemmatize_text(doc):
    return [token.lemma_ for token in doc]

In [None]:
@Language.component("Lowercase Text")
def lowercase(doc):
    return [token.lower() for token in doc]

In [None]:
emoji_codes = re.compile("["
                         u"\U0001F600-\U0001F64F"
                         u"\U0001F300-\U0001F5FF"
                         u"\U0001F680-\U0001F6FF"
                         u"\U0001F1E0-\U0001F1FF"
                         u"\U00002500-\U00002BEF"
                         u"\U00002702-\U000027B0"
                         u"\U00002702-\U000027B0"
                         u"\U000024C2-\U0001F251"
                         u"\U0001f926-\U0001f937"
                         u"\U00010000-\U0010ffff"
                         u"\u2640-\u2642"
                         u"\u2600-\u2B55"
                         u"\u200d"
                         u"\u23cf"
                         u"\u23e9"
                         u"\u231a"
                         u"\ufe0f"
                         u"\u3030"
                         "]+", re.UNICODE)

@Language.component("Remove emojis")
def remove_emojis(doc):
    doc = [token.text for token in doc if not re.match(emoji, token.text)]
    doc = ' '.join(doc)
    return nlp_twitter.make_doc(doc)

In [None]:
@Language.component("Remove URLs")
def remove_urls(doc):
    doc = [token.text for token in doc if not token.like_url]
    doc = ' '.join(doc)
    return nlp_twitter.make_doc(doc)

In [None]:
@Language.component("Remove mentions")
def remove_mentions(doc):
    doc = [token.text for token in doc if not re.match("@.*", token.text)]
    doc = ' '.join(doc)
    return nlp_twitter.make_doc(doc)

In [None]:
@Language.component("Remove stopwords and punctuation")
def remove_stopwords(doc):
    doc = [token.text for token in doc if not token.is_stop and not token.is_punct]
    return doc

In [None]:
# Create spacy pipeline
nlp_tweets_sentiws = spacy.load('de_core_news_sm')
nlp_tweets_sentiws.Defaults.stop_words |= {"amp", "rt"}

# The add_pipe function appends our functions to the default pipeline.
nlp_tweets_sentiws.add_pipe("sentencizer", last=True)
nlp_tweets_sentiws.add_pipe("Detect languages", name='Detect languages', last=True)
nlp_tweets_sentiws.add_pipe("Keep only German documents", name='Keep only German documents', last=True)
nlp_tweets_sentiws.add_pipe("Remove non alphabetic words", name="Remove non alphabetic words", last=True)
nlp_tweets_sentiws.add_pipe("Remove stopwords", name="Remove stopwords", last=True)
# nlp_tweets.add_pipe("Lemmatize text", name="Lemmatize text", last=True)
# nlp_tweets.add_pipe("Lowercase Text", name="Lowercase Text", last=True)
nlp_tweets_sentiws.add_pipe("Sentiment Appplication", name="Sentiment Appplication", last=True)

### 3.6.2.1 Sentiment Analysis for Twitter Data

First, we want to have a look at our Twitter data again. As with the TextBlob analysis we want to go through all the individual politicians and therefore create a loop. In difference to the first approach we used the raw data here as we want to apply our new pipeline to the dataset. After the application of the pipeline with sentiment functionality we go through the preprocessed tweets and take the calculated sentiment of each token. Next we add the scores together and calculate the means for each tweet and then for the individual politician. Again we count the number of positive, negative, and neutral tweets as well.

In [None]:
#Apply the sentiment anaylsis to the Twitter accounts of the politicians
# data=[]
# for name in tqdm(['Ralph Brinkhaus','Hermann Gröhe', 'Nadine Schön' ,'Norbert Röttgen' , 'Peter Altmaier' , 'Jens Spahn' , 'Matthias Hauer',
#             'Christian Lindner' , 'Marco Buschmann' , 'Bettina Stark-Watzinger', 'Alexander Graf Lambsdorff' , 'Johannes Vogel' , 'Konstantin Kuhle' , 'Marie-Agnes Strack-Zimmermann',
#             'Lars Klingbeil' , 'Saskia Esken' , 'Hubertus Heil' , 'Heiko Maas' , 'Martin Schulz' , 'Karamba Diaby' , 'Karl Lauterbach',
#             'Steffi Lemke' , 'Cem Özdemir' , 'Katrin Göring-Eckardt' , 'Konstantin von Notz' , 'Britta Haßelmann' , 'Sven Lehmann' , 'Annalena Baerbock',
#             'Sahra Wagenknecht' , 'Bernd Riexinger' , 'Niema Movassat' , 'Jan Korte' , 'Dietmar Bartsch' , 'Gregor Gysi' , 'Sevim Dağdelen',
#             'Alice Weidel' , 'Beatrix von Storch' , 'Joana Cotar' , 'Stephan Brandner' , 'Tino Chrupalla' , 'Götz Frömming' , 'Leif-Erik Holm']):
#     #get tweets from the specific politician
#     tweets_analyzing = pre_data_twitter.loc[pre_data_twitter['full_name']==name]
#     tweets_analyzing1 = tweets_analyzing.text.progress_apply(nlp_tweets_sentiws)
#     #get the sentiment of the tweets
#     politician_sum=[]
#     for sentence in tweets_analyzing1:
#         sentence_sum=[]
#         for token in sentence:
#             if token._.sentiws == None:
#                 a=0
#             elif token._.sentiws == 'nan':
#                 a=0
#             else:
#                 sentence_sum.append(token._.sentiws)
#         sentence_score=np.nanmean(sentence_sum)
#         politician_sum.append(sentence_score)
#     politician_score=np.nanmean(politician_sum)
#     #get the number of positive, neutral and negative tweets
#     positive_p=0
#     neutral_p=0
#     negative_p=0
#     for item_p in politician_sum:
#         if item_p>0:
#             positive_p += 1
#         elif item_p<0:
#             negative_p += 1
#         elif item_p == 'nan':
#             neutral_p += 1
#         else:
#             neutral_p += 1
#     #set up list to secure the values generated
#     data.append([name,politician_score,positive_p,neutral_p,negative_p])

We transform the list into a dataframe that we can again analyze further.

In [None]:
#set up dataframe with all values
# dataf = pd.DataFrame(data, columns=['Name','Polarity_mean','Num_pos_tweets','Num_neutral_tweets','Num_neg_tweets'])
# dataf.to_csv('../data/processed/sentiment_scores_tweets_sentiws_01.csv')
# dataf.head()

### 3.6.2.2 Sentiment Analysis for Bundestag Speeches

Again we want to have a look at the Bundestag speeches and see how the SentiWS dictionary classifies them in terms of sentiment. We use the same procedure as with the tweets before to calculate the scores and the counts.

In [None]:
#Apply the sentiment analysis to the speeches accounts of the politicians
# data=[]
# for name in tqdm(['Ralph Brinkhaus','Hermann Gröhe', 'Nadine Schön' ,'Norbert Röttgen' , 'Peter Altmaier' , 'Jens Spahn' , 'Matthias Hauer',
#             'Christian Lindner' , 'Marco Buschmann' , 'Bettina Stark-Watzinger', 'Alexander Graf Lambsdorff' , 'Johannes Vogel' , 'Konstantin Kuhle' , 'Marie-Agnes Strack-Zimmermann',
#             'Lars Klingbeil' , 'Saskia Esken' , 'Hubertus Heil' , 'Heiko Maas' , 'Martin Schulz' , 'Karamba Diaby' , 'Karl Lauterbach',
#             'Steffi Lemke' , 'Cem Özdemir' , 'Katrin Göring-Eckardt' , 'Konstantin von Notz' , 'Britta Haßelmann' , 'Sven Lehmann' , 'Annalena Baerbock',
#             'Sahra Wagenknecht' , 'Bernd Riexinger' , 'Niema Movassat' , 'Jan Korte' , 'Dietmar Bartsch' , 'Gregor Gysi' , 'Sevim Dağdelen',
#             'Alice Weidel' , 'Beatrix von Storch' , 'Joana Cotar' , 'Stephan Brandner' , 'Tino Chrupalla' , 'Götz Frömming' , 'Leif-Erik Holm']):
#     #get speeches from the specific politician
#     speeches_analyzing = pre_data_speeches.loc[pre_data_speeches['full_name']==name]
#     speeches_analyzing1 = speeches_analyzing.text.progress_apply(nlp_tweets_sentiws)
#     #get the sentiment of the tweets
#     politician_sum=[]
#     for sentence in speeches_analyzing1:
#         sentence_sum=[]
#         for token in sentence:
#             if token._.sentiws == None:
#                 a=0
#             elif token._.sentiws == 'nan':
#                 a=0
#             else:
#                 sentence_sum.append(token._.sentiws)
#         sentence_score=np.nanmean(sentence_sum)
#         politician_sum.append(sentence_score)
#     politician_score=np.nanmean(politician_sum)
#     #get the number of positive, neutral and negative tweets
#     positive_p=0
#     neutral_p=0
#     negative_p=0
#     for item_p in politician_sum:
#         if item_p>0:
#             positive_p += 1
#         elif item_p<0:
#             negative_p += 1
#         elif item_p == 'nan':
#             neutral_p += 1
#         else:
#             neutral_p += 1
#     #set up list to secure the values generated
#     data.append([name,politician_score,positive_p,neutral_p,negative_p])

And afterwards create a dataframe from the data for analysis.

In [None]:
#set up dataframe with all values
# dataf = pd.DataFrame(data, columns=['Name','Polarity_mean','Num_pos_speeches','Num_neutral_speeches','Num_neg_speeches'])
# dataf.to_csv('../data/processed/sentiment_scores_speeches_sentiws_01.csv')
# dataf.head()

As the polarity scores for the SentiWS dictionary seem to be less significant due to their absolute values being smaller, we decided to conduct the further in depth analysis of the sentiment with the TextBlob model. These smaller values with the SentiWS dictionary could be a result from our loop used because the mean values could be too unrobust to mean neutral tweets. Another possible explanation could be that there are no great outliers for the tweet or speech sentiments as the value range for polarity is only from -1 to 1.