# "The 'Alternative für Deutschland' and their usage of parliamentary speeches on Social Media"

## Information about the research project

In this project, which is the Masters Thesis of Moritz Stockmar, the usage of parliamentary speeches by the 'Alternative für Deutschland' (AfD) on Social Media is analyzed. The main objective is to learn which speeches (and parts thereof) are used by the AfD on TikTok and YouTube Shorts, what differentiates them form the population of all of AfD's parliamentary speeches.

To achieve this goal the following steps are taken:

1. Collecting the data: The speeches of the AfD are collected from the plenary protocolls which can be found on the official website of the Bundestag. The TikTok and YouTube Shorts videos are collected from official social media accounts of the AfD.

2. Preprocessing the data: A corpus of AfD speeches during the 20. legislative period of the Bundestag is built. The short videos are transcribed and matched to the corpus entries.

3. Analyzing the data: After alligning the uploaded speeches with the official parliamentary protocols, the speeches are analyzed using a variety of methods, such as topic modeling, sentiment analysis, syntactical analysis and more.

4. Visualising the data: The results are visualized to be presented in the Masters Thesis.

The steps above are also reflected in the structure of this md-document. Refer to the respective sections for more detailled information.
The data was (will be) collected by the author and is available on request. The goal is, that this markdown will be self contained and can be used to reproduce the results of the thesis.


## 0. Setting global options and loading the required libraries
The following sections load the required libraries and sets some global variables. This sections must be run before any other section

In [524]:
# Import necessary libraries
import pandas as pd
import numpy as np
import spacy
import bundestag_api
import os
import re
import whisper
import subprocess
import ast
import unicodedata
import difflib

from spacy import displacy
from time import strftime, localtime

# Dowmload the required models

# Download and load the spaCy model
!python -m spacy download de_core_news_md
nlp = spacy.load("de_core_news_md")
# @TODO: Trying it with a bigger model


# Whisper: The Turbo Model is used. It is 1.5 GB big in storage and uses roughly 6 GB of RAM. 
# https://huggingface.co/openai/whisper-large-v3-turbo
model = whisper.load_model("turbo")
# Hugging Face Models


"""
@TODO: The following code builds the recquired file structure for the project.
"""


Collecting de-core-news-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_md-3.8.0/de_core_news_md-3.8.0-py3-none-any.whl (44.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 MB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_md')


'\n@TODO: The following code builds the recquired file structure for the project.\n'

In [430]:
# Setting global variables
BUNDESTAG_API = "I9FKdCn.hbfefNWCY336dL6x62vfwNKpoN2RZ1gp21"
period_start = '2021-10-01'
period_end = '2025-03-24'

path_yt_faction = "data/raw_data/videos/YouTube/AfD-Fraktion Bundestag (Parliamentary Faction)/"
path_yt_party = "data/raw_data/videos/YouTube/AfD-TV (Party)/"
path_tt = "data/raw_data/videos/TikTok/"
paths_to_folders = [path_yt_faction, path_yt_party, path_tt]

parties = ["CDU/CSU", "SPD", "AfD", "FDP", "BÜNDNIS 90/DIE GRÜNEN", "Die Linke", "BSW" "fraktionslos"]
länder = ["Baden-Württemberg", "Bayern", "Berlin", "Brandenburg", "Bremen", "Hamburg", "Hessen", "Mecklenburg-Vorpommern", "Niedersachsen", "Nordrhein-Westfalen", "Rheinland-Pfalz", "Saarland", "Sachsen", "Sachsen-Anhalt", "Schleswig-Holstein", "Thüringen"]


## 1. Collecting the Data
The plenary protocolls are collected using the [official API](https://dip.bundestag.de/%C3%BCber-dip/hilfe/api) of the German Bundestag. The [python-wrapper](https://github.com/jschibberges/Bundestag-API) by jschibberges is used to do it. The short videos were downloaded via the tool [youtube-dlp](https://github.com/yt-dlp/yt-dlp). This won't be shown here but the video database can be accessed on request.

### 1.1 Downloading the speeches

In [431]:
""" 
Collecting the protocols from the Bundestag API from the first session of the 20th legislative period 
to the date of the election of the 21st legislative period. 

This codeblock takes some time to run as it collects the data from the Bundestag API. There are 212 protocols to download. 
Only do this once, as the data is saved to a CSV file for long term storage and the dataframe is pickled for shorter term storage.

@TODO: Add the sessions after the election of the 21st legislative period to the data collection
"""
bta = bundestag_api.btaConnection(apikey = BUNDESTAG_API)
protocols = bta.search_plenaryprotocol(date_start = period_start, date_end = period_end, institution = 'BT', num = 400, fulltext = True)
protocols_df  = pd.DataFrame(protocols)

# Save the dataframe
protocols_df.to_csv("data/raw_data/protocols.csv")
protocols_df.to_pickle("data/raw_data/protocols.pkl")


### 1.2 Downloading the MdB Information

In [None]:
# Download persons from the Bundestag API. It is not completly clear how search_person works, therefore we will download all persons and filter them later.
bta = bundestag_api.btaConnection(apikey = BUNDESTAG_API)
persons = bta.search_person(updated_since = period_start + "T00:00:00", num = 3000)
persons_df = pd.DataFrame(persons)

# Filter the persons_df for AfD speakers, these can only be MdBs as the AfD does not have any ministers or MdBRs (Memebers of the Bundesrat)
# There are a few to many persons which can be explained by people leaving the faction (and now being crossbencher) or leaving parliament during the legislative period
filtered_afd_mdbs_df = persons_df[(persons_df['person_roles'].str.contains('AfD', na = False))| 
                                (persons_df['titel'].str.contains('AfD', na = False))]

# Filter the persons_df for non AfD speakers, these can be MdBs, Ministers and MdBRs 
# There is no need to filter for party other than not being in the AfD
filtered_non_afd_speekers_df = persons_df[~((persons_df['person_roles'].str.contains('AfD', na = False))| 
                                (persons_df['titel'].str.contains('AfD', na = False)))]

# Filters everyone out did not speak in the 20th legislative period
filtered_non_afd_speekers_df = filtered_non_afd_speekers_df.loc[
    # Simple lookup for the wahlperiode == 20 (This applies to everyone speaking for the first time in the 20th legislative period)
    (filtered_non_afd_speekers_df['wahlperiode'] == 20) |
    # Complicated lookup for everyone else (This applies to everyone who spoke before the 20th legislative period and in the 20th) 
    (filtered_non_afd_speekers_df['person_roles'].apply(
        lambda roles: isinstance(roles, list) and
        any(20 in role.get('wahlperiode_nummer', []) for role in roles if isinstance(role, dict))
    ))
]

# Save the dataframe
filtered_afd_mdbs_df.to_csv("data/raw_data/afd_mdbs.csv")
filtered_afd_mdbs_df.to_pickle("data/raw_data/afd_mdbs.pkl")

filtered_non_afd_speekers_df.to_csv("data/raw_data/non_afd_speakers.csv")
filtered_non_afd_speekers_df.to_pickle("data/raw_data/non_afd_speakers.pkl")

## 2. Preprocessing the data

In the following the preprocessing of the textual and the audio-visual data is performed. 

1. The textual data (the protocols) is processed into a annotated Corpus consisting of all speeches by AfD MdBs (members of the Bundestag). Therefor firstly all speeches of AfD MdBs have to be extracted (in plain text) from the protocols. The next step is the linguistical preprocessing performed by spaCy to make the data usable for the upcoming syntactical and semantical examinations.

2. The audio-visual data (the uploaded short videos) have to be transcribed.

3. The transcriptions are matched with their corresponding speeches from the frist step. 

### 2.1 Preprocessing the Text

#### 2.1.1 Extracting the AfD Speeches from the protocols

The following codeblock extracts the AfD Speeches from the protocols and stores them in a single dataframe

Attention: This block takes roughly 30s to complete on a M2 Mac.

In [497]:
"""
The following code block reads the protocols from the CSV file / pickeled file and loads them into a pandas dataframe for further processing
if there is no dataframe already created.
"""
if os.path.exists("data/raw_data/protocols.pkl"):
    protocols_df = pd.read_pickle("data/raw_data/protocols.pkl")
elif os.path.exists("data/raw_data/protocols.csv"):
    protocols_df = pd.read_csv("data/raw_data/protocols.csv")
else:
    print("No data found. Please run the data collection code block first.")

"""The afd_mdbs_searchstrings are the strings signaling the start of every AfD Speech in the protocols by being of the form
'[o: Titels] [first name] [o: infix] [last name], (AfD):' -> things in brackets are individual, things with o: are optional
This searchstring can be used as is to find the start of the speech in the protocol """
afd_mdbs_searchstrings = filtered_afd_mdbs_df.apply(lambda row: row['titel'].replace(', MdB', '').replace('Dr. ', '').replace(', AfD', ' (AfD)') + ":", axis = 1).to_list()

"""The non_afd_searchstring is part of the string signaling the start of every non AfD speech in the protocols by being of the form
'[o: Titels] [first name] [o: infix] [last name]', The party or affiliation to the federal government or bundesrat is not included, because that would lead to many edge cases
This searchsting can NOT be used as is to find the start of the speech in the protocol """
non_afd_searchstings = filtered_non_afd_speekers_df.apply(lambda row: row['titel'].split(',')[0], axis = 1).to_list()


"""
Helper method to check if a line is the start of an AfD speech. This is done by checking if the line contains any of the searchstrings
and if the line does not contain the word "Frage" which indicates that the line is a purely written question and not a speech. 
These are found at the end of protocols if the time was to short to answer each question for the government orally.
"""
def is_start_of_AfDSpeech(line, searchstrings) -> bool:
    return any(searchstring in line for searchstring in searchstrings) and not "Frage" in line

"""
Helper method to check if a line is the end of an AfD speech. This is done by checking if the line contains any of the searchstrings
and if the line ends with a colon as otherwise references to people would be falsely identified as the end of a speech. Furthermore
the line should not start with a bracket as this indicates comments from the audience.
"""
def is_end_of_AfDSpeech(line, searchstrings) -> bool:
    return any(searchstring in line for searchstring in searchstrings) and line[-1] == ':' and line[0] != '['

"""
Helper method to finde the session (as in X. Sitzung der 20. Wahlperiode) and the date of the session in the protocol.
@Input: protocol: The protocol as a list of strings (lines). You only need the first 5
@Output: A tuple of the form (session, session_date)
"""
""" def find_session_and_date(protocol_lines) -> tuple:
    session = protocol_lines[0].split('/')[-1] # The session is after the last '/' in the first line
    session_date = protocol_lines[4].split(',')[-1].replace(' den ', '') # The date is after the last ', den ' in the fifth line
    return (session, session_date) """


"""
Helper method to find the start of a agenda item 
"""
def is_start_of_agenda_item(line) -> bool:
    line = line.lower()
    # The agenda item is always after something like "ich rufe" or "wir kommen" and contains either "Tagesordnungspunkt" or "Zusatzpunkt"
    # Logical structure (expression for calling new agenda item) and (numercial expression for agenda item)
    return ("rufe" in line or "kommen" in line or "komme" in line or bool(re.search('setzen (.*) fort', line))) and (bool(re.search('(tagesordnungspunkte?|zusatzpunkte?) [0-9]?[0-9]?', line)))


"""
Main method to extract the speeches from a protocol. 
@Input: protocol: The protocol as a string
        afd_mdbs_searchstrings: The searchstrings to identify the start of an AfD speech
        non_afd_searchstrings: The searchstrings to identify the end of an AfD speech
@Output: A Dataframe of the form {'speaker': speaker, 'text': text, 'session': session, 'session_date': session_date, 'agenda_item': agenda_item}

The method works by iterating over the normalized lines of the protocol and checking if a line is the start of an AfD speech. If it is, the method
starts to collect the text of the speech until it finds the end of the speech. 

Afterwards a first structural cleanup is done. This means that speeches that are split due to interventions by the prisiding officer (espacially because of time) are concatenated.

@TODO: agenda_items is not working correctly yet. It is sometimes empty because of the hard coded way of finding the agenda item.
"""
def extract_speeches_from_protocol(protocol, session, date, afd_mdbs_searchstrings, non_afd_searchstings) -> pd.DataFrame:
    protocol_lines = protocol.split('\n')

    # First cleanup: Text normalization
    normailzed_protocol_lines = [unicodedata.normalize("NFKC", line) for line in protocol_lines]

    speeches = []
    speech_text = ""
    in_speech = False
    speaker = ""
    agenda_item = ""

    for line in normailzed_protocol_lines:
        if line == "": # Skip empty lines alltogether
            continue
        if is_start_of_agenda_item(line):
            i = 1
            while(normailzed_protocol_lines[normailzed_protocol_lines.index(line) + i] == ""): #Skip empty lines between the saying "Ich rufe..." and the name of the
                i += 1
            agenda_item = normailzed_protocol_lines[normailzed_protocol_lines.index(line) + i]
        if in_speech and is_end_of_AfDSpeech(line, non_afd_searchstings):
            speeches.append({
                'speaker': speaker, 
                'text': speech_text, 
                'session': session, 
                'session_date': date, 
                'agenda_item': agenda_item}) # Append the speech (dict) to the list of speeches
            speech_text, speaker = "", "" # Reset the variables
            in_speech = False
        if in_speech:
            speech_text += line
        if not in_speech and is_start_of_AfDSpeech(line, afd_mdbs_searchstrings):
            in_speech = True
            speaker = line.replace(" (AfD):", "")

    # Structural Cleanup: Concat speeches that are split due to interventions by the speaker due to time
    cleaned_speeches = []
    for speech in speeches:
        try:
            if speech['text'].startswith('–'): # Sometimes '-' are used to indicate that a speech is split
                cleaned_speeches[-1]['text'] += speech['text']
                cleaned_speeches[-1]['text'].replace('––', ' ')
            elif len(speech['text'].split(" ")) < 40: # This is a heuristic to determine if the speech is split because of an intervention because of time, 
                #as the rest of a speech after an intervention is usually shorter than 40 words and a new speech certainly would be longer
                cleaned_speeches[-1]['text'] += " " + speech['text']
            else:
                cleaned_speeches.append(speech)
        except:
            cleaned_speeches.append(speech)
    return pd.DataFrame(cleaned_speeches)

#Function call and data storage, takes roughly 30 seconds to run
afd_speeches = pd.DataFrame()
for _, row in protocols_df.iterrows():
    # Get list of dict
    extracted_speeches = extract_speeches_from_protocol(
        protocol = row['text'], 
        session = row['dokumentnummer'].split('/')[1],
        date = row['datum'], 
        afd_mdbs_searchstrings = afd_mdbs_searchstrings, 
        non_afd_searchstings = non_afd_searchstings)

    # Convert list of dict to dataframe
    afd_speeches = pd.concat([afd_speeches, extracted_speeches], ignore_index = True)

afd_speeches.to_csv("data/preprocessed_data/afd_speeches.csv")
afd_speeches.to_pickle("data/preprocessed_data/afd_speeches.pkl")

#### 2.1.2 Bulding the annotated corpus of AfD speeches

In [None]:
"""
The following code block reads the speeches of the AfD MdBs from the CSV file / pickeled file and loads them into a pandas dataframe for further processing
if there is no dataframe already created.
"""
if  afd_speeches.empty:
    if os.path.exists("data/preprocessed_data/afd_speeches.pkl"):
        afd_speeches = pd.read_pickle("data/preprocessed_data/afd_speeches.pkl")
    elif os.path.exists("data/preprocessed_data/afd_speeches.csv"):
        afd_speeches = pd.read_csv("data/preprocessed_data/afd_speeches.csv")
    else: 
        print("No data found. Please run the data collection code block first.")
print("Data loaded.")


"""
The followng code block is used to clean up the speeches further.
Cleand up things:
 - Interjections by the listeners
"""

def further_cleanup(speeches: pd.DataFrame) -> pd.DataFrame:
    cleaned_speeches = speeches.copy()
    # Remove interjections by the listeners, these are idicated by brackets in the protocol. As far as I can tell,
    # this is the only place where brackets are used in a speech.
    cleaned_speeches['text'] = cleaned_speeches['text'].apply(lambda x: re.sub(r'\(.*?\)', '', x))
    return cleaned_speeches

afd_speeches_cleaned = further_cleanup(afd_speeches)

"""
Some helper methods to simplfy the .apply statements in the following code block
"""
def get_token(doc):
    return [token.text for token in doc]
def get_lemma(doc):
    return [token.lemma_ for token in doc]
def get_pos(doc):
    return [(token.pos_, token.tag_) for token in doc]

# Preprocessing the speeches with the spaCy pipeline (tokenization, lemmatization, POS tagging, etc. pp.)
# As you can guess, this takes some time to run (with an Apple M2 the pipeline itself (building the doc) took 1:45 minutes)

def spacy_pipeline(cleaned_speeches: pd.DataFrame) -> pd.DataFrame:
    if not 'doc' in cleaned_speeches.columns:
       cleaned_speeches['doc'] = cleaned_speeches['text'].apply(nlp)
    print("Pipeline finished.")
    if not 'tokens' in cleaned_speeches.columns:
        cleaned_speeches['tokens'] = cleaned_speeches['doc'].apply(get_token)
    if not 'lemmas' in cleaned_speeches.columns:
        cleaned_speeches['lemmas'] = cleaned_speeches['doc'].apply(get_lemma)
    if not 'pos' in cleaned_speeches.columns:
        cleaned_speeches['pos'] = cleaned_speeches['doc'].apply(get_pos)
    print("Added columns for Tokens, Lemmata, and POS.")
    return cleaned_speeches

afd_speeches_cleaned = spacy_pipeline(afd_speeches_cleaned)


### 2.2 Preprocessing the videos (transcribing)

The following code blocks preprocess the uploaded videos. The preprocessing includes the following steps:
1. Extracting only the audio from the videos using ffmpeg
2. Transcribing the audio using OpenAI's Whisper
3. Building a dataframe with the transcribed text and further information about the corresponding videos
4. Saving the dataframe to a CSV file and pickling it for further processing

In [None]:
"""
The following code block extracts the audio from the videos in the raw data folder and saves them as mp3 files in the same folder.
This is done to make the audio files accessible for the whisper model.

This operation should also be done only once, as it takes some time to extract the audio from the videos.
"""
path = "data/raw_data/videos/YouTube/"
for j in os.listdir(path):
    for i in os.listdir(path + j):
        if i.endswith(".webm") or i.endswith(".mp4"):
            # Extract the audio from the video
            command = "ffmpeg -i {} -vn -ar 44100 -ac 2 -b:a 192k {}".format('"'+path+j+'/'+i+'"', ('"'+path+j+'/'+i+'"').replace(".mp4", ".mp3").replace(".webm", ".mp3"))
            subprocess.call(command, shell = True)

path = "data/raw_data/videos/TikTok/"
for i in os.listdir(path):
    if i.endswith(".webm") or i.endswith(".mp4"):
        # Extract the audio from the video
        command = "ffmpeg -i {} -vn -ar 44100 -ac 2 -b:a 192k {}".format('"'+path+i+'"', ('"'+path+i+'"').replace(".mp4", ".mp3").replace(".webm", ".mp3"))
        subprocess.call(command, shell = True)

In [None]:
"""
The following code block transcribes the audio files in the raw data folder and saves them as text files in the same folder.

!!!This operation should also be done only once, as it takes much time to transcribe the audio files.!!!
"""
def transcribe_audio_files(paths_to_folders):
    for path_to_folder in paths_to_folders:
        for i in os.listdir(path_to_folder):
            if i.endswith(".mp3"):
                # Checks if the file has already been transcribed
                if not os.path.exists(path_to_folder+i.replace(".mp3", ".txt")):
                # Transcribe the audio file
                    result = model.transcribe(audio = path_to_folder+i, language = "de")
                    with open(path_to_folder+i.replace(".mp3", ".txt"), "w") as f:
                        f.write(str(result))
        print("Transcription of audio files in {} completed.".format(path_to_folder))

transcribe_audio_files(paths_to_folders)

"""
The following method builds a dataframe from the transcriptions in the txt in the raw data folder.

@TODO: Reproducaibility of the code block. The videofiles have to be reloaded into the folder, because ffmpeg changes their "last modified" date (at least I think so)
"""

def build_transcriptions_df(paths_to_folders):
    row_list = []
    for path_to_folder in paths_to_folders:
        for i in os.listdir(path_to_folder):
            if i.endswith(".txt"):
                with open(path_to_folder+i, "r") as f:
                    data = f.read()
                # Dict in string -> dict and extract the transcription 
                transcription = ast.literal_eval(data)['text']
                # Extract the source from the path
                source = path_to_folder.split('/')[-2]
                # Get the time of the video file. It is the time of the last modification of the video file. You have to try different file endings.
                # @TODO: Program a preprocessing stage where every file is changed to mp4
                try:
                    time = strftime('%Y-%m-%d %H:%M:%S', localtime(os.path.getmtime((path_to_folder+i).replace(".txt", ".webm"))))
                except:
                    try:
                        time = strftime('%Y-%m-%d %H:%M:%S', localtime(os.path.getmtime((path_to_folder+i).replace(".txt", ".mp4"))))
                    except:
                        time = "unknown"
                # Put the transcription and source in the dataframe
                row_list.append({'text': transcription, 'source': source, 'time': time})
    return row_list

transcriptions_df = pd.DataFrame(build_transcriptions_df(paths_to_folders))

transcriptions_df.to_csv("data/preprocessed_data/transcriptions.csv")
transcriptions_df.to_pickle("data/preprocessed_data/transcriptions.pkl")

### 2.3 Matching the preprocessed data

In [None]:
"""
This method aligns the transcriptions with the speeches from the protocols.
It does this by comparing the lemmata of the speeches with the transcriptions and finding the most similar transcription. 
The lemmata are used because they are more robust to spelling errors and other noise in the text. All in all the transcriptions are pretty good
so this may not be strictly necessary.

@Input:
    transcriptions_df: The dataframe with the transcriptions
    afd_speeches_cleaned: The dataframe with the speeches from the protocols
@Output:
    The protool dataframe (2nd argument) with the transcriptions added as a column
"""
def text_alignment(transcriptions_df: pd.DataFrame, afd_speeches_cleaned: pd.DataFrame) -> pd.DataFrame:
    # Align the transcriptions with the speeches from the protcols