# "The 'Alternative für Deutschland' and their usage of parliamentary speeches on Social Media"

## Information about the research project

In this project, which is the Masters Thesis of Moritz Stockmar, the usage of parliamentary speeches by the 'Alternative für Deutschland' (AfD) on Social Media is analyzed. The main objective is to learn which speeches (and parts thereof) are used by the AfD on TikTok and YouTube Shorts, what differentiates them form the population of all of AfD's parliamentary speeches.

To achieve this goal the following steps are taken:

1. Collecting the data: The speeches of the AfD are collected from the plenary protocolls which can be found on the official website of the Bundestag. The TikTok and YouTube Shorts videos are collected from official social media accounts of the AfD.

2. Preprocessing the data: A corpus of AfD speeches during the 20. legislative period of the Bundestag is built. The short videos are transcribed and matched to the corpus entries.

3. Analyzing the data: After alligning the uploaded speeches with the official parliamentary protocols, the speeches are analyzed using a variety of methods, such as topic modeling, sentiment analysis, syntactical analysis and more.

4. Visualising the data: The results are visualized to be presented in the Masters Thesis.

The steps above are also reflected in the structure of this md-document. Refer to the respective sections for more detailled information.
The data was (will be) collected by the author and is available on request. The goal is, that this markdown will be self contained and can be used to reproduce the results of the thesis.


## 0. Setting global options and loading the required libraries
The following sections load the required libraries and sets some global variables. This sections must be run before any other section

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import spacy
import bundestag_api
import os
import re
import whisper
import subprocess
import ast
import unicodedata
import difflib

from spacy import displacy
from time import strftime, localtime

# Dowmload the required models

# Download and load the spaCy model
!python -m spacy download de_core_news_md
nlp = spacy.load("de_core_news_md")
# @TODO: Trying it with a bigger model


# Whisper: The Turbo Model is used. It is 1.5 GB big in storage and uses roughly 6 GB of RAM. 
# https://huggingface.co/openai/whisper-large-v3-turbo
model = whisper.load_model("turbo")
# Hugging Face Models


"""
@TODO: The following code builds the recquired file structure for the project.
"""


Collecting de-core-news-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_md-3.8.0/de_core_news_md-3.8.0-py3-none-any.whl (44.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_md')


'\n@TODO: The following code builds the recquired file structure for the project.\n'

In [12]:
# Setting global variables
BUNDESTAG_API = "I9FKdCn.hbfefNWCY336dL6x62vfwNKpoN2RZ1gp21"
period_start = '2021-10-01'
period_end = '2025-03-24'

path_yt_faction = "data/raw_data/videos/YouTube/AfD-Fraktion Bundestag (Parliamentary Faction)/"
path_yt_party = "data/raw_data/videos/YouTube/AfD-TV (Party)/"
path_tt = "data/raw_data/videos/TikTok/"
path_insta_faction = "data/raw_data/videos/Instagram/InstaFaction/"
path_insta_party = "data/raw_data/videos/Instagram/InstaParty/"
paths_to_folders = [path_yt_faction, path_yt_party, path_tt, path_insta_faction, path_insta_party]

parties = ["CDU/CSU", "SPD", "AfD", "FDP", "BÜNDNIS 90/DIE GRÜNEN", "Die Linke", "BSW" "fraktionslos"]
länder = ["Baden-Württemberg", "Bayern", "Berlin", "Brandenburg", "Bremen", "Hamburg", "Hessen", "Mecklenburg-Vorpommern", "Niedersachsen", "Nordrhein-Westfalen", "Rheinland-Pfalz", "Saarland", "Sachsen", "Sachsen-Anhalt", "Schleswig-Holstein", "Thüringen"]


## 1. Collecting the Data
The plenary protocolls are collected using the [official API](https://dip.bundestag.de/%C3%BCber-dip/hilfe/api) of the German Bundestag. The [python-wrapper](https://github.com/jschibberges/Bundestag-API) by jschibberges is used to do it. The short videos were downloaded via the tool [youtube-dlp](https://github.com/yt-dlp/yt-dlp). This won't be shown here but the video database can be accessed on request.

### 1.1 Downloading the speeches

In [156]:
""" 
Collecting the protocols from the Bundestag API from the first session of the 20th legislative period 
to the date of the election of the 21st legislative period. 

This codeblock takes some time to run as it collects the data from the Bundestag API. There are 214 protocols to download. 
Only do this once, as the data is saved to a CSV file for long term storage and the dataframe is pickled for shorter term storage.

@TODO: Add the sessions after the election of the 21st legislative period to the data collection
"""
bta = bundestag_api.btaConnection(apikey = BUNDESTAG_API)
protocols = bta.search_plenaryprotocol(date_start = period_start, date_end = period_end, institution = 'BT', num = 400, fulltext = True)
protocols_df  = pd.DataFrame(protocols)

# Save the dataframe
protocols_df.to_csv("data/raw_data/protocols.csv")
protocols_df.to_pickle("data/raw_data/protocols.pkl")

### 1.2 Downloading the MdB Information

In [102]:
# Download persons from the Bundestag API. It is not completly clear how search_person works, therefore we will download all persons and filter them later.
bta = bundestag_api.btaConnection(apikey = BUNDESTAG_API)
persons = bta.search_person(updated_since = period_start + "T00:00:00", num = 3000)
persons_df = pd.DataFrame(persons)

# Filter the persons_df for AfD speakers, these can only be MdBs as the AfD does not have any ministers or MdBRs (Memebers of the Bundesrat)
# There are a few to many persons which can be explained by people leaving the faction (and now being crossbencher) or leaving parliament during the legislative period
filtered_afd_mdbs_df = persons_df[(persons_df['person_roles'].str.contains('AfD', na = False))| 
                                (persons_df['titel'].str.contains('AfD', na = False))]

# Filter the persons_df for non AfD speakers, these can be MdBs, Ministers and MdBRs 
# There is no need to filter for party other than not being in the AfD
filtered_non_afd_speekers_df = persons_df[~((persons_df['person_roles'].str.contains('AfD', na = False))| 
                                (persons_df['titel'].str.contains('AfD', na = False)))]

# Filters everyone out did not speak in the 20th legislative period
filtered_non_afd_speekers_df = filtered_non_afd_speekers_df.loc[
    # Simple lookup for the wahlperiode == 20 (This applies to everyone speaking for the first time in the 20th legislative period)
    (filtered_non_afd_speekers_df['wahlperiode'] == 20) |
    # Complicated lookup for everyone else (This applies to everyone who spoke before the 20th legislative period and in the 20th) 
    (filtered_non_afd_speekers_df['person_roles'].apply(
        lambda roles: isinstance(roles, list) and
        any(20 in role.get('wahlperiode_nummer', []) for role in roles if isinstance(role, dict))
    ))
]

# Save the dataframe
filtered_afd_mdbs_df.to_csv("data/raw_data/afd_mdbs.csv")
filtered_afd_mdbs_df.to_pickle("data/raw_data/afd_mdbs.pkl")

filtered_non_afd_speekers_df.to_csv("data/raw_data/non_afd_speakers.csv")
filtered_non_afd_speekers_df.to_pickle("data/raw_data/non_afd_speakers.pkl")

## 2. Preprocessing the data

In the following the preprocessing of the textual and the audio-visual data is performed. 

1. The textual data (the protocols) is processed into a annotated Corpus consisting of all speeches by AfD MdBs (members of the Bundestag). Therefor firstly all speeches of AfD MdBs have to be extracted (in plain text) from the protocols. The next step is the linguistical preprocessing performed by spaCy to make the data usable for the upcoming syntactical and semantical examinations.

2. The audio-visual data (the uploaded short videos) have to be transcribed.

3. The transcriptions are matched with their corresponding speeches from the frist step. 

### 2.1 Preprocessing the Text

#### 2.1.1 Extracting the AfD Speeches from the protocols

The following codeblock extracts the AfD Speeches from the protocols and stores them in a single dataframe

Attention: This block takes roughly 30s to complete on a M2 Mac.

In [None]:
"""
The following code block reads the protocols from the CSV file / pickeled file and loads them into a pandas dataframe for further processing
if there is no dataframe already created.
"""
if os.path.exists("data/raw_data/protocols.pkl"):
    protocols_df = pd.read_pickle("data/raw_data/protocols.pkl")
elif os.path.exists("data/raw_data/protocols.csv"):
    protocols_df = pd.read_csv("data/raw_data/protocols.csv")
else:
    print("No data found. Please run the data collection code block first.")

"""The afd_mdbs_searchstrings are the strings signaling the start of every AfD Speech in the protocols by being of the form
'[o: Titels] [first name] [o: infix] [last name], (AfD):' -> things in brackets are individual, things with o: are optional
This searchstring can be used as is to find the start of the speech in the protocol """
afd_mdbs_searchstrings = filtered_afd_mdbs_df.apply(lambda row: row['titel'].replace(', MdB', '').replace('Dr. ', '').replace(', AfD', ' (AfD)') + ":", axis = 1).to_list()

"""The non_afd_searchstring is part of the string signaling the start of every non AfD speech in the protocols by being of the form
'[o: Titels] [first name] [o: infix] [last name]', The party or affiliation to the federal government or bundesrat is not included, because that would lead to many edge cases
This searchsting can NOT be used as is to find the start of the speech in the protocol """
non_afd_searchstings = filtered_non_afd_speekers_df.apply(lambda row: row['titel'].split(',')[0], axis = 1).to_list()


"""
Helper method to check if a line is the start of an AfD speech. This is done by checking if the line contains any of the searchstrings
and if the line does not contain the word "Frage" which indicates that the line is a purely written question and not a speech. 
These are found at the end of protocols if the time was to short to answer each question for the government orally.
"""
def is_start_of_AfDSpeech(line, searchstrings) -> bool:
    return any(searchstring in line for searchstring in searchstrings) and not "Frage" in line

"""
Second Helper method to check if a line is the start of an AfD speech which handels the case where the speaker announcement in the protocol
is split into two lines. It is only checked for "(AfD):" and the last name of the speaker. 

There are two known special cases: In the first case the line is split after the first name and the second case is that the line is split after the last name.
"""
def is_start_of_AfDSpeech_split_name(line, next_line, afd_mdbs_searchstrings) -> bool:
    if "(AfD):" in line and not "Frage" in line and any(nachname in line for nachname in filtered_afd_mdbs_df['nachname'].to_list()):
        return True
    if "(AfD):" in next_line and not "Frage" in next_line and any(nachname in line for nachname in filtered_afd_mdbs_df['nachname'].to_list()) and not is_start_of_AfDSpeech(next_line, afd_mdbs_searchstrings):
        return True
    return False

"""
Helper method to check if a line is the end of an AfD speech. This is done by checking if the line contains any of the searchstrings
and if the line ends with a colon as otherwise references to people would be falsely identified as the end of a speech. Furthermore
the line should not start with a bracket as this indicates comments from the audience.
"""
def is_end_of_AfDSpeech(line, searchstrings) -> bool:
    return any(searchstring in line for searchstring in searchstrings) and line[-1] == ':' and line[0] != '['

"""
Helper method to finde the session (as in X. Sitzung der 20. Wahlperiode) and the date of the session in the protocol.
@Input: protocol: The protocol as a list of strings (lines). You only need the first 5
@Output: A tuple of the form (session, session_date)
"""
""" def find_session_and_date(protocol_lines) -> tuple:
    session = protocol_lines[0].split('/')[-1] # The session is after the last '/' in the first line
    session_date = protocol_lines[4].split(',')[-1].replace(' den ', '') # The date is after the last ', den ' in the fifth line
    return (session, session_date) """


"""
Helper method to find the start of a agenda item 
"""
def is_start_of_agenda_item(line) -> bool:
    line = line.lower()
    # The agenda item is always after something like "ich rufe" or "wir kommen" and contains either "Tagesordnungspunkt" or "Zusatzpunkt"
    # Logical structure (expression for calling new agenda item) and (numercial expression for agenda item)
    return ("rufe" in line or "kommen" in line or "komme" in line or bool(re.search('setzen (.*) fort', line))) and (bool(re.search('(tagesordnungspunkte?|zusatzpunkte?) [0-9]?[0-9]?', line)))


"""
Main method to extract the speeches from a protocol. 
@Input: protocol: The protocol as a string
        afd_mdbs_searchstrings: The searchstrings to identify the start of an AfD speech
        non_afd_searchstrings: The searchstrings to identify the end of an AfD speech
@Output: A Dataframe of the form {'speaker': speaker, 'text': text, 'session': session, 'session_date': session_date, 'agenda_item': agenda_item}

The method works by iterating over the normalized lines of the protocol and checking if a line is the start of an AfD speech. If it is, the method
starts to collect the text of the speech until it finds the end of the speech. 

Afterwards a first structural cleanup is done. This means that speeches that are split due to interventions by the prisiding officer (espacially because of time) are concatenated.

@TODO: agenda_items is not working correctly yet. It is sometimes empty because of the hard coded way of finding the agenda item.
"""
def extract_speeches_from_protocol(protocol, session, date, afd_mdbs_searchstrings, non_afd_searchstings) -> pd.DataFrame:
    protocol_lines = protocol.split('\n')

    # First cleanup: Text normalization
    normailzed_protocol_lines = [unicodedata.normalize("NFKC", line) for line in protocol_lines]

    speeches = []
    speech_text = ""
    in_speech = False
    speaker = ""
    agenda_item = ""

    for index, line in enumerate(normailzed_protocol_lines):
        found = False
        if line == "": # Skip empty lines alltogether
            continue
        if is_start_of_agenda_item(line):
            i = 1
            while(normailzed_protocol_lines[normailzed_protocol_lines.index(line) + i] == ""): #Skip empty lines between the saying "Ich rufe..." and the name of the
                i += 1
            agenda_item = normailzed_protocol_lines[normailzed_protocol_lines.index(line) + i]
        if in_speech and is_end_of_AfDSpeech(line, non_afd_searchstings):
            speeches.append({
                'speaker': speaker, 
                'text': speech_text, 
                'session': session, 
                'session_date': date, 
                'agenda_item': agenda_item}) # Append the speech (dict) to the list of speeches
            speech_text, speaker = "", "" # Reset the variables
            in_speech = False
        if in_speech:
            speech_text += line
        if not in_speech and is_start_of_AfDSpeech(line, afd_mdbs_searchstrings):
            in_speech = True
            speaker = line.replace(" (AfD):", "")
            found = True
        # This is a special case for the name of the speaker being split over two lines and the next_line containing the last name or just '(AfD):'
        if index < len(normailzed_protocol_lines) - 1:
            if not in_speech and not found and is_start_of_AfDSpeech_split_name(line, normailzed_protocol_lines[index + 1], afd_mdbs_searchstrings):
                print("Found split name")
                in_speech = True
                speaker = line.split(" (AfD):")[0]

    # Structural Cleanup: Concat speeches that are split due to interventions by the speaker due to time
    cleaned_speeches = []
    for speech in speeches:
        try:
            if speech['text'].startswith('–'): # Sometimes '-' are used to indicate that a speech is split
                cleaned_speeches[-1]['text'] += speech['text']
                cleaned_speeches[-1]['text'].replace('––', ' ')
            elif len(speech['text'].split(" ")) < 40: # This is a heuristic to determine if the speech is split because of an intervention because of time, 
                #as the rest of a speech after an intervention is usually shorter than 40 words and a new speech certainly would be longer
                cleaned_speeches[-1]['text'] += " " + speech['text']
            else:
                cleaned_speeches.append(speech)
        except:
            cleaned_speeches.append(speech)
    return pd.DataFrame(cleaned_speeches)

#Function call and data storage, takes roughly 30 seconds to run
afd_speeches = pd.DataFrame()
for _, row in protocols_df.iterrows():
    # Get list of dict
    extracted_speeches = extract_speeches_from_protocol(
        protocol = row['text'], 
        session = row['dokumentnummer'].split('/')[1],
        date = row['datum'], 
        afd_mdbs_searchstrings = afd_mdbs_searchstrings, 
        non_afd_searchstings = non_afd_searchstings)

    # Convert list of dict to dataframe
    afd_speeches = pd.concat([afd_speeches, extracted_speeches], ignore_index = True)

afd_speeches.to_csv("data/preprocessed_data/afd_speeches.csv")
afd_speeches.to_pickle("data/preprocessed_data/afd_speeches.pkl")

In [162]:
print(protocols_df.iloc[204]['text'])


Plenarprotokoll 20/10

Deutscher Bundestag
Stenografischer Bericht

10. Sitzung

Berlin, Mittwoch, den 12. Januar 2022



Inhalt:

Gedenken an den Präsidenten des Europäischen Parlaments, David Maria Sassoli



Begrüßung der neuen Abgeordneten Clara Bünger



Verstärkte Schutzmaßnahmen zur Eindämmung der Verbreitung des Coronavirus im Deutschen Bundestag




Dr. Bernd Baumann (AfD) (zur Geschäftsordnung)




Katja Mast (SPD) (zur Geschäftsordnung)




Thorsten Frei (CDU/CSU) (zur Geschäftsordnung)



Erweiterung und Abwicklung der Tagesordnung



Absetzung des Tagesordnungspunktes 4



Feststellung der Tagesordnung



Tagesordnungspunkt 1:

Befragung der Bundesregierung



Olaf Scholz, Bundeskanzler




Thorsten Frei (CDU/CSU)




Olaf Scholz, Bundeskanzler




Thorsten Frei (CDU/CSU)




Olaf Scholz, Bundeskanzler




Bernd Westphal (SPD)




Olaf Scholz, Bundeskanzler




Bernd Westphal (SPD)




Olaf Scholz, Bundeskanzler




Tino Chrupalla (AfD)




Olaf Scholz, Bundeskanzler




T

#### 2.1.2 Bulding the annotated corpus of AfD speeches

In [None]:
"""
The following code block reads the speeches of the AfD MdBs from the CSV file / pickeled file and loads them into a pandas dataframe for further processing
if there is no dataframe already created.
"""

if os.path.exists("data/preprocessed_data/afd_speeches.pkl"):
    afd_speeches = pd.read_pickle("data/preprocessed_data/afd_speeches.pkl")
    print("Data loaded.")
elif os.path.exists("data/preprocessed_data/afd_speeches.csv"):
    afd_speeches = pd.read_csv("data/preprocessed_data/afd_speeches.csv")
    print("Data loaded.")
else: 
    print("No data found. Please run the data collection code block first.")


"""
The followng code block is used to clean up the speeches further.
Cleand up things:
 - Interjections by the listeners
 - The (AfD): from the text, they appear in the special cases from the start of speech recognition
 - The speech by Alice Weidel on the 11.02.2025 is completly butchered 
"""

def further_cleanup(speeches: pd.DataFrame) -> pd.DataFrame:
    cleaned_speeches = speeches.copy()
    # Remove interjections by the listeners, these are idicated by brackets in the protocol. As far as I can tell,
    # this is the only place where brackets are used in a speech.
    cleaned_speeches['text'] = cleaned_speeches['text'].apply(lambda x: re.sub(r'\(.*?\)', '', x))
    # Remove the (AfD): from the text, they appear in the special cases from the start of speech recognition
    cleaned_speeches['text'] = cleaned_speeches['text'].apply(lambda x: x.replace('(AfD):', '')) 
    # Reconnect the speech by Alice Weidel on the 11.02.2025
    cleaned_speeches.iloc[13]['text'] += " " + cleaned_speeches.iloc[14]['text'] + " " + cleaned_speeches.iloc[15]['text']
    cleaned_speeches.drop([14, 15], inplace = True)
    print("Cleaned up the speeches.")
    return cleaned_speeches



afd_speeches_cleaned = further_cleanup(afd_speeches)

"""
Some helper methods to simplfy the .apply statements in the following code block
"""
def get_token(doc):
    return [token.text for token in doc]
def get_lemma(doc):
    return [token.lemma_ for token in doc]
def get_pos(doc):
    return [(token.pos_, token.tag_) for token in doc]

# Preprocessing the speeches with the spaCy pipeline (tokenization, lemmatization, POS tagging, etc. pp.)
# This takes some time to run (with an Apple M2 the pipeline itself (building the doc) took around 2 minutes)

def spacy_pipeline(cleaned_speeches: pd.DataFrame) -> pd.DataFrame:
    if not 'doc' in cleaned_speeches.columns:
       cleaned_speeches['doc'] = cleaned_speeches['text'].apply(nlp)
    print("Pipeline finished.")
    if not 'tokens' in cleaned_speeches.columns:
        cleaned_speeches['tokens'] = cleaned_speeches['doc'].apply(get_token)
    if not 'lemmas' in cleaned_speeches.columns:
        cleaned_speeches['lemmas'] = cleaned_speeches['doc'].apply(get_lemma)
    if not 'pos' in cleaned_speeches.columns:
        cleaned_speeches['pos'] = cleaned_speeches['doc'].apply(get_pos)
    print("Added columns for Tokens, Lemmata, and POS.")
    return cleaned_speeches

afd_speeches_cleaned = spacy_pipeline(afd_speeches_cleaned)


Data loaded.
Cleaned up the speeches.
Pipeline finished.
Added columns for Tokens, Lemmata, and POS.


### 2.2 Preprocessing the videos (transcribing)

The following code blocks preprocess the uploaded videos. The preprocessing includes the following steps:
1. Extracting only the audio from the videos using ffmpeg
2. Transcribing the audio using OpenAI's Whisper
3. Building a dataframe with the transcribed text and further information about the corresponding videos
4. Saving the dataframe to a CSV file and pickling it for further processing

In [None]:
"""
The following code block extracts the audio from the videos in the raw data folder and saves them as mp3 files in the same folder.
This is done to make the audio files accessible for the whisper model.

This operation should also be done only once, as it takes some time to extract the audio from the videos.
"""

"""
Takes a path to a folder and extracts the audio from the videos in the folder and saves them as mp3 files in the same folder.

@Input: path: The path to the folder with the videos. The videos should be in the formats .mp4 and .webm. 
There can be one layer of subfolders in the folder (party and faction)
"""
def make_mp4(path: str):
    for i in os.listdir(path):
        if i.endswith(".webm") or i.endswith(".mp4"):
            # Extract the audio from the video
            command = "ffmpeg -i {} -vn -ar 44100 -ac 2 -b:a 192k {}".format('"'+path+i+'"', ('"'+path+i+'"').replace(".mp4", ".mp3").replace(".webm", ".mp3"))
            subprocess.call(command, shell = True)

make_mp4("data/raw_data/videos/Instagram/Faction/")
make_mp4("data/raw_data/videos/Instagram/Party/")
#make_mp4("data/raw_data/videos/TikTok/")
#make_mp4("data/raw_data/videos/YouTube/")

In [None]:
"""
The following code block transcribes the audio files in the raw data folder and saves them as text files in the same folder.

!!!This operation should also be done only once, as it takes much time to transcribe the audio files.!!!
"""
def transcribe_audio_files(paths_to_folders):
    for path_to_folder in paths_to_folders:
        for i in os.listdir(path_to_folder):
            if i.endswith(".mp3"):
                # Checks if the file has already been transcribed
                if not os.path.exists(path_to_folder+i.replace(".mp3", ".txt")):
                # Transcribe the audio file
                    result = model.transcribe(audio = path_to_folder+i, language = "de")
                    with open(path_to_folder+i.replace(".mp3", ".txt"), "w") as f:
                        f.write(str(result))
        print("Transcription of audio files in {} completed.".format(path_to_folder))

transcribe_audio_files(paths_to_folders)

"""
The following method builds a dataframe from the transcriptions in the txt in the raw data folder.

@TODO: Reproducaibility of the code block. The videofiles have to be reloaded into the folder, because ffmpeg changes their "last modified" date (at least I think so)
"""

def build_transcriptions_df(paths_to_folders):
    row_list = []
    for path_to_folder in paths_to_folders:
        for i in os.listdir(path_to_folder):
            if i.endswith(".txt"):
                with open(path_to_folder+i, "r") as f:
                    data = f.read()
                # Dict in string -> dict and extract the transcription 
                transcription = ast.literal_eval(data)['text']
                # Extract the source from the path
                source = path_to_folder.split('/')[-2]
                # Get the time of the video file. It is the time of the last modification of the video file. You have to try different file endings.
                # @TODO: Program a preprocessing stage where every file is changed to mp4
                if bool(re.search('202[1-5]-[0-1][0-9]-[0-3][0-9]', i)):
                    time = re.search('202[1-5]-[0-1][0-9]-[0-3][0-9]', i).group(0)
                else:
                    try:
                        time = strftime('%Y-%m-%d %H:%M:%S', localtime(os.path.getmtime((path_to_folder+i).replace(".txt", ".webm"))))
                    except:
                        try:
                            time = strftime('%Y-%m-%d %H:%M:%S', localtime(os.path.getmtime((path_to_folder+i).replace(".txt", ".mp4"))))
                        except:
                            time = "unknown"
                # Put the transcription and source in the dataframe
                row_list.append({'text': transcription, 'source': source, 'time': pd.to_datetime(time)})
    return row_list

all_transcriptions_df = pd.DataFrame(build_transcriptions_df(paths_to_folders))

# Filter Dataframe so that only videos uploaded during the 20. legislative period are included (25.10.2021 - 24.03.2025)
# There still might be a handfull of videos that are from the 19. legislative period
transcriptions_df = all_transcriptions_df.loc[all_transcriptions_df['time'] > pd.to_datetime('2021-10-25')]
transcriptions_df = transcriptions_df.loc[transcriptions_df['time'] < pd.to_datetime('2025-03-24')]

transcriptions_df.to_csv("data/preprocessed_data/transcriptions.csv")
transcriptions_df.to_pickle("data/preprocessed_data/transcriptions.pkl")



In [None]:
"""
In the following the duplicates in the transcription dataframe are removed.
There are duplicates because some Videos were downloaded on YouTube and on TikTok
"""
from rapidfuzz import fuzz

if os.path.exists("data/preprocessed_data/transcriptions.pkl"):
    transcriptions_df = pd.read_pickle("data/preprocessed_data/transcriptions.pkl")
elif os.path.exists("data/preprocessed_data/transcriptions.csv"):
    transcriptions_df = pd.read_csv("data/preprocessed_data/transcriptions.csv")
else:
    print("No data found. Please run the data collection code block first.")

def remove_duplicates(dup_df, text_column, source_column, similarity_treshold = 80) -> pd.DataFrame:
    # Remove leading and trailing whitespaces
    dup_df['text'] = dup_df['text'].apply(str.strip)
    # Lemmatize the text for better comparison
    dup_df['doc'] = dup_df['text'].apply(nlp)
    dup_df['lemma'] = dup_df['doc'].apply(get_lemma)

    # First remove the perfect dubplicates with the base method of pandas
    #dup_df = dup_df.drop_duplicates(subset = 'lemma')
    dup_df.reset_index(drop = True)

    # Then remove near duplicates with the fuzzy matching algorithm
    seen_rows = []
    duplicates = []

    dup_df['dup_num'] = 0
    i = 1

    for _, row in dup_df.iterrows():
        for seen_row in seen_rows:
            if fuzz.ratio(row[text_column], seen_row[text_column]) > similarity_treshold:
                row['dup_num'] = i
                seen_row['dup_num'] = i
                duplicates.append(row)
                i += 1
                break
        seen_rows.append(row)
    
    # Convert the dictionary to a dataframe
    duplicates_df =  pd.DataFrame(duplicates)
    dup_df = pd.DataFrame(seen_rows)
    
    # Add the sources adccording to dub_num
    grouped_sources = duplicates_df.groupby('dup_num')['source'].apply(lambda x: ','.join(x)).to_dict()
    for dup_num, sources in grouped_sources.items():
        dup_df.loc[dup_df['dup_num'] == dup_num, 'source'] += f",{sources}"


    no_dup = dup_df.drop(duplicates_df.index)
    return no_dup, duplicates_df

transcriptions_no_dup_df, duplicates_df = remove_duplicates(transcriptions_df, 'text', 'source')

In [None]:
def remove_duplicates(dup_df, text_column, source_column, similarity_treshold=80) -> pd.DataFrame:
    # Remove leading and trailing whitespaces
    dup_df['text'] = dup_df['text'].apply(str.strip)
    # Lemmatize the text for better comparison
    dup_df['doc'] = dup_df['text'].apply(nlp)
    dup_df['lemma'] = dup_df['doc'].apply(get_lemma)

    # Reset index for consistency
    dup_df.reset_index(drop=True, inplace=True)

    # Initialize variables for tracking duplicates
    seen_rows = []
    duplicates = []
    dup_df['dup_num'] = '0'  # Initialize a column for duplicate group numbers
    current_dup_num = 1  # Start numbering duplicate groups from 1

    # Iterate through rows to find duplicates
    for _, row in dup_df.iterrows():
        matched = False
        for seen_row in seen_rows:
            if fuzz.ratio(row[text_column], seen_row[text_column]) > similarity_treshold:
                # Assign the same dup_num to the current row and the matched row
                row['dup_num'] += ',' + seen_row['dup_num']
                duplicates.append(row)
                matched = True
                break
        if not matched:
            # If no match is found, assign a new dup_num
            row['dup_num'] = str(current_dup_num)
            current_dup_num += 1
        seen_rows.append(row)

    # Convert the duplicates list to a DataFrame
    duplicates_df = pd.DataFrame(duplicates)
    dup_df = pd.DataFrame(seen_rows)

    dup_df['further times'] = ""

    leading_zero_rows = dup_df[dup_df['dup_num'].str.startswith('0')]
    non_zero_rows = dup_df[~dup_df['dup_num'].str.startswith('0')]

     # Iterate through rows with leading 0 and update the corresponding non-zero row
    for _, row in leading_zero_rows.iterrows():
        # Remove the leading 0 from dup_num
        corrected_dup_num = row['dup_num'].lstrip('0,')

        # Find the corresponding row in non_zero_rows
        if corrected_dup_num in non_zero_rows['dup_num'].values:
            # Concatenate the source of the leading 0 row to the non-zero row
            non_zero_rows.loc[non_zero_rows['dup_num'] == corrected_dup_num, source_column] += f",{row[source_column]}"
            non_zero_rows.loc[non_zero_rows['dup_num'] == corrected_dup_num, 'further times'] += f",{row['time']}"

    return non_zero_rows, leading_zero_rows 

# Remove duplicates and update the source column
transcriptions_no_dup_df, duplicates_df = remove_duplicates(transcriptions_df, 'text', 'source')

### 2.3 Matching the preprocessed data

In [191]:
"""
This method aligns the transcriptions with the speeches from the protocols.
It does this by comparing the lemmata of the speeches with the transcriptions and finding the most similar transcription. 
The lemmata are used because they are more robust to spelling errors and other noise in the text. All in all the transcriptions are pretty good
so this may not be strictly necessary.

@Input:
    speeches_df: The dataframe with the transcriptions
    transcriptions_df: The dataframe with the speeches from the protocols
@Output:
    A dataframe with the speeches and the aligned transcriptions

Attention! This code block takes some time to run as it has to compare every speech (> 3500) with some transcription. 
And the fuzzy matching algorithm is not the fastest. As it must deal with the fact that most uploaded speeches are only 
subsets of the transcriptions.
"""
from rapidfuzz import fuzz

def align_speeches_with_transcriptions(speeches_df, transcriptions_df, treshold = 70):
    aligned_data = []
    # Sort the dataframes by date to make the search for the best match faster
    speeches_df.sort_values(by = 'session_date')
    transcriptions_df.sort_values(by = 'time')
    num = 0

    # The outer for has to be the transcriptions, because sometimes there were multiple videos of different parts of one speech
    for _, transcription_row in transcriptions_df.iterrows():
        # You dont start with 0, because in the cases, that the whole speech was uploaded the "Frau Präsidentin! Meine sehr geehrten Damen und Herren!" 
        # at the start of it skews the results if you lower the similarity score to much
        search_text = transcription_row['text'][150:400]
        found_speech, found_speech_date, speaker = None, None, None
        transcription_text = transcription_row['text']
        transcription_date = pd.to_datetime(transcription_row['time'])
        for _, speech_row in speeches_df.iterrows():
            speech_text = speech_row['text']
            speech_date = pd.to_datetime(speech_row['session_date'])
            # Check that the video was uploaded after the speech was given, accelerates the algorithm a bit
            if(speech_date <= transcription_date):
                similarity_score = fuzz.partial_ratio(speech_text, search_text, score_cutoff = treshold)
                if similarity_score > treshold:
                    found_speech = speech_text
                    found_speech_date = speech_date
                    speaker = speech_row['speaker']
                    num += 1
                    print('Found number {}!'.format(num))
                    break
        aligned_data.append({
            'speech_text': found_speech,
            'session_date': found_speech_date,
            'transcription_text': transcription_text,
            'transcription_date': transcription_date,
            'speaker': speaker,
            'uploaded_source': transcription_row['source'],
        })
    return pd.DataFrame(aligned_data)

aligned_df = align_speeches_with_transcriptions(afd_speeches_cleaned, transcriptions_no_dup_df)

aligned_df.to_csv("data/preprocessed_data/aligned_df.csv")
aligned_df.to_pickle("data/preprocessed_data/aligned_df.pkl")



Found number 1!
Found number 2!
Found number 3!
Found number 4!
Found number 5!
Found number 6!
Found number 7!
Found number 8!
Found number 9!
Found number 10!
Found number 11!
Found number 12!
Found number 13!
Found number 14!
Found number 15!
Found number 16!
Found number 17!
Found number 18!
Found number 19!
Found number 20!
Found number 21!
Found number 22!
Found number 23!
Found number 24!
Found number 25!
Found number 26!
Found number 27!
Found number 28!
Found number 29!
Found number 30!
Found number 31!
Found number 32!
Found number 33!
Found number 34!
Found number 35!
Found number 36!
Found number 37!
Found number 38!
Found number 39!
Found number 40!
Found number 41!
Found number 42!
Found number 43!
Found number 44!
Found number 45!
Found number 46!
Found number 47!
Found number 48!
Found number 49!
Found number 50!
Found number 51!
Found number 52!
Found number 53!
Found number 54!
Found number 55!
Found number 56!
Found number 57!
Found number 58!
Found number 59!
Found 

In [192]:
"""
Cleanup: In the following code block the aligned dataframe is cleaned up by hand. 

This means that obvious misses are corrected every step is documented in the code block.
"""

if os.path.exists("data/preprocessed_data/aligned_df.pkl"):
    aligned_df_temp = pd.read_pickle("data/preprocessed_data/aligned_df.pkl")
elif os.path.exists("data/preprocessed_data/aligned_df.csv"):
    aligned_df_temp = pd.read_csv("data/preprocessed_data/aligned_df.csv")
else:
    print("No data found. Please run the alignement code block first.")


# aligned_df = pd.read_pickle("data/preprocessed_data/aligned_df.pkl")
#aligned_df = pd.read_csv("data/preprocessed_data/aligned_df.csv")
aligned_df_temp.drop(index = 17, inplace = True) # Not an AfD MdB (es unglaublich, was für ein Angriff )
aligned_df_temp.drop(index = 218, inplace = True) # This is not a speech by a AfD MdB but an announcement from the speaker (2014, 698 wurden abgegeben...)
aligned_df_temp.drop(index = 333, inplace = True) # This is not a speech by an AfD MdB but an announcement from the speaker (Abgegebene Stimmkarten 683)
aligned_df_temp.drop(index = 267, inplace = True) # This is not a speech by an AfD MdB (Diese Landtagswahlen in Thüringen)
aligned_df_temp.drop(index = 314, inplace = True) # This is not even a speech in the Bundestag (Die Ukraine ist nicht das 17. Bundesland)
aligned_df_temp.drop(index = 152, inplace = True) # This speech is not in protocols_df as they forgot to upload a complete version (even if they wrote they would) (Wenn wir mitbekommen, wie in den )
aligned_df_temp.drop(index = 382, inplace = True) # This is a debate between an AfD MdB and a minister of state which does not fit the following algorithms (Die fünf wichtigsten und aktuell)
aligned_df_temp.drop(index = 420, inplace = True) # This is a debate between an AfD MdB and a minister of state which does not fit the following algorithms (Wer hätte gedacht, dass steigende)
aligned_df_temp.drop(index = 92, inplace = True) # No longer an AfD MdB (Die Abschaffung des Netzwerkdurchsetzungsgesetzes)
aligned_df_temp.drop(index = 416, inplace = True) # No longer an AfD MdB (Frau Präsidentin, werte Kollegen! Bevor ich)
aligned_df_temp.drop(index = 184, inplace = True) # Not an AfD Speaker

# This video is more of a compilation: (Die Fahne Deutschlands)
aligned_df_temp.at[193,'speech_text'] = afd_speeches_cleaned.iloc[1013]['text']
aligned_df_temp.at[193,'session_date'] = afd_speeches_cleaned.iloc[1013]['session_date']
aligned_df_temp.at[193, 'speaker'] = afd_speeches_cleaned.iloc[1013]['speaker']
aligned_df_temp.at[193,'transcription_text'] = aligned_df_temp.at[193,'transcription_text'].split('Wir halten')[0]

# This video is more of a compilation: (Herr Reichert, Sie haben es ja)
aligned_df_temp.at[172,'speech_text'] = afd_speeches_cleaned.iloc[1107]['text']
aligned_df_temp.at[172,'session_date'] = afd_speeches_cleaned.iloc[1107]['session_date']
aligned_df_temp.at[172,'speaker'] = afd_speeches_cleaned.iloc[1107]['speaker']
aligned_df_temp.at[172,'transcription_text'] = aligned_df_temp.at[172,'transcription_text'].split('Laut einer')[1]

# This video is a discussion between the AfD MdB and the speaker (Weil Sie die Wirklichkeit nicht akzeptieren)
aligned_df_temp.at[265,'speech_text'] = afd_speeches_cleaned.iloc[1350]['text']
aligned_df_temp.at[265,'session_date'] = afd_speeches_cleaned.iloc[1350]['session_date']
aligned_df_temp.at[265,'speaker'] = afd_speeches_cleaned.iloc[1350]['speaker']
aligned_df_temp.at[265,'transcription_text'] = aligned_df_temp.at[265,'transcription_text'].split('Ihr Offenbarungsverbot')[1]
# This video as well (Vielen Dank, Frau Präsidentin)
aligned_df_temp.at[411,'speech_text'] = afd_speeches_cleaned.iloc[1350]['text']
aligned_df_temp.at[411,'session_date'] = afd_speeches_cleaned.iloc[1350]['session_date']
aligned_df_temp.at[411,'speaker'] = afd_speeches_cleaned.iloc[1350]['speaker']
aligned_df_temp.at[411,'transcription_text'] = aligned_df_temp.at[411,'transcription_text'].split('Die Beschlussfähigkeit')[0]
# In this video the audio is not only the speech (Herr Habeck, der bei der Nationalhymne)
aligned_df_temp.at[281,'speech_text'] = afd_speeches_cleaned.iloc[1818]['text'] 
aligned_df_temp.at[281,'session_date'] = afd_speeches_cleaned.iloc[1818]['session_date']
aligned_df_temp.at[281,'speaker'] = afd_speeches_cleaned.iloc[1818]['speaker']
aligned_df_temp.at[281,'transcription_text'] = aligned_df_temp.at[281,'transcription_text'].split('Untertitelung')[0]
# This is more of a discussion between the speaker and the visitors (Wir von der AfD erkennen diesen Völkermord) -> Sichert
aligned_df_temp.at[227,'speech_text'] = afd_speeches_cleaned.iloc[2458]['text'] 
aligned_df_temp.at[227,'session_date'] = afd_speeches_cleaned.iloc[2458]['session_date']
aligned_df_temp.at[227,'speaker'] = afd_speeches_cleaned.iloc[2458]['speaker']
aligned_df_temp.at[277,'transcription_text'] = aligned_df_temp.at[277,'transcription_text'].split('Ich weise')[0]

# Misses where it wasn't found, even though the speech is in the protocols (normaly because the transcription is not near enough to the protocol) for example when there are many special characters needed
## (Im Juni 23 bringt die Union)
aligned_df_temp.at[123,'speech_text'] = afd_speeches_cleaned.iloc[1526]['text']
aligned_df_temp.at[123,'session_date'] = afd_speeches_cleaned.iloc[1526]['session_date']
aligned_df_temp.at[123,'speaker'] = afd_speeches_cleaned.iloc[1526]['speaker']   
## (Ja, dass Ihnen das nicht gefällt) -> Frohnmeier
aligned_df_temp.at[72,'speech_text'] = afd_speeches_cleaned.iloc[2015]['text']
aligned_df_temp.at[72,'session_date'] = afd_speeches_cleaned.iloc[2015]['session_date']
aligned_df_temp.at[72,'speaker'] = afd_speeches_cleaned.iloc[2015]['speaker']  
## Eigennutz, Klüngelei, Vetternwirtschaft -> Brandner
aligned_df_temp.at[41,'speech_text'] = afd_speeches_cleaned.iloc[2091]['text']
aligned_df_temp.at[41,'session_date'] = afd_speeches_cleaned.iloc[2091]['session_date'] 
aligned_df_temp.at[41,'speaker'] = afd_speeches_cleaned.iloc[2091]['speaker']
## (Hinter dem Terror der letzten Generation) -> Brandner
aligned_df_temp.at[270,'speech_text'] = afd_speeches_cleaned.iloc[2679]['text']
aligned_df_temp.at[270,'session_date'] = afd_speeches_cleaned.iloc[2679]['session_date']
aligned_df_temp.at[270,'speaker'] = afd_speeches_cleaned.iloc[2679]['speaker']
## (Übermorgen ist 1. Advent) -> Bohringer 
aligned_df_temp.at[440,'speech_text'] = afd_speeches_cleaned.iloc[2612]['text']
aligned_df_temp.at[440,'session_date'] = afd_speeches_cleaned.iloc[2612]['session_date']
aligned_df_temp.at[440,'speaker'] = afd_speeches_cleaned.iloc[2612]['speaker']
## (Was vielerorts und auch hier von) -> Brander
aligned_df_temp.at[364,'speech_text'] = afd_speeches_cleaned.iloc[2679]['text']
aligned_df_temp.at[364,'session_date'] = afd_speeches_cleaned.iloc[2679]['session_date']
aligned_df_temp.at[364,'speaker'] = afd_speeches_cleaned.iloc[2679]['speaker']
aligned_df_temp.at[161,'speech_text'] = afd_speeches_cleaned.iloc[2679]['text']
aligned_df_temp.at[161,'session_date'] = afd_speeches_cleaned.iloc[2679]['session_date']
aligned_df_temp.at[161,'speaker'] = afd_speeches_cleaned.iloc[2679]['speaker']
## (Die Bürger haben jedes Recht, gegen) -> Weidel
aligned_df_temp.at[142,'speech_text'] = afd_speeches_cleaned.iloc[3010]['text']
aligned_df_temp.at[142,'session_date'] = afd_speeches_cleaned.iloc[3010]['session_date'] 
aligned_df_temp.at[142,'speaker'] = afd_speeches_cleaned.iloc[3010]['speaker']
## (Wenn Migranten uns Deutsche) -> Baumann
aligned_df_temp.at[195,'speech_text'] = afd_speeches_cleaned.iloc[3072]['text']
aligned_df_temp.at[195,'session_date'] = afd_speeches_cleaned.iloc[3072]['session_date'] 
aligned_df_temp.at[195,'speaker'] = afd_speeches_cleaned.iloc[3072]['speaker']
## (230.000 Stromsperren) -> Espendilla
aligned_df_temp.at[429,'speech_text'] = afd_speeches_cleaned.iloc[3676]['text']
aligned_df_temp.at[429,'session_date'] = afd_speeches_cleaned.iloc[3676]['session_date'] 
aligned_df_temp.at[429,'speaker'] = afd_speeches_cleaned.iloc[3676]['speaker']
## Die ganze Welt setzt auf die nahezu -> Korte
aligned_df_temp.at[95,'speech_text'] = afd_speeches_cleaned.iloc[3691]['text']
aligned_df_temp.at[95,'session_date'] = afd_speeches_cleaned.iloc[3691]['session_date'] 
aligned_df_temp.at[95,'speaker'] = afd_speeches_cleaned.iloc[3691]['speaker']
## Es ist unerträglich was in dieser Fraktion -> Brander
aligned_df_temp.at[91,'speech_text'] = afd_speeches_cleaned.iloc[3697]['text']
aligned_df_temp.at[91,'session_date'] = afd_speeches_cleaned.iloc[3697]['session_date']
aligned_df_temp.at[91,'speaker'] = afd_speeches_cleaned.iloc[3697]['speaker']
## Ich weiß nicht, was Sie hier eigentlich machen -> Weidel
aligned_df_temp.at[315,'speech_text'] = afd_speeches_cleaned.iloc[14]['text']
aligned_df_temp.at[315,'session_date'] = afd_speeches_cleaned.iloc[14]['session_date'] 
aligned_df_temp.at[315,'speaker'] = afd_speeches_cleaned.iloc[14]['speaker']

# Misses where I don't understand why the speech wasn't recognized
## (Vielen Dank, Frau Präsidentin) -> Brandner
aligned_df_temp.at[411,'speech_text'] = "Vielen Dank, Frau Präsidentin. – § 45 Absatz 1 der Geschäftsordnung sieht vor, dass der Bundestag beschlussfähig ist, wenn mehr als die Hälfte seiner Mitglieder anwesend ist. Der Bundestag hat zurzeit 736 Mitglieder, mehr als die Hälfte wären 369 Mitglieder. Ich habe gerade mal durchgezählt: Es dürften ungefähr 250 bis 300 Mitglieder fehlen, um die Beschlussfähigkeit des Deutschen Bundestags herzustellen. Deshalb bezweifle ich nach § 45 Absatz 2 der Geschäftsordnung für die Fraktion der Alternative für Deutschland die Beschlussfähigkeit des Bundestages. Ich weiß, dass sich das Präsidium dazu gleich verständigen wird. Ich behalte mir für den Fall, dass das Präsidium sich einig sein sollte, dass die Beschlussfähigkeit gegeben ist, vor, eine namentliche Abstimmung zu beantragen. Vielen Dank."
aligned_df_temp.at[411,'session_date'] = pd.to_datetime('2023-07-07')
aligned_df_temp.at[411,'speaker'] = "Stephan Brandner"

## (Wie ist das vereinbar mit dem Gebot) -> Storch
aligned_df_temp.at[6,'speech_text'] = "Wie ist das vereinbar mit dem Gebot des Bundesverfassungsgerichtes, dass das Geschlecht eindeutig und dauerhaft sein muss – mit Blick auf das Selbstbestimmungsgesetz, das jetzt alles völlig chaotisiert?"
aligned_df_temp.at[6,'session_date'] = pd.to_datetime('2023-06-14')
aligned_df_temp.at[6,'speaker'] = "Beatrix von Storch"

## (Da die Grünen das Problem) -> Bleck
speech_bleck = """ Werte Frau Präsidentin! Werte Kolleginnen und Kollegen! Werter Herr Kollege Träger, ich bin überhaupt nicht verwundert, vor allem, weil ich weiß, dass Sie wider besseres Wissen sprechen. Sie wissen ganz genau, dass Ihre undemokratischen Maßnahmen dazu führen, dass nicht alle AfD-Abgeordneten teilnehmen können, die gerne teilnehmen würden.
In der Europäischen Union ist die Bundesregierung wieder einmal der Geisterfahrer. Sie lehnt die Aufnahme der Kernenergie in die Taxonomie ab, und das, obwohl sie sowohl CO2-arm als auch grundlastfähig ist. Sie ist also die Antwort auf die Frage, wie man Klimaschutz und Versorgungssicherheit in Einklang bringen kann.
Da die Grünen das Problem der erneuerbaren Energien bei der Versorgungssicherheit nicht verstehen und sich Vizepräsidentin Katrin Göring-Eckardt Poesie im Deutschen Bundestag wünscht, versuche ich, dieses Problem in einfacher Sprache poetisch am Beispiel der Windkraft mit Wilhelm Busch zu erklären:
Aus der Mühle schaut der Müller, Der so gerne mahlen will. Stiller wird der Wind und stiller, Und die Mühle stehet still.
So gehts immer, wie ich finde, Rief der Müller voller Zorn. Hat man Korn, so fehlts am Winde. Hat man Wind, so fehlt das Korn.
Ja, werte Kolleginnen und Kollegen, im Unterschied zum 21. Jahrhundert wusste man im 19. Jahrhundert, dass auf die Windkraft nicht ohne Weiteres Verlass ist.
In Deutschland interessiert sich die Bundesregierung zwar für Klimatreiber, nicht aber für Preistreiber. Die Folge: 4 Prozent Inflation und explodierende Strom- und Gaspreise. Mit der EEG-Umlage und der CO2-Abgabe werden die Bürger gnadenlos abkassiert. Die Regierungen von Polen und Tschechien wollen ihre Bürger mit einer Aufhebung oder Senkung der Mehrwert- bzw. Umsatzsteuer auf Strom und Gas entlasten. Und die Bundesregierung? Bundeslandwirtschaftsminister Cem Özdemir sinniert währenddessen über höhere Lebensmittelpreise. Herzlichen Glückwunsch! Ihre Politik gegen die globale Erwärmung ist eine Politik der sozialen Kälte.
Darüber hinaus positioniert sich die Bundesregierung im Spannungsfeld zwischen Klimaschutz und Artenschutz völlig einseitig. Der Ausbau der erneuerbaren Energien soll im öffentlichen Interesse sein und der öffentlichen Sicherheit dienen. Habeck nennt das: die Energiewende mit Artenschutz versöhnen. Er verwechselt offenbar „versöhnen“ und „versündigen“.
Fakt ist: Windkraftanlagen töten jährlich Hunderttausende Vögel und Fledermäuse. Eine Fläche, die etwa dreimal so groß wie das Saarland ist, wollen Sie mit Windkraftanlagen verspargeln. Damit opfern insbesondere die grünen Klimaapostel den Artenschutz auf dem Altar der Energiewende.
Und Widerspruch aus dem Bundesumweltministerium gibt es nicht.
Die Pläne der Bundesregierung zum Ausbau der erneuerbaren Energien sind gefährlich. Windkraftanlagen sollen näher an Häuser gebaut werden. Dadurch werden Bewohner durch Infraschall stärker gesundheitlich belastet. Windkraftanlagen sollen auch näher an Drehfunkfeuer gebaut werden. Dadurch werden die Signale zur Orientierung von Flugzeugen stärker gestört.
Sie, werte Kolleginnen und Kollegen, ignorieren das öffentliche Interesse. Sie sind ein Sicherheitsrisiko.
Es ist unfassbar, mit welcher Dreistigkeit Sie die Wirklichkeit verdrehen.
Doch die größte Enttäuschung der Ampelkoalition – von Ihnen habe ich nichts anderes erwartet – ist tatsächlich die FDP. Früher forderte sie unter anderem die Abschaffung der EEG-Umlage, ein Verbot des Baus von Windkraftanlagen in Wäldern und eine technologieoffene Förderung. Das hat sie ja mit gutem Grund gefordert. Heute ist davon aber nichts mehr übrig.
Für die FDP waren diese Forderungen bei der Bildung der Ampelkoalition Verhandlungsmasse, die man jeweils für vier Ministersitze und Ministerwagen bereitwillig aufgegeben hat.
Damit nimmt sich die Ampelkoalition tatsächlich eine Ampel zum Vorbild: Bei einer Ampel sieht man häufig Rot und Grün, und bei Gelb hält sowieso niemand.
Vielen Dank."""
aligned_df_temp.at[10,'speech_text'] = speech_bleck
aligned_df_temp.at[271, 'speech_text'] = speech_bleck
aligned_df_temp.at[10,'session_date'] = pd.to_datetime('2022-01-12')
aligned_df_temp.at[271,'session_date'] = pd.to_datetime('2022-01-12')
aligned_df_temp.at[10,'speaker'] = "Andreas Bleck"
aligned_df_temp.at[271,'speaker'] = "Andreas Bleck"

# The following speeches were uploaded during the 20. legislative period, but are from the 19.as_integer_ratio
aligned_df_temp.drop(index = 432, inplace = True) # Von rund 4.000 bislang
aligned_df_temp.drop(index = 347, inplace = True) # Was unsere Regierung -> Corta
aligned_df_temp.drop(index = 385, inplace = True) # Ihr Schwarz-Weiß-Denken -> Chrupalla
aligned_df_temp.drop(index = 408, inplace = True) # Nun wollen Sie sich -> Brandner
aligned_df_temp.drop(index = 436, inplace = True) # Nach dem Vorbild des schädlichen -> Curio
aligned_df_temp.drop(index = 439, inplace = True) # Aus unserer Sicht wäre es besser -> Frömming
aligned_df_temp.drop(index = 425, inplace = True) # Denn es geht uns als Parlamentarier -> Huber
aligned_df_temp.drop(index = 392, inplace = True) # Wir müssen den Irrweg so schnell -> Sichert
aligned_df_temp.drop(index = 360, inplace = True) 
aligned_df_temp.drop(index = 383, inplace = True)
aligned_df_temp.drop(index = 379, inplace = True) 
aligned_df_temp.drop(index = 370, inplace = True) # Natürlich kann man alle Transaktinen
aligned_df_temp.drop(index = 371, inplace = True) # Dann muss man sich schon fragen,
aligned_df_temp.drop(index = 428, inplace = True) 


finished_alignment_df = aligned_df_temp
finished_alignment_df.to_pickle("data/preprocessed_data/finished_alignment_df.pkl")
finished_alignment_df.to_csv("data/preprocessed_data/finished_alignment_df.csv")
print("Finished alignment and cleanup.")

Finished alignment and cleanup.


## 3. Data Analysis

In the following the data is analyzed according to the theoretically deducted hypothesis. The main dataframes used for this are the finished_alignement and afd_speeches_cleaned.

### 3.1 Getting to know the data set

The following analysis are not part of testing any hypotheses but just to visualize some interesting aspects of the data set including:
- How long are the Videos (in words)?
- How are the uploads distributed over time and platform?
- How far do giving the speech and it being uploaded lay apart?

In [None]:
#@TODO

### 3.2 Meta-Data of the speeches

The following analysis test the hypotheses 
- <b> 1.1 </b> Speeches, given during a agenda item, which is not debate about a bill, are overrepresented.
- <b> 1.2 </b> Speeches, given by a prominent figure are overrepresented. 

In [None]:
# @TODO

### 3.3 Thematics and Semantics

The following analysis test the hypotheses

- <b> 2.1 </b> Speeches with at the time salient topics are overrepresented.
- <b> 2.2 </b> Speeches with topics from the programmatic core of the AfDs ideology are overrepresented.
- <b> 2.3 </b> Patterns from the standard repertoire of right wing populism are overrepresented.
    - <b> 2.3.1 </b> Blaming plays a big role
    - <b> 2.3.2 </b> Crisis rhetoric plays a big role
    - <b> 2.3.3 </b> "Us vs. them" arguments play a big role
    - <b> 2.3.4 </b> "Common sense" arguments play a big role
    - <b> 2.3.5 </b> "The people" play a big role

In [None]:
# @TODO

### 3.4 Emotions

The following analysis test the hypotheses

- <b> 3.1 </b> The uploaded speeches have a more negative sentiment on average.
- <b> 3.2 </b> The uploaded speeches contain more hate speech.
- <b> 3.3 </b> The uploaded speeches contain more direct addresses to the citizens / the people.
- <b> 3.4 </b> The uploaded speeches contain more direct adrresses to the government / the other parties.

In [None]:
# @TODO

### 3.5 Syntax

The following analysis test the hypotheses:

- <b> 4.1 </b> The uploaded speeches are less complex.
    - <b> 4.1.1 </b> They use simpler sentence structures.
    - <b> 4.1.2 </b> They use fewer borrowed words.
- <b> 4.2 </b> More active sentence structures are used in the uploaded speeches.
- <b> 4.3 </b> There are more imperatives and Calls to action in the uploaded speeches.
- <b> 4.4 </b> There are more rheorical questions in the uploaded speeches
- <b> 4.5 </b> The uploaded speeches contain more repetitions in the speech

In [None]:
# @TODO