## 3.3 Data Exploration

Based on the data retrieved in the last two sections, we explore the tweets and speeches of the politicians.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from collections import Counter
from tqdm.notebook import tqdm

pd.options.mode.chained_assignment = None  # default='warn'

tqdm.pandas()

### 3.3.1 Tweets exploration

#### Import data

In [None]:
# Load tweets data
tweets_scraped = pd.read_csv("../data/raw/tweets_scraped.csv", low_memory=False)

#### Check data

In [None]:
tweets_scraped.head()

In [None]:
tweets_scraped.tail()

In [None]:
tweets_scraped.info()

In [None]:
tweets_scraped.describe()

#### Drop missing data

We can drop all records with missing data, as we cannot use these records for our analysis.

In [None]:
# Drop missing data
tweets_scraped.dropna(inplace = True)

#### Clean names

For better comparability, we harmonize the names in the tweets and speeches data.

In [None]:
# Create twitter username to real name dictionary
usernames_to_fullname = {'rbrinkhaus': 'Ralph Brinkhaus', 'groehe': 'Hermann Gröhe', 
                         'NadineSchoen': 'Nadine Schön', 'n_roettgen': 'Norbert Röttgen',
                         'peteraltmaier': 'Peter Altmaier', 'jensspahn': 'Jens Spahn', 
                         'MatthiasHauer': 'Matthias Hauer', 'c_lindner': 'Christian Lindner',
                         'MarcoBuschmann': 'Marco Buschmann', 'starkwatzinger': 'Bettina Stark-Watzinger',
                         'Lambsdorff': 'Alexander Graf Lambsdorff', 'johannesvogel': 'Johannes Vogel',
                         'KonstantinKuhle': 'Konstantin Kuhle', 'MAStrackZi': 'Marie-Agnes Strack-Zimmermann',
                         'larsklingbeil': 'Lars Klingbeil', 'EskenSaskia': 'Saskia Esken',
                         'hubertus_heil': 'Hubertus Heil', 'HeikoMaas': 'Heiko Maas',
                         'MartinSchulz': 'Martin Schulz', 'KarambaDiaby': 'Karamba Diaby',
                         'Karl_Lauterbach': 'Karl Lauterbach', 'SteffiLemke': 'Steffi Lemke',
                         'cem_oezdemir': 'Cem Özdemir', 'GoeringEckardt': 'Katrin Göring-Eckardt',
                         'KonstantinNotz': 'Konstantin von Notz', '6': 'Konstantin von Notz',
                         'BriHasselmann': 'Britta Haßelmann', 'svenlehmann': 'Sven Lehmann',
                         'ABaerbock': 'Annalena Baerbock', 'ABaerbockArchiv': 'Annalena Baerbock',
                         'SWagenknecht': 'Sahra Wagenknecht', 'b_riexinger': 'Bernd Riexinger',
                         'NiemaMovassat': 'Niema Movassat', 'jankortemdb': 'Jan Korte',
                         'DietmarBartsch': 'Dietmar Bartsch', 'GregorGysi': 'Gregor Gysi',
                         'SevimDagdelen': 'Sevim Dağdelen', 'Alice_Weidel': 'Alice Weidel',
                         'Beatrix_vStorch': 'Beatrix von Storch', 'JoanaCotar': 'Joana Cotar',
                         'StBrandner': 'Stephan Brandner', 'Tino_Chrupalla': 'Tino Chrupalla',
                         'GtzFrmming': 'Götz Frömming', '3': 'Götz Frömming', 'Leif_Erik_Holm': 'Leif-Erik Holm'}

In [None]:
# Add full name
tweets_scraped["full_name"] = tweets_scraped.username.replace(usernames_to_fullname)

#### Check time data

In [None]:
# Add normalized date
tweets_scraped["date"] = pd.to_datetime(tweets_scraped["datetime"], format = "%Y-%m-%d").dt.date

In [None]:
tweets_scraped.date.min()

In [None]:
tweets_scraped.date.max()

In [None]:
# Tweet number per time
tweets_scraped.groupby('date')['tweet_id'].size().plot()

We now can drop all data that are not also represented in the speeches dataset.

In [None]:
# Drop unneded data
tweets_subset = tweets_scraped[np.logical_and(tweets_scraped.date >= pd.Timestamp("24.10.2017"), tweets_scraped.date <= pd.Timestamp("07.05.2021"))]

#### Checkt party distribution

When checking the distribution of tweets per party, we can see differences, but they do not significantly alter our results.

In [None]:
# Tweets per party
tweets_subset.groupby("party").size()

#### Check politician distribution

We see significant differences between the number of tweets per politician ranging from nearly 29665 to 658. We have to consider this in our work.

In [None]:
# Tweets per politican
tweets_scraped.groupby('full_name')['tweet_id'].size().sort_values().plot(kind='bar')

We see an strongly increasing trend of tweets per day. This is caused by two new parties entering the bundestag in 2017.

#### Check text

We check the texts of the tweets with a word cloud. We can infer the need for data preprocessing from a first analysis of the visualisation. 

In [None]:
# Create a word cloud
long_string_tweets = ' '.join(tweets_scraped["text"].tolist())
wordcloud_tweets = WordCloud(background_color="white", max_words=5000, contour_width=3, contour_color='steelblue')
wordcloud_tweets.generate(long_string_tweets)
wordcloud_tweets.to_image()

In [None]:
# Create a counter object
counter_tweets = Counter(long_string_tweets.split())

In [None]:
# Check the most common words
counter_tweets.most_common(10)

We can identify the need for a stopword removal.

#### Drop unneeded columns

In [None]:
# Drop unneeded columns
tweets_subset.drop(['datetime', 'tweet_id', 'username','name', 'reply_count'], axis = 1, inplace = True)

#### Export data

In [None]:
tweets_subset.to_csv("../data/interim/tweets_explored.csv", index = False)

### 3.3.2 Explore speeches of politicians

#### Import data

In [None]:
# Load tweets data
speeches_retrieved = pd.read_csv("../data/raw/speeches_retrieved.csv", low_memory=False)

#### Check data

In [None]:
speeches_retrieved.head()

In [None]:
speeches_retrieved.tail()

In [None]:
speeches_retrieved.info()

In [None]:
speeches_retrieved.describe()

#### Drop missing data

We can drop all records with missing speech content, as we cannot use these records for our analysis.

In [None]:
# Drop missing data
speeches_retrieved.dropna(subset = ["text"], inplace = True)

#### Clean names

For better comparability, we harmonize the names in the tweets and speeches data.

In [None]:
# Add full name of politicians
speeches_retrieved["full_name"] = speeches_retrieved["first_name"] + " " + speeches_retrieved["last_name"]

In [None]:
# Subset to the selected politicians
speeches_subset = speeches_retrieved[speeches_retrieved.full_name.isin(tweets_subset.full_name.unique())]

#### Check time data

In [None]:
# Add normalized date
speeches_subset["date"] = pd.to_datetime(speeches_subset["date"], format = "%Y-%m-%d").dt.date

In [None]:
speeches_subset.date.min()

In [None]:
speeches_subset.date.max()

In [None]:
# Speech number per time
speeches_subset.groupby('date')['id'].size().plot()

#### Checkt party distribution

When checking the distribution of speeches per party, we can see differences, but they do not significantly alter our results.

In [None]:
fullname_to_party = {'Ralph Brinkhaus': 'CDU', 'Hermann Gröhe': 'CDU', 'Nadine Schön': 'CDU', 
                     'Norbert Röttgen': 'CDU', 'Peter Altmaier': 'CDU', 'Jens Spahn': 'CDU', 
                     'Matthias Hauer': 'CDU', 'Christian Lindner': 'FDP', 'Marco Buschmann': 'FDP',
                     'Bettina Stark-Watzinger': 'FDP', 'Alexander Graf Lambsdorff': 'FDP', 'Johannes Vogel': 'FDP',
                     'Konstantin Kuhle': 'FDP', 'Marie-Agnes Strack-Zimmermann': 'FDP', 'Lars Klingbeil': 'SPD',
                     'Saskia Esken': 'SPD', 'Hubertus Heil': 'SPD', 'Heiko Maas': 'SPD', 'Martin Schulz': 'SPD', 
                     'Karamba Diaby': 'SPD', 'Karl Lauterbach': 'SPD', 'Steffi Lemke': 'Grüne',
                     'Cem Özdemir': 'Grüne', 'Katrin Göring-Eckardt': 'Grüne', 'Konstantin von Notz': 'Grüne',
                     'Britta Haßelmann': 'Grüne', 'Sven Lehmann': 'Grüne', 'Annalena Baerbock': 'Grüne',
                     'Sahra Wagenknecht': 'Linke', 'Bernd Riexinger': 'Linke', 'Niema Movassat': 'Linke', 
                     'Jan Korte': 'Linke', 'Dietmar Bartsch': 'Linke', 'Gregor Gysi': 'Linke', 
                     'Sevim Dağdelen': 'Linke', 'Alice Weidel': 'AFD', 'Beatrix von Storch': 'AFD', 
                     'Joana Cotar': 'AFD', 'Stephan Brandner': 'AFD', 'Tino Chrupalla': 'AFD',
                     'Götz Frömming': 'AFD', 'Leif-Erik Holm': 'AFD'}

In [None]:
speeches_subset["party"] = speeches_subset.full_name.replace(fullname_to_party)

In [None]:
# Speeches per party
speeches_subset.groupby("party").size()

#### Check politician distribution

We see significant differences between the number of speeches per politician ranging from 368 to 15. We have to consider this in our work.

In [None]:
# Speeches per politican
speeches_subset.groupby('full_name')['id'].size().sort_values().plot(kind='bar')

#### Check text

We check the texts of the tweets with a word cloud. We can infer the need for data preprocessing from a first analysis of the visualisation. 

In [None]:
# Create a word cloud
long_string_speeches = ' '.join(speeches_subset["text"].tolist())
wordcloud_speeches = WordCloud(background_color="white", max_words=5000, contour_width=3, 
                               contour_color='steelblue')
wordcloud_speeches.generate(long_string_speeches)
wordcloud_speeches.to_image()

In [None]:
# Create a counter object
speeches_counter = Counter(long_string_speeches.split())

In [None]:
# Check the most common words
speeches_counter.most_common(10)

We can identify the need for a stopword removal.

#### Drop unneeded columns

In [None]:
# Drop unneeded columns
speeches_subset.drop(['id', 'session', 'electoral_term', 'first_name', 'last_name', 'politician_id',
                      'fraction_id', 'document_url', 'position_short', 'position_long', 'search_speech_content'],
                     axis = 1, inplace = True)

#### Export data

In [None]:
speeches_subset.to_csv("../data/interim/speeches_explored.csv", index = False)