# Downloading data from Twitter

Note that the functions used for downloading are imported from our scripts 'tweepy_helpers' and 'got3_helpers'.

In [4]:
import os
import pandas as pd
import pickle
import tweepy_helpers as th
import got3_helpers as got3

In [5]:
# set up working directory
os.path.abspath(os.getcwd()) # initial working directory (should be equal to source file directory if using Jupyter Notebook)
os.chdir('../../data/web_scraping') # change to directory where all data files are stored
# check working directory
os.path.abspath(os.getcwd())

'/Users/patrickschulze/Desktop/Consulting/Bundestag-MP-Analyse/data/web_scraping'

## 1. Data Import

In [6]:
# import Bundestag data
with open('abg_df.pickle', 'rb') as handle:
    bt_data = pickle.load(handle)


In [7]:
bt_data

Unnamed: 0,Name,Partei,Wahlart,Bundesland,Wahlkreis,Wahlkreis-Nr.,Ausschuesse,Soziale Medien,Biografie,Twitter
0,"Abercron, Dr. Michael von",CDU/CSU,Direkt gewählt,Schleswig-Holstein,Pinneberg,7,{'Ordentliches Mitglied': ['Ausschuss für Ernä...,{'von-abercron.de/': 'http://www.von-abercron....,Geboren am 17. November 1952 in Ehlers...,mvabercron
1,"Achelwilm, Doris",Die Linke,Gewählt über Landesliste,Bremen,,,{'Ordentliches Mitglied': ['Ausschuss für Fami...,{'doris-achelwilm.de': 'http://www.doris-achel...,Geboren am 30. November 1976 in Thuine...,DorisAchelwilm
2,"Aggelidis, Grigorios",FDP,Gewählt über Landesliste,Niedersachsen,Hannover-Land I,43,{'Ordentliches Mitglied': ['Kuratorium der Bun...,{'grigorios-aggelidis.de': 'http://www.grigori...,Geboren am 19. August 1965 in Hannover...,aggelidis_fdp
3,"Akbulut, Gökay",Die Linke,Gewählt über Landesliste,Baden-Württemberg,Mannheim,275,"{'Ordentliches Mitglied': ['Schriftführer/in',...",{'goekay-akbulut.de': 'https://goekay-akbulut....,Geboren 1982 in Pinarbasi/ Türkei; ledig.Juni ...,akbulutgokay
4,"Albani, Stephan",CDU/CSU,Gewählt über Landesliste,Niedersachsen,Oldenburg – Ammerland,27,{'Ordentliches Mitglied': ['Ausschuss für Bild...,{'stephan-albani.de': 'http://www.stephan-alba...,Geboren am 3. Juni 1968 in Göttingen; verheira...,
...,...,...,...,...,...,...,...,...,...,...
725,"Zierke, Stefan",SPD,Gewählt über Landesliste,Brandenburg,Uckermark – Barnim I,57,{'Parlamentarischer Staatssekretär bei der Bun...,{'stefan-zierke.de': 'http://www.stefan-zierke...,Geboren am 5. Dezember 1970 in Prenzlau (Brand...,zierke
726,"Zimmer, Prof. Dr. Matthias",CDU/CSU,Direkt gewählt,Hessen,Frankfurt am Main I,182,{'Obmann': ['Ausschuss für Arbeit und Soziales...,{'matthias-zimmer.de': 'http://www.matthias-zi...,Geboren am 3. Mai 1961 in Marburg/Lahn; verhei...,matthiaszimmer
727,"Zimmermann, Dr. Jens",SPD,Gewählt über Landesliste,Hessen,Odenwald,187,"{'Obmann': ['Ausschuss Digitale Agenda'], 'Ord...",{'jens-zimmermann.org': 'http://www.jens-zimme...,Geboren am 9. September 1981 in Groß-U...,JensZimmermann1
728,"Zimmermann, Pia",Die Linke,Gewählt über Landesliste,Niedersachsen,Helmstedt – Wolfsburg,51,{'Ordentliches Mitglied': ['Ausschuss für Gesu...,{'pia-zimmermann.de': 'http://www.pia-zimmerma...,Geboren am 17. September 1956 in Braunschweig;...,


In [8]:
# select name and username for each member and store in table called twitter_account
names = bt_data['Name']
twitter_usernames = bt_data['Twitter']
names.rename("name", inplace = True)
twitter_usernames.rename("username", inplace = True)
twitter_account = pd.concat([names, twitter_usernames], axis = 1)

In [9]:
twitter_account

Unnamed: 0,name,username
0,"Abercron, Dr. Michael von",mvabercron
1,"Achelwilm, Doris",DorisAchelwilm
2,"Aggelidis, Grigorios",aggelidis_fdp
3,"Akbulut, Gökay",akbulutgokay
4,"Albani, Stephan",
...,...,...
725,"Zierke, Stefan",zierke
726,"Zimmer, Prof. Dr. Matthias",matthiaszimmer
727,"Zimmermann, Dr. Jens",JensZimmermann1
728,"Zimmermann, Pia",


In total we have 730 parliamentarians. However, not all of them have a twitter account.

In [10]:
# drop usernames that are nan or empty (i.e. parliamentarians with no account)
usr_nan = twitter_account.username.isna()
usr_empty = twitter_account.username == ''
twitter_account = twitter_account[~(usr_nan | usr_empty)]
twitter_account.reset_index(drop = True, inplace = True)

In [11]:
twitter_account

Unnamed: 0,name,username
0,"Abercron, Dr. Michael von",mvabercron
1,"Achelwilm, Doris",DorisAchelwilm
2,"Aggelidis, Grigorios",aggelidis_fdp
3,"Akbulut, Gökay",akbulutgokay
4,"Alt, Renata",RenataAlt_MdB
...,...,...
508,"Zdebel, Hubertus",ZdebelHubertus
509,"Ziemiak, Paul",PaulZiemiak
510,"Zierke, Stefan",zierke
511,"Zimmer, Prof. Dr. Matthias",matthiaszimmer


So out of 730 parliamentarians, for 513 we were able to obtain a twitter account.

## 2. Download with GetOldTweets3

GetOldTweets3 is an "inofficial" Python module, which can be used to scrape tweets and other information from Twitter. While the official Twitter-API Tweepy has a limit of 3200 Tweets per user, with GetOldTweets3 it is possible to download an unlimited number of tweets for a given user. 

In [12]:
# download tweets using GetOldTweets3 for specified time period
res_got3 = pd.DataFrame()
# in this demo only for first 40 MPs...
for username in twitter_account.iloc[0:40, 1]:
    res_got3 = pd.concat([res_got3, got3.download_tweets_got3(username, since = "2020-04-05", \
                                          until = "2020-04-08")])

Downloading for mvabercron
Downloading for DorisAchelwilm


In [13]:
# add 'name' column (download only uses 'username' as input)
res_got3 = twitter_account.merge(res_got3, on = 'username')
# display results
res_got3

Unnamed: 0,name,username,to,text,retweets,favorites,replies,id,permalink,author_id,date,formatted_date,hashtags,mentions,geo,urls
0,"Achelwilm, Doris",DorisAchelwilm,,#Weltgesundheitstag 2020: Gesundheit ist kein ...,8,41,4,1247472932466688000,https://twitter.com/DorisAchelwilm/status/1247...,4819478705,2020-04-07 10:34:58+00:00,Tue Apr 07 10:34:58 +0000 2020,#Weltgesundheitstag,,,
1,"Achelwilm, Doris",DorisAchelwilm,ndaktuell,Queerpolitik hat unter #Corona keinen leichten...,15,30,4,1247448540533710848,https://twitter.com/DorisAchelwilm/status/1247...,4819478705,2020-04-07 08:58:03+00:00,Tue Apr 07 08:58:03 +0000 2020,#Corona #Hatespeech #LGBT #Trump #Orban,,,https://twitter.com/ndaktuell/status/124723009...
2,"Achelwilm, Doris",DorisAchelwilm,_hexiklexi,Dieser Tarifabschluss wird aus Pflegeversicher...,0,0,1,1247210632635703298,https://twitter.com/DorisAchelwilm/status/1247...,4819478705,2020-04-06 17:12:41+00:00,Mon Apr 06 17:12:41 +0000 2020,,,,
3,"Achelwilm, Doris",DorisAchelwilm,redheadhb2,"Gut, dass Du es sagst. Nach der langen Ausbild...",0,2,0,1247207997664878592,https://twitter.com/DorisAchelwilm/status/1247...,4819478705,2020-04-06 17:02:13+00:00,Mon Apr 06 17:02:13 +0000 2020,,,,
4,"Achelwilm, Doris",DorisAchelwilm,,Geht doch: #Altenpflegekräfte bekommen im Juli...,4,21,4,1247203823438958594,https://twitter.com/DorisAchelwilm/status/1247...,4819478705,2020-04-06 16:45:38+00:00,Mon Apr 06 16:45:38 +0000 2020,#Altenpflegekr #Krankenhauspersonal #Systemrel...,,,https://www.verdi.de/presse/pressemitteilungen...
5,"Achelwilm, Doris",DorisAchelwilm,,"Damit mediale Infrastruktur, die auf (massiv w...",2,6,1,1246759944877150210,https://twitter.com/DorisAchelwilm/status/1246...,4819478705,2020-04-05 11:21:49+00:00,Sun Apr 05 11:21:49 +0000 2020,#Journalist,@rbbinforadio,,https://www.inforadio.de/programm/schema/sendu...


We can check that it is indeed possible to download more than 3200 tweets per user:

In [10]:
res_trump = got3.download_tweets_got3('realDonaldTrump',since = "2018-09-24", until = "2020-04-08")

Downloading for realDonaldTrump


In [11]:
res_trump.shape

(6731, 15)

However, although occuring very rarely, some tweets appear to be missing (and some rows are empty). Furthermore, retweets cannot be downloaded using GetOldTweets3. 

## 3. Download with Tweepy

With Tweepy we can circumvent these shortcomings, i.e. we can download retweets and there is no information missing, as Tweepy is the official Twitter-API. However, as mentioned, there is a limit of 3200 Tweets per user. 

In [15]:
# download most recent tweets using tweepy (at most 3200 tweets per user)
res_tweepy = pd.DataFrame()
# again, for demonstration purposes we only download for the first 40 PMs
for username in twitter_account.iloc[:40,1]:
    res_tweepy = pd.concat([res_tweepy, th.download_tweets_tweepy(username)])
# add column 'name'
res_tweepy = twitter_account.merge(res_tweepy, on = 'username')

Note that we wrote the function 'download_tweets_tweepy' in the script 'tweepy_helpers.py' in order to generate and select columns that are most important for our purposes:

In [22]:
res_tweepy.columns

Index(['name', 'username', 'available', 'created_at', 'full_text',
       'in_reply_to_user_id_str', 'is_quote_status', 'retweet_count',
       'favorite_count', 'favorited', 'retweeted_status', 'is_retweet',
       'retweet_full_text', 'followers_count', 'location', 'hashtags'],
      dtype='object')

For instance, we generated a column 'available'; if this field is False, then tweets for the respective person cannot be downloaded using tweepy (e.g., because the person has not tweeted anything yet):

In [23]:
res_tweepy['available']

0       True
1       True
2       True
3       True
4       True
        ... 
3980    True
3981    True
3982    True
3983    True
3984    True
Name: available, Length: 3985, dtype: bool

Other columns we have created: 'is_retweet', 'retweet_full_text' and 'hashtags'.

With 'is_retweet' we can easily check whether tweet is a retweet (note that we have used this field to exlcude all retweets from our further analysis):

In [26]:
res_tweepy['is_retweet']

0       False
1       False
2       False
3       False
4       False
        ...  
3980    False
3981    False
3982    False
3983    False
3984    False
Name: is_retweet, Length: 3985, dtype: bool

Using 'retweet_full_text' it is possible to extract the text of a retweet. Furthermore, 'hashtags' returns all hashtags included in a tweet as a list.

The other columns are 'standard' columns of the tweepy module and thus were not manually created; if needed, for more information on these columns just use a search engine of your choice.