# Downloading data from Twitter

Note that the functions used for preprocessing and downloading are imported from our script 'twitter_helpers'.

In [1]:
import pandas as pd
import pickle
import twitter_helpers as th

# set local working directory
# import os
# os.chdir('/Users/patrickschulze/Desktop/Consulting/Bundestag-MP-Analyse/')

## 1. Data Import and Preprocessing

In [2]:
# import Bundestag data
with open('abg_df.pickle', 'rb') as handle:
    bt_data = pickle.load(handle)


In [3]:
bt_data

Unnamed: 0,Name,Partei,Wahlart,Bundesland,Wahlkreis,Ausschuesse,Soziale Medien,Biografie,Twitter,Twitter_right
0,"Abercron, Dr. Michael von",CDU/CSU,Direkt gewählt,Schleswig-Holstein,Wahlkreis 007: Pinneberg,{'Ordentliches Mitglied': ['Ausschuss für Ernä...,{'von-abercron.de/': 'http://www.von-abercron....,Geboren am 17. November 1952 in Ehlers...,mvabercron,https://twitter.com/mvabercron
1,"Achelwilm, Doris",Die Linke,Gewählt über Landesliste,Bremen,n.a.,{'Ordentliches Mitglied': ['Ausschuss für Fami...,{'doris-achelwilm.de': 'http://www.doris-achel...,Geboren am 30. November 1976 in Thuine...,doris_achelwilm,https://twitter.com/doris_achelwilm
2,"Aggelidis, Grigorios",FDP,Gewählt über Landesliste,Niedersachsen,Wahlkreis 043: Hannover-Land I,{'Ordentliches Mitglied': ['Kuratorium der Bun...,{'grigorios-aggelidis.de': 'http://www.grigori...,Geboren am 19. August 1965 in Hannover...,Aggelidis_FDP,http://www.twitter.com/Aggelidis_FDP
3,"Akbulut, Gökay",Die Linke,Gewählt über Landesliste,Baden-Württemberg,Wahlkreis 275: Mannheim,"{'Ordentliches Mitglied': ['Schriftführer/in',...",{'goekay-akbulut.de': 'https://goekay-akbulut....,Geboren 1982 in Pinarbasi/ Türkei; ledig.Juni ...,akbulutgokay,https://twitter.com/akbulutgokay
4,"Albani, Stephan",CDU/CSU,Gewählt über Landesliste,Niedersachsen,Wahlkreis 027: Oldenburg – Ammerland,{'Ordentliches Mitglied': ['Ausschuss für Bild...,{'stephan-albani.de': 'http://www.stephan-alba...,Geboren am 3. Juni 1968 in Göttingen; verheira...,,
...,...,...,...,...,...,...,...,...,...,...
725,"Zierke, Stefan",SPD,Gewählt über Landesliste,Brandenburg,Wahlkreis 057: Uckermark – Barnim I,{'Parlamentarischer Staatssekretär bei der Bun...,{'stefan-zierke.de': 'http://www.stefan-zierke...,Geboren am 5. Dezember 1970 in Prenzlau (Brand...,zierke,http://twitter.com/zierke
726,"Zimmer, Prof. Dr. Matthias",CDU/CSU,Direkt gewählt,Hessen,Wahlkreis 182: Frankfurt am Main I,{'Obmann': ['Ausschuss für Arbeit und Soziales...,{'matthias-zimmer.de': 'http://www.matthias-zi...,Geboren am 3. Mai 1961 in Marburg/Lahn; verhei...,matthiaszimmer,https://twitter.com/matthiaszimmer
727,"Zimmermann, Dr. Jens",SPD,Gewählt über Landesliste,Hessen,Wahlkreis 187: Odenwald,"{'Obmann': ['Ausschuss Digitale Agenda'], 'Ord...",{'jens-zimmermann.org': 'http://www.jens-zimme...,Geboren am 9. September 1981 in Groß-U...,JensZimmermann1,https://twitter.com/JensZimmermann1
728,"Zimmermann, Pia",Die Linke,Gewählt über Landesliste,Niedersachsen,Wahlkreis 051: Helmstedt – Wolfsburg,{'Ordentliches Mitglied': ['Ausschuss für Gesu...,{'pia-zimmermann.de': 'http://www.pia-zimmerma...,Geboren am 17. September 1956 in Braunschweig;...,,


In [5]:
# select name and username for each member and store in table twitter_account
names = bt_data['Name']
twitter_usernames = bt_data['Twitter']
names.rename("name", inplace = True)
twitter_usernames.rename("username", inplace = True)
twitter_account = pd.concat([names, twitter_usernames], axis = 1)

In [6]:
twitter_account

Unnamed: 0,name,username
0,"Abercron, Dr. Michael von",mvabercron
1,"Achelwilm, Doris",doris_achelwilm
2,"Aggelidis, Grigorios",Aggelidis_FDP
3,"Akbulut, Gökay",akbulutgokay
4,"Albani, Stephan",
...,...,...
725,"Zierke, Stefan",zierke
726,"Zimmer, Prof. Dr. Matthias",matthiaszimmer
727,"Zimmermann, Dr. Jens",JensZimmermann1
728,"Zimmermann, Pia",


## 2. Download with GetOldTweets3

GetOldTweets3 is an "inofficial" Python module, which can be used to scrape tweets and other information from Twitter. While the official Twitter-API Tweepy has a limit of 3200 Tweets per user, with GetOldTweets3 it is possible to download an unlimited number of tweets for a given user. 

In [9]:
# download tweets using GetOldTweets3 for specified time period
res_got3 = pd.DataFrame()
for username in twitter_account.iloc[0:3, 1]:
    res_got3 = pd.concat([res_got3, th.download_tweets_got3(username, since = "2017-09-24", \
                                          until = "2020-04-08")])

Downloading for mvabercron
Downloading for doris_achelwilm
Downloading for Aggelidis_FDP


In [10]:
# add 'name' column (download only uses 'username' as input)
res_got3 = twitter_account.merge(res_got3, on = 'username')
# display results
res_got3

Unnamed: 0,name,username,to,text,retweets,favorites,replies,id,permalink,author_id,date,formatted_date,hashtags,mentions,geo,urls
0,"Abercron, Dr. Michael von",mvabercron,,"Uni fällt aus? Keine Angst, eine Pause im Lehr...",0,0,1,1245689849627258881,https://twitter.com/mvabercron/status/12456898...,862747349277450240,2020-04-02 12:29:38+00:00,Thu Apr 02 12:29:38 +0000 2020,#BAf #Corona,,,https://cducsu.cc/3alyQN7
1,"Abercron, Dr. Michael von",mvabercron,,Alle Unternehmen können vom #Corona-Sonderprog...,2,1,1,1245431118968631297,https://twitter.com/mvabercron/status/12454311...,862747349277450240,2020-04-01 19:21:32+00:00,Wed Apr 01 19:21:32 +0000 2020,#Corona #wirhandeln,,,https://cducsu.cc/3alyQN7
2,"Abercron, Dr. Michael von",mvabercron,,Wir kämpfen um jeden Job – durch Ausweitung d....,0,0,0,1244942420841725954,https://twitter.com/mvabercron/status/12449424...,862747349277450240,2020-03-31 10:59:37+00:00,Tue Mar 31 10:59:37 +0000 2020,#wirhandeln,,,https://cducsu.cc/3alyQN7
3,"Abercron, Dr. Michael von",mvabercron,,Schnelle #Corona-Hilfe für Familien: Für den K...,0,0,0,1244646897714855939,https://twitter.com/mvabercron/status/12446468...,862747349277450240,2020-03-30 15:25:19+00:00,Mon Mar 30 15:25:19 +0000 2020,#Corona #wirhandeln,,,https://bit.ly/2Utk4OL
4,"Abercron, Dr. Michael von",mvabercron,,Der #Bundestag passt das #Infektionsschutzgese...,0,2,0,1242874119139590147,https://twitter.com/mvabercron/status/12428741...,862747349277450240,2020-03-25 18:00:56+00:00,Wed Mar 25 18:00:56 +0000 2020,#Bundestag #Infektionsschutzgesetz #Corona,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
126,"Abercron, Dr. Michael von",mvabercron,MdB_Geburtstag,Das wäre uns fast durchgegangen! Herzlichen Gl...,0,1,0,931228736292302848,https://twitter.com/mvabercron/status/93122873...,862747349277450240,2017-11-16 18:33:26+00:00,Thu Nov 16 18:33:26 +0000 2017,,,,
127,"Abercron, Dr. Michael von",mvabercron,,Vorerst keinen starren Zeitplan bei der Sicher...,0,2,0,931223604259442688,https://twitter.com/mvabercron/status/93122360...,862747349277450240,2017-11-16 18:13:03+00:00,Thu Nov 16 18:13:03 +0000 2017,,,,
128,"Abercron, Dr. Michael von",mvabercron,Quadrateule,"Dass MvA direkt ist, hat nichts mit der aktuel...",0,0,0,912831519785472001,https://twitter.com/mvabercron/status/91283151...,862747349277450240,2017-09-27 00:09:28+00:00,Wed Sep 27 00:09:28 +0000 2017,,,,
129,"Abercron, Dr. Michael von",mvabercron,,"Trotz aller Verdienste, Volker Kauder hatte me...",0,5,1,912721248769146882,https://twitter.com/mvabercron/status/91272124...,862747349277450240,2017-09-26 16:51:17+00:00,Tue Sep 26 16:51:17 +0000 2017,,@cducsubt,,


We can check that it is indeed possible to download more than 3200 tweets per user:

In [7]:
res = th.download_tweets_got3('realDonaldTrump',since = "2018-09-24", until = "2020-04-08")

Downloading for realDonaldTrump


In [8]:
res.shape

(2100, 15)

However, although occuring very rarely, some tweets appear to be missing (and some rows are empty). Furthermore, retweets cannot be downloaded using GetOldTweets3. 

## 3. Download with Tweepy

With Tweepy we can circumvent these shortcomings, i.e. we can download retweets and there is no information missing, as Tweepy is the official Twitter-API. However, as mentioned, there is a limit of 3200 Tweets per user. 

In [13]:
# download most recent tweets using tweepy (at most 3200 tweets per user)
res_tweepy = pd.DataFrame()
for username in twitter_account.iloc[0:3, 1]:
    res_tweepy = pd.concat([res_tweepy, th.download_tweets_tweepy(username)])
# again, add column 'name'
res_tweepy = twitter_account.merge(res_tweepy, on = 'username')

Downloading for mvabercron
Downloading for doris_achelwilm


TweepError: [{'code': 34, 'message': 'Sorry, that page does not exist.'}]

In [21]:
res_tweepy = th.download_tweets_tweepy('mvabercron')
# again, add column 'name'
res_tweepy = twitter_account.merge(res_tweepy, on = 'username')

Downloading for mvabercron


In [22]:
res_tweepy.columns

Index(['name', 'username', '_api', '_json', 'created_at', 'id', 'id_str',
       'full_text', 'truncated', 'display_text_range', 'entities',
       'extended_entities', 'source', 'source_url', 'in_reply_to_status_id',
       'in_reply_to_status_id_str', 'in_reply_to_user_id',
       'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'author', 'user',
       'geo', 'coordinates', 'place', 'contributors', 'is_quote_status',
       'retweet_count', 'favorite_count', 'favorited', 'retweeted',
       'possibly_sensitive', 'lang', 'retweeted_status', 'quoted_status_id',
       'quoted_status_id_str', 'quoted_status_permalink', 'quoted_status'],
      dtype='object')

Columns that might be most important for us:

In [23]:
res_tweepy.iloc[:,[0,1,2,5,8,10,11,18,26,27,28,29,30,33]].columns

Index(['name', 'username', '_api', 'id', 'truncated', 'entities',
       'extended_entities', 'in_reply_to_screen_name', 'retweet_count',
       'favorite_count', 'favorited', 'retweeted', 'possibly_sensitive',
       'quoted_status_id'],
      dtype='object')

In [26]:
res_tweepy.iloc[:,[0,1,4,7,9,10,17,25,26,27,28,29,32]]

Unnamed: 0,name,username,created_at,full_text,display_text_range,entities,in_reply_to_user_id_str,is_quote_status,retweet_count,favorite_count,favorited,retweeted,retweeted_status
0,"Abercron, Dr. Michael von",mvabercron,2020-04-10 14:10:03,Dieser Karfreitag wird anders sein als in den ...,"[0, 204]","{'hashtags': [], 'symbols': [], 'user_mentions...",,False,1,3,False,False,
1,"Abercron, Dr. Michael von",mvabercron,2020-04-02 12:29:38,"Uni fällt aus? Keine Angst, eine Pause im Lehr...","[0, 275]","{'hashtags': [{'text': 'BAföG', 'indices': [82...",,False,0,0,False,False,
2,"Abercron, Dr. Michael von",mvabercron,2020-04-01 19:21:32,Alle Unternehmen können vom #Corona-Sonderprog...,"[0, 199]","{'hashtags': [{'text': 'Corona', 'indices': [2...",,False,2,1,False,False,
3,"Abercron, Dr. Michael von",mvabercron,2020-03-31 10:59:46,RT @cducsubt: .@gitta_connemann und @mvabercro...,"[0, 140]","{'hashtags': [{'text': 'Corona', 'indices': [5...",,False,4,0,False,False,Status(_api=<tweepy.api.API object at 0x11e1fb...
4,"Abercron, Dr. Michael von",mvabercron,2020-03-31 10:59:37,Wir kämpfen um jeden Job – durch Ausweitung d....,"[0, 280]","{'hashtags': [{'text': 'wirhandeln', 'indices'...",,False,0,0,False,False,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
182,"Abercron, Dr. Michael von",mvabercron,2017-11-16 18:33:26,@MdB_Geburtstag @goekayakbulut Das wäre uns fa...,"[31, 107]","{'hashtags': [], 'symbols': [], 'user_mentions...",2481661214,False,0,1,False,False,
183,"Abercron, Dr. Michael von",mvabercron,2017-11-16 18:13:03,Vorerst keinen starren Zeitplan bei der Sicher...,"[0, 251]","{'hashtags': [], 'symbols': [], 'user_mentions...",,False,0,2,False,False,
184,"Abercron, Dr. Michael von",mvabercron,2017-09-27 00:09:28,"@Quadrateule Dass MvA direkt ist, hat nichts m...","[13, 100]","{'hashtags': [], 'symbols': [], 'user_mentions...",811931923656351744,False,0,0,False,False,
185,"Abercron, Dr. Michael von",mvabercron,2017-09-26 16:51:17,"Trotz aller Verdienste, Volker Kauder hatte me...","[0, 135]","{'hashtags': [], 'symbols': [], 'user_mentions...",,False,0,5,False,False,


If 'in_reply_to_user_id_str' is not 'None', the tweet is a reply to another tweet. If 'is_quote_status' is not 'False', the tweet is a quote (check definition of quotes in twitter if unknown).

### Author and User

In [44]:
res_tweepy.columns.get_loc('author')

19

In [45]:
res_tweepy.columns.get_loc('user')

20

In [56]:
res_tweepy.iloc[3,19]

User(_api=<tweepy.api.API object at 0x11e1fbb10>, _json={'id': 862747349277450240, 'id_str': '862747349277450240', 'name': 'Dr. Michael von Abercron MdB', 'screen_name': 'mvabercron', 'location': 'Pinneberg, Deutschland', 'description': 'Direkt gewählter Bundestagsabgeordneter aus dem Wahlkreis Pinneberg | Es schreiben Michael von Abercron und sein Team', 'url': 'https://t.co/5Qqm51N9U8', 'entities': {'url': {'urls': [{'url': 'https://t.co/5Qqm51N9U8', 'expanded_url': 'http://www.von-abercron.de', 'display_url': 'von-abercron.de', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 752, 'friends_count': 674, 'listed_count': 46, 'created_at': 'Thu May 11 19:12:51 +0000 2017', 'favourites_count': 162, 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'verified': True, 'statuses_count': 187, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': False, 'profile_background_color': '000000', 'profile_

In [57]:
res_tweepy.iloc[3,20]

User(_api=<tweepy.api.API object at 0x11e1fbb10>, _json={'id': 862747349277450240, 'id_str': '862747349277450240', 'name': 'Dr. Michael von Abercron MdB', 'screen_name': 'mvabercron', 'location': 'Pinneberg, Deutschland', 'description': 'Direkt gewählter Bundestagsabgeordneter aus dem Wahlkreis Pinneberg | Es schreiben Michael von Abercron und sein Team', 'url': 'https://t.co/5Qqm51N9U8', 'entities': {'url': {'urls': [{'url': 'https://t.co/5Qqm51N9U8', 'expanded_url': 'http://www.von-abercron.de', 'display_url': 'von-abercron.de', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 752, 'friends_count': 674, 'listed_count': 46, 'created_at': 'Thu May 11 19:12:51 +0000 2017', 'favourites_count': 162, 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'verified': True, 'statuses_count': 187, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': False, 'profile_background_color': '000000', 'profile_

We see that 'author' and 'user' are essentially the same. On Stackoverflow it's stated that 'user' is deprecated, thus we should use 'author' (if needed).

The following fields of the 'author' object might be interesting for us:

In [60]:
res_tweepy.iloc[3,19].description

'Direkt gewählter Bundestagsabgeordneter aus dem Wahlkreis Pinneberg | Es schreiben Michael von Abercron und sein Team'

In [61]:
res_tweepy.iloc[3,19].location

'Pinneberg, Deutschland'

The number of followers this account currently has:

In [62]:
res_tweepy.iloc[3,19].followers_count

752

The number of users this account is following (AKA their “followings”):

In [63]:
res_tweepy.iloc[3,19].friends_count

674

The number of public lists that this user is a member of:

In [64]:
res_tweepy.iloc[3,19].listed_count

46

The number of Tweets this user has liked in the account’s lifetime:

In [65]:
res_tweepy.iloc[3,19].favourites_count

162

### Retweets

By column 'retweeted_status' it can be checked, whether tweet is a retweet or a new tweet. Retweets are truncated to 140 characters:

In [49]:
res_tweepy[['display_text_range','retweeted_status']]

Unnamed: 0,display_text_range,retweeted_status
0,"[0, 204]",
1,"[0, 275]",
2,"[0, 199]",
3,"[0, 140]",Status(_api=<tweepy.api.API object at 0x11e1fb...
4,"[0, 280]",
...,...,...
182,"[31, 107]",
183,"[0, 251]",
184,"[13, 100]",
185,"[0, 135]",


We can retrieve the full text of a retweet by accessing the attribute 'full_text' of the tweepy object in column 'retweeted_status' (if the tweet is a retweet, in which case the value in this column is not NaN):

In [19]:
res_tweepy.iloc[3,33].full_text

'.@gitta_connemann und @mvabercron erklären: #Corona-Soforthilfen auch für Höfe, Forstbetriebe und landwirtschaftlichen Gartenbau  https://t.co/CRADyyuX4D'

### Additional Information - 'entities'

For each tweet, column 'entities' contains a dict with additional information, such as hashtags or users and urls that are mentioned in the tweet.

In [20]:
# value of 'entities' for the 4th downloaded tweet
res_tweepy.iloc[3,11]

{'hashtags': [{'text': 'Corona', 'indices': [58, 65]}],
 'symbols': [],
 'user_mentions': [{'screen_name': 'cducsubt',
   'name': 'CDU/CSU',
   'id': 46085533,
   'id_str': '46085533',
   'indices': [3, 12]},
  {'screen_name': 'gitta_connemann',
   'name': 'Gitta Connemann',
   'id': 1125751445205262336,
   'id_str': '1125751445205262336',
   'indices': [15, 31]},
  {'screen_name': 'mvabercron',
   'name': 'Dr. Michael von Abercron MdB',
   'id': 862747349277450240,
   'id_str': '862747349277450240',
   'indices': [36, 47]}],
 'urls': []}

In [21]:
# Access first hashtag of this tweet (in this case the only hashtag)
res_tweepy.iloc[3,11]['hashtags'][0]['text']

'Corona'

In [22]:
# obtain username of second user that is mentioned
res_tweepy.iloc[3,11]['user_mentions'][1]['name']

'Gitta Connemann'