# Downloading data from Twitter

Note that the functions used for preprocessing and downloading are imported from our script 'twitter_helpers'.

In [1]:
import pandas as pd
import pickle
import twitter_helpers as th

# set local working directory
# import os
# os.chdir('/Users/patrickschulze/Desktop/Consulting/Bundestag-MP-Analyse/')

## 1. Data Import and Preprocessing

In [2]:
# import Bundestag data
with open('abg_df.pickle', 'rb') as handle:
    bt_data = pickle.load(handle)


In [3]:
# extract name and url for each member
url = bt_data['Soziale Medien'].apply(th.get_twitter_url)
twitter_account = pd.concat([bt_data['Name'], url], axis = 1, \
                            keys = ['name','url'])

In [4]:
# convert twitter url to username for each member
twitter_usernames = twitter_account['url'].apply(th.get_twitter_username)
twitter_usernames.rename("username", inplace = True)
twitter_account = pd.concat([twitter_account, twitter_usernames], axis = 1)

## 2. Download with GetOldTweets3

GetOldTweets3 is an "inofficial" Python module, which can be used to scrape tweets and other information from Twitter. While the official Twitter-API Tweepy has a limit of 3200 Tweets per user, with GetOldTweets3 it is possible to download an unlimited number of tweets for a given user. 

In [5]:
# download tweets using GetOldTweets3 for specified time period
res_got3 = pd.DataFrame()
for username in twitter_account.iloc[0:3, 2]:
    res_got3 = pd.concat([res_got3, th.download_tweets_got3(username, since = "2017-09-24", \
                                          until = "2020-04-08")])

Downloading for mvabercron
Downloading for DorisAchelwilm
Downloading for aggelidis_fdp


In [6]:
# add 'Name' column (download only uses 'username' as input)
res_got3 = twitter_account.merge(res_got3, on = 'username')
# display results
res_got3

Unnamed: 0,username,to,text,retweets,favorites,replies,id,permalink,author_id,date,formatted_date,hashtags,mentions,geo,urls
0,mvabercron,,"Uni fällt aus? Keine Angst, eine Pause im Lehr...",0,0,1,1245689849627258881,https://twitter.com/mvabercron/status/12456898...,862747349277450240,2020-04-02 12:29:38+00:00,Thu Apr 02 12:29:38 +0000 2020,#BAf #Corona,,,https://cducsu.cc/3alyQN7
1,mvabercron,,Alle Unternehmen können vom #Corona-Sonderprog...,2,1,1,1245431118968631297,https://twitter.com/mvabercron/status/12454311...,862747349277450240,2020-04-01 19:21:32+00:00,Wed Apr 01 19:21:32 +0000 2020,#Corona #wirhandeln,,,https://cducsu.cc/3alyQN7
2,mvabercron,,Wir kämpfen um jeden Job – durch Ausweitung d....,0,0,0,1244942420841725954,https://twitter.com/mvabercron/status/12449424...,862747349277450240,2020-03-31 10:59:37+00:00,Tue Mar 31 10:59:37 +0000 2020,#wirhandeln,,,https://cducsu.cc/3alyQN7
3,mvabercron,,Schnelle #Corona-Hilfe für Familien: Für den K...,0,0,0,1244646897714855939,https://twitter.com/mvabercron/status/12446468...,862747349277450240,2020-03-30 15:25:19+00:00,Mon Mar 30 15:25:19 +0000 2020,#Corona #wirhandeln,,,https://bit.ly/2Utk4OL
4,mvabercron,,Der #Bundestag passt das #Infektionsschutzgese...,0,2,0,1242874119139590147,https://twitter.com/mvabercron/status/12428741...,862747349277450240,2020-03-25 18:00:56+00:00,Wed Mar 25 18:00:56 +0000 2020,#Bundestag #Infektionsschutzgesetz #Corona,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
388,aggelidis_fdp,,@BAEKaktuell möchte durch eine zentrale Liste ...,0,1,0,987282872863215616,https://twitter.com/aggelidis_fdp/status/98728...,976107205849206784,2018-04-20 10:52:34+00:00,Fri Apr 20 10:52:34 +0000 2018,#Schwangerschaftsabruch #219a,@BAEKaktuell,,
389,aggelidis_fdp,,Erste Studie zeigt #baukindergeld entfalten ka...,3,9,1,986951112774307840,https://twitter.com/aggelidis_fdp/status/98695...,976107205849206784,2018-04-19 12:54:16+00:00,Thu Apr 19 12:54:16 +0000 2018,#baukindergeld,,,
390,aggelidis_fdp,,Der Vorschlag von @hubertus_heil läuft an Lebe...,0,2,0,986618695735750662,https://twitter.com/aggelidis_fdp/status/98661...,976107205849206784,2018-04-18 14:53:22+00:00,Wed Apr 18 14:53:22 +0000 2018,#Teilzeit #br,@hubertus_heil,,
391,aggelidis_fdp,,Fast noch wichtiger als die Reden sind die mot...,1,2,0,985533985576144896,https://twitter.com/aggelidis_fdp/status/98553...,976107205849206784,2018-04-15 15:03:07+00:00,Sun Apr 15 15:03:07 +0000 2018,#fdplptnds #fdp_nds,,,


We can check that it is indeed possible to download more than 3200 tweets per user:

In [7]:
res = th.download_tweets_got3('realDonaldTrump',since = "2018-09-24", until = "2020-04-08")

Downloading for realDonaldTrump


In [12]:
res.shape

(6737, 15)

However, although occuring very rarely, some tweets appear to be missing (and some rows are empty). Furthermore, retweets cannot be downloaded using GetOldTweets3. 

## 3. Download with Tweepy

With Tweepy we can circumvent these shortcomings, i.e. we can download retweets and there is no information missing, as Tweepy is the official Twitter-API. However, as mentioned, there is a limit of 3200 Tweets per user. 

In [8]:
# download most recent tweets using tweepy (at most 3200 tweets per user)
res_tweepy = pd.DataFrame()
for username in twitter_account.iloc[0:3, 2]:
    res_tweepy = pd.concat([res_tweepy, th.download_tweets_tweepy(username)])
# again, add column 'Name'
res_got3 = twitter_account.merge(res_got3, on = 'username')

Downloading for mvabercron
Downloading for DorisAchelwilm
Downloading for aggelidis_fdp


In [9]:
res_tweepy.columns

Index(['_api', '_json', 'created_at', 'id', 'id_str', 'full_text', 'truncated',
       'display_text_range', 'entities', 'extended_entities', 'source',
       'source_url', 'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'author', 'user', 'geo', 'coordinates',
       'place', 'contributors', 'is_quote_status', 'retweet_count',
       'favorite_count', 'favorited', 'retweeted', 'possibly_sensitive',
       'lang', 'retweeted_status', 'quoted_status_id', 'quoted_status_id_str',
       'quoted_status_permalink', 'quoted_status', 'withheld_in_countries'],
      dtype='object')

Columns that might be most important for us:

In [23]:
res_tweepy.iloc[:,[2,5,7,8,15,23,24,25,26,27,30]].columns

Index(['created_at', 'full_text', 'display_text_range', 'entities',
       'in_reply_to_user_id_str', 'is_quote_status', 'retweet_count',
       'favorite_count', 'favorited', 'retweeted', 'retweeted_status'],
      dtype='object')

In [24]:
res_tweepy.iloc[:,[2,5,7,8,15,23,24,25,26,27,30]]

Unnamed: 0,created_at,full_text,display_text_range,entities,in_reply_to_user_id_str,is_quote_status,retweet_count,favorite_count,favorited,retweeted,retweeted_status
0,2020-04-10 14:10:03,Dieser Karfreitag wird anders sein als in den ...,"[0, 204]","{'hashtags': [], 'symbols': [], 'user_mentions...",,False,1,3,False,False,
1,2020-04-02 12:29:38,"Uni fällt aus? Keine Angst, eine Pause im Lehr...","[0, 275]","{'hashtags': [{'text': 'BAföG', 'indices': [82...",,False,0,0,False,False,
2,2020-04-01 19:21:32,Alle Unternehmen können vom #Corona-Sonderprog...,"[0, 199]","{'hashtags': [{'text': 'Corona', 'indices': [2...",,False,2,1,False,False,
3,2020-03-31 10:59:46,RT @cducsubt: .@gitta_connemann und @mvabercro...,"[0, 140]","{'hashtags': [{'text': 'Corona', 'indices': [5...",,False,4,0,False,False,Status(_api=<tweepy.api.API object at 0x122902...
4,2020-03-31 10:59:37,Wir kämpfen um jeden Job – durch Ausweitung d....,"[0, 280]","{'hashtags': [{'text': 'wirhandeln', 'indices'...",,False,0,0,False,False,
...,...,...,...,...,...,...,...,...,...,...,...
511,2018-04-20 10:52:34,@BAEKaktuell möchte durch eine zentrale Liste ...,"[0, 265]",{'hashtags': [{'text': 'Schwangerschaftsabruch...,243143022,False,0,1,False,False,
512,2018-04-19 12:54:16,Erste Studie zeigt #baukindergeld entfalten ka...,"[0, 212]","{'hashtags': [{'text': 'baukindergeld', 'indic...",,False,3,9,False,False,
513,2018-04-18 14:53:22,Der Vorschlag von @hubertus_heil läuft an Lebe...,"[0, 277]","{'hashtags': [{'text': 'Teilzeit', 'indices': ...",,False,0,2,False,False,
514,2018-04-15 15:03:07,Fast noch wichtiger als die Reden sind die mot...,"[0, 185]","{'hashtags': [{'text': 'fdplptnds', 'indices':...",,False,1,2,False,False,


If 'in_reply_to_user_id_str' is not 'None', the tweet is a reply to another tweet. If 'is_quote_status' is not 'False', the tweet is a quote (check definition of quotes in twitter if unknown).

### Retweets

By column 'retweeted_status' it can be checked, whether tweet is a retweet or a new tweet. Retweets are truncated to 140 characters:

In [14]:
res_tweepy[['display_text_range','retweeted_status']]

Unnamed: 0,display_text_range,retweeted_status
0,"[0, 204]",
1,"[0, 275]",
2,"[0, 199]",
3,"[0, 140]",Status(_api=<tweepy.api.API object at 0x122902...
4,"[0, 280]",
...,...,...
511,"[0, 265]",
512,"[0, 212]",
513,"[0, 277]",
514,"[0, 185]",


We can retrieve the full text of a retweet by accessing the attribute 'full_text' of the tweepy object in column 'retweeted_status' (if the tweet is a retweet, in which case the value in this column is not NaN):

In [20]:
res_tweepy.iloc[3,30].full_text

'.@gitta_connemann und @mvabercron erklären: #Corona-Soforthilfen auch für Höfe, Forstbetriebe und landwirtschaftlichen Gartenbau  https://t.co/CRADyyuX4D'

### Additional Information - 'entities'

For each tweet, column 'entities' contains a dict with additional information, such as hashtags or users and urls that are mentioned in the tweet.

In [22]:
# value of 'entities' for the 4th downloaded tweet
res_tweepy.iloc[3,8]

{'hashtags': [{'text': 'Corona', 'indices': [58, 65]}],
 'symbols': [],
 'user_mentions': [{'screen_name': 'cducsubt',
   'name': 'CDU/CSU',
   'id': 46085533,
   'id_str': '46085533',
   'indices': [3, 12]},
  {'screen_name': 'gitta_connemann',
   'name': 'Gitta Connemann',
   'id': 1125751445205262336,
   'id_str': '1125751445205262336',
   'indices': [15, 31]},
  {'screen_name': 'mvabercron',
   'name': 'Dr. Michael von Abercron MdB',
   'id': 862747349277450240,
   'id_str': '862747349277450240',
   'indices': [36, 47]}],
 'urls': []}

In [33]:
# Access first hashtag of this tweet (in this case the only hashtag)
res_tweepy.iloc[3,8]['hashtags'][0]['text']

'Corona'

In [36]:
# obtain username of second user that is mentioned
res_tweepy.iloc[3,8]['user_mentions'][1]['name']

'Gitta Connemann'