# Data Collection Jupyter Script

The first step is loading your API credentials. This can be done with the following steps:

Open commant prompt/powershell and navigate to some appropriate directory, set up for data collection. 

Set your bearer token: `Set BEARER_TOKEN = your_bearer_token`

Open jupyter notebook through the command prompt: `Jupyter Notebook`

The next step is to import libraries. 

In [None]:
import pandas as pd
import requests
import os
import json
import time
import datetime

The following libraries are useful for colouring and styling JSON data. This is useful when viewing JSON.

In [None]:
from pygments import highlight
from pygments.formatters.terminal256 import Terminal256Formatter
from pygments.lexers.web import JsonLexer
from pygments.style import Style
from pygments.token import Token, String, Name

class MyStyle(Style):
    styles = {
        Token.String:     'ansimagenta',
        Name:             'ansibrightblue'
    }

This reads in the headers to set up Twitter Connection. 

Note that one can directly paste in the Twitter bearer token here. However, code should not be shared with this token included. It is best practise to save the token in an environment variable. 

In [None]:
def connect_to_twitter():
    # bearer_token = "your_bearer_token"
    bearer_token = os.environ.get("BEARER_TOKEN")
    return {"Authorization": "Bearer {}".format(bearer_token)}

In [None]:
headers = connect_to_twitter()

This allows access to the API. The next step is to build the request for the endpoint to use and the parameters to pass. 

This API connection uses the new Twitter (v2) API and pulls tweets searched with keywords and from and to certain dates.

It is slow for large data for a number of reasons: Every 500 tweets it pauses for 5 seconds to avoid the rate limit of one request a minute or 300 requests every 15 minutes. In addition, after 900 requests it will pause for 15 minutes. This is a little conservative and could be lowered, but it is better to be a little conservative to avoid a rate limit error. Information on rate limits is available here: https://developer.twitter.com/en/docs/twitter-api/rate-limits

Note that the tweets come in backwards order each day, therefore I create a list of tweets per day and iterate through it in reverse. This is less efficient than writing each tweet in blocks of 500 as if you've thousands of tweets in a day it requires making that list and a list of users to get the username/ This could be sped up but it is not slow enough to justify it.


In [None]:
def connect_to_endpoint(url, headers, params, next_token = None):
    params['next_token'] = next_token   
    response = requests.request("GET", url, headers = headers, params = params)
    #print("Endpoint Response Code: " + str(response.status_code))
    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()

### User Inputs

Comment out code blocks for all other storms but relevant storm (Ctrl + / all cell contents)

Set `storm_name` to the desired storm name. 

A folder named `storm_name` is created. This is where the data is saved in JSON format. 

Edit `keyword`to whatever words you wish to collect. Edit and fill in the `other` parameter similarly to describe other criteria for collection. Information on filtering tweets is availavle in the API docs such as https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/integrate/build-a-rule. 

Adjust `year`, `month`, `start_day`, `end_day` as desired. 
Can collect daily, or a number of days at once (within a month). 
As the volume of tweets in a single day can be very large, I would advise searching for one day of data at a time (as the occasional "Service not available" error may ruin a whole query after collecting a few million tweets taking from the tweet cap. ). Furthermore, if the volume of data is particularly large some days (e.g. March 1st 2018 - the middle of storm Emma), it may be best to divide up the day into two or more data collections. This can be done by changine the hours. 

In [None]:
#Ophelia
# storm_name = 'Emma'
storm_name = 'Ophelia'
if not os.path.exists("Tweets_"+storm_name):
    os.mkdir("Tweets_"+storm_name)

#What you want to search for - keywords.
#Note this brings up meteireann quote tweets, replies, mentions also.
# keyword ="(Storm OR meteireann OR Emma OR wind OR gale OR windstorm OR hurricane OR rain OR raining OR rainy OR rainstorms OR rainstorm OR hail OR hailstones OR hailstorm OR hailing OR hale OR snow OR blizzard OR snowstorm)"
keyword ="(Storm OR meteireann OR Ophelia OR Ofelia OR Opelia OR Opehlia OR stormophelia OR Opheliaireland OR wind OR gale OR windstorm OR hurricane)"

#Also restricting search to tweets Twitter detects to be written in English and not retweets. 
other = " lang:en -is:retweet" 

#Dates to search between
# year =  2018
# month = 3 # 2 or 3 depending on day
# start_day = end_day = 5 # (25-28)U(1,5)
year =  2017
month = 10 
start_day = end_day = 11 # (Change this to colect data between 11-19)

The academic product track allows users to request up to 500 request per page (thereis a maximum of 100 for the standard product track).

In [None]:
max_results = 500

### Loop

In [None]:
hour = 0

if month < 10:
    month = '0' + str(month)

start_day_str = start_day
end_day_str = end_day 
if start_day_str < 10:
    start_day_str = '0' + str(start_day_str)
if end_day_str < 10:
    end_day_str = '0' + str(end_day_str)
    
start= str(year) +"-"+str(month)+"-"+str(start_day_str)+"T00:00:00.000Z"

#Output file name
output = open('Tweets_'+storm_name+'/tweets_' + storm_name + '_' + start.replace('-','_').replace(':','_').split('T00')[0] +'.json','w', encoding='utf-8') 

#Change to the endpoint you want to collect data from

#academic product track allows full archive search
url = "https://api.twitter.com/2/tweets/search/all"
#academic product track allows recent search
#url = "https://api.twitter.com/2/tweets/search/recent"

The search_url is the link of the endpoint we want to access. The query parameters of the endpoint allows customisation of the request. 
The API-reference page for the full-archive search endpoint query parameters is here: https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all.

In [None]:
#Iterate through hours

#Tweet counter
count = 0

#Request counter
n_requests = 0

my_dict={}

https://towardsdatascience.com/an-extensive-guide-to-collecting-tweets-from-twitter-api-v2-for-academic-research-using-python-3-518fcb71df2a

Note that if you want to break down the search into a number of searches per day (for say busy periods), this code will have to be altered slightly. 

In [None]:
# Iterate through the dates

for date in range(start_day,end_day+1):
    if date < 10:
        date = '0' + str(date)
    start_date= str(year) +"-"+str(month)+"-"+str(date)+"T"+str(hour)+":00:00.000Z"
    end_date= str(year) +"-"+ str(month)+"-"+str(date)+"T"+str(hour+23)+":59:59.000Z"
    
    print(start_date.split('T')[0])
    
    params = {'query': keyword + other,
                'start_time': start_date,
                'end_time': end_date,
                'max_results': max_results,
                'expansions': 'author_id,in_reply_to_user_id,geo.place_id,referenced_tweets.id',
                'tweet.fields': 'id,text,author_id,in_reply_to_user_id,geo,conversation_id,created_at,lang,public_metrics,referenced_tweets,reply_settings,source,entities',
                'user.fields': 'id,name,username,location,created_at,description,public_metrics,verified',
                'place.fields': 'full_name,id,country,country_code,geo,name,place_type,contained_within',
                'next_token': {}}

    ####
    #loop through pages of the search

    #This first one is arbitrary, defined it to be 0 when there are none
    next_token = 1

    #as the API gets from the last time to first time, create a list and reverse it later
    tweets, users, places, ref_tweets = [], [], [], []

    #iterate through next tokens
    while next_token != 0:

        #count requests made
        n_requests += 1
        
        #This is going to have up to 500 tweets and a next_token if more
        json_response = connect_to_endpoint(url, headers, params) if next_token == 1 else  connect_to_endpoint(url, headers, params, next_token) 

        #Get next_token if there is one
        if 'next_token' not in json_response['meta']:
            next_token = 0
        else:
            next_token = json_response['meta']['next_token']

        #Get number of tweets
        count += json_response['meta']['result_count']
        print('Tweets collected:',count, '- Requests:', n_requests)

        #We could write here but due to the backwards order I save it to a list instead
        tweets += json_response['data']
        users += json_response['includes']['users']
        if 'places' in json_response['includes']:
            places += json_response['includes']['places']
        if 'tweets' in json_response['includes']:
            ref_tweets += json_response['includes']['tweets']
            
        #Pause to not hit rate limit
        time.sleep(5) 

        #Every 900 requests you have to wait 15 minutes
        if n_requests %900 == 0:
            time.sleep(900)

    ##############
    #Write to file
    # Dictionaries:
    # USER STUFF
    usernames = {u['id']:u['username'] for u in users}
    user_location = {u['id']:u['location'] for u in users if 'location' in u}
    n_followers = {u['id']:u['public_metrics']['followers_count'] for u in users}
    n_following = {u['id']:u['public_metrics']['following_count'] for u in users}
    user_description = {u['id']:u['description'] for u in users}
    user_entities = {u['id']:u['entities'] for u in users if 'entities' in u}
    user_created_at = {u['id']:u['created_at'] for u in users}
    
    # LOCATION STUFF
    country = {p['id']:p['country_code'] for p in places if 'country_code' in p}
    place_full_name = {p['id']:p['full_name'] for p in places if 'full_name' in p}
    place_name = {p['id']:p['name'] for p in places if 'name' in p}
    places_geo = {p['id']:p['geo'] for p in places if 'geo' in p}# geo in data.includes.places
    place_type = {p['id']:p['place_type'] for p in places if 'place_type' in p}
    
    # REFERENCED_TWEET_STUFF
    referenced_tweets_text = {t['id']:t['text'] for t in ref_tweets if 'text' in t}
    referenced_tweets_author_id = {t['id']:t['author_id'] for t in ref_tweets if 'text' in t}
    referenced_tweets_conversation_id = {t['id']:t['conversation_id'] for t in ref_tweets if 'text' in t}
    referenced_tweets_created_at = {t['id']:t['created_at'] for t in ref_tweets if 'text' in t}
    referenced_tweets_lang = {t['id']:t['lang'] for t in ref_tweets if 'text' in t}
    
    #loop through tweets in reverse order (NOTE: if you change the hour you'll have order issues here)
    for tweet in reversed(tweets):

        #Get user ID
        uid = tweet['author_id']
        
        tweet_GPS = tweet['geo']['place_id'] if 'geo' in tweet and 'place_id' in tweet['geo'] else ''
        
        #Get lists of hashtags and mentions
        hasht = []
        ment = []
        if 'entities' in tweet:
            if 'hashtags' in tweet['entities']:
                for u in tweet['entities']['hashtags']:
                    hasht.append(u['tag'])
            if 'mentions' in tweet['entities']:
                for u in tweet['entities']['mentions']:
                    ment.append(u['username'])

        #Check if retweet
        RT = 'n'
        if 'referenced_tweets' in tweet and tweet['text'][:2] == 'RT':
            RT = 'y'
        referenced_tweet_id = tweet['referenced_tweets'][0]['id'] if 'referenced_tweets' in tweet else ''
        
        my_dict_line = {}
        my_dict_line['tweet_id']=tweet.get('id')
        my_dict_line['conversation_id']=tweet.get('conversation_id')
        my_dict_line['author_id']=tweet.get('author_id')
        my_dict_line['created_at']=tweet.get('created_at')
        my_dict_line['text']=tweet.get('text')
        my_dict_line['lang']=tweet.get('lang')
        my_dict_line['retweet']=RT
        
        my_dict_line['username']=usernames[uid]
        my_dict_line['user_location']=user_location[uid] if uid in user_location else ''
        my_dict_line['n_followers']=n_followers[uid] if uid in n_followers else ''
        my_dict_line['n_following']=n_following[uid] if uid in n_following else ''
        my_dict_line['user_description']=user_description[uid] if uid in user_description else ''
        my_dict_line['user_entities']=user_entities[uid] if uid in user_entities else ''
        my_dict_line['user_created_at']=user_created_at[uid] if uid in user_created_at else ''
        
        my_dict_line['geo']=tweet.get('geo') 
        my_dict_line['country']=country[tweet_GPS] if 'geo' in tweet and tweet_GPS in country else ''
        my_dict_line['place_full_name']=place_full_name[tweet_GPS] if 'geo' in tweet and tweet_GPS in country else ''
        my_dict_line['place_name']=place_name[tweet_GPS] if 'geo' in tweet and tweet_GPS in country else ''
        my_dict_line['places_geo']=places_geo[tweet_GPS] if 'geo' in tweet and tweet_GPS in country else ''
        my_dict_line['place_type']=place_type[tweet_GPS] if 'geo' in tweet and tweet_GPS in country else ''
        
        my_dict_line['public_metrics']=tweet.get('public_metrics')
        my_dict_line['entities']=tweet.get('entities')
        my_dict_line['in_reply_to_user_id']=tweet.get('in_reply_to_user_id')
        my_dict_line['referenced_tweets']=tweet.get('referenced_tweets')
        
        my_dict_line['referenced_tweet_id']=referenced_tweet_id
        my_dict_line['referenced_tweets_text']=referenced_tweets_text[referenced_tweet_id] if 'referenced_tweets' in tweet and referenced_tweet_id in referenced_tweets_text else ''
        my_dict_line['referenced_tweets_author_id']=referenced_tweets_author_id[referenced_tweet_id] if 'referenced_tweets' in tweet and referenced_tweet_id in referenced_tweets_author_id else ''
        my_dict_line['referenced_tweets_conversation_id']=referenced_tweets_conversation_id[referenced_tweet_id] if 'referenced_tweets' in tweet and referenced_tweet_id in referenced_tweets_conversation_id else ''
        my_dict_line['referenced_tweets_created_at']=referenced_tweets_created_at[referenced_tweet_id] if 'referenced_tweets' in tweet and referenced_tweet_id in referenced_tweets_created_at else ''
        my_dict_line['referenced_tweets_lang']=referenced_tweets_lang[referenced_tweet_id] if 'referenced_tweets' in tweet and referenced_tweet_id in referenced_tweets_lang else ''
        
        output.write(json.dumps(my_dict_line) + '\n')
    #End tweet loop
    #############

#Close output file
output.close()

Check the number of requests made:

In [None]:
n_requests

This is what one request looks like:

In [None]:
json_response_col = highlight(
    json.dumps(json_response, sort_keys=True, indent=4),
    lexer=JsonLexer(),
    formatter=Terminal256Formatter(style=MyStyle),
)
print(json_response_col)