# Get Tweets
## Tools
We set up a Twitter developer account and attempted to use the Tweepy tool to extract tweets. Our method was influenced by 
https://www.earthdatascience.org/courses/use-data-open-source-python/intro-to-apis/twitter-data-in-python/

We did extract data but we encountered problems when trying to access historic data.

## Problems encountered

Unfortunately, the Twitter API only returns tweets from the most recent week, irrespective of which start date one provides. This means that we will have to run the extract on a weekly basis (to get the prior week's tweets) in order to get a reasonable body of tweets. This is suboptimal and could restrict our ability to go back and query prior results.

I addition, running search queries on Twitter takes a lot of time and can get timed out. For example, see Appendix A for messages received when I timed out searching Twitter using the search term "sadiq AND khan". Previously this terms had taken 3 hours to complete and returned approximately 12,000 rows. However, it has then consistently failed due to timeouts. I am not sure whether this is happenstance or whether Twitter have throttled my ability to download large volumes of Twitter data, given I had recently downloaded this data. This is of interest to other researchers using developer accounts. 

## Twitter research account
I therefore applied to Twitter's Academic Research track https://developer.twitter.com/en/products/twitter-api/academic-research because this allows researchers to access historic tweets, and in higher tweet volumes. My requests were rejected and I discuss this in further detail in Appendix B. The reason I discuss this is because researching Twitter data without academic access is difficult and yet getting academic access whilst not having a presence on University websites does not seem possible 

## Tweepy Code References:
- retweet and favourite counts, better dataframe creator using Tweepy - https://towardsdatascience.com/how-to-build-a-dataset-from-twitter-using-python-tweepy-861bdbc16fa5
- Getting user and location - https://stackoverflow.com/questions/50366489/how-to-get-twitter-users-screen-name-or-userid-from-a-specific-geolocations
- Cleaning tweet text and finding out if retweet - https://stackoverflow.com/questions/50052330/tweepy-check-if-a-tweet-is-a-retweet
- geocordinates - https://stackoverflow.com/questions/46044445/not-able-to-scrape-geo-coordinate-with-tweets-lat-lon
- avoiding twitter api rate limit - https://stackoverflow.com/questions/21308762/avoid-twitter-api-limitation-with-tweepy
- keeping authentication details secret - https://www.digitalocean.com/community/tutorials/how-to-create-a-twitterbot-with-python-3-and-the-tweepy-library

In [1]:
import os
import tweepy
import datetime
import pandas as pd

## 1. Get Twitter Data
We have two choices to loading twitter data:
- 1.1. use the Tweepy API (but this can take hours)
- 1.2. load the previously saved Twitter data 

### 1.1. Load data from Twitter 
#### 1.1.1 Twitter credentials file
I don't want to make my Twitter credentials public and so these are loaded from a credentials file and that file is not uploaded to github. 

To replicate this code, create a 'credentials.py' file with the following lines (using your own credential details):

`
consumer_key = 'your_consumer_key'
consumer_secret = 'your_consumer_secret'
access_token = 'your_access_token'
access_token_secret = 'your_access_token_secret'`

In [2]:
from credentials import *

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

####  1.1.2 Set date parameters
Using search words, I want to get all tweets between today's date and a start date of July 1, 2019
- The start date is just before the day the Mayor made his speech and today's date is used so I can collect as many tweets as possible

In [3]:
date_from = datetime.date(2019, 7, 1) # this doesn't actually work as twitter only goes back one week
today = datetime.datetime.today().strftime('%Y%m%d')

outputfile_str = "./DataSources/TwitterData/raw_tweets_" + today + ".csv"
print(outputfile_str)

date_from, today

./DataSources/TwitterData/raw_tweets_20210730.csv


(datetime.date(2019, 7, 1), '20210730')

#### 1.1.3 Get tweets using a cursor
- First we define our function to load tweets
- Create search terms to query Twitter - return results as a list of dictionary items
- Concatenate all returned results and then use this to create a pandas dataframe

##### 1.1.3.1 define get_tweets
In order to process quote tweets we borrowed code from https://blog.f-secure.com/processing-quote-tweets-with-twitter-api/ and then, because it didn't work well, searched on json.dumps and json.loads to work out how to strip out the json string for the quoted user and then to turn that into a dictionary I could easily interrogate
- error processing purloined from https://stackoverflow.com/questions/27351207/gracefully-handle-errors-and-exceptions-for-user-timeline-method-in-tweepy
    - error code 50 means there isn't a user for this user id
    - error code 63 means this user id refers to a user who has been suspended from Twitter

In [4]:
import time
import json

def get_tweets(search_words, my_api, today): 
    tic = time.perf_counter()
    tweets = tweepy.Cursor(my_api.search,
                       q=search_words,
                       lang="en",
                       since=date_from).items()
    
    output = []
    for tweet in tweets:
               
        try:
            tweet_id = tweet.id
            text = tweet.text
            tweet_date = tweet.created_at
            user_id_str = tweet.user.id_str
            screen_name = tweet.user.screen_name
            user_name = tweet.user.name
            user_id = api.get_user(user_id_str)
        
            in_reply_to_user_screen_name = ""
            quote_tweet_screen_name = ""
            
            if tweet.in_reply_to_user_id is not None: 
                in_reply_to_user_id = tweet.in_reply_to_user_id 
                in_reply_to_user_screen_name = api.get_user(in_reply_to_user_id).screen_name
             
            if hasattr(tweet, 'quoted_status'): 
                quote_tweet = tweet.quoted_status            
                quote_tweet_str = json.dumps(quote_tweet._json) # dumps json component into a string
                quote_tweet_dict = json.loads(quote_tweet_str) # loads the string into a dictionary                       
                quote_tweet_id = quote_tweet_dict["user"]["id"]
                quote_tweet_screen_name = api.get_user(quote_tweet_id).screen_name
                                  
            user_location = user_id.location
            user_coordinates = tweet.coordinates
            favourite_count = tweet.favorite_count
            retweet_count = tweet.retweet_count
                
            line = {'tweet_id' : tweet_id,
                'tweet_date' : tweet_date,
                'tweeter_id' : user_id_str,
                'tweeter_user_name' : user_name,
                'tweeter_screen_name' : screen_name,
                'tweeter_location' : user_location,
                'tweeter_coordinates' : user_coordinates,
                'message_text' : text,
                'in_reply_to_user_screen_name' : in_reply_to_user_screen_name,      
                'quote_tweet_screen_name' : quote_tweet_screen_name,
                'favourite_count' : favourite_count, 
                'retweet_count' : retweet_count,
                'extract_run_date' : today,
                'retrieved_using_search_term' : search_words}
            output.append(line)
        
        except tweepy.TweepError as e:
            print('\n **************** error ***************')
            print(e)
            print('\n ********* end of error text **********')
               
    toc = time.perf_counter()
    time_taken = toc - tic
    
    print('Time taken to process search term : {} , was {:.2f}'.format(search_words, time_taken))
    
    return output

##### 1.1.3.2 create list of search terms and iteratively get tweets using these terms

In [5]:
search_terms = ["London AND knife AND crime",
                "London AND knifecrime",
                "Khan AND knife AND crime",
                "Khan AND knifecrime",
                "London AND violent AND crime",
                "youth AND violent AND crime",
                "youth AND crime AND London"
                "youth AND knife AND crime",
                "london AND youthcrime",
                "#knifecrime AND #khan",
                "#knifecrime AND #london",
                "#violence AND #khan",
                "#london AND #youthcrime",
                "London AND crime",
                "London AND stabbing"]

# The following term was queried on 30/07/2021 to get wider context on what's being tweeted
#search_terms = ["sadiq AND khan"]

all_tweets = []

tic = time.perf_counter()

for search_term in search_terms:
    current_tweets = get_tweets(search_term, api, today)
    all_tweets.append(current_tweets)

toc = time.perf_counter()
time_taken = toc - tic
    
print('Time taken to process ALL search terms : {} , was {:.2f}'.format(search_terms, time_taken))

Rate limit reached. Sleeping for: 678
Rate limit reached. Sleeping for: 682
Rate limit reached. Sleeping for: 674
Rate limit reached. Sleeping for: 670
Rate limit reached. Sleeping for: 660
Rate limit reached. Sleeping for: 648
Rate limit reached. Sleeping for: 670
Rate limit reached. Sleeping for: 638



 **************** error ***************
Failed to send request: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))

 ********* end of error text **********

 **************** error ***************
[{'code': 63, 'message': 'User has been suspended.'}]

 ********* end of error text **********


Rate limit reached. Sleeping for: 587



 **************** error ***************
[{'code': 50, 'message': 'User not found.'}]

 ********* end of error text **********


Rate limit reached. Sleeping for: 580



 **************** error ***************
Failed to send request: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))

 ********* end of error text **********


Rate limit reached. Sleeping for: 671



 **************** error ***************
[{'code': 50, 'message': 'User not found.'}]

 ********* end of error text **********


Rate limit reached. Sleeping for: 672



 **************** error ***************
Failed to send request: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))

 ********* end of error text **********


Rate limit reached. Sleeping for: 669



 **************** error ***************
[{'code': 50, 'message': 'User not found.'}]

 ********* end of error text **********
Time taken to process search term : sadiq AND khan , was 47091.57
Time taken to process ALL search terms : ['sadiq AND khan'] , was 47091.65


#### 1.1.4 create all_tweets_df dataframe
First check on how many items downloaded

In [6]:
all_tweets_df = pd.DataFrame(columns=['tweet_id', 
                                      'tweet_date', 
                                      'tweeter_id', 
                                      'tweeter_user_name', 
                                      'tweeter_screen_name', 
                                      'tweeter_location',
                                      'tweeter_coordinates',
                                      'message_text',
                                      'in_reply_to_user_screen_name', 
                                      'quote_tweet_screen_name',
                                      'favourite_count',
                                      'retweet_count',
                                      'extract_run_date',
                                      'retrieved_using_search_term'])

for these_tweets in all_tweets:
    print('number of tweets in current list = {}'.format(len(these_tweets)))

    df_tweets = pd.DataFrame(these_tweets)
    all_tweets_df = all_tweets_df.append(df_tweets, ignore_index=True)

print(all_tweets_df.shape)

all_tweets_df.to_csv(outputfile_str, index=False)

all_tweets_df.head()

number of tweets in current list = 11607
(11607, 14)


Unnamed: 0,tweet_id,tweet_date,tweeter_id,tweeter_user_name,tweeter_screen_name,tweeter_location,tweeter_coordinates,message_text,in_reply_to_user_screen_name,quote_tweet_screen_name,favourite_count,retweet_count,extract_run_date,retrieved_using_search_term
0,1421147981890347008,2021-07-30 16:37:37,739449695957884928,Tommy Brexit تومي بركزت,Brexit4us,"London, England",,"RT @LTHlondon: Gridlock, gridlock and more GRI...",,,0,1,20210730,sadiq AND khan
1,1421147788851793928,2021-07-30 16:36:51,215217791,David,WeeksyD,South Wales,,RT @torysleazeUK: https://t.co/UcoFsOGkwb🏴‍☠️ ...,,,0,12,20210730,sadiq AND khan
2,1421147566780059655,2021-07-30 16:35:58,1132230860,Annie Artist 💙,Artyannie,South Lanarkshire Scotland,,RT @GoodLawProject: BREAKING: New emails revea...,,,0,4285,20210730,sadiq AND khan
3,1421146682201411587,2021-07-30 16:32:27,19291485,LTH🇬🇧london,LTHlondon,"London, England",,"Gridlock, gridlock and more GRIDLOCK! For as f...",,,1,1,20210730,sadiq AND khan
4,1421146614907932675,2021-07-30 16:32:11,442906117,Alistair James Baker,Albaker1984,London,,RT @LBC: 'I pay £15 a day to drive 160 yards.'...,,,0,11,20210730,sadiq AND khan


### 1.2. Load previously saved Twitter data
- Need to change the file name passed to 'load_file_name' if we want a prior dataset
- Will eventually concatenate all the files

In [7]:
load_file_name = outputfile_str

all_tweets_df_new = pd.read_csv(load_file_name)
print(all_tweets_df_new.shape)
all_tweets_df_new.head()

(11607, 14)


Unnamed: 0,tweet_id,tweet_date,tweeter_id,tweeter_user_name,tweeter_screen_name,tweeter_location,tweeter_coordinates,message_text,in_reply_to_user_screen_name,quote_tweet_screen_name,favourite_count,retweet_count,extract_run_date,retrieved_using_search_term
0,1421147981890347008,2021-07-30 16:37:37,739449695957884928,Tommy Brexit تومي بركزت,Brexit4us,"London, England",,"RT @LTHlondon: Gridlock, gridlock and more GRI...",,,0,1,20210730,sadiq AND khan
1,1421147788851793928,2021-07-30 16:36:51,215217791,David,WeeksyD,South Wales,,RT @torysleazeUK: https://t.co/UcoFsOGkwb🏴‍☠️ ...,,,0,12,20210730,sadiq AND khan
2,1421147566780059655,2021-07-30 16:35:58,1132230860,Annie Artist 💙,Artyannie,South Lanarkshire Scotland,,RT @GoodLawProject: BREAKING: New emails revea...,,,0,4285,20210730,sadiq AND khan
3,1421146682201411587,2021-07-30 16:32:27,19291485,LTH🇬🇧london,LTHlondon,"London, England",,"Gridlock, gridlock and more GRIDLOCK! For as f...",,,1,1,20210730,sadiq AND khan
4,1421146614907932675,2021-07-30 16:32:11,442906117,Alistair James Baker,Albaker1984,London,,RT @LBC: 'I pay £15 a day to drive 160 yards.'...,,,0,11,20210730,sadiq AND khan


# Appendix A - Twitter Timeout error
The following error message was received when extracting data using the search term "Sadiq AND Khan"

### error message
TimeoutError                              Traceback (most recent call last)
~\anaconda3\lib\site-packages\urllib3\connection.py in _new_conn(self)
    158         try:
--> 159             conn = connection.create_connection(
    160                 (self._dns_host, self.port), self.timeout, **extra_kw

~\anaconda3\lib\site-packages\urllib3\util\connection.py in create_connection(address, timeout, source_address, socket_options)
     83     if err is not None:
---> 84         raise err
     85 

~\anaconda3\lib\site-packages\urllib3\util\connection.py in create_connection(address, timeout, source_address, socket_options)
     73                 sock.bind(source_address)
---> 74             sock.connect(sa)
     75             return sock

TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

During handling of the above exception, another exception occurred:

NewConnectionError                        Traceback (most recent call last)
~\anaconda3\lib\site-packages\urllib3\connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    669             # Make the request on the httplib connection object.
--> 670             httplib_response = self._make_request(
    671                 conn,

~\anaconda3\lib\site-packages\urllib3\connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    380         try:
--> 381             self._validate_conn(conn)
    382         except (SocketTimeout, BaseSSLError) as e:

~\anaconda3\lib\site-packages\urllib3\connectionpool.py in _validate_conn(self, conn)
    975         if not getattr(conn, "sock", None):  # AppEngine might not have  `.sock`
--> 976             conn.connect()
    977 

~\anaconda3\lib\site-packages\urllib3\connection.py in connect(self)
    307         # Add certificate verification
--> 308         conn = self._new_conn()
    309         hostname = self.host

~\anaconda3\lib\site-packages\urllib3\connection.py in _new_conn(self)
    170         except SocketError as e:
--> 171             raise NewConnectionError(
    172                 self, "Failed to establish a new connection: %s" % e

NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x00000183218E7B50>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

During handling of the above exception, another exception occurred:

MaxRetryError                             Traceback (most recent call last)
~\anaconda3\lib\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    438             if not chunked:
--> 439                 resp = conn.urlopen(
    440                     method=request.method,

~\anaconda3\lib\site-packages\urllib3\connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    723 
--> 724             retries = retries.increment(
    725                 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]

~\anaconda3\lib\site-packages\urllib3\util\retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
    438         if new_retry.is_exhausted():
--> 439             raise MaxRetryError(_pool, url, error or ResponseError(cause))
    440 

MaxRetryError: HTTPSConnectionPool(host='api.twitter.com', port=443): Max retries exceeded with url: /1.1/users/show.json?id=862885892 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x00000183218E7B50>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
~\anaconda3\lib\site-packages\tweepy\binder.py in execute(self)
    183                 try:
--> 184                     resp = self.session.request(self.method,
    185                                                 full_url,

~\anaconda3\lib\site-packages\requests\sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    529         send_kwargs.update(settings)
--> 530         resp = self.send(prep, **send_kwargs)
    531 

~\anaconda3\lib\site-packages\requests\sessions.py in send(self, request, **kwargs)
    642         # Send the request
--> 643         r = adapter.send(request, **kwargs)
    644 

~\anaconda3\lib\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    515 
--> 516             raise ConnectionError(e, request=request)
    517 

ConnectionError: HTTPSConnectionPool(host='api.twitter.com', port=443): Max retries exceeded with url: /1.1/users/show.json?id=862885892 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x00000183218E7B50>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

During handling of the above exception, another exception occurred:

TweepError                                Traceback (most recent call last)
<ipython-input-7-c24d3d2b3d48> in <module>
     18 
     19 for search_term in search_terms:
---> 20     current_tweets = get_tweets(search_term, api, today)
     21     all_tweets.append(current_tweets)
     22 

<ipython-input-4-92b12659b409> in get_tweets(search_words, my_api, today)
     16         screen_name = tweet.user.screen_name
     17         user_name = tweet.user.name
---> 18         user_id = api.get_user(user_id_str)
     19         user_location = user_id.location
     20         user_coordinates = tweet.coordinates

~\anaconda3\lib\site-packages\tweepy\binder.py in _call(*args, **kwargs)
    251                 return method
    252             else:
--> 253                 return method.execute()
    254         finally:
    255             method.session.close()

~\anaconda3\lib\site-packages\tweepy\binder.py in execute(self)
    190                                                 proxies=self.api.proxy)
    191                 except Exception as e:
--> 192                     six.reraise(TweepError, TweepError('Failed to send request: %s' % e), sys.exc_info()[2])
    193 
    194                 rem_calls = resp.headers.get('x-rate-limit-remaining')

~\anaconda3\lib\site-packages\six.py in reraise(tp, value, tb)
    700                 value = tp()
    701             if value.__traceback__ is not tb:
--> 702                 raise value.with_traceback(tb)
    703             raise value
    704         finally:

~\anaconda3\lib\site-packages\tweepy\binder.py in execute(self)
    182                 # Execute request
    183                 try:
--> 184                     resp = self.session.request(self.method,
    185                                                 full_url,
    186                                                 data=self.post_data,

~\anaconda3\lib\site-packages\requests\sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    528         }
    529         send_kwargs.update(settings)
--> 530         resp = self.send(prep, **send_kwargs)
    531 
    532         return resp

~\anaconda3\lib\site-packages\requests\sessions.py in send(self, request, **kwargs)
    641 
    642         # Send the request
--> 643         r = adapter.send(request, **kwargs)
    644 
    645         # Total elapsed time of the request (approximately)

~\anaconda3\lib\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    514                 raise SSLError(e, request=request)
    515 
--> 516             raise ConnectionError(e, request=request)
    517 
    518         except ClosedPoolError as e:

TweepError: Failed to send request: HTTPSConnectionPool(host='api.twitter.com', port=443): Max retries exceeded with url: /1.1/users/show.json?id=862885892 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x00000183218E7B50>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

## Appendix B - Twitter academic research access
As discussed, I applied for the research track as this offers access to historical twitter data and also the ability to download higher volumes of Tweets. Unfortunately my access requests were rejected.

My original request failed, with Twitter responding that I did not meet their use case for a research account. I believe this is because they wanted to be able to reference my name via an official University website, for example within a Student directory. However, City University do not make student directories publicly available (for entirely understandable reasons), which suggests the research account isn't readily available to students and it is more aimed at researchers and faculty members who are referencable via the University website. I then reapplied using my city email address and having set up a Twitter account linked to this email address. I am waiting for a response (25/07/2021).
- I received a response on 25/07/2021 saying my request did not "qualify for academic access to the Twitter API". Twitter do not give specific reasons and so it's not possible to understand whether they do not give access if they cannot identify students on a university directory or whether this specific research falls outside what they consider acceptable research (although they did say it qualified for regular developer access, which suggests the research topic was OK).

I any case I go into detail on the application process because not having access to historic tweets significantly impacts our ability to perform the desired research and the process to get access is time consuming, opaque and there is no right to appeal. It would discourage me from doing further academic research into Twitter data. 