# Introduction to Twitter Scraping for Researchers

This workbook is meant to provide some simple, working examples for researchers who would like to collect information from Twitter.  While Twitter provides their own tools and libraries for this they are a little too granular and possibly unfamiliar to many in the research community.  For this reason this workbook uses a Python library build by a third party that greatly streamlines the process of collecting tweets.

The Python library used is called TwitterAPI (no space) and it can be found at https://github.com/geduldig/TwitterAPI.  Most of the code that is throughout this workbook is drawn directly from the examples on these pages.

What this workbook does not do that most researchers will likely need to add to their project is show how to put the tweets that are scraped into a database.  If resesearchers are interested in writing to a database then it should be straightforward to consult the documentation available for connecting that database to Python and then substitute the appropriated commands for the sections of this workbook that write to files.

On the assumption that you have a developer account and a Python environment that can run this notebook let's get started by installing the TwitterAPI library.

In [None]:
!pip install TwitterAPI

With the TwitterAPI library installed on the system we should be able to open it for use throughout this workbook with the following command:

In [None]:
from TwitterAPI import TwitterAPI

[Note that almost every piece of code in the remainder of this workbook assumes that the cell above has been run in advance of it being run.  If you open the workbook and immediately try to run a cell other than this one first then it is likely that you will receive an error.  Simply run the cell above and try to run the cell you want to run again.]

## Authentication

Use of this workbook–and the Twitter Developer API in general—requires a developer account from Twitter.  Unlike the early days of Twitter when anyone with a regular Twitter account who requested a developer account would just be given one Twitter now screens requests for developer accounts, a process that can stall getting started by many days.  If you had a developer account previously and created applications (apps) that used the Twitter application programming interfaces (APIs) then you may still be able to use these apps to do some work but it is possible that their ability to access the Twitter archive has been reduced and, if so, that you'll need to apply for a new developer account to correct this.  A developer account may be requested from https://developer.twitter.com/en/apply/user.

As the [TwitterAPI Documentation](https://geduldig.github.io/TwitterAPI/authentication.html) points out: _Twitter supports both user and application authentication, called oAuth 1 and oAuth 2, respectively. User authentication gives you access to all API endpoints, basically read and write persmission. It is also required in order to using the Streaming API. Application authentication gives you access to just the read portion of the API – so, no creating or destroying tweets. Application authentication, however, has elevated rate limits._  
We will use oAuth1 throughout this workbook even though we'll only be reading tweets since it can be used in more situations (in particular when we try to read from the streaming API).  If it is necessary to read Twitter API endpoints (other than the streaming endpoint) at a faster rate than this workbook initially provides then consider switching to oAuth2.

Both authentication methods will require you to collect some information about keys and tokens and paste it into the appropriate section of the cell below.  This key and token information is generated when you create a profile for an app on the Twitter Developer site.  App profiles can be created at https://developer.twitter.com/en/apps.  That same page will hold a list of all the profiles that you have created and clicking on the "Details" button for each app will bring you to a summary page.  There will be a link/tab near the top of the page called "Keys and Tokens" and clicking this will bring you to the page with the key and token information.

!!! IMPORTANT !!!

If the code in the cells belows fails it is likely because you need to put in your own authentication details.

!!! IMPORTANT !!!

### oAuth1 (User Identification)
Paste in the required key and token information from the Twitter Developer site into the cell below and then run it in order to use this workbook.

In [None]:
api_key = 'zb64BiFI5aksYSCFj1bmZUkT5'
api_key_secret = 'J1h1AvgYW2Ixu9U8NploawihVym4gYZqDg0iiOyuA0OYIFQNdd'
access_token = '557113581-ySShrkeo71Wpdr1lP3I8k3HhJPTNHPUTflbT4rJz'
access_token_secret = 'wiYe4wbO7KFzq0OPL8jM1C4HT4hnSyMHn5rNT56mZgX6B'

api = TwitterAPI(api_key, 
                 api_key_secret, 
                 access_token, 
                 access_token_secret)

api.auth

If successful the output of the cell above should look something like:

    <requests_oauthlib.oauth1_auth.OAuth1 at 0x107b8bba8>

### oAuth2 (App Identification)
!!!WARNING!!! 

Pasting in the required key and token information from the Twitter Developer site into the cell below and then running it will prevent you from using the streaming endpoint, which requires oAuth1.  If you choose to try oAuth2 in the streaming example and receive the following error

    TwitterRequestError: Twitter request failed (401)

then simply complete the code cell in the oAuth1 section and run it.

!!!WARNING!!! 

In [None]:
api_key = 'LYgzEZvg2lHcQyuqgaVhuEh9o'
api_key_secret = 'OkQ33YKl9EgNqDFg3dIUBzjHD26CCixFc6ppfKxLbGdL78DRDz'

api = TwitterAPI(api_key,
                 api_key_secret,
                 auth_type='oAuth2')

api.auth

If successful the output of the cell above should look something like:

    <TwitterAPI.BearerAuth.BearerAuth at 0x107b9acc0>

## Looking at Tweets
Given that most people refer to tweets as bits of text that are usually 140 characters or less it is generally surprising to discover that this is only the proverbial "tip of the iceberg" in terms of what a tweet really is.  In this section we'll see exactly what a tweet is, how to improve looking at the full content, and then how to grab the portions that we want (usually the "text").

To make this easy we'll only request a single tweet by its ID number.  Every tweet has its own unique ID and can be requested if that ID is known.  We request the tweet with ID# 210462857140252672 and then print the response object.

In [None]:
r = api.request('statuses/show/:%d' % 210462857140252672)
print(r)

The output of running the cell above will be something like `<TwitterAPI.TwitterAPI.TwitterResponse object at 0x107b9af28>`, which isn't quite what we are looking for.  This `TwitterResponse object` is a bundle of information related to the request including status code returned (`r.status_code`), how much of your quota is left (`r.get_quota`), the response headers (`r.headers`), etc.  What we wantis the "text" portion of this response (`r.text`).

In [None]:
r.text

That's a lot more than 140 characters!

Exactly what is there is hard to determine though given the formatting.  We can do better.

The format that this content is in is JavaScript Object Notation (JSON), which is really a nested list of properties.  Python doesn't know this is JSON though so we need to tell it.  We do this in the next cell by importing the `json` library, converting `r.text` to json using the load string method ( `.loads()` ), and then outputting that json with formatting using the output string method ( `.dumps()` ) with some some options added for readability.

In [None]:
import json
parsed_r = json.loads(r.text)
print(json.dumps(parsed_r, indent=2, sort_keys=True))

This is much nicer to read, especially since the various components have been alphabetized.  Having the response text as JSON also allows us to easily access each subcomponent.  We show this in the next cell by printing the text (sometimes called the "body" of the tweet), the ID# of the tweet, and the screen name of the user.

In [None]:
print("Tweet Body: ",parsed_r['text'])
print("Tweet ID: ",parsed_r['id'])
print("Scree Name: ",parsed_r['user']['screen_name'])

Rather than parse the output to JSON everytime we can combine the Twitter Response Object's `.get_iterator()` method with a for-loop to do this directly.  It's less work overall and is cleaner.

In [None]:
r = api.request('statuses/show/:%d' % 210462857140252672)
for item in r.get_iterator():
    print("Tweet Body: ",item['text'])
    print("Tweet ID: ",item['id'])
    print("Screen Name: ",item['user']['screen_name'])

## Streaming
There are two approaches to collecting information from Twitter: grabbing tweets as they are published and searching through the archive of past tweets.  The first approach is known as "streaming" and we'll look at how to use it now.  It is important to note up front that the results returned using this method are incomplete: you will _not_ necessarily capture every single tweets that you intend to this way.  Still, you can get a lot of tweets in a very short time and likely enough to get you started.

Streaming amounts to applying a set of filters to the stream of all tweets and capturing what is matched by those filters.  For this first example we'll simply print the body of each tweet out on the screen.  There are lots of opinions about Donald Trump right now so we'll use 'trump' as the term we are tracking.

!!! IMPORTANT !!!

The cell below will run indefinitely.  You'll want to stop it at some point so be prepared to click the stop button at the top of this workbook once you have some tweets in the output cell.

!!! IMPORTANT !!!

In [None]:
TRACK_TERM = 'trump'

r = api.request('statuses/filter', {'track': TRACK_TERM})

for item in r.get_iterator():
    print(item['text'] if 'text' in item else item)

# Write to File

As satisfying as it may be to have an endless stream of tweets scroll in front of you there isn't much value in it unless you are able to capture the tweets for analysis in the future.  The ideal way to do this is to write the tweets to a database because then you will have the search features of the database at your disposal.  However, as mentioned at the start of this workbook selecting and setting up a database goes beyond what will be covered here (If you're not sure where to start, [MongoDB](https://www.mongodb.com/) is a good place to start given that it's internal format is very similar to the JSON that tweets come in).

Here we will simply write the text and ID number to a file.  We do this using `with` because this method of opening a file will ensure that it will close properly if the program crashes/halts unexpectedly, something that in inevitable at this point because the only way we have to stop the code we are running right now is to interrupt it.  There is some casting values to strings using the `str` function when writing to the output file because the ` .write() `  method requires a single string as an input.  Note as well the addition of the line break (`\n`) to ensure that each tweet body starts on a new line.

In [None]:
TRACK_TERM = 'trump'

r = api.request('statuses/filter', {'track': TRACK_TERM})

with open("streamTweets.csv","a") as outfile:
    for item in r.get_iterator():
        line = item['text'] + ',' + str(item['id'])
        print(line if 'text' in item else item)
        outfile.write((repr(line) + '\n') if 'text' in item else item)

Looking at the output (there should be a file called "streamTweets.csv" in the same directory as this notebook) we can see that the body of each tweet is printed on its own line followed by a comma which is followed by the tweet ID... mostly.  Scrolling through the list will reveal that there are tweets that span multiple lines and blank spaces.  Why is this?  The body of some tweets includes line break characters (` \n `).

If we compare what was printed to the screen to what was written in the file we'll see the `\n`'s on the screen translated to blank lines in the file.

While there is an argument to be made that removing these characters makes no difference to the content of the tweet the counter argument is that line breaks are important punctuation and should be kept with the original tweet.  We should keep the line breaks.

A popular way to do this would be to use a regular expression and replace each `\n` that occurs with a `\\\\n` so that each `\` is appropriately escaped as it is passed from the variable to the file.  The problem with this method is that there are other characters that might appear as well (either in tweets or elsewhere) and while we could write a regular expression to do the substitution in each case Python offers a better way: the [representation function](https://docs.python.org/3.5/library/functions.html#repr).  To do this we pass the line variable to the function `repr` as we write it to the output file, as in the example below.

(Remember to stop the cell after a few seconds.)

In [None]:
TRACK_TERM = 'myocardial infarction'

r = api.request('statuses/filter', {'track': TRACK_TERM})

with open("sentimentTest.csv","a") as outfile:
    for item in r.get_iterator():
        line = item['text'] + ',' + str(item['id'])
        print(line if 'text' in item else item)
        outfile.write((repr(line) + '\n') if 'text' in item else item)

Checking streamTweets.csv shows that this approach is working well.  We won't write to an output file in every example in the rest of this workbook keep in mind that you can use the same methods in all the examples that follow.

# Standard Search

The [Standard Search API](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets) allows for searching in the past 7 days.  It is rate limited to 180 requests per 15 minutes using oAuth1 and 450 requests per 15 minute using oAuth2.  It is also "not exhaustive", meaning that the full body of tweets matching search criteria within the window is unlikely to be returned (maybe if the body of tweets is very small).

We'll shift away from the politically hot topic of Donald Trump to the topic of pizza for these next examples.

Note the use of the `.get_quota()` method on the response object in order to see how much of our quota remains.

In [None]:
SEARCH_TERM = 'pizza pie'

r = api.request('search/tweets', {'q': SEARCH_TERM})

for item in r.get_iterator():
    print(item['text'] if 'text' in item else item)

print('\nQUOTA: %s' % r.get_quota())

Seems to be working but there are not many tweets being returned.  We can increase this by specifying the `count` parameter.  We'll also add a simple counter just to see what we are actually getting.  

The set of all the parameters that can be invoked is available [HERE](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets).

In [None]:
SEARCH_TERM = '#pizza'
COUNT = 100 

r = api.request('search/tweets', {'q': SEARCH_TERM, 'count': COUNT})

a = 1
for item in r.get_iterator():
    print(a)
    print(item['text'] if 'text' in item else item)
    a=a+1

print('\nQUOTA: %s' % r.get_quota())

So, not 100 tweets but certainly more than before.  To go back further we'll need to look at paging.

# Paging 
Twitter returns results in chunks that are called "pages".  In the example above we are seeing just the first page of results and setting the `count` parameter set the maximum number of results to return per page.  If you want to go back further then it becomes necessary to send multiple requests to the API in succession with each one asking for the next page.  While this can be implemented "by hand" TwitterAPI makes this much easier by providing a paging function called 'TwitterPager' (You can read more about it [HERE](https://geduldig.github.io/TwitterAPI/paging.html) that does all of the heavy lifting for you by tracking what page to ask for, ensuring the request rate is not too high, and generally managing the connection.  It is invoked in an almost identical way to everything you have seen so far in this workbook.

Unlike the previous examples where we were printing out the body of each tweet here we will print out only the date the tweet was created and the ID.  This is done simply because it makes it easier to track what is happening.

It is important to note that as with the streaming API endpoint you will need to stop the code from running.  While you will eventually hit the 7-day limit it is unlikely that you want to wait that long. 

In [None]:
from TwitterAPI import TwitterPager

SEARCH_TERM = 'pizza'
COUNT = 100

pager = TwitterPager(api, 'search/tweets', {'q': SEARCH_TERM, 'count': COUNT})

for item in pager.get_iterator():
    #print(item['text'] if 'text' in item else item)
    print((item['created_at'], item['id']) if 'text' in item else item)
    

You'll note that as the code runs the dates and the tweet IDs both roll backwards and the search tool moves from the present (tweets become accessible via the search APIs about 30 seconds after they are created) to the past.

The TwitterAPI documentation provides some [advanced ways to add fault tolerance](https://geduldig.github.io/TwitterAPI/faulttolerance.html).  These are fairly sophisticated and involve checking status codes and the like.  While valuable it is inevitable that you will end up halting your program for one reason or another and need it to restart scraping where it left off.  

The code below does this by first checking to see if there is an object called "item" that has a value keyed to 'id'.  If it does then it captures this ID and uses it as input into the TwitterPager function so that all new tweets collected will be earlier than it.  If the value does not exist then an empty string is assigned as the ID to start from which the TwitterAPI will ignore and start providing input from the present.

This code will work as long as the notebook stays open, no matter how often the cell is interrupted.  If you close the notebook and reopen it then you'll need to pass in the ID value from the last line of the output file to restart in the correct location.

In [None]:
from TwitterAPI import TwitterPager

SEARCH_TERM = 'pizza'
COUNT = 100

try:
    SINCE_ID = item['id']
except:
    SINCE_ID = ''

pager = TwitterPager(api, 'search/tweets', {'q': SEARCH_TERM, 'count': COUNT,'since_id':SINCE_ID})

with open("restartTweetsTest.csv","a") as outfile:
    for item in pager.get_iterator():
            line = item['text'] + ',' + str(item['id'])
            print(line if 'text' in item else item)
            outfile.write((repr(line) + '\n') if 'text' in item else item)

# Premium Access

Access beyond stream filtering and searching imperfectly through the past 7 days requires some extra steps beyond simply making an app.

1. Setting up a dev environment.  Within the Twitter developer site click on your name in the top right corner.  From the menu select "Dev environments".  Follow the interface to create the environments that you would like and associate an app with each.
2. Note the name of each dev environment because it will go into one of the variables called `LABEL`, below.  I named my 30-Day development environment "30DayTesting" and my full archive development environment "fullArchiveTesting".  So in the 30 Day example I set `LABEL` to "30DayTesting" and in the Full Archive example I set `LABEL` to "fullArchiveTesting".


## 30 Day

In [None]:
from TwitterAPI import TwitterAPI

SEARCH_TERM = 'pizza'
PRODUCT = '30day'
LABEL = '30DayTesting'

r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL), 
                {'query':SEARCH_TERM})

for item in r:
    print(item['text'] if 'text' in item else item)

## Full Archive

In [None]:
from TwitterAPI import TwitterAPI

SEARCH_TERM = 'pizza'
PRODUCT = 'fullarchive'
LABEL = 'fullArchiveTesting'

r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL), 
                {'query':SEARCH_TERM})

for item in r:
    print(item['text'] if 'text' in item else item)