# An extensive guide to collecting Tweets using the newly released Academic Research Twitter API package and Python 3

![header-photo-twitter-api-v2](https://github.com/AndrewEdward37/Twitter_API_V2_Tutorial/blob/main/images/header-image.png?raw=true)
Background photo credit: [NASA](https://unsplash.com/photos/Q1p7bh3SHj8)

## Table of Contents:

1. [Introduction](#1.-Introduction)
2. [Pre-requisites to Start](#2.-Pre-requisites-to-Start)
3. [Bearer Token](#3.-Bearer-Token)
4. [Create Headers](#4.-Create-Headers)
5. [Create URL](#5.-Create-URL)
6. [Connect to Endpoint](#6.-Connect-to-Endpoint)
7. [Putting it all together](#7.-Putting-it-all-Together)
8. [Save Results to CSV](#8.-Save-Results-to-CSV)
9. [Looping Through Requests](#9.-Looping-Through-Requests)

# 1. Introduction

At the end of 2020, Twitter introduced a new Twitter API built from the ground up. Twitter API v2 comes with more features and data you can pull and analyze, new endpoints and a lot of functionalities.

With the introduction of that new API, Twitter also introduced a new powerful free product for academics: [The Academic Research product track](https://developer.twitter.com/en/solutions/academic-research).

The track grants free access to full-archive search and other v2 endpoints, with a volume cap of 10,000,000 tweets per month!
If you want to know if you qualify for the track or not, check this [link](https://developer.twitter.com/en/solutions/academic-research/application-info).

I have recently worked on a data analysis research project at Carnegie Mellon University using the capabilites that this track offers, and the power it gives you as a researcher is mindblowing!

Yet since v2 of the API is fairly new, less resources exist if you run into issues through the process of collecting data for your research.

So, in this article, I will go through a step-by-step process from setting up, accessing endpoints, to saving tweets collected in CSV format to use for analysis in the future.

This article will be using an endpoint that is available only for the Academic Research track ([Full-archive Search](https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all) endpoint), but almost everything in this guide can be applied to any other endpoint that is available for all developer accounts.

If you do not have access to the Full-archive Search endpoint, you can still use follow this tutorial using the ([Recent Search](https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent) endpoint)

# 2. Pre-requisites to Start

First, we are going to be importing some essential libraries for this guide

In [30]:
# For sending GET requests from the API
import requests

# For saving access tokens and for file management when creating and adding to the dataset
import os

# For dealing with json responses we receive from the API
import json

# For displaying the data after
import pandas as pd

# For saving the response data in CSV format
import csv

# For parsing the dates received from twitter in readable formats
import datetime
import dateutil.parser
import unicodedata

#To add wait time between requests
import time

To be able to send your first request to the Twitter API, you need to have a developer account. If you don't have one yet, you can apply for one [here](https://developer.twitter.com/en/apply-for-access)! (Don't worry its free and you just need to provide some information about the research you plan on pursuing)

Got an approved developer account? Fantastic!

All you need to do is create a project and connect an App through the developer portal and we are set to go!

1. Go to the [developer portal dashboard](https://developer.twitter.com/en/portal/dashboard)
2. Sign in with your developer account
3. Create a new Project, give it a name, a use-case based on the goal you want to achieve, and a description.

![New Project page](https://raw.githubusercontent.com/AndrewEdward37/Twitter_API_V2_Tutorial/main/images/new-project.png?token=AHQLCWF4J4IOFAMWKOQJGJTAU26HC)

4. Assuming this is your first time, choose ‘create a new App instead’ and give your App a name in order to create a new App.

![New App last step](https://raw.githubusercontent.com/AndrewEdward37/Twitter_API_V2_Tutorial/main/images/name-app.png)

If everything is successful, you should be able to see this page containing your keys and tokens, we will use one of these to access the API.

![keys and token](https://raw.githubusercontent.com/AndrewEdward37/Twitter_API_V2_Tutorial/main/images/tokens-and-keys.png)

# 3. Bearer Token

If you have reached this step, CONGRATULATIONS! You are elligible send your first request from the API :))

First we will create an auth() function that will have the "Bearer Token" from the app we just created.

Since this Bearer Token is sensitive information, you should not be sharing it with anyone at all. If you are working with a team you don't want anyone to have access to it.

So, we will save the token in an "environment variable".

There are many ways to do this, you can check out these two options:

- [Environment Variables .env with Python](https://www.youtube.com/watch?v=ecshCQU6X2U)
- [Making Use of Environment Variables in Python](https://www.nylas.com/blog/making-use-of-environment-variables-in-python/)

In this article, what we will do is we will just run these two lines in our code to set a "TOKEN" variable:

```
    import os
    os.environ['TOKEN'] = '<ADD_BEARER_TOKEN>'
```
Just replace the <ADD_BEARER_TOKEN> with your bearer token and after you run the function, delete the two lines.

If you get an error from this approach, try any of the links listed above.

Now, we will create our auth() function, which retrieves the token from the environment.

In [31]:
def auth():
    return os.getenv('TOKEN')

# 4. Create Headers

Next we will define a function that will take our bearer token, pass it for authorization and return headers we will use to access the API

In [32]:
def create_headers(bearer_token):
    headers = {"Authorization": "Bearer {}".format(bearer_token)}
    return headers

# 5. Create URL

Now that we can access the API, we will build the request for the endpoint we are going to use and the parameters we want to pass.

In [None]:
def create_url(keyword, start_date, end_date, max_results = 10):
    
    search_url = "https://api.twitter.com/2/tweets/search/all" #Change to the endpoint you want to collect data from

    #change params based on the endpoint you are using
    query_params = {'query': keyword,
                    'start_time': start_date,
                    'end_time': end_date,
                    'max_results': max_results,
                    'expansions': 'author_id,in_reply_to_user_id,geo.place_id',
                    'tweet.fields': 'id,text,author_id,in_reply_to_user_id,geo,conversation_id,created_at,lang,public_metrics,referenced_tweets,reply_settings,source',
                    'user.fields': 'id,name,username,created_at,description,public_metrics,verified',
                    'place.fields': 'full_name,id,country,country_code,geo,name,place_type',
                    'next_token': {}}
    return (search_url, query_params)

The defined function above, contains two pieces:

### A. *search_url:* Which is the link of the endpoint we want to access.

Twitter's API has a lot of different endpoints, here is the list of endpoints currently available at the time of writing this article with early access:

![endpoints](https://raw.githubusercontent.com/AndrewEdward37/Twitter_API_V2_Tutorial/main/images/early-access-endpoints.png)

You can also find the full list and more information about each of the endpoints in this [link](https://developer.twitter.com/en/docs/twitter-api/early-access).

For this article, since it is targeted towards Academic Researchers who are possibly trying to benefit from Twitter's new product, we will be using the **full-archive search endpoint**.


### B. *query_params:* The parameters that the endpoint offers and we can use to customize the request we want to send.

Each endpoint has different parameters that we can pass to it, and of course Twitter has an API-reference for each of them in its documentation!

For example for the full-archive search endpoint that we are using for this article, you can find the list of query parameters here in its [API Reference page](https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all) under the "Query parameters" section.

![Query parameters screenshot](https://raw.githubusercontent.com/AndrewEdward37/Twitter_API_V2_Tutorial/main/images/query_parameters.png)

We can decompose the query parameters above to three sections:

![parameters-breakdown](https://github.com/AndrewEdward37/Twitter_API_V2_Tutorial/blob/main/images/parameters-breakdown.png?raw=true)

1. The first 4 parameters are ones we are controlling

```
'query':        keyword,
'start_time':   start_date,
'end_time':     end_date,
'max_results':  max_results,
```

2. The next 4 parameters are basically us instructing the endpoint to return more information that is optional that it won't return by default. 

```
'expansions':   'author_id,in_reply_to_user_id,geo.place_id',
'tweet.fields': 'id,text,author_id,in_reply_to_user_id,geo,conversation_id,created_at,lang,public_metrics,referenced_tweets,reply_settings,source',
'user.fields':  'id,name,username,created_at,description,public_metrics,verified',
'place.fields': 'full_name,id,country,country_code,geo,name,place_type',
```

3. Lastly, the 'next_token' parameter is used to get the next 'page' of results. The value used in the parameter is pulled directly from the response provided by the API if more results than the cap per request exist, we will talk more about this as we go in this article.

```
'next_token': {}
```

**NOTE:** It is important to note that since some of these are optional parameters, they might not exist for all tweets.

For example, in 'expansions', the 'geo.place_id' will only return a result if a location is attached to the tweet retrieved. This will be important to remember when we save the results later on.

Now that we know what the create_url function does, a couple of important notes:

- *Required endpoints:*

    In the case of the full-archive search endpoint, the *query* parameter is the only parameter that is **required** to make a request. Always make sure to look at the documentation for the endpoint you are using to confirm which parameters HAVE to exist so you do not face issues.


- *Query Parameter:*

    The *query* parameter is where you put the keyword(s) you want to search for. Queries can be as simple as searching for tweets containing the word "xbox" or as complex as "(xbox europe) OR (xbox usa)" which will return tweets that contain the words xbox AND europe or xbox AND usa.
    
    Also, a *query* can be customized using *search operators*. There are so many options that help you narrow your search results. We will hopefully discuss operators more in depth in another article. For now, you can find the full list of operators for building queries [here](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query).
    
    Example of a simple query with an operator: "xbox lang:en"
    

- *Timestamps:* 

    The *end_time* and *start_time* format that Twitter uses for timestamps is *YYYY-MM-DDTHH:mm:ssZ (ISO 8601/RFC 3339)*. So make sure to convert the dates you want to this format. If you are unsure about how to, this is a nice [timestamp converter](https://www.timestamp-converter.com/) that will definitely help.
    

- *Results volume:* 

    The number of search results returned by a request is currently limited between 10 and 500 results. 

Now you might be asking, how can I get more than 500 results then? **That is where *next_token* and pagination comes to play!**

The answer is simple: If more results exists for your query, Twitter will return a unique *next_token* that you can use in your next request and it will give you the new results.

If you want to retrieve all the tweets that exist for your query, you just keep sending requests using the new *next_token* you receive every time, until no *next_token* exists, signaling that you have retrieved all the tweets!


Hopefully you are not feeling too confused! But don't worry, when we run all of the functions we just created, it will be clear!

# 6. Connect to Endpoint

Now that we have the url, headers, and parameters we want, we will create a function that will put all of this together and connect to the endpoint.

The function below will send the "GET" request and if everything is correct (response code 200), it will return the response in "json" format.

**Note:** *next_token* is set to "None" by default since we only care about it if it exists.

In [56]:
def connect_to_endpoint(url, headers, params, next_token = None):
    params['next_token'] = next_token   #params object received from create_url function
    response = requests.request("GET", url, headers = headers, params = params)
    print("Endpoint Response Code: " + str(response.status_code))
    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()

## 7. Putting it all Together

Now that we have all the functions we need, lets test putting them all together to create our first request!

In the next cell, we will setup our inputs:

- bearer_token and headers from the API.
- We will look for tweets in English that contain the word "xbox".
- We will look for tweets between the 1st and the 31st of March, 2021.
- We want only a maximum of 10 tweets returned.

In [35]:
#Inputs for the request
bearer_token = auth()
headers = create_headers(bearer_token)
keyword = "xbox lang:en"
start_time = "2021-03-01T00:00:00.000Z"
end_time = "2021-03-31T00:00:00.000Z"
max_results = 10

Now we will create the url and get the response from the API.

The response returned from the Twitter API is returned in JavaScript Object Notation "JSON" format.
To be able to deal with it and break-down the response we get, we will the encoder and decoder that exists for python which we have imported earlier. You can find more information about the library [here](https://docs.python.org/3/library/json.html).

If the code of the returned response is *200*, then the request was successful.

In [36]:
url = create_url(keyword, start_time,end_time, max_results)
json_response = connect_to_endpoint(url[0], headers, url[1])

200


In [1]:
print(json.dumps(json_response, indent=4, sort_keys=True))

Now lets break-down the returned JSON response, the response is basically read as a Python dictionary and the keys either contain data or contain more dictionaries. The top two keys are:

### A. Data:

A list of dictionaries, each dictionary represents the data for a tweet.

Example on how to retrieve the time from the first tweet was created:

```
    json_response['data'][0]['created_at']
```

### B. Meta:

A dictionary of attributes about the request we sent, we usually would only care about two keys in this dictionary, *next_token* and *result_count*.

![meta-data-response-breakdown](https://github.com/AndrewEdward37/Twitter_API_V2_Tutorial/blob/main/images/meta-data-response-breakdown.png?raw=true)

Example on how to retrieve the next_token:

```
    json_response['meta']['result_count']
```


Now we have two options to save results depending on how we want to deal with the data, either we can save the results in the same JSON format we received, or in CSV format.
To save results in JSON, we can easily do it using these two lines of code:

In [38]:
with open('data.json', 'w') as f:
    json.dump(json_response, f)

# 8. Save Results to CSV

You might be asking yourself, why do we want to save the results in CSV format? The short answer is, compared to a JSON object, CSVs are a widely used format that can easily be imported into an Excel spreadsheet, database, or data visualization software.


Now, to save the results in a table format for CSV, there are two approaches, a simple approach and a more custom approach.

Well...

**If there is an simple approach through a Python library, why do we need the custom approach?**

The answer to that is: The custom function will make us break-down and streamline the embedded dictionaries in some of the returned results into separate columns, making our analysis task easier.

For example, the public metrics key:
```
"public_metrics": {
                "like_count": 0,
                "quote_count": 0,
                "reply_count": 0,
                "retweet_count": 0
            },
```
The key returns another dictionary, the simple approach will save this how dictionary under one CSV column, while in the custom approach, we can break each into a different column before saving to the data to CSV.

## A. The simple approach:

For this approach we will be using the Pandas package

In [None]:
df = pd.DataFrame(response['json_response'])
df.to_csv('data.csv')

This was taken from [this blogpost](https://developer.twitter.com/en/docs/tutorials/five-ways-to-convert-a-json-object-to-csv) from the Twitter Dev team on this topic, which I tried and it works really well with simple queries.

## B. The custom approach:

First, we will create a CSV file with our desired column headers, we will do this separately from our actual function so later on it does not interfere with looping over requests.

In [39]:
# Create file
csvFile = open("data.csv", "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)

#Create headers for the data you want to save, in this example, we only want save these columns in our dataset
csvWriter.writerow(['author id', 'created_at', 'geo', 'id','lang', 'like_count', 'quote_count', 'reply_count','retweet_count','source','tweet'])
csvFile.close()

Then, we will create our *append_to_csv* function, that we will input the response and desired filename into, and the function will append all the data we collected to the CSV file.

In [60]:
def append_to_csv(json_response, fileName):

    #A counter variable
    counter = 0

    #Open OR create the target CSV file
    csvFile = open(fileName, "a", newline="", encoding='utf-8')
    csvWriter = csv.writer(csvFile)

    #Loop through each tweet
    for tweet in json_response['data']:
        
        # We will create a variable for each since some of the keys might not exist for some tweets
        # So we will account for that

        # 1. Author ID
        author_id = tweet['author_id']

        # 2. Time created
        created_at = dateutil.parser.parse(tweet['created_at'])

        # 3. Geolocation
        if ('geo' in tweet):   
            geo = tweet['geo']['place_id']
        else:
            geo = " "

        # 4. Tweet ID
        tweet_id = tweet['id']

        # 5. Language
        lang = tweet['lang']

        # 6. Tweet metrics
        retweet_count = tweet['public_metrics']['retweet_count']
        reply_count = tweet['public_metrics']['reply_count']
        like_count = tweet['public_metrics']['like_count']
        quote_count = tweet['public_metrics']['quote_count']

        # 7. source
        source = tweet['source']

        # 8. Tweet text
        text = tweet['text']
        
        # Assemble all data in a list
        res = [author_id, created_at, geo, tweet_id, lang, like_count, quote_count, reply_count, retweet_count, source, text]
        
        # Append the result to the CSV file
        csvWriter.writerow(res)
        counter += 1

    # When done, close the CSV file
    csvFile.close()

    # Print the number of tweets for this iteration
    print("# of Tweets added from this response: ", counter) 

Now if we run our append_to_csv() function on our last call, we shoul have a file that contains 10 tweets (or less depending on your query)

In [41]:
append_to_csv(json_response, "data.csv")

Tweets in API call:  10


# 9. Looping Through Requests

Good job! We have sent our first request and saved the first response from it.

Now, what if we want to save more responses? Beyond the first 500 results that Twitter gave us or if we want to automate getting Tweets over a specific period of time. For that we will be using loops and the next_token variables we receive from Twitter.

Lets think about this case:

We want to collect tweets that contained the word "COVID-19" in 2020 to analyse people's sentiment when tweeting about the virus. Probably millions of tweets exist, and we have a limit of collecting 10 Million tweets per month only.

If we just send a request to collect tweets between the 1st of January 2020 and the 31st of December 2020, we will hit our cap very quickly without having a good distribution from all 12 months.

So what we can do is, we can set a limit for tweets we want to collect per month, so that if we reach the specific cap at one month, we move on to the next one.

The code below is an example that will just do that exactly! The block of code below is composed of two loops:

1. For-loop that goes over the months/weeks/days we want to cover (Depending on how it is set)

2. While-loop that controls the maximum number of tweets we want to collect per time period.

In [61]:
#Inputs for tweets
bearer_token = auth()
headers = create_headers(bearer_token)
keyword = "xbox lang:en"
start_list =    ['2021-01-01T00:00:00.000Z',
                 '2021-02-01T00:00:00.000Z',
                 '2021-03-01T00:00:00.000Z']

end_list =      ['2021-01-31T00:00:00.000Z',
                 '2021-02-28T00:00:00.000Z',
                 '2021-03-31T00:00:00.000Z']
max_results = 500

#Total number of tweets we collected from the loop
total_tweets = 0

# Create file
csvFile = open("data.csv", "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)

#Create headers for the data you want to save, in this example, we only want save these columns in our dataset
csvWriter.writerow(['author id', 'created_at', 'geo', 'id','lang', 'like_count', 'quote_count', 'reply_count','retweet_count','source','tweet'])
csvFile.close()

for i in range(0,len(start_list)):

    # Inputs
    count = 0 # Counting tweets per time period
    max_count = 100 # Max tweets per time period
    flag = True
    next_token = None
    
    # Check if flag is true
    while flag:
        # Check if max_count reached
        if count >= max_count:
            break
        print("-------------------")
        print("Token: ", next_token)
        url = create_url(keyword, start_list[i],end_list[i], max_results)
        json_response = connect_to_endpoint(url[0], headers, url[1], next_token)
        result_count = json_response['meta']['result_count']

        if 'next_token' in json_response['meta']:
            # Save the token to use for next call
            next_token = json_response['meta']['next_token']
            print("Next Token: ", next_token)
            if result_count is not None and result_count > 0 and next_token is not None:
                print("Start Date: ", start_list[i])
                append_to_csv(json_response, "data.csv")
                count += result_count
                total_tweets += result_count
                print("Total # of Tweets added: ", total_tweets)
                print("-------------------")
                time.sleep(5)                
        # If no next token exists
        else:
            if result_count is not None and result_count > 0:
                print("-------------------")
                print("Start Date: ", start_list[i])
                append_to_csv(json_response, "data.csv")
                count += result_count
                total_tweets += result_count
                print("Total # of Tweets added: ", total_tweets)
                print("-------------------")
                time.sleep(5)
            
            #Since this is the final request, turn flag to false to move to the next time period.
            flag = False
            next_token = None
        time.sleep(5)
print("Total number of results: ", total_tweets)

-------------------
Token:  None
Endpoint Response Code: 200
Next Token:  b26v89c19zqg8o3fosktkma06ysgm4wgynfrumz0sw5tp
Start Date:  2021-01-01T00:00:00.000Z
# of Tweets added from this response:  490
Total # of Tweets added:  490
-------------------
-------------------
Token:  None
Endpoint Response Code: 200
Next Token:  b26v89c19zqg8o3fosntcmiv0wh735qw6efvq5lhzawe5
Start Date:  2021-02-01T00:00:00.000Z
# of Tweets added from this response:  487
Total # of Tweets added:  977
-------------------
-------------------
Token:  None
Endpoint Response Code: 200
Next Token:  b26v89c19zqg8o3fosqtycrbq6s54n8yu7qrwyeeez6v1
Start Date:  2021-03-01T00:00:00.000Z
# of Tweets added from this response:  494
Total # of Tweets added:  1471
-------------------
Total number of results:  1471


## Summary:

In this article, we have gone through an extensive step-by-step process for collecting Tweets from Twitter API v2 for Academic Research using Python.

We have covered the pre-requisites required, authentication, creating requests and sending requests to the search/all endpoint, and finally saving responses in different formats.

If you liked this, feel free to share it with your friends and colleagues over Twitter and LinkedIn!

Feel free to connect with me on LinkedIn or follow me on Twitter!