# An Extensive Guide to collecting Tweets from Twitter API v2 for Academic Research using Python

## List:

1. Introduction
2. Pre-requisites to start
3. Bearer Token
4. Create Headers
5. Create URL
6. Connect to Endpoint
7. Save results to CSV
8. Tweet collection, explained

## 1. Introduction

At the end of 2020, Twitter introduced a new Twitter API built from the ground up. Twitter API v2 comes with more features and data you can pull and analyze, new endpoints and a lot of functionalities.

One of the most common reasons developers use the Twitter API is to listen to and analyze the conversation happening on Twitter.

In this article, we will go through a step-by-step process from setting up, accessing endpoints, to saving tweets collected in CSV format to use for analysis in the future.


## 2. Pre-requisites

First, we are going to be importing some essential libraries for this guide

In [None]:
# For sending GET requests from the API
import requests


# For file management when creating and adding to the dataset
import os

# For dealing with json responses we receive from the API
import json

# For displaying the data after
import pandas as pd

# For saving the response data in CSV format
import csv

# For parsing the dates received from twitter in readable formats
import datetime
import dateutil.parser
import unicodedata

import time
import re

def remove_url(txt):
    return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", txt).split())

To be able to send your first request to the Twitter API, you need to have a developer account. If you don't have one yet, you can apply for one [here](https://developer.twitter.com/en/apply-for-access)! (Don't worry its free and you just need to provide some information)

Got an approved developer account? Fantastic!

All you need to do is create a project and connect an App through the developer portal and we are set to go!

1. Go to the [developer portal dashboard](https://developer.twitter.com/en/portal/dashboard)
2. Sign in with your developer account
3. Create a new Project, give it a name, a use-case based on the goal you want to achieve, and a description.

![New Project page](https://raw.githubusercontent.com/AndrewEdward37/Twitter_API_V2_Tutorial/main/images/new-project.png?token=AHQLCWF4J4IOFAMWKOQJGJTAU26HC)

4. Assuming this is your first time, choose ‘create a new App instead’ and give your App a name in order to create a new App.

![New App last step](https://raw.githubusercontent.com/AndrewEdward37/Twitter_API_V2_Tutorial/main/images/name-app.png)

If everything is successful, you should be able to see this page containing your keys and tokens, we will use one of these to access the API.

![keys and token](https://raw.githubusercontent.com/AndrewEdward37/Twitter_API_V2_Tutorial/main/images/tokens-and-keys.png)

## 3. Bearer Token

If you have reached this step, CONGRATULATIONS! You can send your first request from the API :))

First we will create an auth() function that will have the "Bearer Token" from the app we just created.

Just replace the <ADD_BEARER_TOKEN> with your bearer token.

In [None]:
# (((Remember to remove key before publishing)))
def auth():
    return 'AAAAAAAAAAAAAAAAAAAAAPOyOAEAAAAAvoDoQHKZfQWoXvLeMAxwyU0QQks%3Dj1pEstfU8e3iLZOQCULI4tipIp5lk0z16DnsHAOBwMcoCII2Q9'

## 4. Create Headers

Next we will define a function that will take our bearer token, pass it for authorization and return headers we will use to access the API

In [None]:
def create_headers(bearer_token):
    headers = {"Authorization": "Bearer {}".format(bearer_token)}
    return headers

## 5. Create URL

Now that we can access the API, we will build the request for the endpoint we will use and the parameters we want to pass.
The defined function below, contains two pieces:

### A. *search_url:* Which is the link of the endpoint we want to access.

Twitter's API has a lot of different endpoints, here is the list of endpoints currently available at the time of writing this article with early access:

![endpoints](https://raw.githubusercontent.com/AndrewEdward37/Twitter_API_V2_Tutorial/main/images/early-access-endpoints.png)

You can also find the full list and more information about each of the [link](https://developer.twitter.com/en/docs/twitter-api/early-access).

For this article, since it is targeted towards Academic Researchers who are possibly trying to benefit from Twitter's new product, we will be using the **full-archive search endpoint**.


### B. *query_params:* The parameters that the endpoint offers and we can use to customize the request we want to send.

Each endpoint has different parameters that we can pass to it, and of course Twitter has an API-reference for each of them!

For example for the full-archive search endpoint that we are using for this article, you can find the list of Query parameters here in its [API Reference page](https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-all) under the "Query parameters" section.

![Query parameters screenshot](https://raw.githubusercontent.com/AndrewEdward37/Twitter_API_V2_Tutorial/main/images/query_parameters.png)

We can decompose the query below to three sections:

1. The first 4 parameters are ones we are controlling

```
'query':        keyword,
'start_time':   start_date,
'end_time':     end_date,
'max_results':  max_results,
```

2. The next 4 parameters are basically us instructing the endpoint to return more information that is optional that it won't return by default. 

```
'expansions':   'author_id,in_reply_to_user_id,geo.place_id',
'tweet.fields': 'id,text,author_id,in_reply_to_user_id,geo,conversation_id,created_at,lang,public_metrics,referenced_tweets,reply_settings,source',
'user.fields':  'id,name,username,created_at,description,public_metrics,verified',
'place.fields': 'full_name,id,country,country_code,geo,name,place_type',
```

3. Lastly, the 'next_token' parameter is used to get the next 'page' of results. The value used with the parameter is pulled directly from the response provided by the API, and should not be modified, we will talk more about this as we go in this article.

```
'next_token': {}
```

**NOTE:** It is important to note that since some of these are optional parameters, they might not exist for all tweets.

For example, in 'expansions', the 'geo.place_id' will only return a result if a location is attached to the tweet retrieved. This will be important to remember when we save the results later on.

In [None]:
def create_url(keyword, start_date, end_date, max_results = 10):
    
    search_url = "https://api.twitter.com/2/tweets/search/all" #Change to the endpoint you want to collect data from

    #change params based on the endpoint you are using
    query_params = {'query': keyword,
                    'start_time': start_date,
                    'end_time': end_date,
                    'max_results': max_results,
                    'expansions': 'author_id,in_reply_to_user_id,geo.place_id',
                    'tweet.fields': 'id,text,author_id,in_reply_to_user_id,geo,conversation_id,created_at,lang,public_metrics,referenced_tweets,reply_settings,source',
                    'user.fields': 'id,name,username,created_at,description,public_metrics,verified',
                    'place.fields': 'full_name,id,country,country_code,geo,name,place_type',
                    'next_token': {}}
    return (search_url, query_params)

Now that we know what the create_url function does, a couple of important notes:

- *Required endpoints:*

    In the case of the full-archive search endpoint, the *query* parameter is the only parameter that is **required** to make a request. Always make sure to look at the documentation for the endpoint you are using to confirm which parameters HAVE to exist so you do not face issues.

- *Query Parameter:*

    The *query* parameter is where you put the keyword(s) you want to search for. Queries can be as simple as searching for tweets containing the word "xbox" or as complex as "(xbox europe) OR (xbox usa)" which will return tweets that contain the words xbox AND europe or xbox AND usa.
    
    Also, a *query* can be customized using *search operators*. There are so many options that help you narrow your search results. We will hopefully discuss operators more in depth in another article. For now, you can find the full list of operators for building queries [here](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query).
    
    Example of a simple query with an operator: "xbox lang:en"

- *Timestamps:* 

    The *end_time* and *start_time* format that Twitter uses for timestamps is *YYYY-MM-DDTHH:mm:ssZ (ISO 8601/RFC 3339)*. So make sure to convert the dates you want to this format. If you are unsure about how to, this is a nice [timestamp converter](https://www.timestamp-converter.com/) that will definitely help.

- *Results volume:* 

    The number of search results returned by a request is currently limited between 10 and 500 results. 

Now you might be asking, how can I get more than 500 results then? **That is where *next_token* and pagination comes to play!**

The answer is simple: If more results exists for your query, Twitter will return a unique *next_token* that you can use in your next request and it will give you the new results.

If you want to retrieve all the tweets that exist for your query, you just keep sending requests using the new *next_token* you receive every time, until no *next_token* exists, signaling that you have retrieved all the tweets!


Hopefully you are not feeling too confused! But don't worry, when we run all of the functions we just created, it will be clear!

## 6. Connect to Endpoint

Now that we have the url, headers, and parameters we want, we will create a function that will put all of this together and connect to the endpoint.

The function below will send the "GET" request and if everything is correct (response code 200), it will return the response in "json" format.

**Note:** *next_token* is set to "None" by default since we only care about it if it exists.

In [None]:
def connect_to_endpoint(url, headers, params, next_token = None):
    params['next_token'] = next_token   #params object received from create_url function
    response = requests.request("GET", url, headers = headers, params = params)
    print(response.status_code)
    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()

## 7. Save results to CSV

In [None]:
def save_to_csv(json_response, fileName):

    #Saving the tweets in a csv file:
    csvFile = open(fileName, "a", newline="", encoding='utf-8-sig')
    csvWriter = csv.writer(csvFile)

    #Counter to find how many tweets were retrieved
    counter = 0

    #Loop through each tweet
    for tweet in json_response['data']:
        
        # 1) Time 
        time_created = dateutil.parser.parse(tweet['created_at'])
        
        # 2) Tweet ID
        tweet_id = tweet['id'].encode('utf-8')

        # 3) Tweet text
        # text = remove_url(tweet['text'])
        text = tweet['text']
        # print(text)    

        # 4) Author ID
        author_id = tweet['author_id'].encode('utf-8')

        # 5) If A GEO location exist
        if ('geo' in tweet):   
            geo = tweet['geo']['place_id']
        else:
            geo = " "
        
        # 6) Conversation ID
        conversation_id = tweet['conversation_id'].encode('utf-8')

        # 7) If "tweet is a reply" ID of the original tweet exist
        if ('in_reply_to_user_id' in tweet):   
            in_reply = tweet['in_reply_to_user_id'].encode('utf-8')
        else:
            in_reply = " "
        
        # 8) Tweet metrics
        retweet_count = tweet['public_metrics']['retweet_count']
        reply_count = tweet['public_metrics']['reply_count']
        like_count = tweet['public_metrics']['like_count']
        quote_count = tweet['public_metrics']['quote_count']

        # 9) Language of tweet
        language = tweet['lang']

        # Assemble all data in a list
        res = [time_created, tweet_id, author_id, text, geo, language, conversation_id,in_reply,retweet_count,reply_count,like_count,quote_count]

        # Look for context Annotations
        if ('context_annotations' in tweet):
            
            # Only look at the first three
            if len(tweet['context_annotations']) > 3:
                min_len = 3
            else:
                min_len = len(tweet['context_annotations']) 

            #loop through the context content
            for i in range(min_len):
                domain_desc = tweet['context_annotations'][i]['domain']['description']
                domain_id = tweet['context_annotations'][i]['domain']['id']
                domain_name = tweet['context_annotations'][i]['domain']['name']
                entity_id = tweet['context_annotations'][i]['entity']['id']
                entity_name = unicodedata.normalize('NFKD', tweet['context_annotations'][i]['entity']['name']).encode('ascii','ignore')
                
                # Add the new context info to the row of data
                res +=[domain_id, domain_name, domain_desc, entity_id, entity_name]
        
        # Append the result to the CSV file
        csvWriter.writerow(res)
        counter += 1

    # When done, close the CSV file
    csvFile.close()

    # Print the number of tweets for this iteration
    print("Tweets in API call: ", counter) 