*Hello again!* 👋

This notebook is the <u>second</u> part of a **tutorial** on how to  **collect data from Twitter API v2 using Python** 🤓

In this notebook, we make use of the **recent search** endpoint to collect Twitter data on heat pumps and gas boilers from the last 7 days.

### Importing packages and loading credentials
We start by importing the necessary packages to run the code.

In [1]:
import requests
import json
import time
import random
import os
import pandas as pd

We import our *bearer_token* which we previously defined as an environment variable. This way you do not have to expose your credentials in your code.

In [2]:
bearer_token = os.environ.get("BEARER_TOKEN")

### Preparing our API request
We will use the recent search endpoint to collect our first set of tweets. To do that we need to define the endpoint URL, the rules clarifying the data we want to collect and other query parameters such as fields to include and maximum number of results.

In [3]:
endpoint_url = "https://api.twitter.com/2/tweets/search/recent"

We define the following two rules:
- tweets matching one of the expressions "heat pump"/"heat pumps", written in english, which are not retweets;
- tweets matching one of the expressions "gas boiler"/"gas boilers", written in english, which are not retweets.

In [4]:
rules = [
    {"value": '("heat pump" OR "heat pumps") -is:retweet lang:en', "tag": "heat_pump"},
    {"value": '("gas boiler" OR "gas boilers") -is:retweet lang:en', "tag": "gas_boiler"},
]

We create a dictionary with query parameters, where we pass the following fields:
- **tweet.fields**: fields in the tweet object for which we want to collect information, in this example: the tweet unique identifier, the tweet text, the identifier of the user posting the tweet and the date/time the tweet was created;
- **user.fields**: fields in the user object for which we want to collect information, in this example: the user unique identifier, name, username, date/time the user created their account, description, user defined location and whether the user is verified or not;
- **expansions**: expansion query parameter with info relating to the user. We need to add this in order to receive user data in our response object.
- **max_results**: the maximum number of tweets to be retrieved per request to the API, in this case 100 (which is also the maximum allowed).

Unlike our previous example, here we do not define the query rules straight away.

In [5]:
query_parameters = {
    "tweet.fields": "id,text,author_id,created_at",
    "user.fields": "id,name,username,created_at,description,location,verified",
    "expansions": "author_id",
    "max_results": 100,
}

### Authentication
Authentication is done by bearer token.

In [6]:
def request_headers(bearer_token: str) -> dict:
    """
    Set up the request headers. 
    Returns a dictionary summarising the bearer token authentication details.
    """
    return {"Authorization": "Bearer {}".format(bearer_token)}

In [7]:
headers = request_headers(bearer_token)

### Connecting to endpoint and taking a look at the data
We connect to the endpoint and retrieve our first page of data to see what changed in comparison to the previous notebook.

In [8]:
def connect_to_endpoint(endpoint_url: str, headers: dict, parameters: dict) -> json:
    """
    Connects to the endpoint and requests data.
    Returns a json with Twitter data if a 200 status code is yielded.
    Programme stops if there is a problem with the request and sleeps
    if there is a temporary problem accessing the endpoint.
    """
    response = requests.request(
        "GET", url=endpoint_url, headers=headers, params=parameters
    )
    response_status_code = response.status_code
    if response_status_code != 200:
        if response_status_code >= 400 and response_status_code < 500:
            raise Exception(
                "Cannot get data, the program will stop!\nHTTP {}: {}".format(
                    response_status_code, response.text
                )
            )
        
        sleep_seconds = random.randint(5, 60)
        print(
            "Cannot get data, your program will sleep for {} seconds...\nHTTP {}: {}".format(
                sleep_seconds, response_status_code, response.text
            )
        )
        time.sleep(sleep_seconds)
        return connect_to_endpoint(endpoint_url, headers, parameters)
    return response.json()

Let us retrieve the first page of tweets for our first rule:

In [9]:
query_parameters["query"] = rules[0]["value"]
json_response = connect_to_endpoint(endpoint_url, headers, query_parameters)

Now the json_response dictionary contains 3 keys: *data*, *includes* and *meta*. The only difference from the previous example is the *includes* field.

In [10]:
json_response.keys()

dict_keys(['data', 'includes', 'meta'])

json_response["includes"] is also a dictionary and it contains one key, "users", because we are now also collecting user information. If other information such as places/location information was also being collected, then we would have another key in our json_response["includes"] dictionary.

In [11]:
json_response["includes"].keys()

dict_keys(['users'])

This is what each user dictionary looks like:

In [12]:
json_response["includes"]["users"][0]

{'id': '1335131',
 'created_at': '2007-03-17T04:29:45.000Z',
 'location': 'Seattle, WA',
 'username': 'aaronbrethorst',
 'verified': False,
 'description': 'extremely hardcore middle-aged software bozo.',
 'name': '@aaronbrethorst@mastodon.social \uea00'}

### Collecting tweets from the past 7 days

We define a functions to process twitter data and we start the data collection process.

In [13]:
def process_twitter_data(
    json_response: json,
    query_tag: str,
    tweets_data: pd.DataFrame,
    users_data: pd.DataFrame,
) -> tuple[pd.DataFrame]:
    """
    Adds new tweet/user information to the table of
    tweets/users and saves dataframes as pickle files,
    if data is avaiable.
    
    Returns the tweets and users updated dataframes.
    """
    if "data" in json_response.keys():
        new = pd.DataFrame(json_response["data"])
        tweets_data = pd.concat([tweets_data, new])
        tweets_data.reset_index(drop=True, inplace=True)
        tweets_data.to_pickle("tweets_" + query_tag + ".pkl")

        if "users" in json_response["includes"].keys():
            new = pd.DataFrame(json_response["includes"]["users"])
            users_data = pd.concat([users_data, new])
            users_data.drop_duplicates("id", inplace=True)
            users_data.reset_index(drop=True, inplace=True)
            users_data.to_pickle("users_" + query_tag + ".pkl")

    return tweets_data, users_data

Now that we know what the data looks like, let's start our data collection process!

**The data collection process:**
- We define empty dataframes where we will store information about tweets and users;
- The for loop allows you to go through all your rules;
- We update the query parameters query field according to the rule in question;
- We connect to the endpoint as in the previous example and process the data, using the process_twitter_data() function;
- Then the program sleeps for 5 seconds. This is necessary not to surpass the rate limit. For this specific endpoint and Essential access level, the rate limit is 180 requests/15 minutes per user, which translates into 1 request every 5 seconds so we need to wait for at least 5 seconds before we make another request.
- If json_response["meta"] has a next_token (the pagination token) field then it means that we have not reached the final page of tweets, so we add it as a query parameter and collect more tweets;
- We repeat the process until  json_response["meta"] no longer contains  next_token field.

In [14]:
tweets_data = pd.DataFrame()
users_data = pd.DataFrame()

for i in range(len(rules)):
    query_parameters["query"] = rules[i]["value"]
    query_tag = rules[i]["tag"]

    json_response = connect_to_endpoint(endpoint_url, headers, query_parameters)
    tweets_data, users_data = process_twitter_data(
        json_response, query_tag, tweets_data, users_data
    )

    time.sleep(5)

    while "next_token" in json_response["meta"]:
        query_parameters["next_token"] = json_response["meta"]["next_token"]

        json_response = connect_to_endpoint(endpoint_url, headers, query_parameters)
        tweets_data, users_data = process_twitter_data(
            json_response, query_tag, tweets_data, users_data
        )

        time.sleep(5)

### Exercise: Take some time to look at the data we just collected

In [15]:
tweets_hp = pd.read_pickle("tweets_heat_pump.pkl")
tweets_gb = pd.read_pickle("tweets_gas_boiler.pkl")

users_hp = pd.read_pickle("users_heat_pump.pkl")
users_gb = pd.read_pickle("users_gas_boiler.pkl")

In [16]:
tweets_hp

Unnamed: 0,id,edit_history_tweet_ids,author_id,created_at,text
0,1612970711546724353,[1612970711546724353],1335131,2023-01-11T00:32:39.000Z,@loganb Are there any heat pump systems that w...
1,1612970521783574529,[1612970521783574529],1561751361359847424,2023-01-11T00:31:54.000Z,@DoombergT Can someone give her a physics less...
2,1612970310239752196,[1612970310239752196],1565388607791108097,2023-01-11T00:31:04.000Z,Heat Pumps Massively Improve EV Range in Cold ...
3,1612969767672979456,[1612969767672979456],209980898,2023-01-11T00:28:54.000Z,"In the long term, switching from gas boilers t..."
4,1612969563993550850,[1612969563993550850],2873595955,2023-01-11T00:28:06.000Z,@cdrsalamander Yes ban natural gas in northern...
...,...,...,...,...,...
4463,1610436840456069120,[1610436840456069120],23367384,2023-01-04T00:43:57.000Z,French startup unveils new residential thermo-...
4464,1610436690635636738,[1610436690635636738],3153011014,2023-01-04T00:43:22.000Z,French startup unveils new residential thermo-...
4465,1610436380081164292,[1610436380081164292],1373034961758945281,2023-01-04T00:42:08.000Z,Quad Zone High SEER Ductless Mini Split Air Co...
4466,1610436101402918912,[1610436101402918912],1176560598567407616,2023-01-04T00:41:01.000Z,French startup unveils new residential thermo-...


In [17]:
tweets_gb

Unnamed: 0,id,edit_history_tweet_ids,author_id,created_at,text
0,1612970711546724353,[1612970711546724353],1335131,2023-01-11T00:32:39.000Z,@loganb Are there any heat pump systems that w...
1,1612970521783574529,[1612970521783574529],1561751361359847424,2023-01-11T00:31:54.000Z,@DoombergT Can someone give her a physics less...
2,1612970310239752196,[1612970310239752196],1565388607791108097,2023-01-11T00:31:04.000Z,Heat Pumps Massively Improve EV Range in Cold ...
3,1612969767672979456,[1612969767672979456],209980898,2023-01-11T00:28:54.000Z,"In the long term, switching from gas boilers t..."
4,1612969563993550850,[1612969563993550850],2873595955,2023-01-11T00:28:06.000Z,@cdrsalamander Yes ban natural gas in northern...
...,...,...,...,...,...
4466,1610436101402918912,[1610436101402918912],1176560598567407616,2023-01-04T00:41:01.000Z,French startup unveils new residential thermo-...
4467,1610435925963579395,[1610435925963579395],1181481640532406272,2023-01-04T00:40:19.000Z,French startup unveils new residential thermo-...
4468,1610486623019442176,[1610486623019442176],395416760,2023-01-04T04:01:46.000Z,"Total cost here came to about $2,000. Multiple..."
4469,1610473114005114882,[1610473114005114882],23439751,2023-01-04T03:08:06.000Z,#BOXT_Gas_Boilers is a leading UK #gas_install...


In [18]:
users_hp

Unnamed: 0,description,location,name,verified,username,id,created_at
0,extremely hardcore middle-aged software bozo.,"Seattle, WA",@aaronbrethorst@mastodon.social ,False,aaronbrethorst,1335131,2007-03-17T04:29:45.000Z
1,,,Mr. Smith,False,MrSmith_009,1561751361359847424,2022-08-22T16:25:32.000Z
2,The CVEC is a partnership of community-based o...,"Fresno, CA",Clean Vehicle Empowerment Collaborative,False,EVempowerment,1565388607791108097,2022-09-01T17:18:37.000Z
3,Deputy Director for Sustainable Future @nesta_...,Bristol,Andrew Sissons,False,ACJSissons,209980898,2010-10-30T08:57:51.000Z
4,,mmi,Tom Callender,False,Tom_Callender,2873595955,2014-10-23T15:30:21.000Z
...,...,...,...,...,...,...,...
3055,When the blast wave hit\nThe impact burned pai...,The Lookout,2xMvr,False,2xMvr,1264366185153323008,2020-05-24T01:22:48.000Z
3056,Technology Expert - Advisory Board Member - Co...,"California, Riverside",Keith Nelson,False,knelsonvsi,23367384,2009-03-08T23:01:16.000Z
3057,Husband | Engineer | Traveler | Creator | Thin...,"Irvine, CA",Terry Ferguson,False,terrypferguson,3153011014,2015-04-09T23:07:01.000Z
3058,Hacker News bot that tweets any story that bre...,,HackerNewsTop10,False,HackerNewsTop10,1176560598567407616,2019-09-24T18:14:54.000Z


In [19]:
users_gb

Unnamed: 0,description,location,name,verified,username,id,created_at
0,extremely hardcore middle-aged software bozo.,"Seattle, WA",@aaronbrethorst@mastodon.social ,False,aaronbrethorst,1335131,2007-03-17T04:29:45.000Z
1,,,Mr. Smith,False,MrSmith_009,1561751361359847424,2022-08-22T16:25:32.000Z
2,The CVEC is a partnership of community-based o...,"Fresno, CA",Clean Vehicle Empowerment Collaborative,False,EVempowerment,1565388607791108097,2022-09-01T17:18:37.000Z
3,Deputy Director for Sustainable Future @nesta_...,Bristol,Andrew Sissons,False,ACJSissons,209980898,2010-10-30T08:57:51.000Z
4,,mmi,Tom Callender,False,Tom_Callender,2873595955,2014-10-23T15:30:21.000Z
...,...,...,...,...,...,...,...
3058,Hacker News bot that tweets any story that bre...,,HackerNewsTop10,False,HackerNewsTop10,1176560598567407616,2019-09-24T18:14:54.000Z
3059,An average #radonc with extraordinary interest...,,Abhishek Puri,False,radoncnotes,1181481640532406272,2019-10-08T08:09:39.000Z
3060,Building wealth one property at a time. Showin...,Tools for your money journey:,The Frugal Mogul 🏡,False,TheFrugalMogul,395416760,2011-10-21T16:39:31.000Z
3061,"We list and promote 👜 #Fashion, 🥑 #Food, 💍 #Je...",United Kingdom,PrimeSiteUK - Business Listing & Promotion,False,PrimeSiteUK,23439751,2009-03-09T14:18:51.000Z
