# Getting Tweets from Twitter
For data acquisition, the twitter API (academic track) is used. Keep in mind that changes to the twitter API are frequent, especialle since the latest change of management. We already experienced difficulties getting the same data with the same requests. For example, the latest trials give us no more geodata for poi tweets.

This notebook presents our workflow for communicating with the API, however in for the following notebooks we use the data acquired at ~September 2022.

### Loading modules
The base code for communicating with the Twitter API is get_tweets.py. get_tweets is a comprehesive code for communicating with the twitter API that was written for another project.

In [1]:
from random import sample
import backend_codes.get_tweets as gt

The get_tweets module operates object orientated, so you create an instance of the getter and set the respective variables. After that you can use different commands to pull the tweets for the API, e.g. get_tweets(), where you specify the number of pages you want to pull.

### Setting parameters

Start time and end time in ISO format

In [2]:
start_time="2020-04-06T00:00:00.000Z"
end_time="2022-07-31T23:59:59.000Z"

Define the bounding box of Rio's city boundaries. One bounding box is too much for the twitter limitations (max ~40 km of edge length), so we work with two bounding boxes.

In [3]:
bbox=[(-43.79682, -23.08779, -43.44485, -22.74568),
        (-43.44485, -23.08779, -43.09288, -22.74568)]

### More details on geolocation:
Tweets nowadays do not really have real coordinates attached to them (very few still have, but these coordinates are unuseful since the only have a precicion of two decimal points). When users give their locations they do it on different adimistrave levels. Each place, no matter its type has a unique place-id in the twitter database:
- POIs: For example the local Gym or Coffee shop. These are given as exact coordinates provided by Twitter
- Neighborhoods: For example the centro district in Rio. In this case Twitter gives a bounding box of this district and the district name.
- City: The user just attaches the city where they sent the tweet form. This is irrelevant when analysing urban movements inside the city
- Even higher adiministrave units (state, country, etc.), also irrelevant

Most users who send a geolocated tweet just share the city they sent it from (e.g. Rio de Janeiro, Duque de Caxias, etc.). Since we do not need these Tweets and do not want to strech our tweets limits, we identify the place-ids of Rio itself and all citys that cross Rio's bounding box. This still leaves tweets sent from adjecent cities on neighborhood level, but these are not to many and can be filtered later.

Set the place_ids in a list to combine them, put -- in front to exclude them instead of explicitly getting tweets only from this place, use False as the second argument in order to get AND logic in between, so all these cities are excluded.

In [4]:
place_id=(['--97bcdfca1a2dca59', '--41bf05f3b26396e4', '--596a82c8c53236bd', 
            '--4029837e46e8e369', '--1c88433143d383e1', '--59373f0a295160e4', 
            '--7343e9c57b3427b5', '--3b5c5c9c62f7c538',
            '--abbb45debbf38127', '--b1a5f3bbff698d24', '--d1fc0c973adbff22'], False)

### Initialize a getter object and set parameters
has_geo has to be on True, since we only want geolocated tweets, we do not want retweets since the do not produce new location data. Results per page are the number of tweets we get per request sent to twitter, in our case this should be the maximum (500) to save on the request limits.

The function returns a short summary of the current request setup.

In [5]:
getter = gt.FullArchiveV2()

getter.set_parameters(start_time=start_time, end_time=end_time, bbox=bbox, has_geo=True, get_retweets=False, \
                      place_id=place_id, max_results_per_page=500)

Your Query:

place.fields: full_name,geo,id,name,place_type

query: (bounding_box:[-43.79682 -23.08779 -43.44485 -22.74568] OR bounding_box:[-43.44485 -23.08779 -43.09288 -22.74568]) (-place:97bcdfca1a2dca59 -place:41bf05f3b26396e4 -place:596a82c8c53236bd -place:4029837e46e8e369 -place:1c88433143d383e1 -place:59373f0a295160e4 -place:7343e9c57b3427b5 -place:3b5c5c9c62f7c538 -place:abbb45debbf38127 -place:b1a5f3bbff698d24 -place:d1fc0c973adbff22) -is:retweet has:geo

tweet.fields: created_at,text,public_metrics,referenced_tweets,geo,lang,conversation_id

expansions: author_id,geo.place_id

max_results: 500

start_time: 2020-04-06T00:00:00.000Z

end_time: 2022-07-31T23:59:59.000Z

Your Query is 400 characters long


### Running the tweet machine
get tweets gets the number_of_pages * max_results_per_page tweets. If it is more tweets, we can call get_more_tweets which will send further requests to twitter. The getter automatically attaches the places to the tweet. Twitter however sends the place information as well as user information seperatly from the tweet itself to avoid redundancy. The original tweet response can be accessed via getter.full_response. One request does not always contain 500 tweets, since tweets get for example deleted or are not accessible for another reason.
Examples below:

If we want to get all possible tweets in this timeframe we have to put a very high number of pages. Since this is a demostration notebook, we cannot pull the tweets here to not reveal the API token. Tweets are loaded in the next step.

In [6]:
# tweets = getter.get_tweets(number_of_pages=1, save_inbetween=True)

### Working with the full response

The full response contains three fields: data (the tweets), errors (hopefully empty), and includes. Inculdes contains users (all users with their information and id to be joined later with their respective tweets) and places (all places (neighborhoods, pois, etc.), their information and also ids to be joined later with the tweet.

Normally, the full response would be accessed like this:

In [7]:
full_response = getter.full_response

But we can also load the full response with all of our data from the repository.

In [8]:
full_response = gt.load_tweets(r"data\tweets\retrieved_tweets.txt")
tweets = full_response['data']

In [9]:
for p in sample(tweets, 10):
    print(p['geo']['place_type'])

poi
poi
poi
poi
poi
poi
neighborhood
poi
poi
poi


### The full response structure:

In [10]:
print("Data (all the tweets)")
print("    ", [key for key in full_response['data'][7].keys()])
print("\nIncludes")
print("    Users\n         [...]")
print("    Places (all the places)")
print("        ", [key for key in full_response['includes']['places'][5].keys()])

Data (all the tweets)
     ['public_metrics', 'author_id', 'text', 'lang', 'id', 'conversation_id', 'created_at', 'geo']

Includes
    Users
         [...]
    Places (all the places)
         ['place_type', 'name', 'id', 'full_name', 'geo']


### How to go from here?
All the places in the full response data will now be assigned to a neighborhood. This is not so simple, because we cannot use accurate location coordinates anymore. Joining the neighborhoods to their respective barrios (neighborhoods) well be described in the next notebook.