# Twitter

<center><img src="png/twitter_logo.png" width = 150/></center>

The third API we are going to explore is Twitter API. On the one hand it will be the easiest one to access, on the other hand it will be the hardest one because unlike Wikipedia and Reddit it requires having not only a Twitter account but also a Developer Account. Moreover, in the version we are going to use it has a very strict limits of how much data we can actually get. However, enought with downsides. The good thing is that there is a great resource provided by Twitter itself with a very comprehensive [tutorial](https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research). This notebook contains a small extract from it. So if you feel you need more information or something is unclear I would recommend seeing the tutorial.

## What is Twitter?

I would assume that most of you know what Twitter is, what Tweets look like, and what kind of interactions are possible there. Just in case below a very brief definition form the above-mentioned course.

> Twitter is a platform that is used by people across the world to exchange thoughts, ideas and information with one another, using Tweets. Each Tweet consists of up to 280 characters and may include media such as links, images and videos. In the context of research, Twitter data refers to the public information that is provided via Twitter’s application programming interface (API). The API supports various endpoints such as recent search, filtered steam etc. that let developers and researchers connect to the API and request Twitter data.

In general, if you ever saw a tweet. You probably saw it in the format like in the picture below. However, under the hood there is much more infomation (metadata) that characterize every single tweet. With luck we can get some of this information using Twitter API, for example:

 * Tweet text
 * Tweet ID (that uniquely identifies a Tweet)
 * The time at which the Tweet was created
 * Public metrics associated with the Tweet such as number of retweets, number of likes etc.
 * Public user information such as username, user ID, user bio, profile image url etc.
 * Tweet Annotations - some Tweets are annotated based on the topic that they are about and the named entities present in the Tweets, i.e. COVID-19 stream.

A complete information of the data we can get from each tweet might be found in the documantation under the following [link](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet).

## Developer Account

As I mentioned in the begining, to access Twitter API we need to establish a [Developer Account](https://developer.twitter.com/en) (it requires having a regular account first). It requires to answer a few fairly easy questions and confirming your email. By defualt you will create a Developer Account with Essential Access to Twitter API. For the purpose of this class it will be enough but if you plan on doing a real research it is worht applying for Academic Research Access. 

After creating an account you should be able to log into [Developer Platform](https://developer.twitter.com/en) and create a project. In general, projects serve for organizing your access to Twitter API. Each project might contain multiple Apps (in our case it will be only one app) that serve for generating credentials for authentication. In other words, you need to create a project and later an app so Twitter can recognize that it is you who try to access the API. Therefore, you are never anonymous when you are getting the data from Twitter. More or less, they know what you are doing. [Let that sink in](https://edition.cnn.com/videos/business/2022/10/28/late-night-elon-musk-sink-pun-twitter-orig-cprog-fj.cnn-business).

When creating the App and project you simply need to answer a few simple questions. The answer should be straightforward. Untill you get to the screen looking like that.

<center><img src="png/keys.png" /></center>

 It is the most important moment because those are the credentials (authorization details you need to store somewhere safe). In this particular case, although it is not the best idea ever just just copy and paste them in the following chunk under relevant names. You should, however, try not to share them with anyone. That is because as I mentioned before they serve to identify you (it is more or less you ID for Twitter). So if someone maluses them it will be on your account. You should never share them on public repositories. Probably the best practice is to add them as envioronment variables but it is far beyond this class. Therefore, for now you will store them in this notebook (you can always access them on the Twitter Developer Paltform).
 
  

In [37]:
## Define authorization keys
API_KEY = ''
API_SECRET_KEY = ''
BEARER_TOKEN = ''
## For the extraction of the enviornmental variable
import os
BEARER_TOKEN = os.getenv('Bearer_Token')


The just established account gives us access to so-called standard product track. In general, it allows to:

* Search for Tweets from the last 7 days by specifying queries using supported operators (more on building queries in later sections)
* Stream Tweets in real-time as they are happening by specifying rules to filter for Tweets that you are interested in.
* Get Tweets from a user’s timeline (up to 3200 most recent Tweets)
* Build the full Tweet objects from a Tweet ID, or a set of Tweet IDs
* Look up follower relationships

As you can see it has limitation of how far we can move back. Moreover, there is a total limit of 500,000 tweets we can get over the month. This restriction, however, does not apply for streaming tweets. We can stream as many tweets we want. But it is important to acknowledge that Stream Tweets endpoint gives access to only a sample of tweets (around 1% of all tweets).

### Academic Research Product Track

It is possible and rather straightforward to get access to the whole archive of tweets, dating back to 2006 (using the full-archive search endpoint). It requires just to apply for it and the benefits include:

* Ability to get historical Tweets from the entire archive of public conversation on Twitter, dating back to 2006 (using the full-archive search endpoint)
* Higher monthly Tweet volume cap of 10 million Tweets per month
* More advanced filter options to return relevant data, including a longer query length, support for more concurrent rules (for filtered stream endpoint), and additional operators that are only supported in this product track (more on this later)

As Twitter states it: "The Academic Research product track is reserved for those conducting professional academic research who have a specific research purpose with Twitter data." In order to get access to the academic research product track, these are the requirements:

* You are a graduate student, doctoral candidate, post-doc, faculty, or research-focused employee at an academic institution or university.
* You have a clearly defined research objective, and you have specific plans for how you intend to use, analyze, and share Twitter data from your research.
* You will use this product track for non-commercial purposes.

From what I know it is not very hard to get such an access to Twitter API, however, it requires writing quite a few sentences about the purpose.

## Endpoints

The Twitter API provides different endpoints to get Tweets, based on your use-case. It is important to know which endpoint you should use, in order to get the right data. For example, if you want to get historical Tweets, you have the choice of using the [recent search endpoint](https://developer.twitter.com/en/docs/twitter-api/tweets/search/introduction) (if the Tweets are from the last 7 days) or the [full-archive search endpoint](https://developer.twitter.com/en/docs/twitter-api/tweets/search/quick-start/full-archive-search) (if the Tweets are older than that). You can not get this historical data using a streaming endpoint such as [filtered stream endpoint](https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/introduction), because that endpoint only provides Tweets in real-time, as they happen. Similarly, if you want to build your Tweet dataset from a list of Tweet IDs, you can use the [Tweet lookup endpoint](https://developer.twitter.com/en/docs/twitter-api/tweets/lookup/introduction). A good summary of the most popular endpoints with the questions they might help to answer might be found [here](https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/3-deciding-which-endpoints-to-use.md).

In [84]:
from twarc import Twarc2, expansions
import datetime
import json
import os

## Replace your bearer token below
client = Twarc2(bearer_token=BEARER_TOKEN)


## Specify the start time in UTC for the time period you want Tweets from
## It must be within last 7 days
start_time = datetime.datetime(2022, 11, 1, 0, 0, 0, 0, datetime.timezone.utc)

## Specify the end time in UTC for the time period you want Tweets from
end_time = datetime.datetime(2022, 11, 2, 0, 0, 0, 0, datetime.timezone.utc)

## This is where we specify our query 
query = "from:elonmusk -is:retweet"

## The search_recent method call the recent search endpoint to get Tweets based on the query, start and end times
search_results = client.search_recent(query=query, start_time=start_time, end_time=end_time, max_results=100)

Let's unpack what we got. We accessed the Twitter API using this module `Twarc2` and probably we would expect any kind of information that we succeed. Status code? At least something similar to Wikipedia's response, right?

In [85]:
## We would expect a status code or a at least a list of JSONs?
search_results

<generator object Twarc2._search at 0x120344900>

Unfortunetly, it is neither. It is a generator object (as it says above). It means that it is the output of a special kind of functions that are called generators. In nutshell, a generator is a function that returns a lazy iterator. It means that the its value is generated when it is iterated over. It might be cofusing (and for me still is) but we don't have to worry about it right now too much. That is because the good news is that we can use a `for-loop` to extract values of a generator object.

In [86]:
## Assign elements of the generator to the 
## object calle results
results = [ item for item in search_results ]

On the other hand a bad news is that since generator generates its content when it is interated over we can do it only once. Therefore, you can only iterate over them once -- they are single use.

In [87]:
## Let's try to re-use our generator to extract its
## values once again.
[ item for item in search_results ]

[]

As I mentioned before, we should not be very suprised that we got an empty list knowing what a generator is. Good we assigned the values of the generator to the name `results`. Let's examine it.

In [89]:
## Let's see what we got from Twitter
print(type(results))
print(len(results))
results

<class 'list'>
1


[{'data': [{'source': 'Twitter for iPhone',
    'conversation_id': '1587498907336118274',
    'possibly_sensitive': False,
    'reply_settings': 'everyone',
    'context_annotations': [{'domain': {'id': '46',
       'name': 'Business Taxonomy',
       'description': 'Categories within Brand Verticals that narrow down the scope of Brands'},
      'entity': {'id': '1557696848252391426',
       'name': 'Financial Services Business',
       'description': 'Brands, companies, advertisers and every non-person handle with the profit intent related to Banks, Credit cards, Insurance, Investments, Stocks '}},
     {'domain': {'id': '46',
       'name': 'Business Taxonomy',
       'description': 'Categories within Brand Verticals that narrow down the scope of Brands'},
      'entity': {'id': '1557696940178935808',
       'name': 'Gaming Business',
       'description': 'Brands, companies, advertisers and every non-person handle with the profit intent related to offline and online games such as ga

In [91]:
results[0].keys()

dict_keys(['data', 'includes', 'meta', '__twarc'])

In [99]:
results[0]['meta']

{'newest_id': '1587588008756318211',
 'oldest_id': '1587250787775832067',
 'result_count': 30}

It is a list of a length 1. And as we see under the index 0 we have a json that contains four keys. While we probably will be interested in the `data` other migth also draw our attention, for example `meta` and `includes`. They both contain some additional information about tweets (`includes` -- information about media attached to the tweet and `meta` -- simply meta information about tweets we gathered). Therefore, they both fields might be of some interest for us. We could try to add them to the relevant tweets in the `data` field but fortunetly we don't have to do it manually. We can use the `expansions.flatten()` function. It will do the dirty work for us.

In [102]:
results = expansions.flatten(results[0])

Now when we know the structure what we get from Twitter API, we can do all the above-mentioned steps in a single line using a list comprehension.

In [105]:
## Let's do everything in one line
results_lc = [ tweet  for item in search_results for tweet in expansions.flatten(item) ]
results_lc

[]

Yyyyy, why it did not work? It did not work because we tried to iterate over search_results which are a generator object. Therefore, they are only single interable.

Ok, let's now see what kind of data we get when we examine a single tweet. It will be probably not a big of a suprise if I tell you that it is a dictionary. Therefore, let's look at its keys.

In [108]:
results[0].keys()

dict_keys(['source', 'conversation_id', 'possibly_sensitive', 'reply_settings', 'context_annotations', 'lang', 'edit_history_tweet_ids', 'id', 'entities', 'text', 'edit_controls', 'in_reply_to_user_id', 'created_at', 'public_metrics', 'referenced_tweets', 'author_id', 'author', 'in_reply_to_user', '__twarc'])

In [118]:
results[0]['in_reply_to_user']

{'location': 'Skyrim',
 'description': 'Rata Noruega.  Me gustan los gatos obesos.',
 'profile_image_url': 'https://pbs.twimg.com/profile_images/1570271576683053056/-szDrDzd_normal.jpg',
 'username': 'Rubiu5',
 'entities': {'url': {'urls': [{'start': 0,
     'end': 23,
     'url': 'https://t.co/GuhPC0QiTt',
     'expanded_url': 'http://www.youtube.com/elrubiusOMG',
     'display_url': 'youtube.com/elrubiusOMG'}]}},
 'pinned_tweet_id': '1046131118385352704',
 'created_at': '2011-10-25T21:37:48.000Z',
 'name': 'elrubius',
 'protected': False,
 'verified': True,
 'id': '398306220',
 'public_metrics': {'followers_count': 20204662,
  'following_count': 931,
  'tweet_count': 25693,
  'listed_count': 8372},
 'url': 'https://t.co/GuhPC0QiTt',
 'pinned_tweet': {}}

Let's unpack a bit the fields and recognize what we can expect them to contain.

* `source` (string) -- it indicates the software used to create a tweet, i.e. Twitter for iPhone
* `conversation_id` (string) -- the Tweet ID of the original Tweet of the conversation (which includes direct replies, replies of replies).
* `possibly sensitive` (boolean) -- this field indicates content may be recognized as sensitive. The Tweet author can select within their own account preferences and choose “Mark media you tweet as having material that may be sensitive” so each Tweet created after has this flag set. This may also be judged and labeled by an internal Twitter support agent.
* `reply_settings` (string) -- shows you who can reply to a given Tweet. Fields returned are "everyone", "mentioned_users", and "followers".
* `context_annotations` (list) -- contains context annotations for the Tweet. This is a bit of a magic.
* `lang` (string) -- Language of the Tweet, if detected by Twitter.
* `edit_history_tweet_ids` (list) -- unique identifiers indicating all versions of a Tweet. For Tweets with no edits, there will be one ID. For Tweets with an edit history, there will be multiple IDs, arranged in ascending order reflecting the order of edits. The most recent version is the last position of the array.
* `id` (string) -- the unique identifier of the requested Tweet.
* `entities` (dict) -- entities that have been parsed out of the text of the tweet.
* `text` (string) -- the content of the tweet (encoded in UTF-8).
* `edit_controls` (dict) -- when present, this indicates how much longer the Tweet can be edited and the number of remaining edits. Tweets are only editable for the first 30 minutes after creation and can be edited up to five times.
* `in_reply_to_user_id` (string) -- if the represented tweet is a reply, this field will contain the original Tweet’s author ID. This will not necessarily always be the user directly mentioned in the Tweet.
* `created_at` (string) -- creation time of the Tweet (in the following format YYYY-MM-DDThh:mm:ss.<time>Z, i.e. 2022-11-01T23:30:51.000Z).
* `public_metrics` (dict) --  public engagement metrics for the Tweet at the time of the request (retweet count, like count, quote count, reply count).
* `referenced_tweets` (list) --  list of Tweets this Tweet refers to. For example, if the parent Tweet is a Retweet, a Retweet with comment (also known as Quoted Tweet) or a Reply, it will include the related Tweet referenced to by its parent.
* `author_id` (string) -- the unique identifier of the User who posted this tweet.
* `author` (dict) -- information about the author of the tweet.
* `in_reply_to_user` -- information about the user to which this tweet replies.

### Exercise

Now, when we know what tweets look like. Let's examine the tweets we collected. Please extract from the collected tweets a tweet with the biggest number of likes and return its text (maybe it will be something funny...).




In [120]:
## YOUR CODE

{'likes': 1120672, 'text': 'Halloween with my Mom https://t.co/xOAgNeeiNN'}
