# Using APIs

## 1. Using the Twitter API

In [3]:
import tweepy
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

### 1.1: Pulling all tweets based on a search query with the v1.1 API

With the Twitter API we can access most of Twitter’s functionality from within Python (that means both reading **and** writing Tweets, or finding out about users and trends). The package of choice is *Tweepy*, which deals with some of the messy details.

To access the Twitter API, you need to be authenticated. Hence, every request has to come with authentication information. To get this information in the first place, we need to generate our own credentials with a Developer Account:

1. Go to the <a href=https://developer.twitter.com/en>Twitter Developer Site</a> and apply for a Developer Account (you will need a Twitter account for this). If you want to use the v1.1 APIs you will also need to apply for elevated access (the v2.0 API can be used with just essential access)
2. Create an application (e.g., "My_first_application"). Credentials and limits are per application, not per account.
3. Once you have created your application, you can transfer your consumer API key and secret, as well as your app access key and secret to the Python code below (see also https://developer.twitter.com/en/docs/basics/authentication/overview/oauth). This will be needed for the v1.1 APIs. For the v2.0 API it suffices to use the "bear token"

You can directly add your data as a string like this:
```
CONSUMER_API_KEY = 'COPY STRING HERE'
CONSUMER_API_SECRET = 'COPY STRING HERE'
ACCESS_KEY = 'COPY STRING HERE'
ACCESS_SECRET = 'COPY STRING HERE'
```

So that I can share my code without everyone using my credentials (which would probably lead to me being blocked by Twitter), I'm instead reading the data from a csv here:

In [4]:
api_access = pd.read_csv('API_access.csv',delimiter=';')
CONSUMER_API_KEY = api_access[api_access['api'] == 'twitter_consumer_api_key']['key'].tolist()[0]
CONSUMER_API_SECRET = api_access[api_access['api'] == 'twitter_consumer_api_secret']['key'].tolist()[0]
ACCESS_KEY = api_access[api_access['api'] == 'twitter_access_key']['key'].tolist()[0]
ACCESS_SECRET = api_access[api_access['api'] == 'twitter_access_secret']['key'].tolist()[0]

FileNotFoundError: [Errno 2] No such file or directory: 'API_access.csv'

We are also not allowed to request too many Tweets at the same time. There are per-day limits, as well as "rate limits" for 15-minute blocks. If you exceed your limits, you **will** get blocked for some time. For detailed information on the limits, check out https://developer.twitter.com/en/docs/rate-limits.
In many cases, we can use the functionality of Tweepy to automatically delay calls in order to wait on the rate limit - but be aware that this doesn't always work, and we may need to manually add timeouts.

We are now ready to create our verified interface (automatically waiting on our rate limit as necessary):

In [None]:
auth = tweepy.OAuthHandler(CONSUMER_API_KEY, CONSUMER_API_SECRET)
auth.set_access_token(ACCESS_KEY, ACCESS_SECRET)
api = tweepy.API(auth, wait_on_rate_limit = True)

Let's download some tweets! 
We actually have different "endpoints" to choose from (each is essentially its own API). An overview of the v1.1 APis can be found here: https://docs.tweepy.org/en/stable/api.html

We will use the standard search endpoint. Note that this endpoint only allows you to download tweets based on general queries from the past week. If you want to download older tweets, you will need to dowload the tweets of a particular account (see below), or use the 30-day endpoint, for example.

Let's search for tweets with the hash tag `"#redbull"`:

In [None]:
tweets = api.search_tweets(q='#redbull',lang='en')

You can find details about the tweet objects at https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet.

In [None]:
for tweet in tweets:
    print("Created at: " + str(tweet.created_at))
    print("User: " + tweet.user.screen_name)
    print("Followers: " + str(tweet.user.followers_count))
    print("Content: " + tweet.text)
    print("---------------------\n")

Are these all the tweets? What do you think?

In [None]:
len(tweets)

No! A simple search will only return a small number of tweets, similar to the first page of the search results on a website. Instead, we need to paginate all the results. For the v1.1 API, we can use the `Cursor` class of `tweepy`. The documentation is here: https://docs.tweepy.org/en/stable/v1_pagination.html. `Cursor` allows us to control how many items we want using `.items()`. If we want all that can be found through the relevant endpoint, we just leave out the number here.

In [None]:
for tweet in tweepy.Cursor(api.search_tweets,q="#redbull",lang="en").items(5):
    print("Created at: " + str(tweet.created_at))
    print("User: " + tweet.user.screen_name)
    print("Followers: " + str(tweet.user.followers_count))
    print("Content: " + tweet.text)
    print("---------------------\n")

When requesting tweets in this manner, the API will cut anything beyond 140 characters. That means, even if we search for tweets with #redbull, the tweet we receive may not contain the hashtag. However, we can add the parameter `tweet_mode='extended'` to our `tweepy.Cursor()` call. In this case, returned tweets no longer have a `.text` attribute, but a `.full_text` attribute

(alternatively, we can "hydrate" tweets at any time, using just their ID (i.e. request the full text). You can thus use only the tweet ID to share your data)

In [None]:
for tweet in tweepy.Cursor(api.search_tweets,q="#redbull",lang="en", tweet_mode='extended').items(5):
    print("Created at: " + str(tweet.created_at))
    print("User: " + tweet.user.screen_name)
    print("Followers: " + str(tweet.user.followers_count))
    print("Content: " + tweet.full_text) # Note: when looking at extended tweets, there is no attribute `.text`
    print("---------------------\n")

### 1.2: Pulling tweets based on a search query with the v2.0 API

As mentioned above, we can access the v2.0 API without elevated access and by simply using a bearer token. `tweepy` also supports this API, making our life easier.

In [2]:
BEARER_TOKEN = api_access[api_access['api'] == 'twitter_bearer_token']['key'].tolist()[0]

NameError: name 'api_access' is not defined

With v2.0 we cannot use the `API` class anymore. Instead, we use the `Client` class, which is documented here: https://docs.tweepy.org/en/stable/client.html. The site also gives an overview of the different endpoints that you can access, similar to the documentation for the `API` class used in v1.1.

In [None]:
client = tweepy.Client(bearer_token=BEARER_TOKEN)

We will use the `search_recent_tweets` endpoint. Bear in mind that the `search_all_tweets` endpoint is only accessible with research access.

When you search for tweets with the client, you need to specify a query, just as before. However, in v2.0, not all tweet information is delivered. Hence, we use `tweet_fields`; more information can be found here: https://docs.tweepy.org/en/stable/expansions_and_fields.html#tweet-fields-parameter.

Moreover, we can only get up to 100 tweets per search (and can specify to obtain less using `max_results`). We will see below how to get more results.

In [None]:
tweets = client.search_recent_tweets(query="#redbull -is:retweet",
                                     tweet_fields=["created_at","lang"],
                                     max_results=10)

The key information is now within the `tweets.data` list:

In [None]:
tweets.data[0]

Note that each element is an object and this object has the attributes that we specified in the `tweet_field` (it also has the `text`, which is not shortened in this version of the API)

In [None]:
for tweet in tweets.data:
    print("Created at: " + str(tweet.created_at))
    print("Language: " + str(tweet.lang))
    print("Content: " + tweet.text)
    print("---------------------\n")

However, a lot of data is missing. For example, the tweet doens't come with all the user information as it did in v1.1 (there will be an error in the next line):

In [None]:
tweets.data[0].user

Instead, tweets can return user-information as a "child object", but only if we request this. For this, we use the `expansions` parameter, requesting `'author_id'`. A list of expansions can be found here: https://developer.twitter.com/en/docs/twitter-api/expansions.

Note that when we use the expansion `'author_id'`, we only get basic information about the user. We can extend what information we get about the user by specifying the `user_fields` parameter.

In [None]:
tweets = client.search_recent_tweets(query="#redbull -is:retweet",
                                     tweet_fields=["created_at","lang"],
                                     expansions=['author_id'],
                                     user_fields=["description","profile_image_url"],
                                     max_results=10)

The user information is not stored within `tweets.data`, but instead within `tweets.includes` (which itself is a dictionary that can contain things related to any expansion - the key `'users'` will give us the user information as a list). Note that the whole object model is described here: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet.

In [None]:
tweets.includes['users']

The objects in our `'users'` list always contains some basic information:

In [None]:
tweets.includes['users'][0].name

For each of these users, we can also get all the information that we requested through the `user_fields`. For example, the `profile_image_url`:

In [None]:
tweets.includes['users'][0].profile_image_url

There is one important aspect to keep in mind: the API will only return as many users as have posted anything in the search. Hence, `tweets.includes['users']` might a shorter list than `tweets.data` (exactly if one or more users posted multiple tweets).

****
So far, so good. The problem is that we only get 100 tweets. How can we get more? This is where the `Paginator` class comes in: https://docs.tweepy.org/en/stable/v2_pagination.html. This allows us to essentially do multiple requests and ensures that tweets are processed in order. Actually, this is very similar to the `Cursor` class from v1.1.

In [None]:
pages = tweepy.Paginator(client.search_recent_tweets,
                              query="#redbull -is:retweet",
                              tweet_fields=["created_at","lang"],
                              expansions=['author_id'],
                              user_fields=["description","profile_image_url"],
                              max_results=100)

We get back a "pages", each containing 100 tweets contained within its data. Basically, each page is like one call to our `client` object. If we want to also get information on the client, we have to be a bit careful and match the `author_id` field of the tweet with the `id` field of the user

In [None]:
for page in pages:
    page_users = {user.id: user for user in page.includes['users']} # We create a dictionary indexed by the user id to easily retrieve the full user object of each tweet
    for tweet in page.data:
        print("Created at: " + str(tweet.created_at))
        print("Language: " + str(tweet.lang))
        print("Content: " + tweet.text)
        print("User name: " + page_users[tweet.author_id].name)
        print("---------------------\n")

### 1.3: Finding followers in v1.1

We now want to learn more about the people (and company accounts) that follow Red Bull (as well as about whom they follow other than Red Bull). Let's start with finding some of Red Bull's followers:

In [None]:
followers_rb = []
for follower in tweepy.Cursor(api.get_followers,screen_name="redbull").items(5):
    followers_rb.append(follower)

A company like Red Bull has quite some followers and we would run into problems trying to get all at once.

Note that followers are saved as "User" objects, with their very own attributes, found here: https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/user. The twitter-handle is defined by the `screen_name` attribute.

In [None]:
follower = followers_rb[0]
follower

In [None]:
follower.screen_name

Can we get other accounts that this person follows? (Twitter defines those as friends). Sometimes, the information is set to private, so we don't know who the person is following. Hence, we need to do some Exception management:

In [None]:
try:
    for user in tweepy.Cursor(api.get_friends, screen_name=follower.screen_name).items(10):
        print(user.screen_name)
except:
    print("Follower " + follower.screen_name + " does not provide access to their friends.")

**Exercise**

Let's combine this for multiple of Red Bull's followers. For the first 5 followers, let's get up 10 of the accounts that they follow each. Can you add each of the (up to) 10 friends of 5 followers to a list?

In [None]:
all_followers_friends = []








Once done, print out the screen name of the combined list:

It's easy to imagine how we could create a network of accounts, right?

### 1.4: Finding followers in v2.0

As with the search for tweets, we have to change things up a little bit in order to get information about followers and friends through v2.0. As before, we can use the `Client` class for up to 100 results, and the `Paginator` if we need more. We will stick to the `Client` class for this part, since we only need the followers for Red Bull. The endpoint of choice is `get_users_followers`. However, it requires a user ID as input, so we first have to find Red Bull's user ID. For this, we can make use of the `get_user` endpoint, which returns a user object, specified here: https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user.

In [None]:
rb = client.get_user(username="redbull")
rb_id = rb.data.id
print(rb_id)

We can now get the followers. Note that we will only get the base data with the following request (everything marked as "default" in the User object documentaiton). If we want more information, we can again use the `user_fields` parameter.

In [None]:
followers_rb = client.get_users_followers(id=rb_id, max_results = 5)

In [None]:
for follower in followers_rb.data:
    print(follower.name)

The "friends" are now found through the `get_users_following` endpoint, which works in a very similar manner (it will also return a user object):

In [None]:
friends = client.get_users_following(id=followers_rb.data[0].id, max_results = 5)
for friend in friends.data:
    print(friend.name)

**Exercise**

Can you again combine the two pieces of code above to create a list containing 10 friends of each of the first 5 followers of Red Bull?

In [None]:
all_followers_friends = []









Once done, print out the screen name of the combined list:

### 1.5: Finding a specific user's tweets (over time) in v1.1

We can also take a look at all the Tweets of a specific account. When looking at an account's Tweets, we do not have to worry about date limits (there are some limitations, however).

To search an account's Tweets, we can use either the `.screen_name` or the `.user_id`:

In [None]:
for tweet in tweepy.Cursor(api.user_timeline,user_id=rb_id).items(5):
    print(tweet.text)

### 1.6: Finding a specific user's tweets (over time) in v2.0

The relevant endpoint in v2.0 is `get_users_tweets`, which is documented here: https://developer.twitter.com/en/docs/twitter-api/tweets/timelines/api-reference/get-users-id-tweets. This allows to find up to 3,200 tweets using the `Paginator` (or 100 using the `Client`).

In [None]:
tweets = client.get_users_tweets(id=rb_id,
                                 tweet_fields=["created_at","lang"],
                                 max_results=5)

In [None]:
for tweet in tweets.data:
    print(tweet.text)

### 1.7: Back to our engagement measures

Let's try to enrich our `racingdf` using Tweet data.

We can only collect tweets by hashtag for a week. Hence, I have prepared previous tweets for the last year (see on Moodle). This is stored as a `pickle` file - a system that allows to directly save arbitrary Python objects outside of our program. Once we call up the pickle file, we get back exactly the variables we saved into it. Since I saved a list of tweets, the return value from `pickle.load(file)` will be a list of tweets.

In [None]:
with open('red_bull_tweets.txt', 'rb') as file:
    tweets = pickle.load(file)

In [None]:
len(tweets)

Note: this contains tweets (without retweets) for all the race days in the '"red_bull_race_results.csv"' file.

Briefly recall / recreate our dataset `racingdf`:

In [None]:
racingdf = pd.read_csv('red_bull_race_results.csv')

# Date formatting
racingdf['date'] = pd.to_datetime(racingdf['date'], format="%d.%m.%y")

# "out"-indicator (adjusted on March 20)
racingdf['perez_out'] = racingdf['perez'].isna().astype(int)
racingdf['verstappen_out'] = racingdf['verstappen'].isna().astype(int)
racingdf.loc[racingdf['date'] == '2022-03-20','perez_out'] = 1
racingdf.loc[racingdf['date'] == '2022-03-20','verstappen_out'] = 1

# Input placement for "out" days
racingdf['perez'] = racingdf['perez'].fillna(racingdf['perez'].max())
racingdf['verstappen'] = racingdf['verstappen'].fillna(racingdf['verstappen'].max())

racingdf

Let's take the first of these race dates and try to assign the corresponding tweets. Note that we need to convert the Twitter-specific time format into the same time format we've been using for our dates in the dataframe:

In [None]:
racedate = racingdf['date'].iloc[0]
print("Date: " +  str(racedate))
raceday_tweets = [tweet for tweet in tweets if pd.to_datetime(tweet.created_at.date()) == racedate]
print("Total tweets: " + str(len(raceday_tweets)))

Let's take a look at one of those tweets:

In [None]:
raceday_tweets[11].text

The name "Verstappen" appears in here. We can, of course, check this automatically with Python (note that we use `.lower()` to avoid issues when comparing different capitalization):

In [None]:
'verstappen' in raceday_tweets[11].text.lower()

We can combine the above code to create three new "engagement" columns: a total count of tweets on raceday, a count of tweets talking about Perez and a count of tweets talking about Verstappen, all relative to the count of tweets in the week around the race. We first create an empty column:

In [None]:
racingdf['tweets_total'] = 0
racingdf['tweets_perez'] = 0
racingdf['tweets_verstappen'] = 0

In order to create our relative metric, we again go through each racedate, and search for the relevant tweets in the week around there. To be more accurate when attributing tweets to Perez/Verstappen, we search both for last and first names.

Keep in mind that this is only an initial proxy measure for engagement. And note that the code will run for a bit (runtime could be much improved - do you know how? We are not doing that here, to make clearer what the code is actually doing)

In [None]:
tweet_dates = [pd.to_datetime(tweet.created_at.date()) for tweet in tweets] # We get a list of all tweet dates, so we don't have to recalculate them later
for racedate in racingdf['date']:
    perez_count_day = 0
    perez_count_week = 0
    verstappen_count_day = 0
    verstappen_count_week = 0
    total_count_day = 0
    total_count_week = 0
    for i in range(len(tweets)):
        tweet_text = tweets[i].text.lower()
        if tweet_dates[i] == racedate:
            total_count_day += 1
            if 'sergio' in tweet_text or 'perez' in tweet_text:
                perez_count_day += 1
            if 'max' in tweet_text or 'verstappen' in tweet_text:
                verstappen_count_day += 1
        if abs((tweet_dates[i] - racedate).days) <= 3:
            total_count_week += 1
            if 'sergio' in tweet_text or 'perez' in tweet_text:
                perez_count_week += 1
            if 'max' in tweet_text or 'verstappen' in tweet_text:
                verstappen_count_week += 1
    # The final measures are defined as tweet count on race day divided by average daily tweet count in the week around the race
    racingdf.loc[racingdf['date'] == racedate,"tweets_total"] = total_count_day / (total_count_week / 7)
    racingdf.loc[racingdf['date'] == racedate,"tweets_perez"] = perez_count_day / (perez_count_week / 7)
    racingdf.loc[racingdf['date'] == racedate,"tweets_verstappen"] = verstappen_count_day / (verstappen_count_week / 7)
racingdf

### Discussion point: Can you interpret these results? Why is the engagement around Verstappen so large on May 22, July 31, and October 9?

We can, of course, do more analysis here. A good starting point is always to use visualization. For example:

In [None]:
sns.scatterplot(x = 'verstappen',
             y = 'tweets_verstappen',
             data = racingdf,
             hue="verstappen_out")
plt.show()

We can get a more concrete picture by using regression analysis:

In [None]:
X = racingdf[['verstappen','perez','verstappen_out','perez_out']]
Y = racingdf.tweets_total
X = sm.add_constant(X)
lm = sm.OLS(Y,X).fit()
print (lm.summary()) 

In [None]:
Y = racingdf.tweets_perez
lm = sm.OLS(Y,X).fit()
print (lm.summary()) 

In [None]:
Y = racingdf.tweets_verstappen
lm = sm.OLS(Y,X).fit()
print (lm.summary()) 




### 1.8 (Exercise): Finding out more about the people talking about Verstappen

**Finding the right tweets and users**

Start by finding all the tweets in which the word `'verstappen'` appears, making sure to eliminate any capitalizaiton issues. Put those tweets into a list `verstappen_tweets`.

In [None]:
verstappen_tweets = []




Next, find out how many tweets each user made that made any of the `verstappen_tweets`.

One possibile approach is to create a dictionary of tweet-lists, loop through the tweets, and change the dictionary as follows: if the `.user.screen_name` attribute has not appeared before, create a new entry into the dictionary. The attribute is the key and as a value, create a new list with the current tweet inside. If the `.user.screen_name` attribute has appeared before, simply append the current tweet to the corresponding list.

In [None]:
user_dict = {}






Next, create a list of `active_tweeters` and a list of `inactive_tweeters`. The former list should contain the users with more than one tweet within `verstappen_tweets`.

Note: it will be useful later on if you store the `.user`-objects, not just the `.user.screen_name` attribute.

In [None]:
active_tweeters = []
inactive_tweeters = []






How many `active_tweeters` are there? How many `inactive_tweeters`?

**Counting followers**

Next, we will take a look at the followers of our different tweeters. For the active (resp. inactive) tweeters, display a histogram showing the number of followers. The relevant user-attribute is `.followers_count`.

In [None]:
followers_active = []
followers_inactive = []





The extremely skewed nature of the number of followers makes it difficult to see anything or make comparisons. When we have heavily skewed data, we usually use the logarithm instead. Hence, repeat the plotting exercise with the `np.log()` of the `.followers_count`. Keep in mind that some may have 0 followers, so add a 1 to avoid errors. That is, use `np.log(original_value + 1)`.

In [None]:
followers_active = []
followers_inactive = []





Do you see any differences?

**Analyzing highly influential followers**

Next, we will take a look at the `active_tweeters` with more than $10,000$ followers. Create a new list, `selected_accounts`, and store the relevant user-objects within the list.

In [None]:
selected_accounts = [user for user in active_tweeters if user.followers_count > 10000]
for user in selected_accounts:
    print("User " + user.screen_name + " writes about Verstappen and has " + str(user.followers_count) + " followers")

To understand the type of highly influential followers better, we take a look at the tweets of the `selected_accounts`. In particular, we explore the hashtags that they use.

1. Create a list of hashtags
2. loop through the `selected_accounts`
3. For each user, find the last 5 (complete) tweets they wrote (using `tweepy.Cursor(api.user_timeline,user_id=user.id, tweet_mode='extended').items(5)` in v1.1 or `client.get_users_tweets(id=user.id,max_results=5) in v2.0)
4. Within each tweet, collect the list of hashtags (using `.entities['hashtags']` - in the case of v2.0 we have to specifically the request the `entities` sub-object) and append these to our overall list

In [None]:
all_hashtags = []





If you print out the list of hashtags, you'll see that each hashtag is a dictionary with two keys:
1. `'text'`: this gives the actual hashtag
2. `'indices'`: this gives the position of the hashtag within the tweet

Go through the list of hashtags and store only the actual hashtag using key `'text'`

Finally, add the hashtags into a dictionary, together with the number of times they appear (using the function `.groupby()` of your newly created data frame). Sort the dataframe by the occurence.

What types of influential accounts do you think actively post about Red Bull?

Similar to hashtags, we can also find out which users are being mentioned in tweets (the @'s), using `.entities['user_mentions']`.

### 1.9 (Exercise): A brief look at text analysis in Python


**Text analysis**

We will next use some basic text analytics tools to find out more about what people have to say.
We can start with the most used words. This gives a sense of how people are perceiving Red Bull. We can easily split tweets into words using `.split()`:

In [None]:
tweet = tweets[0]
tweet.text.split()

Let's use this to generate a complete list of (lowercase) words (without hashtags). Remember, to get lowercase, you can apply `.lower()` to a string. You can get rid of hashtags using `.replace('#','')`.

Can you find out which ones are the 15 most frequently occuring words? There are many ways to do this, but the one with least code is to create a dataframe, then apply `.groupby('group_column')['group_column'].count().reset_index(name='count')` to the datafframe, and finally sort it by the `'count'` column, using `.sort_values(by='count',ascending=False).head(number to display)`.

There is a lot of junk here. One first attempt to clean up this table is to remove all English stopwords (the most common words like "the" and "a"). Many libraries can do this for us, such as `nltk`. But if we haven't used `nltk` before, we need to download the stopword library first.

In [None]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('stopwords')

The English stopwords shouldn't be all too surprising:

In [None]:
nltk.corpus.stopwords.words('english')

We can now remove all the (English) stopwords from the data frame. For this, you want to make sure that you remove any row from your dataframe, where the word is not in the list of stopwords. That is, you can use `df[~(df['word_column'].isin(list of stopwords))]`. Afterwards, use the code from before to sort the table and display it.

*****

**Interlude: Sentiment analysis introduction**

Beyond counting words, there are fantastic tools out there to analyze sentinments. Usually, we need to start by training a sentiment analyzer. Luckily, `nltk` comes with an in-built pre-trained sentiment analyzer (VADER), purpose-built for analyzing short text on social media (convenient, right?). To use it for the first time, we first have to download its lexicon:

In [None]:
nltk.download('vader_lexicon')

Let's see how it works by handing it a short sentence:

In [None]:
sia = SentimentIntensityAnalyzer()
sia.polarity_scores("My outlook on life is fantastic!")

The negative, neutral, and positive scores are self-explanatory and numbers here are between $0$ and $1$, with the total adding up to 1. The compound score follows a somewhat complex arithmetic, but it's easy to understand how to use it: it's between $-1$ and $1$, anything $>0$ signifies a positive sentiment, and anything $<0$ signifies a negative sentiment.

Can it tell us something about cliché optimists and pessimists?

In [None]:
sia.polarity_scores("The glass is half full")

In [None]:
sia.polarity_scores("The glass is half empty")

What do you think? Does it make sense to rate the first sentence as neutral and the second one as negative?

****

Now let's apply this simple sentiment analysis to our tweets. Take any tweet and compute the "compound" score for that tweet (hint: when you apply `sia.polarity_scores` it returns a dictionary of scores)

Next, iteratre through all tweets, finding the compound score of each tweet and then displaying a histogram of compound scores.

Of course, this is just a an initial look at sentiment analysis. You will see some more of this later in the module and in future modules.

 # 2. Connecting manually to an API

We will see here how to connect to an API without the help of a wrapper package such as `tweepy`. We will use the example of Twitter, but this should give you an idea about requesting data from APIs more generally - it is essentially like requesting a website! Hence, we will use `requests`. If you want to learn more, I recommend this blogpost to get started: https://www.dataquest.io/blog/python-api-tutorial/

In [None]:
import requests
import json
import pandas as pd

On the Twitter Developer Platform under https://developer.twitter.com/en/docs/api-reference-index#Twitter, you can find the different API "endpoints" that Twitter provides (essentially, there is a different API depending on what data you want). We will be using here version 2.0 and searching for (recent) tweets. Hence, we follow the link https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent. Here, we find an "Endpoint URL", which provides the relevant data: 'https://api.twitter.com/2/tweets/search/recent' (essentially, the location of the API application on Twitter's server). We can send a request to this site, but we won't get much of a response:

In [None]:
requests.get("https://api.twitter.com/2/tweets/search/recent")

Why? because we haven't authenticated ourselves (response 401 indicates that we are not authorized). As we are using version 2.0, we can just use the Bearer Token:

In [None]:
BEARER_TOKEN = api_access[api_access['api'] == 'twitter_bearer_token']['key'].tolist()[0]

We use the token to form a "header", which tells the server who we are. https://developer.twitter.com/apitools/api?endpoint=%2F2%2Ftweets%2Fsearch%2Frecent&method=get shows us how to build requests (it doesn't have code for Python as of now, but we can easily make sense of the curl code):

In [None]:
headers = {
    "Authorization": "Bearer {}".format(BEARER_TOKEN),
}

We also specify parameters: this is what we are looking for! This corresponds to the query parameter we used in the `tweepy.Client`. At https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent, you can find all the possible parameters

In [None]:
parameters={
    'query': '#redbull -is:retweet',
    'tweet.fields' : "created_at,lang",
    'max_results' : 10
}

We can now retry our request with the header and the search parameters:

In [None]:
response = requests.get("https://api.twitter.com/2/tweets/search/recent",
                        headers = headers,
                        params = parameters)

In [None]:
response

This time, we get a better outcome: 200 indicates that the request was accepted by the server and we get a "normal" response. Of course, we can now print this out. We use the fact that the API returns information in JSON format (this is the case for most modern APIs. Older APIs tend to return XMLs).

In [None]:
response_data = response.json()
response_data

This looks familiar, right? In fact, we can look at the typical tweet attributes we already learned about (just that we now have dictionary notation, instead of attributes)

In [None]:
response_data['data'][0]['text']

## 3. The Reddit API

Similar to Twitter, Reddit provides an API that allows us to access a lot of data directly from Python. While we can use`praw` as a wrapper for our requests, we will access the API manually for training purposes.

Note that Reddit is a bit more liberal with the allowable number of requests than Twitter. You can make up to 60 requests per minute (with a single request returning up to 100 posts). More information can be found <a href='https://github.com/reddit-archive/reddit/wiki/API'>here</a>.

To access the Reddit API, we again need to authenticate ourselves:

1. Go to the <a href=https://www.reddit.com/prefs/apps>Reddit Apps Site</a> and "create another app" (you will need a Reddit account for this).
2. Create an application (e.g., "dtvc_bot"). The best option for our purposes is "script".
3. Read the <a href=https://docs.google.com/a/reddit.com/forms/d/e/1FAIpQLSezNdDNK1-P8mspSbmtC2r86Ee9ZRbC66u929cG2GX0T9UMyw/viewform>terms and conditions</a> and register.
4. With your application in hand, you can transfer your personal use script and secret token. However, you will also need your username and password to make this work.

As before, I'm transferring all data from a csv-file, but feel free to input your data directly as a string.

In [None]:
api_access = pd.read_csv('API_access.csv',delimiter=';')
PERSONAL_USE_SCRIPT = api_access[api_access['api'] == 'reddit_personal_use_script']['key'].tolist()[0]
SECRET_TOKEN = api_access[api_access['api'] == 'reddit_secret_token']['key'].tolist()[0]
USERNAME = api_access[api_access['api'] == 'reddit_username']['key'].tolist()[0]
PASSWORD = api_access[api_access['api'] == 'reddit_password']['key'].tolist()[0]
USER_AGENT = api_access[api_access['api'] == 'reddit_user_agent']['key'].tolist()[0] # This should be descriptive, such as 'testscript by u/<Username>'

In [None]:
data = {'grant_type': 'password', 'username': USERNAME, 'password': PASSWORD}
headers = {'User-Agent': USER_AGENT}
auth = requests.auth.HTTPBasicAuth(PERSONAL_USE_SCRIPT, SECRET_TOKEN)
r = requests.post('https://www.reddit.com/api/v1/access_token',
                        data=data,
                        headers=headers,
                        auth=auth)

In [None]:
response = r.json()
print(response)

In [None]:
token = 'bearer ' + response['access_token']
headers = {'Authorization': token, 'User-Agent': USER_AGENT}
params = {'q': 'redbull', 'limit': 5, 'sort': 'relevance'}
r = requests.get('https://oauth.reddit.com/subreddits/search', headers=headers, params=params)
print(r.json())

Unfortunately, the structure of the response is a bit messier than what we saw with Twitter. You will have to do some exploration of the documentation: https://www.reddit.com/dev/api

In [None]:
for subreddit in r.json()['data']['children']:
    print(subreddit['data']['display_name'])

Let's now try to find posts in all subreddits. To do so, we need to understand the reddit url structure: "/r" means we are searching in a specific subreddit, but "/all" is actually a placeholder to mean we are considering all possible subreddits. "/new" indicates the sorting. There are different ways to sort:
* hot
* controversial
* gilded
* new
* rising
* top

In [None]:
params = {'q': 'redbull', 'limit': 5, 'sort': 'relevance'}
r = requests.get('https://oauth.reddit.com/r/all/new', headers=headers, params=params)

In [None]:
for submission in r.json()['data']['children']:
    print('---------------')
    print('Subreddit: ' + submission['data']['subreddit'])
    print('Title: ' + submission['data']['title'])
    print('Name: ' + submission['data']['author_fullname'])
    print('upvote_ratio: ' + str(submission['data']['upvote_ratio']))

Since we can only get up to 100 items at a time, we have to be a bit more creative when requesting more. For this purpose, most of the endpoints have a `before` and `after` result, which allows you to link searches (and create your own paginator). Things are a bit easier if you use `praw`, however.