# SC207 Social Data Science
# APIs - Gathering Twitter Data
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/tweepy.jpg?raw=true" align="right" width="300">


- API = Application Programming Interface
- A Standardised way to retrieve data from platforms.
- Many platforms have an API and they all work relatively similarly
- Today we will use the package `tweepy` to retrieve data from the Twitter API

[Tweepy Documentation](http://docs.tweepy.org/en/stable/)

# Installing Tweepy
Tweepy is a library that helps us interact with Twitter using Python. Unfortunately it is not installed by default, so we need to install it ourselves. Most of the time you can install new python libraries using the '**Package Installer** in **Python**' or PIP, which stores all the libraries online at the [Python Package Index](https://pypi.org/).

Jupyter Lab makes installing from PIP fairly simple.

You only need to run this command once. After it has been run tweepy will be installed on your system and won't need reinstalling every time.

In [None]:
!pip install tweepy --upgrade

### Imports

In [None]:
import tweepy
import pandas as pd

# Prepping your credentials storage
Generally you want to avoid storing sensitive information, such as API keys, within your code that you may share with others. Whilst there are many solutions to this, a simple one is to store the credentials in a different file which your code can use later.
1. Open up the file navigation pane to the left if it's not already open.
2. Ensure you are in the folder containing this notebook.
3. Right click in some empty space and select 'New File'.
4. Rename the file to 'credentials.py' removing the .txt extension completely. You now have a Python file.
5. Open the file and in the editor and create one new variable as below, and then save the file.

```

BEARER_TOKEN = 'PASTE YOUR BEARER TOKEN HERE'

```

In [None]:
# Here is how we use the credentials from our seperate file, in this notebook.

from credentials import BEARER_TOKEN

### Connect the API
We create a new client object and feed it our bearer token.

In [None]:
client = tweepy.Client(bearer_token=BEARER_TOKEN)

# 2. Gathering Data - Search
Search is one of the simpler ways you can interact with the API.
- Search returns a list of tweet objects matching your query
- Every request returns up to 100 tweets
- You can make 450 requests in a 15 minute window.
- A maximum of 45,000 tweets every 15 minutes.
- Each request counts against your quota, no matter how many Tweets it returns.
- Top limit of 500,000 Tweets per month (so be conservative until you know what you want!)

### What you recieve
It is important to be clear what Twitter is providing you when you ask for data.
>The Twitter's standard search API (search/tweets) allows simple queries against the indices of recent or popular Tweets and behaves similarly to, but not exactly like the Search UI feature available in Twitter mobile or web clients. The Twitter Search API searches against a sampling of recent Tweets published in the past 7 days. Before digging in, it’s important to know that the standard search API is focused on relevance and not completeness. This means that some Tweets and users may be missing from search results.
[Twitter API Documentation: Standard Search](https://developer.twitter.com/en/docs/tweets/search/overview/standard)

- Already sampled based on 'relevance'.
- Max. 7 days old.
- NOT complete.

### Making a Single Request
Lets make a single request for something that will have a lot of results.
- `query=` : a string to search for. You can also use [operators](https://developer.twitter.com/en/docs/tweets/search/guides/standard-operators) to make complex queries.

Some common operators you can use include...
- `AND`
- `OR`
- Using "" to search for a phrase e.g. `"Elon Musk"`
- Using a (-) dash as a NOT argument. E.g. To get tweets mentioning Elon Musk but not Tesla we would use `"Elon Musk" -Tesla`
- is: to specify type of tweet for example `is:retweet` will only return retweets. `-is:retweet` will only return non-retweets.

[You can view all the argument options in the Tweepy Documentation](http://docs.tweepy.org/en/latest/api.html#search-methods)

In [None]:
single_response = client.search_recent_tweets(query='society', max_results=10)

In [None]:
single_response

In [None]:
tweets = single_response.data

In [None]:
len(tweets)

In [None]:
# lets examine just one tweet object

single_tweet = tweets[0]

single_tweet

### Types of Data available for a single Tweet
Tweets from the API can contain a range of different relevant data including...
- Date/Time posted
- The text of the tweet (default)
- Edit history of the tweet
- Public Metrics such as counted likes, retweets.
- Keywords Twitter have identified that might describe the topic, types of people, specific public figures etc known as "Context Annotations".
- Conversation ID - If it is part of a larger thread, the ID associated with the overall conversation
- Entities such as hashtags, urls, mentioned users
- An indicator of the Tweet's language
- The Source of the Tweet, i.e. the website, the iOS app, the Android app etc.
... and more. See the full documentation on [Tweet Fields](https://docs.tweepy.org/en/stable/expansions_and_fields.html)

In [None]:
# Let's select a few fields to examine their content

query = 'society -is:retweet'
fields = ['created_at','author_id','referenced_tweets','conversation_id','public_metrics','lang','source','context_annotations','entities']
response = client.search_recent_tweets(query,tweet_fields=fields, max_results=10)

In [None]:
tweets = response.data
single_tweet = tweets[0]

In [None]:
# If we check the type of our single_tweet we can see it is a tweepy Tweet object.
# When Tweepy recieved the response from Twitter, it wrapped it up into a useful object for us.
type(single_tweet)

In [None]:
single_tweet.text

In [None]:
single_tweet.public_metrics

In [None]:
single_tweet.lang

In [None]:
single_tweet.created_at

In [None]:
single_tweet.entities

In [None]:
single_tweet.context_annotations

In [None]:
# If we wrap out list of tweet objects in a Pandas Dataframe, Pandas interprets each object as a row, and the different attributes as columns

# More on this later...

pd.DataFrame(tweets)

## Tweet Expansions
Whilst fields provide additional attributes to our Tweets, expansions 'expand' other objects referenced by the tweet objects. For example...
- Data about the tweet author
- Data about any referenced Tweets such as retweets, replies etc.
- Data about users that are mentioned in tweets
- Data about users that are mentioned in referenced tweets
- Data about attached media items

For a full coverage of all possible expansions see the [Expansions documentation](https://developer.twitter.com/en/docs/twitter-api/expansions)

Below we're going to call the api again and ask for the author_id expansion to get information about the tweet authors. As well as our regular fields we can request additional fields about the user.
The [full list of User fields is available via the documentation](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user)

In [None]:
query = 'society is:retweet'
fields = ['created_at','author_id','referenced_tweets','conversation_id','public_metrics','lang','source','context_annotations','entities']

expansions = ['author_id', # This ensures we get info about users that wrote the main tweets we collect
              'referenced_tweets.id.author_id'] # This one provides info about authors
                                                # of referenced tweets, and the tweets themselves


user_fields = ['created_at','public_metrics']



response = client.search_recent_tweets(query,
                                       tweet_fields=fields, expansions=expansions,
                                       user_fields=user_fields,max_results=10)
tweets = response.data


In [None]:
users = response.includes['users']
single_user = users[0]

In [None]:
single_user

In [None]:
dict(single_user)

In [None]:
pd.DataFrame(users)

In [None]:
response.includes['tweets'] # These are the referenced tweets

In [None]:
pd.DataFrame(response.includes['tweets'])

In [None]:
# Using the Paginator is similar to our original single_response method.
# we first create our Paginator, providing it the client method we want to use,
# and any of the arguments we want to be used by that method.

query = 'society is:retweet'

fields = ['created_at','author_id',
          'referenced_tweets','conversation_id',
          'public_metrics','lang','source',
          'context_annotations','entities']

expansions = ['author_id']
user_fields = ['created_at','public_metrics']

paginator = tweepy.Paginator(client.search_recent_tweets, query=query, tweet_fields=fields,
                             user_fields=user_fields, expansions=expansions, max_results=10, limit=2)

In [None]:
paginator

In [None]:
# Using the Paginator is similar to our original single_response method.
# we first create our Paginator, providing it the client method we want to use,
# and any of the arguments we want to be used by that method.

query = 'society is:retweet'

fields = ['created_at','author_id',
          'referenced_tweets','conversation_id',
          'public_metrics','lang','source',
          'context_annotations','entities']

expansions = ['author_id','referenced_tweets.id.author_id']
user_fields = ['created_at','public_metrics']

paginator = tweepy.Paginator(client.search_recent_tweets, query=query, tweet_fields=fields,
                             user_fields=user_fields, expansions=expansions, max_results=10, limit=2)

In [None]:
# We can examine each list and see how many results and what they look like
# Here we should have 20 tweets, 10 results per page, 2 pages.
len(tweet_data)

In [None]:
# The most explicit way - using a for loop


# This list will capture our primary tweets returned by the query
tweet_data = []

# This list will capture user data about the authors of our primary tweets
user_data = []

for response in paginator:
    main_tweets = response.data
    referenced_tweets = response.includes['tweets']
    users = response.includes['users']

    tweet_data.extend(main_tweets)
    tweet_data.extend(referenced_tweets)
    user_data.extend(users)



In [None]:
# We can examine each list and see how many results and what they look like
#
len(tweet_data)

In [None]:

tweet_df = pd.DataFrame(tweet_data)
tweet_df

In [None]:
# It's a good idea to drop any duplicates, as there is no guarantee that our referenced tweets aren't also tweets we already collected in our main response.
# Each tweet has a unique id number, so we'll drop any duplicates on the id column
tweet_df = tweet_df.drop_duplicates(subset=['id'])
tweet_df

In [None]:

tweet_df = pd.DataFrame(tweet_data)
tweet_df

In [None]:
# A quick example

df_a = pd.DataFrame({'name':['Alice','Bert','Chris','Danielle'],'age':[22,25,28,32]})
df_a

In [None]:
user_df = pd.DataFrame(user_data).drop_duplicates(subset=['id'])
user_df

In [None]:
# we can check that all the author ids in our tweets have corresponding user data like so
all(tweet_df['author_id'].isin(user_df['id']))

In [None]:
df_b = pd.DataFrame({'name':['Bert','Chris','Danielle','Fliss'],'job':['Baker','Chef','Doctor','Firefighter']})
df_b

In [None]:
# df_a is on the left, df_b is on the right

df_a.merge(df_b, how='left', left_on='name', right_on='name')

In [None]:
user_df = pd.DataFrame(user_data)

In [None]:
# our user df and tweet df have similar column names, which will cause confusion
print(tweet_df.columns)
print(user_df.columns)

In [None]:
# our user df and tweet df have similar column names, which will cause confusion
print(tweet_df.columns)
print(user_df.columns)

In [None]:
tweet_df.info()

In [None]:
tweet_df = tweet_df.merge(user_df, how='left', left_on='author_id', right_on='user_id')
tweet_df

In [None]:
# If we examine one of our columns...

tweet_df['entities']

The values in the entities columns aren't strings, they're dictionaries...

In [None]:
# Here is the first row's value in the 'entities' column
tweet_df.loc[0, 'entities']

In [None]:
# The type of the value is dict - dictionary.
type(tweet_df.loc[0, 'entities'])

In [None]:
# and parts of it can be accessed like a dictionary
tweet_df.loc[0, 'entities']['mentions']

If we were to save this DataFrame as a .csv file, it would have to turn those dictionaries into strings, because .csv's don't understand Python objects. When we reloaded the data from a CSV our entities column would be a column of weird messy strings.

### How do we solve this?

- A json file is particularly good at dealing with highly *nested* structures, for example, like a dataframe with column containing lists, which themselves contain dictionaries that contain dictionaries(!). We can save our dataframe to disk as a json file using the Pandas Dataframe `.to_json()` method.

In [None]:
tweet_df.to_json('my_tweets.json')

In [None]:
pd.read_json('my_tweets.json')

# Extending your Collection

If you want to gather data across a longer period, such as sampling across a week, you may want to pull from the Twitter API once a day. How do we do this without duplicating our data, and how do we easily just add the new data to our dataset, rather than creating a new one each time?

In [None]:
import pandas as pd
import tweepy
from pathlib import Path

from credentials import BEARER_TOKEN


my_data_filename = Path('my_twitter_dataset.json')
query = '"Elon Musk" -is:retweet'
fields = ['created_at','author_id','conversation_id','referenced_tweets',
          'public_metrics','lang','source','context_annotations','entities']
expansions = ['author_id']
user_fields = ['created_at','public_metrics']

total_items = 10000
items_per_call = 100
n_pages = total_items / items_per_call

# Create API
client = tweepy.Client(BEARER_TOKEN)

# First load in your data if you have it, otherwise create a new DataFrame

if my_data_filename.exists():
    df = pd.read_json(my_data_filename)
    n_records = len(df)

    # if there is data check to find the largest id in your dataset, this will be the most recent, and the smallest id, this will be the oldest
    max_id = df['id'].max()

else:
    df = pd.DataFrame()
    # set max_id to None because on the first run we don't need to provide an id to limit results
    max_id = None
    n_records = 0

# Pull results from the Twitter API

paginator = tweepy.Paginator(client.search_recent_tweets, query=query, tweet_fields=fields,
                             user_fields=user_fields,
                             expansions=expansions,
                             max_results=items_per_call,
                             limit=n_pages,
                             # since_id=max_id, # ensures to return only Tweets published after your most recently collected tweets
                             )

# Get primary data...
tweet_results = []
user_results = []

for response in paginator:
    tweets = response.data
    if response.data is None:
        break
    users = response.includes['users']

    tweet_results.extend(tweets)
    user_results.extend(users)

temporary_df = pd.DataFrame(tweet_results).drop_duplicates(subset=['id'])
user_df = pd.DataFrame(user_results).add_prefix('user_')
temporary_df = temporary_df.merge(user_df, how='left', left_on='author_id', right_on='user_id')

# Append the new data onto the end of the loaded data (or the empty dataframe if this is the first run)
df = pd.concat([df, temporary_df])

# Check the dataset for any duplicates by dropping any rows with duplicate ids
df = df.drop_duplicates('id',ignore_index=True)

# Save back to disk
df.to_json(my_data_filename)

print(f'Dataset has {len(df)} entries, an increase of {len(df) - n_records}')

In [None]:
df.info()

In [None]:
import pandas as pd
import tweepy
from pathlib import Path

from credentials import BEARER_TOKEN


my_data_filename = Path('my_tweets.json')
query = 'society'
fields = ['created_at','author_id','conversation_id','referenced_tweets',
          'public_metrics','lang','source','context_annotations','entities']
expansions = ['author_id','referenced_tweets.id.author_id']
user_fields = ['created_at','public_metrics']

total_items = 10000
items_per_call = 100
n_pages = total_items / items_per_call

# Create API
client = tweepy.Client(BEARER_TOKEN)

# First load in your data if you have it, otherwise create a new DataFrame

if my_data_filename.exists():
    df = pd.read_json(my_data_filename)
    n_records = len(df)

    # if there is data check to find the largest id in your dataset, this will be the most recent, and the smallest id, this will be the oldest
    max_id = df['id'].max()

else:
    df = pd.DataFrame()
    # set max_id to None because on the first run we don't need to provide an id to limit results
    max_id = None
    n_records = 0

# Pull results from the Twitter API

paginator = tweepy.Paginator(client.search_recent_tweets, query=query, tweet_fields=fields,
                             user_fields=user_fields,
                             expansions=expansions,
                             max_results=items_per_call,
                             limit=n_pages,
                             # since_id=max_id, # ensures to return only Tweets published after your most recently collected tweets
                             )

# Get primary data...
tweet_results = []
user_results = []

for response in paginator:
    tweets = response.data
    if response.data is None:
        break
    users = response.includes['users']
    referenced_tweets = response.includes['tweets']
    tweet_results.extend(tweets)
    tweet_results.extend(referenced_tweets)
    user_results.extend(users)

temporary_df = pd.DataFrame(tweet_results).drop_duplicates(subset=['id'])
user_df = pd.DataFrame(user_results).drop_duplicates(subset=['id']).add_prefix('user_')
temporary_df = temporary_df.merge(user_df, how='left', left_on='author_id', right_on='user_id')

# Append the new data onto the end of the loaded data (or the empty dataframe if this is the first run)
df = pd.concat([df, temporary_df])

# Check the dataset for any duplicates by dropping any rows with duplicate ids
df = df.drop_duplicates('id',ignore_index=True)

# Save back to disk
df.to_json(my_data_filename)

print(f'Dataset has {len(df)} entries, an increase of {len(df) - n_records}')

In [None]:
df.info()