# SC207 - Session 6
# APIs - Gathering Twitter Data
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/tweepy.jpg?raw=true" align="right" width="300">


- API = Application Programming Interface
- A Standardised way to retrieve data from platforms.
- Many platforms have an API and they all work relatively similarly
- Today we will use the package `tweepy` to retrieve data from the Twitter API

[Tweepy Documentation](http://docs.tweepy.org/en/stable/)

# Installing Tweepy
Tweepy is a library that helps us interact with Twitter using Python. Unfortunately it is not installed by default, so we need to install it ourselves. Most of the time you can install new python libraries using the '**Package Installer** in **Python**' or PIP, which stores all the libraries online at the [Python Package Index](https://pypi.org/).

Jupyter Lab makes installing from PIP fairly simple.

You only need to run this command once. After it has been run tweepy will be installed on your system and won't need reinstalling every time.

In [None]:
! pip install tweepy

### Imports

Today we will be using Tweepy and Pandas to retrieve, store and explore data.

In [None]:
import tweepy
import pandas as pd

# This function is here just to make the class go smoothly!
def find_first_retweet(list_of_tweets):
    for tweet in list_of_tweets:
        if 'retweeted_status' in tweet._json:
            return tweet
        
def find_first_regular_tweet(list_of_tweets):
    for tweet in list_of_tweets:
        if 'retweeted_status' not in tweet._json:
            return tweet

# Prepping your credentials storage
Generally you want to avoid storing sensitive information, such as API keys, within your code that you may share with others. Whilst there are many solutions to this, a simple one is to store the credentials in a different file which your code can use later.
1. Open up the file navigation pane to the left if it's not already open.
2. Right click in some empty space and select 'New File'.
3. Rename the file to 'credentials.py' removing the .txt extension completely. You now have a Python file.
4. Open the file and in the editor and create two new variables as below, and then save the file.

```
API_KEY = ''
API_SECRET = ''

```

We'll come back to this file in a minute.

# 1. Authorising and Connecting the API
`Tweepy` makes this process incredibly streamlined into essentially three simple stages.

### a) Identify your Access Tokens
APIs require authorisation tokens to identify who is using the API and to manage API usage by a single account holder.
- Go to https://developer.twitter.com
- Sign in with your Twitter account details
- You may have to navigate back to the Twitter developer page if you get redirected to normal Twitter.
- Once signed in select 'Projects & Apps' and then 'Overview' from the left-side menu.
- Under Standalone Apps click '+ Create App'.
- Give your app a unique name, we suggest your essex username.
- Now copy and paste the API Key, and the API Key Secret into the strings in your credentials.py file you created earlier, and save the file. Make sure you are happy with the file before leaving the keys & tokens page in your browser. 


In [None]:
# Here is how we use the credentials from our seperate file, in this notebook.



### b) Create an Authorisation Object
We create a special authorisation handler to store our keys.

In [None]:
auth = tweepy.AppAuthHandler(API_KEY, API_SECRET)

### c) Connect the API
We create a new `API` object and feed it our authorisation handler.

We also set two additional arguments...
- `wait_on_rate_limit` sets the API to wait if you have maxed out your number of queries, and then resume when the limit is lifted
- `wait_on_rate_limit_notify` ensures Tweepy informs you of the wait occuring.

In [None]:
api = 

# 2. Gathering Data - Search
Search is one of the simpler ways you can interact with the API.
- Search returns a list of tweet objects matching your query
- Every request returns up to 100 tweets
- You can make 450 requests in a 15 minute window.
- A maximum of 45,000 tweets every 15 minutes.
- Each request counts against your quota, no matter how many Tweets it returns.

### What you recieve
It is important to be clear what Twitter is providing you when you ask for data.
>The Twitter's standard search API (search/tweets) allows simple queries against the indices of recent or popular Tweets and behaves similarly to, but not exactly like the Search UI feature available in Twitter mobile or web clients. The Twitter Search API searches against a sampling of recent Tweets published in the past 7 days. Before digging in, it’s important to know that the standard search API is focused on relevance and not completeness. This means that some Tweets and users may be missing from search results.
[Twitter API Documentation: Standard Search](https://developer.twitter.com/en/docs/tweets/search/overview/standard)

- Already sampled based on 'relevance'.
- Max. 7 days old.
- NOT complete.

### Making a Single Request
Lets make a single request for something that will have a lot of results.

- Tweepy has a range of 'arguments' built in to the search function.
- `q=` query: a string to search for. You can also use [operators](https://developer.twitter.com/en/docs/tweets/search/guides/standard-operators) to make complex queries.
- `result_type=`: set this to either `mixed`, `recent` or `popular`. Note again that Twitter is to some extent pre-sampling for you.
    - `popular` - prioritise popular tweets
    - `recent`  - prioritise recent tweets
    - `mixed` **DEFAULT** - include both popular and recent
- `tweet_mode=` Set this to 'extended' to ensure you get the full text of a tweet, otherwise it will be cut off after 140 characters (Tweets can now be 280 characters). If your project doesn't care about tweet text then don't bother including the argument.
- `count=` max tweets per request. Defaults to 15, can be set up to 100.
- `since_id=` Each tweet has a unique ID. If you provide a tweet's ID number here, it will only return tweets posted AFTER that tweet was posted.
- `max_id=` As above, but limits the API to returning tweets posted BEFORE the tweet provided.

[You can view all the argument options in the Tweepy Documentation](http://docs.tweepy.org/en/latest/api.html#search-methods)

In [None]:
single_response = 

In [None]:
# lets check the number of results we got


In [None]:
# lets examine just one tweet object

single_tweet = 

single_tweet

In [None]:
# You get a LOT of data in one single Tweet of a single response, but it's also a bit unwieldy. 
# Luckily we can access a nice structured version of this with the ._json attribute attached to each tweet object

# we'll use the ._json attribute in later sessions...



### Types of Data in a single Tweet object
Tweets from the API contain data such as...
- Time posted
- The text of the tweet
- Full details on the User who posted.
- Details of any media embedded in the tweet
- Details of any hashtags user mentions, urls

In [None]:
# If we check the type of our single_tweet we can see it is a tweepy Status object.
# When Tweepy recieved the response from Twitter, it wrapped it up into a useful object for us.

type(single_tweet)

In [None]:
# You can access any of these items individually as they are set as attributes of the Status class...



In [None]:
# a clean way to see all the relevant attributes is to ask for the json keys...

single_tweet.

In [None]:
# You can also use Jupyter to help you by using the code completion suggestions
# type single_tweet. and then hit Tab on your keyboard to see your options.

# single_tweet.

In [None]:
# Some the values of some items will themselves be other objects, with their own attributes...

type(single_tweet.user)

In [None]:
single_tweet.user

In [None]:
single_tweet.user._json

In [None]:
# We can access these subvalues by just chaining our attribute requests

single_tweet.user.screen_name

If a tweet is a retweet it will also contain another tweet object with all the information on the original tweet.

In [None]:
# lets make sure we are all looking at a retweet by using our handy function - this may be the tweet you were looking at already!

single_tweet_with_RT = 
print(single_tweet_with_RT)

In [None]:
#.... and therefore we can also access the details of that tweet

print(single_tweet_with_RT.created_at)
print(single_tweet_with_RT.full_text)
print(single_tweet_with_RT.user.screen_name)

print('*'*100)

print(single_tweet_with_RT.retweeted_status.created_at)
print(single_tweet_with_RT.retweeted_status.full_text)
print(single_tweet_with_RT.retweeted_status.user.screen_name)



### Making Multiple Requests
 - With a single request we can retrieve 100 tweets
 - What if we want to maximise our data access and make multiple requests
 - We could make a second request and then join the lists of results together...
 - However Twitter doesn't know what tweets we already retrieved in the first request, so we might get the same ones again.
 - Enter Tweepy's `Cursor` object.
 - The `Cursor` will keep track of where we are in the results stream, handle any api limits and blocks, and keep producing results until it reaches the set limit.


In [None]:
import tweepy
import pandas as pd
from credentials import API_KEY, API_SECRET

auth = 
api = 



In [None]:
# This was our original way we made a request for data from the API

old_approach = api.search_tweets(q='brexit',tweet_mode='extended',result_type='mixed', count=100)

In [None]:
# Using the cursor is similar to our original single_response method.
# we first create our custom cursor, providing it the api method we want to use,
# and any of the arguments we want to be used by that method.



our_cursor = 

## Cursors
Cursor objects don't DO anything alone. They are almost like a set of instructions, but the instructions aren't being acted out until we do two things....
1. Specify whether we want our results as `items` or `pages`.
2. Iterate over the cursor

#### 1. Items / Pages

Cursors can return either a list of individual result items, or result pages depending on what is best. 
- Pages returns you a stream of response objects, each containing the maximum number of tweets per request.
- Items returns you a stream of tweets, essentially joining together the results of the responses.

We set whether we are using pages or items using a method attached to the cursor. The number we pass to the cursor defines the limit, of either pages or items. These arguments would return the same number of tweets, presuming we set our count to 100.

`our_cursor.pages(2)`

`our_cursor.items(200)`

#### 2. Iterating over the Cursor

For our purposes asking for the `.items()` is sufficient, now we need to iterate over it.

In [None]:
# The most explicit way - using a for loop



# 3. Managing Tweet Data
- It's all well and good having this data and printing out pieces of it, but how do we...
- Structure it...
- Store it...
- and Explore it?

In [None]:
# First lets get a fresh set of results with a new cursor. Limit to 500 items.
results = []


Ideally we'd like this data now in a Pandas DataFrame so we can work with it. Let's just try and put it in and see what happens...

In [None]:
df = 
df

Ok....partial success.
Pandas doesn't understand these `Status` objects we're trying to load into it. 
Whilst often people will load data into Pandas using .csv files, Pandas can create dataframes from python data structures such as lists and dictionaries.

In [None]:
flintstones_data = [ {'name':'Fred', 'age':30}, {'name':'Wilma', 'age':27}, {'name':'Barney', 'age':32}, {'name':'Betty', 'age':26}  ]

toy_df = 
toy_df

So we need to somehow convert all of our `status` objects into some sort of Python data structure like our Flintstones data....

#### Luckily for us....
The `._json` method attached to each `Status` turns the object into a dictionary.

In [None]:
single_tweet = results[0]
print(type(single_tweet._json))
single_tweet._json

In [None]:
# Lets first create a new list of the transformed Status objects


    

In [None]:
# Now try...

df = pd.DataFrame(json_results)
df.head()

If we check, we can see that the columns in the DataFrame, match the names of the attributes in our status objects, meaning each column represents that attribute, and each row represents a single Tweet/Status

In [None]:
results[0]._json.keys()

In [None]:
df.info()

# 4. Saving Tweet Data

We can finally save our data to disk if we like. In this case we're going to save to something called a `pickle` file. Why?

In [None]:
# If we examine one of our columns...

df.entities

The values in the entities columns aren't strings, they're dictionaries...

In [None]:
# Here is the first row's value in the 'entities' column
df.loc[0, 'entities']

In [None]:
# The type of the value is dict - dictionary.
type(df.loc[0, 'entities'])

In [None]:
# and parts of it can be accessed like a dictionary
df.loc[0, 'entities']['user_mentions']

<img src="https://github.com/Minyall/sc207_materials/blob/master/images/pickle.jpg?raw=true" align="right" height="200">
If we were to save this DataFrame as a .csv file, it would have to turn those dictionaries into strings, because .csv's don't understand Python objects. When we reloaded the data from a CSV our entities column would be a column of weird messy strings.

### How do we solve this?

# PICKLES!
- A pickle file is a saved version of a python object. So long as it is saved and loaded with the same version of Pandas, it will retain all the data exactly in the state it is in now.

How do we complete this highly complex procedure?....

In [None]:
# Pickle it!
df.to_pickle('my_tweet_df.pkl')

# Extending your Collection

If you want to gather data across a longer period, such as sampling across a week, you may want to pull from the Twitter API once a day. How do we do this without duplicating our data, and how do we easily just add the new data to our dataset, rather than creating a new one each time?

In [None]:
from pathlib import Path

my_data_filename = Path('twitter_data.pkl')
query = 'brexit'
n_items = 1000


# First load in your data if you have it, otherwise create a new DataFrame

if my_data_filename.exists():
    df = pd.read_pickle(my_data_filename)
    
    # if there is data check to find the largest id in your dataset, this will be the most recent
    max_id = df['id'].max()
else:
    df = pd.DataFrame()
    # set max_id to None because on the first run we don't need to provide an id to limit results
    max_id = None
    
# Pull results from the Twitter API

results = []
our_cursor = tweepy.Cursor(api.search_tweets, q=query, count=100, tweet_mode='extended', since_id=max_id)

for item in our_cursor.items(n_items):
    results.append(item._json)

# Load this batch of data into a DataFrame
    
current_data = pd.DataFrame(results)

# Append the new data onto the end of the loaded data (or the empty dataframe if this is the first run)
df = df.append(current_data)

# Check the dataset for any duplicates by dropping any rows with duplicate ids
df = df.drop_duplicates('id')

# Save back to disk
df.to_pickle(my_data_filename)

print(len(df))