# Lab 5 - APIs

APIs, or Application Programming Interfaces, are a method of accessing data from a business or government organization. Many, if not most, social media companies have some kind of API which are made available to third-party developers and researchers. Some of these work better for researchers than others.

Today we're going to explore two different types of APIs, one from Statistics Canada and another from Twitter.

You can think of an API request as what happens when you access a website with a web browser. You type in the URL for a page or click a link, and then the internet gives you want you want. Something similar happens when you make an API request. You are accessing a particular URL which will return a set of requested infomation.

In both of the examples which are working with in this lab, the information is returned in JSON, or **J**ava**S**cript **O**bject **N**otation. This is usually pronounced like "JAY-SAAN". JSON is structured a lot like Python objects, with dictionaries and lists being the main objects. For instance, a tweet is sort of structured like this:

In [None]:
tweet_json = """
{
    "id": 12345,
    "created_at": "2016-11-01",
    "text": "I am really into #python programming. #utm #winning",
    "user": {
        "id": 2345,
        "screen_name": "alexhanna",
        "name": "Alex Hanna"
    },
    "entities": {
        "hashtags": [
            {
                "name": "#python"
            },
            {
                "name": "#utm"
            },
            {
                "name": "#winning"
            }
        ]
    }
}"""

To parse it into something Python can use, we use the <code>json</code> module. Once there, we can access parts of it like a dictionary.

In [None]:
import json
tweet_obj = json.loads(tweet_json)

print(tweet_obj['user']['name'])
print(tweet_obj['text'])
for ht in tweet_obj['entities']['hashtags']:
    print(ht['name'])

## Accessing APIs with <code>urllib</code>

In [None]:
import urllib.request
import json

url = "http://open.canada.ca/data/en/api/3/action/package_search?q=spending"

res = urllib.request.urlopen(url)

In [None]:
json_str = res.read()

In [None]:
json_obj = json.loads(json_str.decode())

In [None]:
json_obj

In [None]:
results = json_obj['result']['results']
print(len(results))

In [None]:
for result in results:
    title = result['title']
    print(title)
    print()

## Accessing the Twitter API through <code>tweepy</code>

First we need install a package called <code>tweepy</code>. Instead of handling all the URL requests by hand, <code>tweepy</code> does this all behind the scenes. We install a new package through Jupyter Notebook like so. This has the effect of running this same command on the Windows, Mac, or UNIX command-line.

In [None]:
!pip install tweepy --prefix=packages

After that, we need to load the newly installed <code>tweepy</code> module into our library path. The path is where Python looks for new libraries. Because we don't have permission to permanently install new packages, we can tell Python to look in the packages folder where we told it to put the new module.

In [None]:
import os
import sys
path = '/packages/Lib/site-packages'
## If you are running this on Mac, comment out the previous line 
## and uncomment the line below
##path = '/packages/lib/python3.5/site-packages'
sys.path.insert(0, os.getcwd() + path)

In [None]:
from slistener import SListener
import json
import time
import tweepy
import sys

In [None]:
## authentication
access_token = ""
access_token_secret = ""
consumer_key = ""
consumer_secret = ""

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api  = tweepy.API(auth)

After that, we can start tracking things from Twitter's stream. This is only a small sample of possible things we can get from Twitter. To get full access, you need to pay a company to get historical data.

For now, we'll pick some fairly obvious keywords which will be on Twitter which related to the US election. There's other [things we can track](https://dev.twitter.com/streaming/reference/post/statuses/filter) here, but for now we'll just track keywords.

In [None]:
## set up words to track
track = ['trump', 'clinton']

listen = SListener(api, 'election2016')
stream = tweepy.Stream(auth, listen)

print("Streaming started...")

stream.filter(track = track)

Finally, we want to convert this all to a DataFrame. The problem is that with the nested structure of a tweet, it's kind of difficult to put it in a rectangular format. For that, we'll do a little bit of "flattening" of the data. This entails going through all the files, then going through all the tweets in those files and putting them into a column at the top level of the nested structure. In particular, let's get the screen names of users, the text and users of retweets, and the text and users of quoted tweets.

In [None]:
import glob
import pandas as pd
import numpy as np

tweets = []
files  = list(glob.iglob('election2016*.json'))
for f in files:
    fh = open(f, 'r', encoding = 'utf-8')
    tweets_json = fh.read().split("\n")

    ## remove empty lines
    tweets_json = list(filter(len, tweets_json))

    ## parse each tweet
    for tweet in tweets_json:
        try:
            tweet_obj = json.loads(tweet)

            ## flatten the file to include quoted status and retweeted status info
            if 'quoted_status' in tweet_obj:
                tweet_obj['quoted_status-text'] = tweet_obj['quoted_status']['text'] 
                tweet_obj['quoted_status-user-screen_name'] = tweet_obj['quoted_status']['user']['screen_name']

            if 'retweeted_status' in tweet_obj:
                tweet_obj['retweeted_status-user-screen_name'] = tweet_obj['retweeted_status']['user']['screen_name']
                tweet_obj['retweeted_status-text'] = tweet_obj['retweeted_status']['text']

            tweet_obj['user-screen_name'] = tweet_obj['user']['screen_name']

            tweets.append(tweet_obj)
        except:
            pass

## create pandas DataFrame for further analysis
df_tweet = pd.DataFrame(tweets)