# Introduction to APIs
In this workbook, we'll play around with some APIs and get introduced to Pandas tangentially. 

APIs seem confusing and mysterious, but once you learn the lingo and play with 2 or 3 you'll see they're not that complicated. 

In [1]:
# I like to put all my includes at the top here.
from IPython.display import YouTubeVideo
import requests
import pandas as pd
from pprint import pprint

## What can we use web-based APIs for?

Here are some examples:

You can use web based APIs for a variety of things including:

* Making requests to use a particular service (e.g. embedding Google Maps on your website)
* Making requests for data (e.g. using the <a href="http://developer.edmunds.com/">Edmunds API</a>).
* Performing an action (e.g. Posting to twitter using the <a href = "https://dev.twitter.com/rest/public">Twitter API</a>).

Most APIs have an authentication system to regulate access so in most cases you'll need to register and get an API key. In the lecture you heard a bunch of information on APIs. If you're trying to play catch up, want to hear it from someone else, or just like watching videos, here's another short introduction:

In [2]:
from IPython.display import YouTubeVideo
YouTubeVideo("7YcW25PHnAA",width=800,height=600)

## The structure of API requests

Any API requests have a fixed structure:

* URL: The location of the API. Can also contain parameters relating to the request such as the API key.
* Method: What we want the API to do:
    GET - Retrieves a resource
    POST - Creates a new resource
    PUT - Edits an existing resource
    DELETE - Deletes a resource
* Headers: The parameters of the transaction with the API.
* Body: Contains the content of the API request

You need to be aware of something called the _API Endpoint_, which is the URL where our API request will end up. These are called endpoints because they go at the end of the URL. APIs can have multiple endpoints for different resources.

We'll also need to be aware of something called a _Query String_. This is some text that is appended to the API endpoint location and allows us to set parameters for our API request.

## API Responses

Once our API request has been made, we'll receive a response in the following format:

* Header: The parameters of the transaction with the API.
* Response: A 3 digit code that indicates the status of our request. Some common codes are as follows:

    **200** Successful<br/>
    **300** Multiple Choices<br/>
    **301** Moved Permanently<br/>
    **302** Found<br/>
    **304** Not Modified<br/>
    **307** Temporary Redirect<br/>
    **400** Bad Request<br/>
    **401** Unauthorized<br/>
    **403** Forbidden<br/>
    **404** Not Found<br/>
    **410** Gone<br/>
    **500** Internal Server Error<br/>
    **501** Not Implemented<br/>
    **503** Service Unavailable<br/>
    **550** Permission denied<br/>

Here's a full <a href = "https://www.smartlabsoftware.com/ref/http-status-codes.htm">list</a>.

The two most common are 200 (where your request has been successful) and 404 (where the URL doesn't exist).

I found a fun example online accessing data from London public transportation. Let's look through it.

In [2]:
num_rows = 15
url   = 'http://darwin.hacktrain.com/api/train/'                       # The Location of the API
values= {'apiKey':'b05cc6d2-7704-4350-a44f-062b59ba39c5',
         'rows':num_rows}  # A Dictionary for our API key and limiting the rows to num_rows
stat  = 'EUS'                                                          # The API parameter for Euston Station

Requests describes itself as an elegant and simple HTTP library for Python, built for human beings. It's included as part of the Anaconda distribution, so that import at the top should have worked. Requests knows how to take the url and values and glue them together into the API request

In [3]:
r = requests.get(url+stat,params=values)                               # Makes the request and assigns it to the object "r".

ConnectionError: HTTPConnectionPool(host='darwin.hacktrain.com', port=80): Max retries exceeded with url: /api/train/EUS?apiKey=b05cc6d2-7704-4350-a44f-062b59ba39c5&rows=15 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001760258EFD0>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',))

In [None]:
print(r.request)           # The type of request we made
print(r.status_code)       # The 3 digit API response code
print(r.reason)            # The decode for the API response reason code
print(r.headers)           # The headers associated with the API response
print(r.cookies)           # Shows any cookies related to the API response
print(r.url)               # The URL of the API response

In [11]:
# comment and uncomment these to see the response in raw text and in json. 
# I like using pprint for most objects. I've got both here.

#print(r.text)              # The content (body) of the response in text format
#print(r.json())           # The content (body) of the response in JSON format
pprint(r.json())           # Alternate printing of the above. 

[{'destinations': [{'stationCode': 'SRA', 'stationName': 'Stratford (London)'}],
  'estimatedArrivalTime': '22:32',
  'estimatedDepartureTime': None,
  'operatorCode': 'LO',
  'operatorName': 'London Overground',
  'origins': [{'stationCode': 'RMD', 'stationName': 'Richmond'}],
  'platform': '1',
  'scheduledArrivalTime': '22:24',
  'scheduledDepartureTime': None,
  'serviceId': 'V6fA0vRDHGE9gq7Pnh76Iw=='},
 {'destinations': [{'stationCode': 'RMD', 'stationName': 'Richmond'}],
  'estimatedArrivalTime': None,
  'estimatedDepartureTime': '22:36',
  'operatorCode': 'LO',
  'operatorName': 'London Overground',
  'origins': [{'stationCode': 'SRA', 'stationName': 'Stratford (London)'}],
  'platform': '1',
  'scheduledArrivalTime': None,
  'scheduledDepartureTime': '22:30',
  'serviceId': '2LA15xIiEwiIFYEzXh8deg=='},
 {'destinations': [{'stationCode': 'LST',
                    'stationName': 'London Liverpool Street'}],
  'estimatedArrivalTime': 'On time',
  'estimatedDepartureTime': 'On tim

Pandas can make JSON import much easier and easier to read:

In [7]:
df = pd.DataFrame(r.json())
df.head(5)

Unnamed: 0,destinations,estimatedArrivalTime,estimatedDepartureTime,operatorCode,operatorName,origins,platform,scheduledArrivalTime,scheduledDepartureTime,serviceId
0,"[{'stationCode': 'EUS', 'stationName': 'London...",22:32,,VT,Virgin Trains,"[{'stationCode': 'MAN', 'stationName': 'Manche...",5,22:28,,/B8yx4lCM4KpuCH/u94vyg==
1,"[{'stationCode': 'EUS', 'stationName': 'London...",On time,,VT,Virgin Trains,"[{'stationCode': 'EDB', 'stationName': 'Edinbu...",3,22:43,,4W7r70YfoarjlIetalobsA==
2,"[{'stationCode': 'MKC', 'stationName': 'Milton...",,On time,LM,London Midland,"[{'stationCode': 'EUS', 'stationName': 'London...",8,,22:44,U3kn7t4wzSCdRoE7VvD98A==
3,"[{'stationCode': 'EUS', 'stationName': 'London...",On time,,LO,London Overground,"[{'stationCode': 'WFJ', 'stationName': 'Watfor...",9,22:52,,BdqSe7nfyL1VJrSaVRXW+Q==
4,"[{'stationCode': 'EUS', 'stationName': 'London...",On time,,LM,London Midland,"[{'stationCode': 'BHM', 'stationName': 'Birmin...",8,22:52,,cO+gaGfS1QtGm4VmpXrTdw==


There are some sub-fields in destination and origin that make this format sub-optimal. Let's clean it up with a loop:

In [8]:
df['Origin Station'] = ''  # These are blank columns in a pandas data frame.
df['Destination Station'] = ''  # In R you'd type something like `df$destination <- ""

# Wanna see what's going on in the code? Put 
# print statements everywhere! I've done it for you
# this time, but now you should know the trick.
#print(df.head(5))

for x in df.index.values:           # A loop over the data frame to clean origin / destination variables
    df['Origin Station'][x] = df['origins'][x][0]['stationName'] # We'll explain more about what's going on here later.
    df['Destination Station'][x] = df['destinations'][x][0]['stationName']

#print(df.head(5))

df.drop(['origins','destinations','platform','operatorCode'],axis=1,inplace=True) # get rid of some columns.

#print(df.head(5))

df = df[['Origin Station','Destination Station',
         'scheduledArrivalTime','estimatedArrivalTime',
         'scheduledDepartureTime','estimatedDepartureTime',
         'operatorName','serviceId']]

#print(df.head(5))

In some sense, that's kind of it. We called an API, got some data, and processed it. Kudos. Let's push a little further pulling information for lots of London stations:

In [12]:
# List of station codes to iterate through:
stat_list  = ['VIC','WAT','PAD','LST','CLJ','KGX','EUS','LBG','SRA']

# Creating a blank dataframe to which we can append our results:
df = pd.DataFrame(columns=['Query Code','Origin Station','Destination Station','scheduledArrivalTime','estimatedArrivalTime',
                           'scheduledDepartureTime','estimatedDepartureTime','operatorName','serviceId'])

# For loop to make the  request and clean the results:
for stat in stat_list:
    r = requests.get(url+stat,params=values)         # Makes the request
    df_stat = pd.DataFrame(r.json())                 # Creates a dataframe based upon the response data

    df_stat['Origin Station'] = ''
    df_stat['Destination Station'] = ''
    df_stat['Query Code'] = stat

    # same tricks as before to get inside the dataframe and clean it.
    for x in df_stat.index.values:
        df_stat['Origin Station'][x] = df_stat['origins'][x][0]['stationName']
        df_stat['Destination Station'][x] = df_stat['destinations'][x][0]['stationName']

    df_stat.drop(['origins','destinations','platform','operatorCode'],axis=1,inplace=True)
    df_stat = df_stat[['Query Code','Origin Station','Destination Station','scheduledArrivalTime','estimatedArrivalTime','scheduledDepartureTime','estimatedDepartureTime','operatorName','serviceId']]
    df = pd.concat([df,df_stat])
    
df.head()

Unnamed: 0,Query Code,Origin Station,Destination Station,scheduledArrivalTime,estimatedArrivalTime,scheduledDepartureTime,estimatedDepartureTime,operatorName,serviceId
0,VIC,London Victoria,Brighton,,,23:07,23:09,Southern,ocaY9h2PiD2/LEoqtWjn0A==
1,VIC,Ashford International,London Victoria,23:07,On time,,,Southeastern,8enFhAokSdrlvl/69ww1mg==
2,VIC,London Victoria,Dartford,,,23:09,On time,Southeastern,+Bo/DSixnOuXC4wmUJh4sw==
3,VIC,East Grinstead,London Victoria,23:09,23:18,,,Southern,BIHt4PCAi7RGITrjc+2yag==
4,VIC,London Victoria,Orpington,,,23:10,On time,Southeastern,/14UCxPrrgQtD0wJnui6Pw==


## More reading on APIs
<a href = "https://zapier.com/learn/apis/chapter-2-protocols/">Great Zapier introduction to APIs</a><br/>
<a href = "http://docs.python-requests.org/en/master/user/quickstart/">Requests API reference</a><br/>
<a href = "http://blog.smartbear.com/apis/understanding-soap-and-rest-basics/">REST vs SOAP</a><br/>
<a href = "https://dev.socrata.com/docs/endpoints.html">API Endpoints</a><br/>
<a href = "https://www.smartlabsoftware.com/ref/http-status-codes.htm">HTTP Status Codes</a><br/>

# _Data Science from Scratch_ API Code
The below code reproduces some stuff from Chapter 9 in Joel's book.

In [4]:
from dateutil.parser import parse
import math, random, csv, json
from collections import Counter

# Joel data
endpoint = "https://api.github.com/users/joelgrus/repos"
repos = json.loads(requests.get(endpoint).text)

dates = [parse(repo["created_at"]) for repo in repos]
month_counts = Counter(date.month for date in dates)
weekday_counts = Counter(date.weekday() for date in dates)

Print those variables that were just created to see what's in them. `Counter` is insanely useful, as is `parse`.

In [5]:
print("GitHub API")
#print("dates", dates) # commented out because it's pretty long
print("month_counts", month_counts)
print("weekday_count", weekday_counts)

last_5_repositories = sorted(repos,
                             key=lambda r: r["created_at"],
                             reverse=True)[:5]

print("last five languages", [repo["language"] 
                              for repo in last_5_repositories])

GitHub API
month_counts Counter({7: 8, 5: 4, 11: 4, 1: 3, 8: 2, 2: 2, 10: 2, 12: 2, 9: 1, 6: 1, 4: 1})
weekday_count Counter({2: 6, 1: 5, 6: 5, 3: 5, 4: 4, 0: 3, 5: 2})
last five languages [None, 'Python', 'Jupyter Notebook', 'JavaScript', 'HTML']


Below is Joel's code on working with Twitter. We probably won't cover this in class today.

In [None]:
from twython import Twython

# fill these in if you want to use the code
CONSUMER_KEY = ""
CONSUMER_SECRET = ""
ACCESS_TOKEN = ""
ACCESS_TOKEN_SECRET = ""

def call_twitter_search_api():

    twitter = Twython(CONSUMER_KEY, CONSUMER_SECRET)

    # search for tweets containing the phrase "data science"
    for status in twitter.search(q='"data science"')["statuses"]:
        user = status["user"]["screen_name"].encode('utf-8')
        text = status["text"].encode('utf-8')
        print(user, ":", text)

from twython import TwythonStreamer

# appending data to a global variable is pretty poor form
# but it makes the example much simpler
tweets = [] 

class MyStreamer(TwythonStreamer):
    """our own subclass of TwythonStreamer that specifies
    how to interact with the stream"""

    def on_success(self, data):
        """what do we do when twitter sends us data?
        here data will be a Python object representing a tweet"""

        # only want to collect English-language tweets
        if data['lang'] == 'en':
            tweets.append(data)

        # stop when we've collected enough
        if len(tweets) >= 1000:
            self.disconnect()

    def on_error(self, status_code, data):
        print(status_code, data)
        self.disconnect()

def call_twitter_streaming_api():
    stream = MyStreamer(CONSUMER_KEY, CONSUMER_SECRET, 
                        ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

    # starts consuming public statuses that contain the keyword 'data'
    stream.statuses.filter(track='data')
    