<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice Using APIs

_Instructor: Aymeric Flaisler_

---

In this lab we will practice using some popular APIs to retrieve and store data.

In [11]:
# Imports at the top
import json
import urllib
import pandas as pd
import numpy as np
import requests
import json
import re
import matplotlib.pyplot as plt
%matplotlib inline

## Exercise 1: get data from Sheetsu

---

[Sheetsu](https://sheetsu.com/) is an online service that allows you to access any Google spreadsheet from an API. This can be a very powerful way to share a dataset with colleagues as well as to create a mini centralized data storage that is simpler to edit than a database.

A Google Spreadsheet with Wine data can be found [here](https://docs.google.com/spreadsheets/d/17CnKCqQTnyHgEljbQE5yK5DNuuqnssZ90qAi8XhM1e8/edit?usp=sharing).

It can be accessed through sheetsu API at this endpoint: https://sheetsu.com/apis/v1.0su/4ce9c3459aad

**Questions:**

1. Use the requests library to access the document. Inspect the response text. What kind of data is it?
- Check the status code of the response object. What code is it?
- Use the appropriate libraries and read functions to read the response into a Pandas Dataframe.
- Once you've imported the data into a dataframe, check the value of the 5th line: what's the price?

In [12]:
# You can either post or get info from this API
api_base_url = 'https://sheetsu.com/apis/v1.0su/4ce9c3459aad'

In [13]:
# What kind of data is this returning?
api_response = requests.get(api_base_url)

In [4]:
api_response

<Response [200]>

In [5]:
api_response.text[:100]

'[{"Color":"W","Region":"Portugal","Country":"Portugal","Vintage":"2013","Vinyard":"Vinho Verde","Nam'

In [38]:
# 1. It is a json string

In [39]:
api_response.text

'[{"Color":"W","Region":"Portugal","Country":"Portugal","Vintage":"2013","Vinyard":"Vinho Verde","Name":"","Grape":"","Consumed In":"2015","Score":"4","Price":""},{"Color":"W","Region":"France","Country":"France","Vintage":"2013","Vinyard":"Peyruchet","Name":"","Grape":"","Consumed In":"2015","Score":"3","Price":"17.8"},{"Color":"W","Region":"Oregon","Country":"Oregon","Vintage":"2013","Vinyard":"Abacela","Name":"","Grape":"","Consumed In":"2015","Score":"3","Price":"20"},{"Color":"W","Region":"Spain","Country":"Spain","Vintage":"2012","Vinyard":"Ochoa","Name":"","Grape":"chardonay","Consumed In":"2015","Score":"2.5","Price":"7"},{"Color":"R","Region":"","Country":"US","Vintage":"2012","Vinyard":"Heartland","Name":"Spice Trader","Grape":"chiraz, cab","Consumed In":"2015","Score":"3","Price":"6"},{"Color":"R","Region":"California","Country":"US","Vintage":"2012","Vinyard":"Crow Canyon","Name":"","Grape":"cab","Consumed In":"2015","Score":"3.5","Price":"13"},{"Color":"R","Region":"Oregon

In [8]:
response = json.loads(api_response.text)

In [9]:
type(response)

list

In [10]:
response[0]

{'Color': 'W',
 'Consumed In': '2015',
 'Country': 'Portugal',
 'Grape': '',
 'Name': '',
 'Price': '',
 'Region': 'Portugal',
 'Score': '4',
 'Vintage': '2013',
 'Vinyard': 'Vinho Verde'}

In [11]:
api_response.status_code

200

In [12]:
# 2. response code is 200

In [14]:
wine_df = pd.DataFrame(response)
wine_df.head()

NameError: name 'response' is not defined

In [15]:
pd.read_json(api_base_url)

Unnamed: 0,Color,Consumed In,Country,Grape,Name,Price,Region,Score,Vintage,Vinyard
0,W,2015,Portugal,,,,Portugal,4.0,2013,Vinho Verde
1,W,2015,France,,,17.8,France,3.0,2013,Peyruchet
2,W,2015,Oregon,,,20.0,Oregon,3.0,2013,Abacela
3,W,2015,Spain,chardonay,,7.0,Spain,2.5,2012,Ochoa
4,R,2015,US,"chiraz, cab",Spice Trader,6.0,,3.0,2012,Heartland
5,R,2015,US,cab,,13.0,California,3.5,2012,Crow Canyon
6,R,2015,US,,#14,21.0,Oregon,2.5,2013,Abacela
7,R,2015,France,"merlot, cab",,12.0,Bordeaux,3.5,2012,David Beaulieu
8,R,2015,France,"merlot, cab",,11.99,Medoc,3.5,2011,Chantemerle
9,R,2015,US,merlot,,13.0,Washington,4.0,2011,Hyatt


In [16]:
# alternatively:
wine_df = pd.read_json(api_response.text)
wine_df.head(2)

Unnamed: 0,Color,Consumed In,Country,Grape,Name,Price,Region,Score,Vintage,Vinyard
0,W,2015,Portugal,,,,Portugal,4,2013,Vinho Verde
1,W,2015,France,,,17.8,France,3,2013,Peyruchet


In [17]:
wine_df.iloc[4, :]
# the price is 6 for the 5th row.

Color                     R
Consumed In            2015
Country                  US
Grape           chiraz, cab
Name           Spice Trader
Price                     6
Region                     
Score                     3
Vintage                2012
Vinyard           Heartland
Name: 4, dtype: object

## Exercise 2: post data to Sheetsu

---

We've learned how to read data, but it'd be great if we could also write data. For this we will need to use a _POST_ request.

1. Use the post command to add the `post_data` to the spreadsheet.
- What status did you get? How can you check that you actually added the data correctly?
- In this exercise, your classmates are adding data to the same spreadsheet. What happens because of this? Is it a problem? How could you mitigate it?

In [18]:
post_data = {
'Grape' : ''
, 'Name' : "Aymeric's wine"
, 'Color' : 'R'
, 'Country' : 'US'
, 'Region' : 'Sonoma'
, 'Vinyard' : ''
, 'Score' : '10'
, 'Consumed In' : '2015'
, 'Vintage' : '1973'
, 'Price' : '200'
}

In [10]:
requests.post(api_base_url, data=post_data)

<Response [201]>

In [19]:
# 2. To check we could send a get request and check the last line.

In [20]:
# 3. There will be many duplicate lines on the spreadsheet. One way to mitigate 
# this would be through permission, another would be to insert at a 
# specific position, so that the line is overwritten at each time.

## Exercise 3: IMDB TV Shows

---

Sometimes an API doesn't provide all the information we would like to get and we need to be creative.

Here we will use a combination of scraping and API calls to find the ratings and networks of famous television shows.

### 3.A Get the top TV Shows

The Internet Movie Database contains data about movies and TV shows. Unfortunately it does not have a public API.

The page http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2 contains the list of the top 250 tv shows of all time. Retrieve the page using the requests library and then parse the html to obtain a list of the `movie_ids` for these movies. You can parse it with regular expression or using a library like `BeautifulSoup`.

> **Hint:** movie_ids look like this: `tt2582802`
> _Everything after "/title/" and before "/?"_

In [19]:
def get_top_250():
    response = requests.get('http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2')
    html = response.text
    entries = re.findall("<a href.*?/title/(.*?)/", html) 


    entries = re.findall("<a href.*?/title/(.*?)/", html) 
    # create a list of the top 250 results
    return list(set(entries))

In [20]:
entries = get_top_250()

In [21]:
len(entries)

251

In [22]:
entries[0]

'tt2017109'

### 3.B Get data on the top shows

Although the Internet Movie Database does not have a public API, an open API exists at http://www.tvmaze.com/api.

Use this API to retrieve information about each of the 250 TV shows you have extracted in the previous step.
1. Check the documentation of tvmaze's api to learn how to request show data by id.
- Define a function that returns a python object with select information for a given id.
    - Show name
    - Rating (avg)
    - Genre(s)
    - Network name
    - Premiere date
    - Status
> Tip: the json object can easily be converted into a python dictionary.

- Store the gathered information in a Pandas Dataframe.


As the target information is in json format you will need `json.loads(res.text)` in order to gather it.

Heres and example of the information and how we can interact with it.

In [23]:
# example url
res=requests.get('http://api.tvmaze.com/lookup/shows?imdb=tt0944947')

# status code
print(res.status_code)

200


In [24]:
# just the contents of the name element
# print json.loads(res.text).get('name')
res.json()['name']

# entire contents
# print json.loads(res.text)

'Game of Thrones'

In [25]:
"[1,2,3]+[4,5,6]"+"test"

'[1,2,3]+[4,5,6]test'

In [26]:
def get_entry(entries):
    shows_df = pd.DataFrame(
        columns=['show_name', 'rating_avg', 'genres', 'network', 'premiere_date', 'status'])
    for entry in entries:
        res = requests.get('http://api.tvmaze.com/lookup/shows?imdb=' + entry)
        jsont = json.loads(res.text)
        if res.status_code == 200:
            try:
                status = jsont.get('status')
            except AttributeError:
                status = 'NA'
            try:
                rating = jsont.get('rating').get('average')
            except AttributeError:
                rating = 'NA'

            try:
                network = jsont.get('network').get('name')
            except AttributeError:
                network = 'NA'

            try:
                title = jsont.get('name')
            except AttributeError:
                title = 'NA'

            try:
                premier = jsont.get('premiered')
            except AttributeError:
                premier = 'NA'

            try:
                genres = jsont.get('genres')
            except AttributeError:
                genres = 'NA'
            shows_df.loc[len(shows_df)] = [title, rating,
                                           genres, network, premier, status]
    return shows_df

In [27]:
#  function to pull information from API converting Json into a python dictionary element
def get_entry(entries):
    shows_df= pd.DataFrame( columns = ['show_name', 'rating_avg', 'genres', 'network', 'premiere_date', 'status'])
    for entry in entries:
        res=requests.get('http://api.tvmaze.com/lookup/shows?imdb='+entry)
        if res.status_code == 200:
            results = json.loads(res.text)

            try:    
                status = results['status']
            except TypeError:
                status = 'NA'   
            try:
                rating = results['rating']['average']
            except TypeError:
                rating = 'NA'
            try:
                network = results['network']['name']
            except TypeError:
                network = 'NA'
            try:   
                title = results['name']
            except TypeError:
                title = 'NA'
            try:   
                genres = results['genres']
            except TypeError:
                genres = 'NA'
            try:   
                premier = results['premiered']
            except TypeError:
                premier = 'NA'
            shows_df.loc[len(shows_df)] = [title, rating, genres, network, premier, status]
    return shows_df

In [30]:
# in both functions we are looking for specific elements.  If an element is missing an error will return thus the need
# for try and except statements.

In [31]:
shows_df = get_entry(entries)

In [32]:
shows_df.head()

Unnamed: 0,show_name,rating_avg,genres,network,premiere_date,status
0,Chef's Table,8.8,[Food],,2015-04-26,Running
1,Samurai Champloo,8.0,"[Comedy, Action, Adventure, Anime]",Fuji TV,2004-05-20,Ended
2,South Park,8.7,[Comedy],Comedy Central,1997-08-13,Running
3,Better Call Saul,8.7,"[Drama, Crime, Legal]",AMC,2015-02-08,Running
4,Westworld,8.8,"[Drama, Science-Fiction, Western]",HBO,2016-10-02,Running


In [33]:
shows_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 0 to 149
Data columns (total 6 columns):
show_name        150 non-null object
rating_avg       139 non-null float64
genres           150 non-null object
network          150 non-null object
premiere_date    150 non-null object
status           150 non-null object
dtypes: float64(1), object(5)
memory usage: 8.2+ KB
