# Lecture 5 - Data Acquisition, Web Scraping and Web APIs *

# Table of Contents
* [Lecture 5 - Data Acquisition, Web Scraping and Web APIs *](#Lecture-5---Data-Acquisition,-Web-Scraping-and-Web-APIs-*)
	* &nbsp;
		* [Content](#Content)
		* [Learning Outcomes](#Learning-Outcomes)
* [Data Acquisition](#Data-Acquisition)
* [1. Web scraping](#1.-Web-scraping)
	* [HTML](#HTML)
		* &nbsp;
			* [What is HTML?](#What-is-HTML?)
	* [Intro to Web Scraping](#Intro-to-Web-Scraping)
		* [--- WARNING ---](#----WARNING----)
	* [2. Web APIs](#2.-Web-APIs)
		* [REST](#REST)
		* [JSON](#JSON)
		* [Forming an API query](#Forming-an-API-query)
	* [Proprietary API Wrapper Modules](#Proprietary-API-Wrapper-Modules)
	* [API Repositories and Market Places](#API-Repositories-and-Market-Places)


---
* Some material on web scraping and usage of APIs adapted from Kevin Markham's data science courses at https://github.com/justmarkham

### Content

1. Data gathering via web scraping
2. HTML basics
3. Data gathering via web APIs
4. JSON file format

### Learning Outcomes

At the end of this lecture, you should be able to:

* list the different dynamic sources of data
* explain what HTML is and its basic structure
* make HTTP requests using python
* traverse the HTML document tree
* perform web scraping at an introductory level
* describe and process the JSON file format
* perform rudimentary data acquisition using Web APIs



---

# Data Acquisition

So far, we have looked at how we can acquire data from pre-prepared Excel and text files in the CSV format. We also saw how we can use pandas clipboard facility to paste and build data frames. 

We also experienced that much of the data does not come in tidy formats that are prepared and ready for data analysis. For this we learned a number of techniques that help us to wrangle and tidy our data into shape. 

Now we are going to look at two additional sources of data that are dynamic and will require the combination of all the techniques we learned previously, such as wrangling, merging, aggregation, as well as some new skills. 

It is becoming common these days that data is acquired from multiple sources and merged into a single dataset. The data sources that are increasingly becoming the backbone of many analytics and information systems are web based.

This section considers how data can be read (scraped) from web pages (HTML documents), and how data can be retrieved from web servers using their application program interfaces (APIs).

# 1. Web scraping

Often when we need to acquire data, web pages are a great resource to turn to. Many websites make data available on their web pages for viewing in a browser, but do not make it conveniently downloadable as an easily machine-readable format like JSON, CSV, or XML. Because of this, we sometimes need to employ web scraping techniques.

The term "web scraping" refers to an application or script that processes HTML pages. This is done in order to extract data embedded in HTML for manipulation. 

Web scraping applications in effect simulate a person viewing a website with a browser.

Our task then becomes writing scripts that can traverse the structure of HTML documents and locate the particular piece of data we need.

## HTML

#### What is HTML?

HTML is a markup language (not a programming language) for describing web documents (web pages).

    HTML stands for Hyper Text Markup Language
    A markup language is a set of markup tags
    HTML documents are described by HTML tags
    Each HTML tag describes different document content

HTML pages consist of elements. Elements are marked up by tags, and the tags may have attributes inside them which describe how the content should be rendered by web browsers. The initial tag specifies the type of the document so that the browsers render the content correctly.

Please refer to http://www.w3schools.com/html/html_intro.asp for an introduction to HTML.

The examples below will show how we can perform web scraping on HTML pages using a Python package called `BeautifulSoup`. 

BeautifulSoup is an HTML/XML parser for Python that can turn markup text into a parse tree, that can then be traversed more easily.

In [1]:
from IPython.core.display import HTML
HTML("<iframe src=http://www.crummy.com/software/BeautifulSoup/bs4/doc/ width=1100 height=500></iframe>")

BeautifulSoup provides a simplified, idiomatic way of navigating, searching, and modifying parse tree generated by HTML and XML.

More info on BeautifulSoup http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

Good examples of how this is done can be found in : http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/ and http://blog.miguelgrinberg.com/post/easy-web-scraping-with-python

## Intro to Web Scraping

We are going to begin with a toy example first using the simple html page created below:

In [2]:
# imports
import requests                 # How Python gets the webpages
from bs4 import BeautifulSoup   # Creates structured, searchable object
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json

In [3]:
# First, let's read the toy webpage as a string - this is what happens initially when you scrape any webpage
html_doc = """
<!doctype html>
<html lang="en">
<head>
  <title>Teo's Webpage</title>
</head>

<body>
  <h1>Teo's Webpage</h1>
  <p id="intro">My name is Teo.  I find web scraping interesting.</p>
  <p id="background">I live in Auckland and completed my PhD at Massey University in Computer Science, while studying the field of machine learning.</p>
  <p id="current">I currently work as a lecturer in Information Technology.</p>
  
  <h3>My Interests</h3>
  <ul>
      <li id="my favorite">Data Science and Machine Learning</li>
      <li class="hobby">Tennis</li>
      <li class="hobby">Reading</li>
      <li class="hobby">Travelling</li>
      <li class="hobby">Running</li>
  </ul>
</body>
</html>
"""
type(html_doc)

str

In [67]:
# Beautiful soup allows us to create structure from the html elements, and to traverse it
page = BeautifulSoup(html_doc, "lxml")
print type(page)
page

<class 'bs4.BeautifulSoup'>


<!DOCTYPE html>\n<html lang="en">\n<head>\n<title>Teo's Webpage</title>\n</head>\n<body>\n<h1>Teo's Webpage</h1>\n<p id="intro">My name is Teo.  I find web scraping interesting.</p>\n<p id="background">I live in Auckland and completed my PhD at Massey University in Computer Science, while studying the field of machine learning.</p>\n<p id="current">I currently work as a lecturer in Information Technology.</p>\n<h3>My Interests</h3>\n<ul>\n<li id="my favorite">Data Science and Machine Learning</li>\n<li class="hobby">Tennis</li>\n<li class="hobby">Reading</li>\n<li class="hobby">Travelling</li>\n<li class="hobby">Running</li>\n</ul>\n</body>\n</html>\n

In [6]:
# The most useful methods in a Beautiful Soup object are "find" and "findAll".
# "find" takes several parameters, the most important are "name" and "attrs".
# name will help us find the type of an element
# Let's target "name".
page.find(name='body') # Finds the 'body' tag and everything inside of it.

<body>\n<h1>Teo's Webpage</h1>\n<p id="intro">My name is Teo.  I find web scraping interesting.</p>\n<p id="background">I live in Auckland and completed my PhD at Massey University in Computer Science, while studying the field of machine learning.</p>\n<p id="current">I currently work as a lecturer in Information Technology.</p>\n<h3>My Interests</h3>\n<ul>\n<li id="my favorite">Data Science and Machine Learning</li>\n<li class="hobby">Tennis</li>\n<li class="hobby">Reading</li>\n<li class="hobby">Travelling</li>\n<li class="hobby">Running</li>\n</ul>\n</body>

In [7]:
body = page.find(name='body')
type(body) #element.Tag

bs4.element.Tag

The above result tells us that 'body' element was found in the HTML page, and it tells us what object type it is. When the find fails, then this is what we get:

In [8]:
body = page.find(name='bodyyy')
type(body) #element.Tag

NoneType

We can see its content below

In [9]:
body = page.find(name='body')
body.contents

[u'\n',
 <h1>Teo's Webpage</h1>,
 u'\n',
 <p id="intro">My name is Teo.  I find web scraping interesting.</p>,
 u'\n',
 <p id="background">I live in Auckland and completed my PhD at Massey University in Computer Science, while studying the field of machine learning.</p>,
 u'\n',
 <p id="current">I currently work as a lecturer in Information Technology.</p>,
 u'\n',
 <h3>My Interests</h3>,
 u'\n',
 <ul>\n<li id="my favorite">Data Science and Machine Learning</li>\n<li class="hobby">Tennis</li>\n<li class="hobby">Reading</li>\n<li class="hobby">Travelling</li>\n<li class="hobby">Running</li>\n</ul>,
 u'\n']

We can recursively search for other elements inside the returned result as well:

In [10]:
h1 = body.find(name='h1') # Find the 'h1' element inside of the 'body' tag
print h1

<h1>Teo's Webpage</h1>


In [11]:
print h1.text

Teo's Webpage


Notice how we can access the entire element or just the content. 

Now let's find the 'p' elements:

In [12]:
p = page.find(name='p')
# This only finds one.  This is where 'findAll' comes in.
print p

<p id="intro">My name is Teo.  I find web scraping interesting.</p>


We can also do a search of all instances of an element:

In [13]:
all_p = page.findAll(name='p')
print all_p
print type(all_p) # Result sets are a lot like Python lists

[<p id="intro">My name is Teo.  I find web scraping interesting.</p>, <p id="background">I live in Auckland and completed my PhD at Massey University in Computer Science, while studying the field of machine learning.</p>, <p id="current">I currently work as a lecturer in Information Technology.</p>]
<class 'bs4.element.ResultSet'>


Access specific element with index:

In [14]:
print all_p[0]
print all_p[1]

<p id="intro">My name is Teo.  I find web scraping interesting.</p>
<p id="background">I live in Auckland and completed my PhD at Massey University in Computer Science, while studying the field of machine learning.</p>


In [15]:
# Iterable like  list
for one_p in all_p:
    print one_p.text # Print text

My name is Teo.  I find web scraping interesting.
I live in Auckland and completed my PhD at Massey University in Computer Science, while studying the field of machine learning.
I currently work as a lecturer in Information Technology.


Access specific attribute of a tag:

In [16]:
print all_p[0] # Specific element

<p id="intro">My name is Teo.  I find web scraping interesting.</p>


In [17]:
print all_p[0]['id'] # Specific attribute value of a specific element

intro


Now let's look at 'attrs'. Beautiful soup also allows us to locate elements with specific attributes:

In [18]:
print page.find(name='p', attrs={"id":"intro"})

<p id="intro">My name is Teo.  I find web scraping interesting.</p>


In [19]:
print page.find(name='p', attrs={"id":"background"})

<p id="background">I live in Auckland and completed my PhD at Massey University in Computer Science, while studying the field of machine learning.</p>


In [20]:
result = page.find(name='p', attrs={"id":"current"})
result.text

u'I currently work as a lecturer in Information Technology.'

Again we can also do a search of all instances of an element and name of a class:

In [21]:
print page.findAll("li", "hobby")

[<li class="hobby">Tennis</li>, <li class="hobby">Reading</li>, <li class="hobby">Travelling</li>, <li class="hobby">Running</li>]


**Exercise:** Extract the 'h3' element from Teo's webpage.

In [69]:
h3 = page.find(name='h3')
# This only finds one.  This is where 'findAll' comes in.
print h3

<h3>My Interests</h3>


**Exercise:** Extract Teo's hobbies from the html_doc.  Print out the text of the hobby. 

In [70]:
hobbies = page.find(name='ul')
# This only finds one.  This is where 'findAll' comes in.
print hobbies

<ul>
<li id="my favorite">Data Science and Machine Learning</li>
<li class="hobby">Tennis</li>
<li class="hobby">Reading</li>
<li class="hobby">Travelling</li>
<li class="hobby">Running</li>
</ul>


**Exercise:** Extract Teo's hobby that has the id "my favorite".

In order to illustrate HTML web scraping on a real-world site, we will look at a website that lists the up-to-date gold price found on http://www.gold.org, and which is refreshed every minute. 

We will attempt to read the asking price of gold from the HTML document.

In [22]:
from IPython.core.display import HTML
HTML("<iframe src=http://www.gold.org width=1100 height=500></iframe>")

The price we are interested in is found in the "ASK" row under the "Spot Price" section. 

In order to find where the price is situated in the HTML document, we must look at the document's source code. By right clicking on a page in a browser, an option should be displayed allowing you to view the source.

We must inspect the source so that we can find the element that houses this value. We can then use the python's BeautifulSoup package to **read and traverse through the HTML element tree** in order to extract the data that we want.

There are three basic steps to scraping a single page:

    1. Get (request) the page
    2. Parse the page content (read and interpret the document structure)
    3. Search through the content of interest


Below is the example of a script that will access and display the latest gold price being traded:


In [73]:
#we first need to make some extra imports
import json
from time import sleep
from datetime import datetime

#you might need to set the proxies if you are doiung this from Massey's domain
#if the below does not work, then try this: "http://get-proxy.massey.ac.nz/"
massey_proxies = {
  "http": "http://alb-cache1.massey.ac.nz/",
 "https": "http://alb-cache1.massey.ac.nz/",
}


#massey_proxies = ""

**STEP 1: GET** Access the page and read it into the beautiful soup object

In [74]:
url = "http://gold.org"
response = requests.get(url, proxies=massey_proxies)
response

<Response [200]>

### --- WARNING --- 

ALWAYS FIRST MAKE SURE THAT THE RESPONSE IS 200 - OTHERWISE YOU MIGHT HAVE AN ERROR, IN WHICH CASE YOU'D BE BEST TO STOP AND NOT TRY TO PROCESS THE DOCUMENT, SINCE THERE WILL BE NOTHING TO PROCESS

In [75]:
page = response.content

In [76]:
page[:10000]

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">\n<!--[if (IE 6) & (!IEMobile)]> <html xmlns="http://www.w3.org/1999/xhtml" version="XHTML+RDFa 1.0" class="no-js lt-ie10 lt-ie9 lt-ie8 lt-ie7 ie6" lang="en"  xmlns:wb="http://open.weibo.com/wb" xml:lang="en" dir="ltr"\n  xmlns:fb="http://ogp.me/ns/fb#"\n  xmlns:og="http://ogp.me/ns#"\n  xmlns:article="http://ogp.me/ns/article#"\n  xmlns:book="http://ogp.me/ns/book#"\n  xmlns:profile="http://ogp.me/ns/profile#"\n  xmlns:video="http://ogp.me/ns/video#"> <![endif]-->\n<!--[if (IE 7) & (!IEMobile)]> <html xmlns="http://www.w3.org/1999/xhtml" version="XHTML+RDFa 1.0" class="no-js lt-ie10 lt-ie9 lt-ie8 ie7" lang="en"  xmlns:wb="http://open.weibo.com/wb" xml:lang="en" dir="ltr"\n  xmlns:fb="http://ogp.me/ns/fb#"\n  xmlns:og="http://ogp.me/ns#"\n  xmlns:article="http://ogp.me/ns/article#"\n  xmlns:book="http://ogp.me/ns/book#"\n  xmlns:profile="http://ogp.me/ns/profile#"\n  xmlns:video="htt

**STEP 2: PARSE** Create a BeautifulSoup object that reads and parses the HTML page into a format that we can search and traverse.

In [77]:
scraping = BeautifulSoup(page, "lxml") 

In [78]:
scraping

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">\n<!--[if (IE 6) & (!IEMobile)]> <html xmlns="http://www.w3.org/1999/xhtml" version="XHTML+RDFa 1.0" class="no-js lt-ie10 lt-ie9 lt-ie8 lt-ie7 ie6" lang="en"  xmlns:wb="http://open.weibo.com/wb" xml:lang="en" dir="ltr"\n  xmlns:fb="http://ogp.me/ns/fb#"\n  xmlns:og="http://ogp.me/ns#"\n  xmlns:article="http://ogp.me/ns/article#"\n  xmlns:book="http://ogp.me/ns/book#"\n  xmlns:profile="http://ogp.me/ns/profile#"\n  xmlns:video="http://ogp.me/ns/video#"> <![endif]--><!--[if (IE 7) & (!IEMobile)]> <html xmlns="http://www.w3.org/1999/xhtml" version="XHTML+RDFa 1.0" class="no-js lt-ie10 lt-ie9 lt-ie8 ie7" lang="en"  xmlns:wb="http://open.weibo.com/wb" xml:lang="en" dir="ltr"\n  xmlns:fb="http://ogp.me/ns/fb#"\n  xmlns:og="http://ogp.me/ns#"\n  xmlns:article="http://ogp.me/ns/article#"\n  xmlns:book="http://ogp.me/ns/book#"\n  xmlns:profile="http://ogp.me/ns/profile#"\n  xmlns:video="http:/

Now we can search for a given tag, id or class name.

**STEP 3: SEARCH** Search through the page for 'dd' type tags with the class name 'value':

In [29]:
element = scraping.find("dd", attrs={"class" : "value"})
element

<dd class="value">1,345.35</dd>

Once we have found the tag we want, we extract the contents of it by calling .contents and optionally convert it into a float.

In [30]:
print float(str(element.contents[0]).replace(',', ''))

1345.35


As it turns out, there are multiple tags in the document with this tag-name combination. 

If we re-run the search from before and ask for all results to be returned that match our criteria, this is what we get:

In [31]:
element = scraping.find_all("dd", attrs={"class" : "value"})
element

[<dd class="value">1,345.35</dd>,
 <dd class="value">1,345.15</dd>,
 <dd class="value">1,344.95</dd>,
 <dd class="value">1,345.35</dd>,
 <dd class="value">1,345.15</dd>,
 <dd class="value">1,344.95</dd>]

Our previous scrape worked because the value of interest was the first one. But say we would now like to scrape the mid price now on the top of the page. 

We would first need to go back to the webpage and extract a section which houses the element of target, which would then hopefully make it easier to extract the actual element we are after. In our case, we will extract the 'div' section with 'asset mid' attribute (there could however be a shortcut to the example below).

In [32]:
element2 = scraping.find("div", attrs={"class" : "asset mid"})

In [33]:
print element2

<div class="asset mid">
<p class="heading">Mid</p>
<dl>
<dt class="accessibility">Value</dt>
<dd class="value">1,345.15</dd>
<dt class="accessibility">Variation</dt>
<dd class="variation minus">
<span class="icon minus" role="presentation"></span>
</dd>
</dl>
</div>


In [34]:
element2.find("dd", "value").contents

[u'1,345.15']

**Exercise:** Scrape the bid price from the web page and convert it into a float.

In [87]:
element3 = scraping.find("div", attrs={"class" : "asset bid"})
print element3
bid = element2.find("dd", "value").contents
print bid[0]

print float(str(bid[0]).replace(',', ''))

<div class="asset bid">
<p class="heading">Bid</p>
<dl>
<dt class="accessibility">Value</dt>
<dd class="value">1,344.95</dd>
<dt class="accessibility">Variation</dt>
<dd class="variation minus">
<span class="icon minus" role="presentation"></span>
</dd>
</dl>
</div>
1,345.15
1345.15


Below is an example of how we might write a script that continually extracts data from a page every 1-2 seconds:

In [35]:
def GetGoldPrice():
    url = "http://gold.org"
    response = requests.get(url, proxies=massey_proxies)
    page = response.content
    #create a BeautifulSoup object that reads in the HTML page
    scraping = BeautifulSoup(page)
    #search through the page for 'dd' type tags with the class name 'value'
    element = scraping.find("dd", "value")
    #access the contents inside the tags
    price = element.contents[0].string
    return price

for x in range(0,10):
    time_now = datetime.now().strftime("%I:%M:%S%p")
    print("{0}, Gold price is: {1} \n ".format(time_now, GetGoldPrice()))
    sleep(0.01)



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


03:05:39PM, Gold price is: 1,345.35 
 
03:05:40PM, Gold price is: 1,345.35 
 
03:05:42PM, Gold price is: 1,345.35 
 
03:05:44PM, Gold price is: 1,345.35 
 
03:05:46PM, Gold price is: 1,345.35 
 
03:05:47PM, Gold price is: 1,345.35 
 
03:05:52PM, Gold price is: 1,345.35 
 
03:05:54PM, Gold price is: 1,345.35 
 
03:05:56PM, Gold price is: 1,345.35 
 
03:05:57PM, Gold price is: 1,345.35 
 


**Exercise**: Extract the current FTSE 100 stock market index from the Google Finance page http://www.google.com/finance

In [115]:
url = "http://www.google.com/finance"
response = requests.get(url, proxies=massey_proxies)
response

<Response [200]>

In [116]:
page = response.content
page[:10000]

'<!DOCTYPE html><html><head><script>(function(){(function(){function e(a){this.t={};this.tick=function(a,c,b){var d=void 0!=b?b:(new Date).getTime();this.t[a]=[d,c];if(void 0==b)try{window.console.timeStamp("CSI/"+a)}catch(e){}};this.tick("start",null,a)}var a,d;window.performance&&(d=(a=window.performance.timing)&&a.responseStart);var f=0<d?new e(d):new e;window.jstiming={Timer:e,load:f};if(a){var c=a.navigationStart;0<c&&d>=c&&(window.jstiming.srt=d-c)}if(a){var b=window.jstiming.load;0<c&&d>=c&&(b.tick("_wtsrt",void 0,c),b.tick("wtsrt_","_wtsrt",\nd),b.tick("tbsd_","wtsrt_"))}try{a=null,window.chrome&&window.chrome.csi&&(a=Math.floor(window.chrome.csi().pageT),b&&0<c&&(b.tick("_tbnd",void 0,window.chrome.csi().startE),b.tick("tbnd_","_tbnd",c))),null==a&&window.gtbExternal&&(a=window.gtbExternal.pageT()),null==a&&window.external&&(a=window.external.pageT,b&&0<c&&(b.tick("_tbnd",void 0,window.external.startE),b.tick("tbnd_","_tbnd",c))),a&&(window.jstiming.pt=a)}catch(g){}})();})();\

In [117]:
scraping = BeautifulSoup(page, "lxml") 
scraping

<!DOCTYPE html>\n<html><head><script>(function(){(function(){function e(a){this.t={};this.tick=function(a,c,b){var d=void 0!=b?b:(new Date).getTime();this.t[a]=[d,c];if(void 0==b)try{window.console.timeStamp("CSI/"+a)}catch(e){}};this.tick("start",null,a)}var a,d;window.performance&&(d=(a=window.performance.timing)&&a.responseStart);var f=0<d?new e(d):new e;window.jstiming={Timer:e,load:f};if(a){var c=a.navigationStart;0<c&&d>=c&&(window.jstiming.srt=d-c)}if(a){var b=window.jstiming.load;0<c&&d>=c&&(b.tick("_wtsrt",void 0,c),b.tick("wtsrt_","_wtsrt",\nd),b.tick("tbsd_","wtsrt_"))}try{a=null,window.chrome&&window.chrome.csi&&(a=Math.floor(window.chrome.csi().pageT),b&&0<c&&(b.tick("_tbnd",void 0,window.chrome.csi().startE),b.tick("tbnd_","_tbnd",c))),null==a&&window.gtbExternal&&(a=window.gtbExternal.pageT()),null==a&&window.external&&(a=window.external.pageT,b&&0<c&&(b.tick("_tbnd",void 0,window.external.startE),b.tick("tbnd_","_tbnd",c))),a&&(window.jstiming.pt=a)}catch(g){}})();})();

In [120]:
element_FTSE100 = scraping.find(attrs={"id":"ref_12590587_1"})
print element_FTSE100

None


In [103]:
element_FTSE100.text

AttributeError: 'NoneType' object has no attribute 'text'

We can also read in entire HTML tables into dataframe objects:

In [94]:
scraping_html_table = BeautifulSoup(response.content)

In [95]:
scraping_html_table_FTSE100 = scraping.find_all("table", "quotes")
scraping_html_table_FTSE100

[<table class="quotes" width="100%"><tbody><tr>\n<td class="symbol"><a href="/finance?q=SHA:000001&amp;ei=Cd2zV9nlGtec0QSqmongCg">Shanghai\n</a></td><td class="price"><span id="ref_7521596_l">3,104.09</span>\n</td><td class="change"><span class="chr" id="ref_7521596_c">-5.95</span> <span class="chr" id="ref_7521596_cp">(-0.19%)</span></td></tr><tr>\n<td class="symbol"><a href="/finance?q=INDEXNIKKEI:NI225&amp;ei=Cd2zV9nlGtec0QSqmongCg">Nikkei 225\n</a></td><td class="price"><span id="ref_15513676_l">16,679.89</span>\n</td><td class="change"><span class="chg" id="ref_15513676_c">+83.38</span> <span class="chg" id="ref_15513676_cp">(0.50%)</span></td></tr><tr>\n<td class="symbol"><a href="/finance?q=INDEXHANGSENG:HSI&amp;ei=Cd2zV9nlGtec0QSqmongCg">Hang Seng Index\n</a></td><td class="price"><span id="ref_13414271_l">22,964.85</span>\n</td><td class="change"><span class="chg" id="ref_13414271_c">+54.01</span> <span class="chg" id="ref_13414271_cp">(0.24%)</span></td></tr><tr>\n<td class="

In [96]:
df = pd.read_html(str(scraping_html_table_FTSE100))
df[0]

Unnamed: 0,0,1,2
0,Shanghai\n,"3,104.09\n",-5.95 (-0.19%)
1,Nikkei 225\n,"16,679.89\n",+83.38 (0.50%)
2,Hang Seng Index\n,"22,964.85\n",+54.01 (0.24%)
3,TSEC\n,"9,085.14\n",-25.22 (-0.28%)
4,FTSE 100\n,"6,893.92\n",-47.27 (-0.68%)
5,EURO STOXX 50\n,"3,016.19\n",-30.46 (-1.00%)
6,CAC 40\n,"4,460.44\n",-37.42 (-0.83%)
7,S&P TSX\n,"14,703.44\n",-73.58 (-0.50%)
8,S&P/ASX 200\n,"5,527.30\n",-4.70 (-0.08%)
9,BSE Sensex\n,"28,061.79\n",-2.82 (-0.01%)


**Exercise**: Extract the second table from the wikipedia page (https://en.wikipedia.org/wiki/Richter_magnitude_scale) on earthquake magnitudes giving the approximate energy equivalents in terms of TNT explosive force:

In [109]:
url = "https://en.wikipedia.org/wiki/Richter_magnitude_scale"
response = requests.get(url, proxies=massey_proxies)
response

<Response [200]>

In [111]:
page = response.content
scraping = BeautifulSoup(page, "lxml") 
scraping
scraping_html_table = BeautifulSoup(response.content)
scraping_html_table_mag = scraping.find_all("table", "wikitable")
scraping_html_table_mag

[<table class="wikitable">\n<tr>\n<th>Magnitude</th>\n<th>Description</th>\n<th><a href="/wiki/Mercalli_intensity_scale" title="Mercalli intensity scale">Mercalli intensity</a></th>\n<th>Average earthquake effects</th>\n<th>Average frequency of occurrence (estimated)</th>\n</tr>\n<tr>\n<td>1.0\u20131.9</td>\n<td><a href="/wiki/Microearthquake" title="Microearthquake">Micro</a></td>\n<td>I</td>\n<td>Microearthquakes, not felt, or felt rarely. Recorded by seismographs.<sup class="reference" id="cite_ref-16"><a href="#cite_note-16">[16]</a></sup></td>\n<td>Continual/several million per year</td>\n</tr>\n<tr>\n<td>2.0\u20132.9</td>\n<td rowspan="2">Minor</td>\n<td>I to II</td>\n<td>Felt slightly by some people. No damage to buildings.</td>\n<td>Over one million per year</td>\n</tr>\n<tr>\n<td>3.0\u20133.9</td>\n<td>III to IV</td>\n<td>Often felt by people, but very rarely causes damage. Shaking of indoor objects can be noticeable.</td>\n<td>Over 100,000 per year</td>\n</tr>\n<tr>\n<td>4.0\

In [114]:
df = pd.read_html(str(scraping_html_table_mag))
df[1]

Unnamed: 0,0,1,2,3
0,Approximate magnitude,Approximate TNT equivalent for\nseismic energy...,Joule equivalent,Example
1,0.0,15 g,63 kJ,
2,0.2,30 g,130 kJ,Large hand grenade
3,1.5,2.7\xa0kg,11 MJ,Seismic impact of typical small construction b...
4,2.1,21\xa0kg,89 MJ,West fertilizer plant explosion[22]
5,3.0,480\xa0kg,2.0 GJ,"Oklahoma City bombing, 1995"
6,3.5,2.7 metric tons,11 GJ,"PEPCON fuel plant explosion, Henderson, Nevada..."
7,3.87,9.5 metric tons,40 GJ,"Explosion at Chernobyl nuclear power plant, 1986"
8,3.91,11 metric tons,46 GJ,Massive Ordnance Air Blast bomb
9,6.0,15 kilotons,63 TJ,Approximate yield of the Little Boy atomic bom...


## 2. Web APIs

Web servers serve out web pages in the HTML format as they are requested by users. Web servers are also capable of providing data that is not formatted in HTML. 

These web servers provide public (and private) APIs through which users can interact, construct queries that the web servers understand, and receive data from them. 

Depending on who owns them, web servers will have different APIs. They usually provide developer help pages that demonstrate how they work and how queries can be constructed using HTTP which the servers understand.

Many websites have public APIs providing data feeds via JSON or some other common formats. We will consider only **JSON** as it is becoming a standard, and is conveniently, virtually identical to python's dictionaries in its syntax. 

Increasingly though, in order to access these APIs we must register for API Keys. They are **credentials**. Some of them are free and simply require that an account be created with a given website, while others must be purchased and have limits on the amount of data that can be pulled.

There are a number of ways to access these APIs. **REST** is becoming the most common mechanism. 

### REST

**REST is a lightweight mechanism built on top of the HTTP protocol** which enables applications to exchange data with severs. 

A combination of HTTP requests, together with valid REST queries can easily be constructed from Python. One easy-to-use method is through the `requests` package (http://docs.python-requests.org).

Previously, using Web Services and SOAP would result in queries like:

Using REST, such clumsy queries can be transformed into simple HTTP requests of a format (1) like:

Or alternatively, passing arguments using format (2) as follows:

There are slight differences in what you can expect from the two formats. Format 1 (path segment parameter) will return a 404 error when the parameter value does not correspond to an existing resource. 

Format 2 uses optional parameters. Instead of en error, this format will return an empty list when the parameter is not found in the query result. 

Example (no longer works as the API for Echonest has been taken ovber by Spotify this year and has been deprecated):

In [40]:
#echonest api - http://developer.echonest.com/index.html
url = "http://developer.echonest.com/api/v4/artist/reviews?api_key=YB4F9B7ZLS2YMOGUG&id=ARH6W4X1187B99274F&format=json&results=1&start=0"
response = requests.get(url, proxies=massey_proxies)

#we want HTTP Response 200
response

<Response [403]>

In [41]:
response_json = response.content
json.loads(response_json)

{u'response': {u'status': {u'code': 2,
   u'message': u'2|API key not allowed: "YB4F9B7ZLS2YMOGUG" is not allowed to call this method',
   u'version': u'4.2'}}}

### JSON

JSON (short for JavaScript Object Notation) has become one of the standard formats
for sending data by HTTP request between web servers and browsers and other applications. 

It is a much more flexible data format than a tabular text form like CSV. 

Here is an example:

In [122]:
#In Python triple-quoted strings allow us to include strings that have escape chars in it.
obj = """
{"name": "Massey University",
"campuses_NZ": ["Albany", "Palmerston North", "Wellington"],
"campuses_international": null,
"colleges": [{"name": "Sciences", "degrees": 10, "majors": 30},
{"name": "Business", "degrees": 8, "majors": 25}]
}
"""
obj
#type(obj)


'\n{"name": "Massey University",\n"campuses_NZ": ["Albany", "Palmerston North", "Wellington"],\n"campuses_international": null,\n"colleges": [{"name": "Sciences", "degrees": 10, "majors": 30},\n{"name": "Business", "degrees": 8, "majors": 25}]\n}\n'

JSON is very nearly valid Python code with the exception of its null value `null` and
some other nuances (such as disallowing trailing commas at the end of lists). The basic
types are objects (dicts), arrays (lists), strings, numbers, booleans, and nulls. 

**All of the keys in an object must be strings**. There are several Python libraries for reading and
writing JSON data. We will use `json` here as it is built into the Python standard library. 

To convert (deserialize) a JSON string from above to an equivalent Python object (`dict`), use `json.loads`:

In [43]:
result = json.loads(obj)
result

{u'campuses_NZ': [u'Albany', u'Palmerston North', u'Wellington'],
 u'campuses_international': None,
 u'colleges': [{u'degrees': 10, u'majors': 30, u'name': u'Sciences'},
  {u'degrees': 8, u'majors': 25, u'name': u'Business'}],
 u'name': u'Massey University'}

`json.dumps` on the other hand converts a Python object back to JSON:

In [44]:
as_json = json.dumps(result)
as_json

'{"campuses_international": null, "campuses_NZ": ["Albany", "Palmerston North", "Wellington"], "colleges": [{"majors": 30, "degrees": 10, "name": "Sciences"}, {"majors": 25, "degrees": 8, "name": "Business"}], "name": "Massey University"}'

How you convert a JSON object or list of objects to a DataFrame or some other data
structure for analysis will be up to you. Conveniently, you can pass a list of JSON objects
to the DataFrame constructor and select a subset of the data fields:

In [45]:
massey_colleges = pd.DataFrame(result['colleges'], columns=['name', 'degrees'])
massey_colleges

Unnamed: 0,name,degrees
0,Sciences,10
1,Business,8


We can convert a data frame back to a JSON object with the following:

In [46]:
massey_colleges.to_json()

'{"name":{"0":"Sciences","1":"Business"},"degrees":{"0":10,"1":8}}'

### Forming an API query

Yahoo makes a weather forecast API available https://developer.yahoo.com/weather/ with the documentation on its usage here https://developer.yahoo.com/weather/documentation.html

It is important to study the documentation for each particular API as they are likely to be very different.

The following shows the weather forecast for the current wind conditions in Auckland:

In [47]:
import urllib2, urllib, json

baseurl = "https://query.yahooapis.com/v1/public/yql?"
# auckland is id 2348079
yql_query = "select wind from weather.forecast where woeid=2348079"
yql_url = baseurl + urllib.urlencode({'q':yql_query}) + "&format=json"
result = urllib2.urlopen(yql_url).read()
data = json.loads(result)

data

{u'query': {u'count': 1,
  u'created': u'2016-08-17T03:09:04Z',
  u'lang': u'en-US',
  u'results': {u'channel': {u'wind': {u'chill': u'57',
     u'direction': u'203',
     u'speed': u'7'}}}}}

In [48]:
print data['query']['results']

{u'channel': {u'wind': {u'direction': u'203', u'speed': u'7', u'chill': u'57'}}}


Code below queries for the current weather conditions in Auckland:

In [49]:
yql_query = "select item.condition from weather.forecast where woeid =2348079 and u='c' "
yql_url = baseurl + urllib.urlencode({'q':yql_query}) + "&format=json"
result = urllib2.urlopen(yql_url).read()
data = json.loads(result)

data

{u'query': {u'count': 1,
  u'created': u'2016-08-17T03:09:09Z',
  u'lang': u'en-US',
  u'results': {u'channel': {u'item': {u'condition': {u'code': u'34',
      u'date': u'Wed, 17 Aug 2016 02:00 PM NZST',
      u'temp': u'13',
      u'text': u'Mostly Sunny'}}}}}}

**Exercise:** Find out the ID for Wellington and search for the current weather conditions there:

In [124]:
import urllib2, urllib, json

baseurl = "https://query.yahooapis.com/v1/public/yql?"
# wellington is id 2351310
yql_query = "select wind from weather.forecast where woeid=2351310"
yql_url = baseurl + urllib.urlencode({'q':yql_query}) + "&format=json"
result = urllib2.urlopen(yql_url).read()
data = json.loads(result)

data

{u'query': {u'count': 1,
  u'created': u'2016-08-17T04:34:15Z',
  u'lang': u'en-US',
  u'results': {u'channel': {u'wind': {u'chill': u'48',
     u'direction': u'0',
     u'speed': u'18'}}}}}

## Proprietary API Wrapper Modules

Well established companies will sometimes write and make available modules in various programming languages that form a wrapper around their REST APIs and an easier interface for communicating with their servers.

Spotify is an example of such a company that has provided a Python module. Some of their APIs are free-access and some require an account to be created with them first. Premium content can only be pulled from their servers using a paid Premium account.

In [50]:
!pip install spotipy

Collecting spotipy
  Downloading spotipy-2.3.8.tar.gz
Building wheels for collected packages: spotipy
  Running setup.py bdist_wheel for spotipy: started
  Running setup.py bdist_wheel for spotipy: finished with status 'done'
  Stored in directory: C:\Users\OEM\AppData\Local\pip\Cache\wheels\2b\5b\e8\05820c08321dafd920a5bfa63362476536dd77c98c49e169de
Successfully built spotipy
Installing collected packages: spotipy
Successfully installed spotipy-2.3.8


Example of a search query for Madonna:

In [51]:
import spotipy

sp = spotipy.Spotify()
results = sp.search(q='madonna', limit=20)
for i, t in enumerate(results['tracks']['items']):
    print(' ', i, t['name'])

(' ', 0, u'Madonna')
(' ', 1, u'Like A Prayer')
(' ', 2, u"Bitch I'm Madonna")
(' ', 3, u'Material Girl')
(' ', 4, u'Vogue')
(' ', 5, u'4 Minutes - feat. Justin Timberlake And Timbaland')
(' ', 6, u'Lady Madonna - Remastered 2015')
(' ', 7, u'Like A Virgin')
(' ', 8, u'Into The Groove')
(' ', 9, u"Bitch I'm Madonna")
(' ', 10, u'Like A Virgin')
(' ', 11, u'La Isla Bonita')
(' ', 12, u'Holiday')
(' ', 13, u'Like A Prayer')
(' ', 14, u'Holiday')
(' ', 15, u'Borderline')
(' ', 16, u"Bitch I'm Madonna")
(' ', 17, u'Crazy For You')
(' ', 18, u'Lucky Star')
(' ', 19, u'Hung Up')


Below is  example code adapted from https://github.com/plamere/spotipy/blob/master/examples/artist_albums.py showing how to query Spotify and show the albums and tracks for a given artist.

In [52]:
def get_artist(name):
    results = sp.search(q='artist:' + name, type='artist')
    items = results['artists']['items']
    if len(items) > 0:
        return items[0]
    else:
        return None
    
def show_artist_albums(artist):
    albums = []
    results = sp.artist_albums(artist['id'], album_type='album')
    albums.extend(results['items'])
    while results['next']:
        results = sp.next(results)
        albums.extend(results['items'])
    seen = set() # to avoid dups
    albums.sort(key=lambda album:album['name'].lower())
    for album in albums:
        name = album['name']
        if name not in seen:
            print((' ' + name))
            seen.add(name)
            

In [126]:
artist = get_artist('Bieber')
artist

{u'external_urls': {u'spotify': u'https://open.spotify.com/artist/1uNFoZAHBGtllmzznpCI3s'},
 u'followers': {u'href': None, u'total': 6136975},
 u'genres': [u'teen pop'],
 u'href': u'https://api.spotify.com/v1/artists/1uNFoZAHBGtllmzznpCI3s',
 u'id': u'1uNFoZAHBGtllmzznpCI3s',
 u'images': [{u'height': 1000,
   u'url': u'https://i.scdn.co/image/5c3cf2ee3494e2da71dcf26303202ec491b26213',
   u'width': 1000},
  {u'height': 640,
   u'url': u'https://i.scdn.co/image/2e451efa87b706098553583cffac821b7ebac450',
   u'width': 640},
  {u'height': 200,
   u'url': u'https://i.scdn.co/image/ca283ddea2afc65c15d802d45ee3d3fd255ab4e2',
   u'width': 200},
  {u'height': 64,
   u'url': u'https://i.scdn.co/image/ce41eb4beaad8d07fd55a68aba16b27e341a2e4f',
   u'width': 64}],
 u'name': u'Justin Bieber',
 u'popularity': 94,
 u'type': u'artist',
 u'uri': u'spotify:artist:1uNFoZAHBGtllmzznpCI3s'}

In [54]:
show_artist_albums(artist)

 Believe
 Believe (Deluxe Edition)
 Believe Acoustic
 Journals
 My World
 My World (Canada Version - All BP's)
 My World (France Version)
 My World 2.0
 My Worlds
 My Worlds (International Version)
 My Worlds (Oz Version)
 My Worlds - The Collection (International Package)
 My Worlds - The Collection (Oz Package)
 My Worlds Acoustic
 Never Say Never - The Remixes
 Purpose (Deluxe)
 Under The Mistletoe
 Under The Mistletoe (Deluxe Edition)


**Exercise:** Search for Taylor Swift and find out her current popularity rating ans well as the number of followers she has:

In [127]:
artist.keys

<function keys>

## API Repositories and Market Places

A large number of other API repositories can be found under these links:

http://www.publicapis.com/

http://www.programmableweb.com/apis/directory

Mashape (http://www.publicapis.com/) is the Cloud API Marketplace where developers can easily consume Cloud APIs to integrate in their next project, and where existing APIs can be distributed to the community and monetized.

In order to access their APIs, it is usually required to at least create an account, while some web sites will charge fees for accessing their data. There are different ways of communicating with API servers. Mashape has created a python library that can simplify accessing their data. The library is called *unirest* and can easily be installed on your computer if you type in your command line the following line:

In [55]:
!pip install unirest

Collecting unirest
  Downloading Unirest-1.1.7.tar.gz
Collecting poster>=0.8.1 (from unirest)
  Downloading poster-0.8.1.tar.gz
Building wheels for collected packages: unirest, poster
  Running setup.py bdist_wheel for unirest: started
  Running setup.py bdist_wheel for unirest: finished with status 'done'
  Stored in directory: C:\Users\OEM\AppData\Local\pip\Cache\wheels\67\ed\96\d2b57abe9692255f18d20de5e2b2e15f0ea4f34fa027baef32
  Running setup.py bdist_wheel for poster: started
  Running setup.py bdist_wheel for poster: finished with status 'done'
  Stored in directory: C:\Users\OEM\AppData\Local\pip\Cache\wheels\7f\50\85\e015e7056e73b6dac4653f1d27cee339c5adfa1b34c47bab9a
Successfully built unirest poster
Installing collected packages: poster, unirest
Successfully installed poster-0.8.1 unirest-1.1.7


In [56]:
import unirest

One of the free APIs listed under this market place is Bitcoin Exchange Rates which lists exchange rates between major companies and bitcoin as well as exchange rates between the major currencies.

https://www.mashape.com/montanaflynn/bitcoin-exchange-rates#

Below is an example of how to construct a query for the buying price of one bitcoin, wit hthe result returned in USD.

In [57]:
response = unirest.get("https://montanaflynn-bitcoin-exchange-rate.p.mashape.com/prices/buy?qty=1",
  headers={
    "X-Mashape-Key": "2BTWnoXPgrmshykB91haA2hod3UYp1FDVvyjsnjK3EfNKw5329",
    "Accept": "text/plain"
  }
)

response.body

{u'amount': u'589.66',
 u'btc': {u'amount': u'1.00000000', u'currency': u'BTC'},
 u'currency': u'USD',
 u'fees': [{u'coinbase': {u'amount': u'5.69', u'currency': u'USD'}},
  {u'bank': {u'amount': u'0.15', u'currency': u'USD'}}],
 u'subtotal': {u'amount': u'583.82', u'currency': u'USD'},
 u'total': {u'amount': u'589.66', u'currency': u'USD'}}

In [58]:
type(response.body)

dict

Notice that the type of the result response body is a familiar dictionary from which we can easily extract our data

**Exercise:** Extract the total amount of the cost for 1 bitcoin.

**Exercise:** Execute a query for the cost of 15 for bitcoins and extract the total price from the dictionary.

**Exercise:** Search through the https://www.mashape.com/montanaflynn/bitcoin-exchange-rates# webpage and find out how to construct a query to extract from their API the sell price for a single bitcoin. Execute this and extract the price.

Below is an example of a query for extracting the current exchange rates between the major currencies

In [59]:
response = unirest.get("https://montanaflynn-bitcoin-exchange-rate.p.mashape.com/currencies/exchange_rates",
  headers={
    "X-Mashape-Key": "QgrDeDPRdFmshQBsi3cDAvZvD6Ykp1AxBj4jsn1po92UN8XxKx",
    "Accept": "text/plain"
  }
)

response.body

{u'usd_to_byr': u'20026.25',
 u'ils_to_usd': u'0.264273',
 u'hrk_to_btc': u'0.000259',
 u'btc_to_eek': u'8091.215314',
 u'kmf_to_eth': u'0.000203',
 u'inr_to_eth': u'0.001342',
 u'jep_to_btc': u'0.002248',
 u'nok_to_usd': u'0.12201',
 u'btc_to_irr': u'17459927.52',
 u'ltc_to_uzs': u'8970.0',
 u'sos_to_eth': u'0.000154',
 u'bam_to_ltc': u'0.19174',
 u'irr_to_usd': u'0.000033',
 u'usd_to_hrk': u'6.64218',
 u'usd_to_nzd': u'1.371975',
 u'mur_to_ltc': u'0.009606',
 u'usd_to_tjs': u'7.86775',
 u'btc_to_sdg': u'3528.001758',
 u'gmd_to_ltc': u'0.00778',
 u'usd_to_uah': u'24.97498',
 u'try_to_eth': u'0.030575',
 u'cop_to_usd': u'0.000345',
 u'xof_to_ltc': u'0.00057',
 u'btc_to_lvl': u'362.413415',
 u'php_to_usd': u'0.021591',
 u'ltc_to_rub': u'191.76543',
 u'mro_to_ltc': u'0.000938',
 u'eth_to_try': u'32.70625',
 u'usd_to_rsd': u'109.48834',
 u'btc_to_cdf': u'570768.6369',
 u'mga_to_btc': u'0.000001',
 u'btc_to_zwl': u'187216.722078',
 u'gel_to_ltc': u'0.143241',
 u'eth_to_xcd': u'30.094073',



Markit http://www.markit.com/Company/About-Markit is a provider of financial information services.

Below is an example of how the current stock proce of Apple can be queried though their API


In [60]:
url = "http://dev.markitondemand.com/Api/v2/Quote/json?symbol=AAPL"
response = requests.get(url, proxies=massey_proxies)

response

<Response [200]>

In [61]:
markit_dict = json.loads(response.content)
markit_dict

{u'Change': -0.0900000000000034,
 u'ChangePercent': -0.0822067957617861,
 u'ChangePercentYTD': 3.92361770853125,
 u'ChangeYTD': 105.26,
 u'High': 110.23,
 u'LastPrice': 109.39,
 u'Low': 109.21,
 u'MSDate': 42598.6659722222,
 u'MarketCap': 589441779770L,
 u'Name': u'Apple Inc',
 u'Open': 109.67,
 u'Status': u'SUCCESS',
 u'Symbol': u'AAPL',
 u'Timestamp': u'Tue Aug 16 15:59:00 UTC-04:00 2016',
 u'Volume': 1957567}

**Exercise:** Look through their API documentation at http://dev.markitondemand.com/#doc_lookup and construct a query.

Another popular API provider is https://apigee.com/providers

In [62]:
from IPython.core.display import HTML
HTML("<iframe src=https://apigee.com/providers width=1100 height=500></iframe>")

In [63]:
%%javascript
require(['base/js/utils'],
function(utils) {
   utils.load_extensions('calico-spell-check', 'calico-document-tools', 'calico-cell-tools');
});

<IPython.core.display.Javascript object>