# Tutorial 5: 
## Part A: Web Scraping 
We cover in this part scraping data from the web. Data can be presented in HTML, XML and API etc. Web scraping is the practice of using libraries to sift through a web page and gather the data that you need in a format most useful to you while at the same time preserving the structure of the data. 

There are several ways to extract information from the web. Use of APIs being probably the best way to extract data from a website. Almost all large websites like Twitter, Facebook, Google, Twitter, StackOverflow provide APIs to access their data in a more structured manner. If you can get what you need through an API, it is almost always preferred approach over web scrapping. However, not all websites provide an API.Thus, we need to scrape the HTML website to fetch the information.

We first explain how to extract information from HTML pages. 

***

## HTML:

Web sites are written using HTML, which means that each web page is a structured document. HTML data is easy to read in the browser but not downloadable in machine readable format.

### 1. Urllib2 :
Python has many libraries for reading and writing data in the ubiquitous HTML. lxml (http://lxml.de) is one that has consistently strong performance in parsing very large files. lxml has multiple programmer interfaces.


So, how can we deal with HTML data?
#### 1. First, locate th e website (URL) of the data:
For example, check the URL https://www.google.com/finance#stockscreener for google finance live feeds.

#### 2. Install Libraries required for web scraping:

**Urllib2:** It is a Python module which can be used for fetching URLs. It defines functions and classes to help with URL actions (basic and digest authentication, redirections, cookies, etc). 

        pip install urllib2
        
**lxml:** provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API. 
        
        pip install lxml


For more detail refer to the documentation page.
http://lxml.de
https://docs.python.org/2/library/urllib2.html

#### 3. Get the page contents:


In [12]:
#import libraries
from lxml.html import parse 
from urllib2 import urlopen

#specify the url
financeURL= 'https://www.google.com/finance#stockscreener'

##Query the website and return the html to the variable 'page'

page= urlopen(financeURL)

#Parse the page variable with lxl
data_parsed = parse(page)

# Get the root element for this tree.	
data = data_parsed.getroot()
print type(data), type(data_parsed)

<class 'lxml.html.HtmlElement'> <type 'lxml.etree._ElementTree'>


#### 4. Understand the page structure:

To be able to extract information from an HTML page, you need to understand the elements and attributes of any HTML page. Details about this can be found in: <a href="http://www.w3schools.com/html/default.asp"> HTML Tutorial</a>. 

To start, we need to take a look at the HTML that displays different categories. If you're in Chrome or Firefox, highlight "Readers' Poll Winners", right-click, and select Inspect Element. Such as the following:
<img src="ss1.png">

<img src="ss2.png">



### You could dump the raw html:

In [13]:
import lxml.html
from lxml import etree
file = urlopen('https://www.google.com/finance#stockscreener').read()
#html.fromstring implicitly expects bytes as input
tree = lxml.html.fromstring(file)
#convert the tree into a string
#the string now contains the whole HTML ile 
html = lxml.html.tostring(tree)
html


'<html><head><script>(function(){(function(){function e(a){this.t={};this.tick=function(a,c,b){var d=void 0!=b?b:(new Date).getTime();this.t[a]=[d,c];if(void 0==b)try{window.console.timeStamp("CSI/"+a)}catch(e){}};this.tick("start",null,a)}var a,d;window.performance&&(d=(a=window.performance.timing)&&a.responseStart);var f=0<d?new e(d):new e;window.jstiming={Timer:e,load:f};if(a){var c=a.navigationStart;0<c&&d>=c&&(window.jstiming.srt=d-c)}if(a){var b=window.jstiming.load;0<c&&d>=c&&(b.tick("_wtsrt",void 0,c),b.tick("wtsrt_","_wtsrt",\nd),b.tick("tbsd_","wtsrt_"))}try{a=null,window.chrome&&window.chrome.csi&&(a=Math.floor(window.chrome.csi().pageT),b&&0<c&&(b.tick("_tbnd",void 0,window.chrome.csi().startE),b.tick("tbnd_","_tbnd",c))),null==a&&window.gtbExternal&&(a=window.gtbExternal.pageT()),null==a&&window.external&&(a=window.external.pageT,b&&0<c&&(b.tick("_tbnd",void 0,window.external.startE),b.tick("tbnd_","_tbnd",c))),a&&(window.jstiming.pt=a)}catch(g){}})();})();\n</script><titl

In [14]:
# Or print the tree in a prettier way using the tree structure
print(etree.tostring(tree, encoding='unicode', pretty_print=True))

<html>
  <head>
    <script>(function(){(function(){function e(a){this.t={};this.tick=function(a,c,b){var d=void 0!=b?b:(new Date).getTime();this.t[a]=[d,c];if(void 0==b)try{window.console.timeStamp("CSI/"+a)}catch(e){}};this.tick("start",null,a)}var a,d;window.performance&amp;&amp;(d=(a=window.performance.timing)&amp;&amp;a.responseStart);var f=0&lt;d?new e(d):new e;window.jstiming={Timer:e,load:f};if(a){var c=a.navigationStart;0&lt;c&amp;&amp;d&gt;=c&amp;&amp;(window.jstiming.srt=d-c)}if(a){var b=window.jstiming.load;0&lt;c&amp;&amp;d&gt;=c&amp;&amp;(b.tick("_wtsrt",void 0,c),b.tick("wtsrt_","_wtsrt",
d),b.tick("tbsd_","wtsrt_"))}try{a=null,window.chrome&amp;&amp;window.chrome.csi&amp;&amp;(a=Math.floor(window.chrome.csi().pageT),b&amp;&amp;0&lt;c&amp;&amp;(b.tick("_tbnd",void 0,window.chrome.csi().startE),b.tick("tbnd_","_tbnd",c))),null==a&amp;&amp;window.gtbExternal&amp;&amp;(a=window.gtbExternal.pageT()),null==a&amp;&amp;window.external&amp;&amp;(a=window.external.pageT,b&amp;&am

    HTML links are defined in HTML with the <a> tag, “<a href=“http://www.test.com”>This is a link for test.com</a>”
    ** Search manually for '<a ' (note the space):


### Task: Extract all links in HTML page: 
In our example, the 'data' object contains the root of the parsed document. To access an element in the page, we extract it by its index or name. For example, the following code gets all the links in the data. 

In [15]:
links = data.findall('.//a') # find all 'a' which are '<a ' anchors, i.e. href links
linkURLs = []
for link in links:
    linkURLs.append(link.get('href'))

### More compact way of implementing the same code is as follows:

In [16]:
linkURLs = [link.get('href') for link in data.findall('.//a')] # look for all the 'hrefs' in 'a' anchors
# e.g. from above: <a href="http://www.w3schools.com/html/default.asp"> HTML</a>

### What type is linkURLs?

In [17]:
type(linkURLs)

list

### Let's have a look on the list items:

In [18]:
linkURLs[0:10] # look at the first few

['https://www.google.com.au/webhp?tab=ew',
 'http://www.google.com.au/imghp?hl=en&tab=ei',
 'https://maps.google.com.au/maps?hl=en&tab=el',
 'https://play.google.com/?hl=en&tab=e8',
 'https://www.youtube.com/?tab=e1',
 'https://news.google.com.au/nwshp?hl=en&tab=en',
 'https://mail.google.com/mail/?tab=em',
 'https://drive.google.com/?tab=eo',
 'https://www.google.com.au/intl/en/options/',
 'http://www.google.com/support/finance?hl=en']

### These are the links not the text, how can we get the text instead?

We do the same for getting the link text instead of the link, we use text_contents method instead of get in this case. 

In [19]:
linkText=[link.text_content() for link in data.findall('.//a')]
linkText[0:10]

['Search',
 'Images',
 'Maps',
 'Play',
 'YouTube',
 'News',
 'Gmail',
 'Drive',
 u'More \xbb',
 'Help']

And you can see they match the linkURLs above (e.g. 'maps.google.com' & 'Maps')

### 2. BeautifulSoup

http://www.crummy.com/software/BeautifulSoup/

An easier way to parse HTML is by using BeautifulSoup library. We first request the page then pass the contents to BeautifulSoup to parse. Once we have the parsed page object, we can use its attributes and methods. This following example asks BeautifulSoup to find all a tags (or links) on the page. First, you will need to install the libraries.

    pip install BeautifulSoup
    pip install Requests

In [20]:
#import the Beautiful soup functions to parse the data returned from the website
from bs4 import BeautifulSoup 
import requests

#Parse the html in the 'page' variable, and store it in Beautiful Soup format

page = requests.get("http://www.rba.gov.au/statistics/cash-rate/")
bs = BeautifulSoup(page.content,"lxml")
print bs.prettify()

<!DOCTYPE html>
<html>
 <head>
  <script>
   (function(){(function(){function e(a){this.t={};this.tick=function(a,c,b){var d=void 0!=b?b:(new Date).getTime();this.t[a]=[d,c];if(void 0==b)try{window.console.timeStamp("CSI/"+a)}catch(e){}};this.tick("start",null,a)}var a,d;window.performance&&(d=(a=window.performance.timing)&&a.responseStart);var f=0<d?new e(d):new e;window.jstiming={Timer:e,load:f};if(a){var c=a.navigationStart;0<c&&d>=c&&(window.jstiming.srt=d-c)}if(a){var b=window.jstiming.load;0<c&&d>=c&&(b.tick("_wtsrt",void 0,c),b.tick("wtsrt_","_wtsrt",
d),b.tick("tbsd_","wtsrt_"))}try{a=null,window.chrome&&window.chrome.csi&&(a=Math.floor(window.chrome.csi().pageT),b&&0<c&&(b.tick("_tbnd",void 0,window.chrome.csi().startE),b.tick("tbnd_","_tbnd",c))),null==a&&window.gtbExternal&&(a=window.gtbExternal.pageT()),null==a&&window.external&&(a=window.external.pageT,b&&0<c&&(b.tick("_tbnd",void 0,window.external.startE),b.tick("tbnd_","_tbnd",c))),a&&(window.jstiming.pt=a)}catch(g){}})(

### Extract the links using BeautifulSoup

In [21]:
links= bs.find_all('a')
links[1:10]

[<a class="gb1" href="http://www.google.com.au/imghp?hl=en&amp;tab=ei">Images</a>,
 <a class="gb1" href="https://maps.google.com.au/maps?hl=en&amp;tab=el">Maps</a>,
 <a class="gb1" href="https://play.google.com/?hl=en&amp;tab=e8">Play</a>,
 <a class="gb1" href="https://www.youtube.com/?tab=e1">YouTube</a>,
 <a class="gb1" href="https://news.google.com.au/nwshp?hl=en&amp;tab=en">News</a>,
 <a class="gb1" href="https://mail.google.com/mail/?tab=em">Gmail</a>,
 <a class="gb1" href="https://drive.google.com/?tab=eo">Drive</a>,
 <a class="gb1" href="https://www.google.com.au/intl/en/options/" style="text-decoration:none"><u>More</u> »</a>,
 <a class="gb4" href="http://www.google.com/support/finance?hl=en">Help</a>]

### Discussion: 

Tables are also very important elements in HTML for parsing. Similarly, we can use findall method to find all tables in the page. Try to fetch important tables in the page and convert the data inside into a dataframe. 


### Solution:

In [22]:
# grab a piece of a table, row & column
def fetch_row_in_table(table, table_id, row_id):
    elements=[]
    contents=[]
    if len(table)> table_id + 1: rows= table[table_id].findall('.//tr')
    if len(rows)> row_id + 1: elements= rows[row_id].findall('.//td')
    for element in elements:
        contents += [val.text_content() for val in element]
    return contents

In [23]:
tables = data.findall('.//table')
fetch_row_in_table(tables,1,1)

['Nikkei 225\n', '16,908.52', '+21.12', '(0.13%)']

In [24]:
# and, as above, search for '<table ' in the raw html, then Nikkei (Japanese stock market)
# seems it's not the first one, it's 'Shanghai'
fetch_row_in_table(tables,1,0)

['Shanghai\n', '3,079.38', '-6.11', '(-0.20%)']

### Discussion
Pandas makes our life easier with many useful functions.
Try pandas.read_html.


In [25]:
import pandas as pd
all_tables = pd.read_html('https://www.google.com/finance#stockscreener')

*** 
## API: 
In addition to HTML format, data is commonly found on the web through public APIs. We use the 'requests' package (http://docs.python-requests.org) to call APIs using Python. In the following example, we call a public API for collecting weather data. 


** You need to sign up for a free account to get your unique API key to use in the following code. register at**  http://api.openweathermap.org

In [27]:

#Now we  use requests to retrieve the web page with our data
import requests
url= 'http://api.openweathermap.org/data/2.5/forecast/city?id=524901&APPID=8b54baa39e4e6fa4ae7baea09320d63c'#write your APPID here#
response= requests.get(url)
response

<Response [200]>

The response object contains GET query response. A successfull one has a value of 200. we need to parse the response with json to extract the information. 

In [28]:
#Check the HTTP status code https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
print response.status_code

200


In [29]:
# response.content is text
print type(response.content)

<type 'str'>


In [30]:
#response.json() converts the content to json 
data = response.json()
print type(data)

<type 'dict'>


In [31]:
data.keys()

[u'city', u'message', u'list', u'cod', u'cnt']

The keys explain the structure of the fetched data. Try displaying values for each element. In this example, the weather information exists in the 'list'. 

In [32]:
data['list'][10]

{u'clouds': {u'all': 64},
 u'dt': 1472806800,
 u'dt_txt': u'2016-09-02 09:00:00',
 u'main': {u'grnd_level': 1007.4,
  u'humidity': 74,
  u'pressure': 1007.4,
  u'sea_level': 1026.82,
  u'temp': 292.55,
  u'temp_kf': -0.99,
  u'temp_max': 293.532,
  u'temp_min': 292.55},
 u'rain': {},
 u'sys': {u'pod': u'd'},
 u'weather': [{u'description': u'broken clouds',
   u'icon': u'04d',
   u'id': 803,
   u'main': u'Clouds'}],
 u'wind': {u'deg': 269.516, u'speed': 3.04}}

The next step is to create a DataFrame with the weather information, which is demonstrated as follows. You can select a subset to display or display the entire data

In [33]:
from pandas import DataFrame
# data with the default column headers
weather_table_all= DataFrame(data['list'])
weather_table_all

Unnamed: 0,clouds,dt,dt_txt,main,rain,sys,weather,wind
0,{u'all': 0},1472698800,2016-09-01 03:00:00,"{u'temp_kf': -2.08, u'temp': 280.61, u'grnd_le...",{},{u'pod': u'd'},"[{u'main': u'Clear', u'id': 800, u'icon': u'01...","{u'speed': 3.42, u'deg': 318.006}"
1,{u'all': 0},1472709600,2016-09-01 06:00:00,"{u'temp_kf': -1.97, u'temp': 284.85, u'grnd_le...",{},{u'pod': u'd'},"[{u'main': u'Clear', u'id': 800, u'icon': u'01...","{u'speed': 3.75, u'deg': 315.003}"
2,{u'all': 8},1472720400,2016-09-01 09:00:00,"{u'temp_kf': -1.86, u'temp': 287.64, u'grnd_le...",{},{u'pod': u'd'},"[{u'main': u'Clear', u'id': 800, u'icon': u'02...","{u'speed': 5.01, u'deg': 322}"
3,{u'all': 0},1472731200,2016-09-01 12:00:00,"{u'temp_kf': -1.75, u'temp': 289.06, u'grnd_le...",{},{u'pod': u'd'},"[{u'main': u'Clear', u'id': 800, u'icon': u'01...","{u'speed': 4.82, u'deg': 321.002}"
4,{u'all': 0},1472742000,2016-09-01 15:00:00,"{u'temp_kf': -1.64, u'temp': 288.46, u'grnd_le...",{},{u'pod': u'd'},"[{u'main': u'Clear', u'id': 800, u'icon': u'01...","{u'speed': 3.57, u'deg': 320.501}"
5,{u'all': 12},1472752800,2016-09-01 18:00:00,"{u'temp_kf': -1.53, u'temp': 283.49, u'grnd_le...",{},{u'pod': u'n'},"[{u'main': u'Clouds', u'id': 801, u'icon': u'0...","{u'speed': 1.75, u'deg': 304.502}"
6,{u'all': 80},1472763600,2016-09-01 21:00:00,"{u'temp_kf': -1.42, u'temp': 282.98, u'grnd_le...",{},{u'pod': u'n'},"[{u'main': u'Clouds', u'id': 803, u'icon': u'0...","{u'speed': 1.15, u'deg': 264.503}"
7,{u'all': 56},1472774400,2016-09-02 00:00:00,"{u'temp_kf': -1.31, u'temp': 284.15, u'grnd_le...",{},{u'pod': u'n'},"[{u'main': u'Clouds', u'id': 803, u'icon': u'0...","{u'speed': 1.69, u'deg': 227.001}"
8,{u'all': 80},1472785200,2016-09-02 03:00:00,"{u'temp_kf': -1.2, u'temp': 284.07, u'grnd_lev...",{u'3h': 0.02},{u'pod': u'd'},"[{u'main': u'Rain', u'id': 500, u'icon': u'10d...","{u'speed': 1.47, u'deg': 233.501}"
9,{u'all': 12},1472796000,2016-09-02 06:00:00,"{u'temp_kf': -1.1, u'temp': 289.56, u'grnd_lev...",{u'3h': 0.025},{u'pod': u'd'},"[{u'main': u'Rain', u'id': 500, u'icon': u'10d...","{u'speed': 1.96, u'deg': 258}"


In [34]:
#Select data to display, looks like main has the information we need
weather_table_sel= DataFrame(data['list'],columns=['main'])
weather_table_sel

Unnamed: 0,main
0,"{u'temp_kf': -2.08, u'temp': 280.61, u'grnd_le..."
1,"{u'temp_kf': -1.97, u'temp': 284.85, u'grnd_le..."
2,"{u'temp_kf': -1.86, u'temp': 287.64, u'grnd_le..."
3,"{u'temp_kf': -1.75, u'temp': 289.06, u'grnd_le..."
4,"{u'temp_kf': -1.64, u'temp': 288.46, u'grnd_le..."
5,"{u'temp_kf': -1.53, u'temp': 283.49, u'grnd_le..."
6,"{u'temp_kf': -1.42, u'temp': 282.98, u'grnd_le..."
7,"{u'temp_kf': -1.31, u'temp': 284.15, u'grnd_le..."
8,"{u'temp_kf': -1.2, u'temp': 284.07, u'grnd_lev..."
9,"{u'temp_kf': -1.1, u'temp': 289.56, u'grnd_lev..."


### Discussion: 

Further parsing is still required to get the table (DataFrame) in a flat shape.Now it it's your turn, parse the weather data to generate a table with the following information (each in a column):  temp, humidity, rain duration, wind speed, wind degree. 