### ST445 Managing and Visualizing Data
#  Using data from the Internet
### Week 4 Lecture, MT 2017 - Kenneth Benoit, Akitaka Matsuo

## Plan for today

- 
- Web-scraping

## Web scraping 

#### What is it?
"Web scraping (web harvesting or web data extraction) is data scraping used for extracting data from websites" [Wikipedia: Web Scraping](https://en.wikipedia.org/wiki/Web_scraping)


![Web Scraping](https://upload.wikimedia.org/wikipedia/commons/d/da/Scrapp.jpg)






## Web-scraping steps
1. Get contents from the web
2. Extract information
3. Reshape and save the information as data

## Get contents from the web

- First of all you need to know where is the information 
- Examples:
    - Government's administrative data
    - Newspaper websites
- The data format
    - web-pages (in html)
    - data files in various format (csv, spss, stata)
    - document files (MS-Word, pdf)
    - API (e.g. JSON)
    - pictures

     


## Get to know the target website

1. Open the website, learn how it's structured
2. "View page source" and "Inspect"
    - [Example 0](http://www.r-datacollection.com/materials/ch-2-html/fortunes.html)
    - [Example 1](http://www.r-datacollection.com/materials/ch-6-ajax/fortunes/fortunes1.html)
    - [Example 2](http://www.r-datacollection.com/materials/ch-6-ajax/fortunes/fortunes2.html)
    - [Example 3](http://www.r-datacollection.com/materials/ch-6-ajax/fortunes/fortunes3.html)

These examples looks similar (especially Ex 0 and Ex 2) but the static contents are different, so what a normal scraper can see might be different.

     


## Get webpage contents

- Suppose that I you know that what you want to get is in static contents of the webpage (i.e. something you can find in "View page source")
- Then steps are 
    1. Get the page contents
    2. Parse the contents
    3. Extract and format the contents
     


## Get webpage contents in Python    


In [2]:
from urllib.request import urlopen
html = urlopen("http://www.r-datacollection.com/materials/ch-2-html/fortunes.html") 
print(html.read())

b'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">\n<html> <head>\n<title>Collected R wisdoms</title>\n</head>\n\n<body>\n<div id="R Inventor" lang="english" date="June/2003">\n  <h1>Robert Gentleman</h1>\n  <p><i>\'What we have is nice, but we need something very different\'</i></p>\n  <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>\n</div>\n\n<div lang="english" date="October/2011">\n  <h1>Rolf Turner</h1>\n  <p><i>\'R is wonderful, but it cannot work magic\'</i> <br><emph>answering a request for automatic generation of \'data from a known mean and 95% CI\'</emph></p>\n  <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>\n</div>\n\n<address><a href="http://www.rdatacollectionbook.com"><i>The book homepage</i><a/></address>\n\n</body> </html>\n'


## Get webpage contents in R    

```r
url <- "http://www.r-datacollection.com/materials/html/fortunes.html"
fortunes <- readLines(con = url)
cat(fortunes)
## <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> <html> <head> <title>Collected R wisdoms</title> </head>  <body> <div id="R Inventor" lang="english" date="June/2003">   <h1>Robert Gentleman</h1>   <p><i>'What we have is nice, but we need something very different'</i></p>   <p><b>Source: </b>Statistical Computing 2003, Reisensburg </div>  <div lang=english date="October/2011">   <h1>Rolf Turner</h1>   <p><i>'R is wonderful, but it cannot work magic'</i> <br><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>   <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p> </div>  <address><a href="www.r-datacollectionbook.com"><i>The book homepage</i><a/></address>  </body> </html>
```


## Parse html

### Typical html structure
![html tree](https://www.w3schools.com/js/pic_htmltree.gif)

HTML parsers anlayze the structure of html and make it ready for extracting the information

## HTML Parsing

The next step is to parse the content of html 
### A very simple example

In [10]:
from bs4 import BeautifulSoup

html = urlopen("http://www.r-datacollection.com/materials/ch-2-html/fortunes.html")
bsObj = BeautifulSoup(html, "html.parser")
nameList = bsObj.findAll("h1") # this line extract "h1" tags
for name in nameList: # this loop print out the content of "h1" tags
    print(name.get_text())

Robert Gentleman
Rolf Turner


## XPath

You may want to navigate through html structure to get a particular information. 

####Example
- Select in the text of `<i>`-tag inside `<p>`-tag 
- Select based on the class value (this can be achieved with BeautifulSoup, though)

Use `etree` in `lxml`


In [19]:
from lxml import etree

parser = etree.HTMLParser()
tree = etree.parse("http://www.r-datacollection.com/materials/ch-2-html/fortunes.html", parser) 
h1nodes = tree.xpath('.//div/h1') # find the h1 in div
for nod in h1nodes:
    print(nod.text)

Robert Gentleman
Rolf Turner


In [21]:
h1nodes_oct2011 = tree.xpath('.//div[@date="October/2011"]/h1') # find the h1 in div with specific date value
for nod in h1nodes_oct2011:
    print(nod.text)

Rolf Turner


## R web scraping toolbox

-  Get contents
    - `RCurl`
    - `httr`
-  Parse and extract information 
    - parsing and analyzing markup language:
        - `XML`
        - `XML2`
    - content extraction with matching 
        - (base R)
        - `stringr`
        - `sgringi`


## Python web scraping toolbox

-  Get contents
    - `urllib`
    - `httplib`
    - `requests`
-  Parse and extract information 
    - parsing and analyzing markup language:
        - `bs4` (`BeautifulSoup`)
        - `lxml`
    - content extraction with matching 
        - `re`


## Selenium

- Standard tools for web scraping (e.g. `httr` in R or `urllib` in Python) may not work in some occasions
- Reasons:
    - "Some websites don’t like to be webscraped. In these cases you may need to disguise your webscraping bot as a human being. Selenium is just the tool for that." [webscraping with Selenium](http://thiagomarzagao.com/2013/11/12/webscraping-with-selenium-part-1/)
    - The information is in non-static contents 
- Solution:
    - Use `selenium` = an automated testing suite for web applications 
    - Manipulate actual web-browser (e.g. Chrome, Firefox) using selenium drivers
    
WIth selenium, you should be able to get whatever you can get with your browser (theoretically speaking...)


## Caveats

#### Web-scraping is not always (or never) welcomed by site-owners

#### Why?
- excessive traffic 
- influence on their revenues

You can be warned, blocked, and even sued. 

#### So, what to do?
1. Read TOC carefully 
2. Check `robot.txt` (c.f. http://www.robotstxt.org/)
2. Get permission if possible 
3. Be nice 
    - place short breaks between fetching
    - scrape during off-peak hours
    - avoid scraping exessive materials



## Further reading

#### Python
*  [Automate the Boring Stuff with Python](https://automatetheboringstuff.com/)
*  [Web Scraping with Python](http://shop.oreilly.com/product/0636920034391.do)

#### R
*  [Automated Data Collection with R](http://www.r-datacollection.com/)


## Coming soon

* **Lab**: Simple web-scraping exercise
* **Next week**: API