# Transforming the Web into Data (with Python)

# Agenda

- conceptual introduction to web scraping
- tools for non-programmers
- tools for python programmers
- code tour
- break
- *scrape from scratch* exercise

# whoami

- [Matt Burton (@mcburton)](http://twitter.com/mcburton)
- PhD University of Michigan School of Information
- Visiting Assistant Professor / Postdoc
- [School of Information Science](http://www.ischool.pitt.edu/) / [University Library System](http://library.pitt.edu/) Digital Scholarship Services
- love scraping the web
- long time programmer & web developer
- cut my teeth writing translators/scrapers for [Zotero](http://www.zotero.org/)

# why scrape the web

- there is a lot of human activity on the web, which produces
- new and unique data/traces, that can lead to
- insight & understanding for data science, the social sciences, and the humanities.
- Ground Truthiness - *remember the web is only a particular representation of human behavior*
- You can also scrape for [fun & profit](https://blog.hartleybrody.com/guide-to-web-scraping/) 💰

# so what is scraping the web?

Web scraping (web harvesting or web data extraction) is a computer software technique of **extracting information from websites**. Usually, such software programs **simulate human exploration** of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox. - [Wikipedia](http://en.wikipedia.org/wiki/Web_scraping)

# conceptual introduction to web scraping

- there roughly three steps - *results may vary*
    1. fetching resources - *asking a computer "hey, can you send me `http://google.com`?"*
    2. parsing documents - *creating a **machine readable** representation of a web page*
    3. extracting data - *pulling out just the information of interest*

![Scraping for fun & profit](scraping.png)

# fetching resources


- [Hyper Text Transfer Protocol](http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol) (HTTP)
    - fundamentally about **requests** & **responses**
    - the language of the web
    - four request methods: **GET, POST, PUT, DELETE**
    - **URLs** point to **resources**
- verbs & nouns
    - request methods are the verbs
    - resouces are the nouns
    - URLs are the proper nouns
- stateless
    - doesn't have a good memory
    - [sessions](http://en.wikipedia.org/wiki/Session_%28computer_science%29#HTTP_session_token) - *how HTTP servers remember "state"*
    - [cookies](http://en.wikipedia.org/wiki/HTTP_cookie) - *the token passed in HTTP requests & responses*

# fetching resources

- web pages - *made for humans*
    - [HyperText Markup Language](http://en.wikipedia.org/wiki/HTML)(HTML) - *defines document structure*
    - [Cascading Syle Sheets](http://en.wikipedia.org/wiki/Cascading_Style_Sheets)(CSS) - *makes web pages pretty*
    - [JavaScript](http://en.wikipedia.org/wiki/JavaScript) - *makes web pages interactive*
    - so many more standards... [W3C](http://www.w3.org/standards/)
- APIs - *made for machines*
    - [Application Programming Interface](http://en.wikipedia.org/wiki/Application_programming_interface) - *fancy name for how computer machines connect with each other*
    - how to get data from the *social web* (i.e. Twitter, Facebook, etc.)
    - related, but distinct from web scraping (more structured, access control)

    

# parsing documents

- HTML documents are composed of elements or **tags**
    - the `<html>` tag is the **root** of the tree
- the [HTML specification defines a bunch of tags](http://www.w3schools.com/tags/)
    - `<p>this is a paragraph tag with text <em>inside</em> of it</p>`
    - `<a href="http://pitt.edu">This is an anchor tag, basically a link</a>`
    - not enough time to review all of them
- parsing transforms the barf into a tree of elements
    - also called the the [Document Object Model](http://en.wikipedia.org/wiki/Document_Object_Model) or DOM

```html
<!DOCTYPE html>
<html>
  <head>
    <title>A basic webpage</title>
  <body>
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
    <ul>
      <li>First item in a unordered list</li>
      <li>Second item in an unordered list</li>
      <div class="stuff">
        <p>Another paragraph separated by a div element.</p>
      </div>
      <table>
  </body>
</html>
```

In [8]:
! curl -s http://pitt.edu | head -n30

<!DOCTYPE html>
<!--[if IEMobile 7]><html class="iem7"  lang="en" dir="ltr"><![endif]-->
<!--[if lte IE 6]><html class="lt-ie9 lt-ie8 lt-ie7"  lang="en" dir="ltr"><![endif]-->
<!--[if (IE 7)&(!IEMobile)]><html class="lt-ie9 lt-ie8"  lang="en" dir="ltr"><![endif]-->
<!--[if IE 8]><html class="lt-ie9"  lang="en" dir="ltr"><![endif]-->
<!--[if (gte IE 9)|(gt IEMobile 7)]><!--><html  lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ og: http://ogp.me/ns# rdfs: http://www.w3.org/2000/01/rdf-schema# sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema#"><!--<![endif]-->

<head profile="http://www.w3.org/1999/xhtml/vocab">
  <meta charset="utf-8" />
<link rel="shortcut icon" href="http://www.pitt.edu/sites/default/files/pitt_favicon_0.ico" type="image/vnd.microsoft.icon" />
<link rel="shortlink" href=

# extracting data

- ok, now we are going to get *really* technical
- pull information out of the tree and *push it somewhere else*

![push somewhere else](http://i3.kym-cdn.com/entries/icons/original/000/003/356/test.gif)

# extracting data

- how?
    - copy & paste
    - automated scripts
- if you have a lot of data, copy & paste probably won't work for you
- if the data are on multiple pages, you will need to **crawl** with a **spider**
- **web crawlers** extract the links from a web page, fetch those pages, extract links, fetch, extract, fetch...
- scripts and tools help automate this process

# extracting data

- first step: *where are the data* in the HTML tree?
- right-click & select "inspect element" - *works in Firefox, Chrome, & if developer mode enabled in Safari*
    - THE MATRIX 


# extracting data

- *selection* is the key
- many different ways to select HTML tags
    - library APIs - *python code for navigating & searching parsed HTML documents*
    - [CSS selectors](http://www.w3schools.com/cssref/css_selectors.asp) - *used by Cascading Style Sheets, good to know* 
    - [XPATH](http://en.wikipedia.org/wiki/XPath) - *a query language for XML documents, works for HTML too!*
    - REGEX? - [*DON'T EVER USE REGEX TO PARSE HTML!*](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) 

# basic workflow

1. fetch pages
2. extract data
3. extract links
4. fetch more pages
5. ...
6. profit?

# DATA CLEANING!

# challenges in web scraping

- logins, paywalls, and access control
    - these are not impossible, tools support HTTP sessions & cookies
    - throttling - *fine line between scraping & denial of service attack*
    - THE LAW - *read the terms of service, copyright? FAIR USE!* 
- dynamic websites
    - javascript - *hard to scrape because the DOM changes*
    - AJAX or XMLHttpRequest - *pages can asyncronously fetch data & update themselves*
- the document vs. application centric web
    - scraping gmail?
    - APIs help, if they exist
- mobile web / apps????
    -  ¯\\_(ツ)_/¯

# in a perfect world....


# ...we'd scrape *web archives* not the open web

![](web-research.png)

![](web-archived-research.png)

# tools for non-programmers

- [Kimono](https://www.kimonolabs.com) - *really slick and easy to use*
- [Import.io](http://import.io) - *less slick than kimono, but less knobs*
- [Wget](http://en.wikipedia.org/wiki/Wget) - *Swiss army knife of web scraping tools, command line*
- [HTTrack](http://www.httrack.com/) - *windows tool for copying websites, GUI*
- [ScraperWiki](https://scraperwiki.com/) - *a service that costs money, if you have a grant...*
- [Scraper Plugin](https://chrome.google.com/webstore/detail/scraper/mbigbapnjcgaffohmbkdlecaccepngjd) – *chrome plugin instad of a service, looks pretty easy to use*
- [Diffbot](http://diffbot.com/) – *more advanced extraction, really nice guys, costs money but they support research if you ask them*
- **EMAIL** – *sometimes it doesn't hurt to ask!*

# kimono example

- first visit https://www.kimonolabs.com/
- click "Get started, click to install" (NOTE: must be using the Chrome web browswer)
- once it is installed, visit http://news.ycombinator.com
- click on the kimono plugin button to activate the data extraction tool
- demo time!

# tools for python programmers

## fetching

- [`urllib2`](https://docs.python.org/2/library/urllib2.html) - *batteries included*
- [Requests: HTTP for Humans](http://docs.python-requests.org/en/latest/) - *much nicer than urllib2*
- [Scrapey](http://scrapy.org/) - *a complete framework for web crawling and data extraction*

## parsing & extracting

- [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) - *the most popular library*
- [Soupy](http://soupy.readthedocs.org/en/stable/) - *a wrapper around BS to make life easier*
- [Scrapely](https://github.com/scrapy/scrapely) - *another tool for extracting structured data from web pages*
- [lxml](http://lxml.de/) - *a bit lower level, supports XPATH, which I prefer*

## data management

- [Pandas Data Analysis Library](http://pandas.pydata.org/) - *think R dataframes*
- [`sqlite`](https://docs.python.org/2/library/sqlite3.html) - *nice lightweight relational database*
- [`json`](https://docs.python.org/2/library/json.html) or [`csv`](https://docs.python.org/2/library/csv.html?highlight=csv) - *serializing data to a file*

# Where to go next?


## Learning Python

- [Python for Informatics](http://pythonlearn.com/book.php) - *A book as well as [online course materials](http://pythonlearn.com/) and even a [MOOC on Coursera](https://www.coursera.org/course/pythonlearn). Everything is FREE and OPEN SOURCE.*
- [Learn Python the Hard Way](http://learnpythonthehardway.org/) - *Another book & companion website that is NOT FREE.*
- [Codecademy Python Track](http://www.codecademy.com/tracks/python) - *A set of online interactive tutorials for learning python.*
- [Google](http://google.com) - *Seriously, there are a million resources online for leaning Python. Try a few of them out and see which ones work best for you.*



## Web Scraping

- [The Ultimate Guide to Web Scraping](https://blog.hartleybrody.com/guide-to-web-scraping/) - *A short book that provides a conceptual introduction to web scraping*
    - ["I Don't Need No Stinking API"](https://blog.hartleybrody.com/web-scraping/) - *A popular blog post written by the author of the aformentioned book.*
- [Mining the Social Web, 2nd Edition](http://shop.oreilly.com/product/0636920030195.do) - *An excellent book for more advanced programmers who are interested in collecting and analyzing data from social websites like Twitter, Facebook, and Github.*
- [Web Scraping with Python](http://shop.oreilly.com/product/0636920034391.do) - *A new book coming summer of 2015 that appears to cover the more technical aspects of scraping the web with python.*
- [Google](http://google.com) - *Again, seriously, there are a million tutorials on the web. Some are more technical than others.*