# Web Scraping with Python

March 31, 2017

## Outline

 - Introduction
 - Data sources
 - Types of available data 
 - RESTful APIs
 - Scraping
 - Crawling
 - Useful tools
 - Integration

## Collecting Data from the Web 
 
Your mission is to get the data you need to the job you need to do.

However:
- Data is designed for operations, not analysis.
- Data used in analysis usually needs to be denormalized.
- There can be many gatekeepers.

So what makes a good data source?

As data scientists, we rely heavily on structure and patterns, not only in the content of our data, but in its history and provenance. In general, good data sources have a determinable structure, where different pieces of content are organized according to some schema and can be extracted systematically via the application of some logic to that schema. If there is no common structure or schema between documents, it becomes difficult to discern any patterns for extracting the information we want, which often results in either no data retrieved at all or significant cleaning required to correct what the ingestion process got wrong.

### Publicly Available Datasets

 - [Amazon S3 Cloud Public Datasets](https://aws.amazon.com/datasets/)
 - [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/)
 - [Awesome Public Datasets](https://github.com/caesar0301/awesome-public-datasets)
 - [More Datasets](http://rs.io/100-interesting-data-sets-for-statistics/)
 - [Kaggle](https://www.kaggle.com/)
 - [Data.gov](https://www.data.gov/)
 - [Sunlight Foundation](https://sunlightfoundation.com/)

Strategy: look for academic data sets that implement techniques that you’re interested in - these may lead you to initial data or other primary sources.

Also, we’re in DC - Data.gov is a very important resource for data collection and aggregation - with APIs that are constantly being updated with new data. More importantly, Federal agencies in this area are desperate for community data work and visualizations - there are reverse pitches and more to get data scientists involved. Also keep in mind that Data.gov is just a start - the Federal Reserve Board has massive amounts of data, but cannot participate in Data.gov.

**In the end, the best data is always the data you gather yourself.**

## Features of Data in the Wild

Data comes from a variety of sources in a format that was intended for the producer; not necessarily as you require it. Once you have it stored locally you can wrangle it to your needs and input it into a database.

### Common Data Formats

 - CSV: stores tabular data in plain text where each row is a record and the values are delimited by commas.
 - JSON: a data-interchange format that is easy for humans to read and write and for machines to parse and generate.
 - XML: a markup language designed to carry data, with a focus on what the data is.
 - HTML: a markup language designed to display data, with a focus on how the data looks.
 
### Serialization

 - Converting structured data into format to be shared, stored, or updated 
 - Original structure can be restored. 
 - Minimizes the size of the data so that it takes up less disk space when stored or bandwidth when shared.
 - .write()

### Parsing

 - Processing input into meaningful structures to extract information.
 - Examples:
     - A student parses a sentence into subject, verb, and object.
     - A compiler parses source code.
     - A CSV parser reads a stream according to rules (comma delimiters, quoting, etc) to extract the data in each line of a file.
 - .read()


## APIs

Although computer scientists are used to APIs; most of the time APIs refer to Web APIs now - and this is essentially a data ingestion topic.

    “In the simplest terms, APIs are sets of requirements that govern how one application can talk to another. APIs aren't at all new; whenever you use a desktop or laptop, APIs are what make it possible to move information between programs."

    These days, APIs are especially important because they dictate how developers can create new apps that tap into big Web services—social networks like Facebook or Pinterest, for instance, or utilities like Google Maps or Dropbox. The developer of a game app, for instance, can use the Dropbox API to let users store their saved games in the Dropbox cloud instead of working out some other cloud-storage option from scratch.

    Viewed more broadly, though, APIs make possible a sprawling array of Web-service "mashups," in which developers use mix and match APIs from the likes of Google or Facebook or Twitter to create entirely new apps and services. In many ways, the widespread availability of APIs for major services is what's made the modern Web experience possible.”

http://readwrite.com/2013/09/19/api-defined

### Examples

 - Twitter
 - Amazon
 - Soundcloud
 - Goodreads
 - Weather Underground
 - Wordnik

## REST

REST is a simple way to organize interactions between independent systems.

REST allows you to interact with minimal overhead with clients as diverse as mobile phones and other websites. In theory, REST is not tied to the web, but it's almost always implemented as such, and was inspired by HTTP. As a result, REST can be used wherever HTTP can.

So what is HTTP?

## HTTP Basics

 - HyperText Transfer Protocol
 - Foundation of data communication on the web
 - Send request, receive response
 - HTTP Request Methods
     - GET
     - HEAD
     - POST
     - PUT
     - DELETE
 - User Agent String - browser, OS, and other system info.
 - HTTP Status Codes
     - 1xx - Informational
     - 2xx - Success
     - 3xx - Redirection
     - 4xx - Client Error
     - 5xx - Server Error
 - TLS - successor to SSL that provides protocol for secure communications.

## Scraping and Crawling

Two of the most popular ways of ingesting data from the internet are web scraping and web crawling. Scraping (done by scrapers) refers to the automated extraction of specific information from a web page. This information is often a page's text content, but it may also include the headers, the date the page was published, what links are present on the page, or any other specific information the page contains. 

Crawling (done by crawlers or spiders) involves the traversal of a website's link network, while saving or indexing all the pages in that network. 

Scraping is done with an explicit purpose of extracting specific information from a page, while crawling is done in order to obtain information about link networks within and between websites. 

It is possible to both crawl a website and scrape each of the pages, but only if we know what specific content we want from each page and have information about its structure in advance.

### What is Web Scraping?

 - Automated extraction of specific information from a web page. 
 - Often a page's text content, but it may also include: 
     - Headers
     - Date the page was published
     - Links are present on the page
     - Any other specific information the page contains
 - Objective: extracting specific information from a page

#### Challenges of Web Scraping

 - Need to determine what information you want
 - Need custom scraper for each site
 - Different pages have different structure
 - Page structure/content changes periodically
 - Javascript can make scraping difficult
 - Potential legal issues


### What is Web Crawling?

 - Traversal of a website's link network
 - Saving or indexing all the pages in that network
 - Obtain information about link networks within and between websites.


#### Challenges of Web Crawling

 - Need to know the site structure in advance
 - Determining depth of crawl
 - Latency/bandwidth variations
 - Site mirrors and duplicate pages
 - Spider/crawler traps

### From Crawling to Scraping

 - Different Objectives
     - Scraping - extracting specific information from a page.
     - Crawling - obtain information about link networks within and between websites.

 - Possible to crawl a site and scrape pages.
 - Need to know specific content we want from each page .
 - Need to have information about site structure in advance.


## Tools

### Requests

Elegant, simple HTTP library for Python

How it works:
 - Make a request to a web page (get, post, put, etc.)
 - Receive a response from server
 - Read content of server response
 - Headers
 - Cookies
 - Content
 - Etc.


### Scrapy

Open source framework for crawling websites and extracting structured data. 

 - Spiders - define how a certain site (or group of sites) will be scraped.
 - Selectors - select certain parts of the HTML document.
 - Items - objects that serve as simple containers used to collect the scraped data. 
 - Scrapy Shell - debug scraping code quickly without having to run spider. 
 - Pipelines, extractors, and more!


*For more advanced crawling and scraping, it may be worth looking into the following tools.*

* Selenium - a Python library that allows you to simulate user interaction with a website.
* Apache Nutch - a highly extensible and scalable open source web crawler.

### Databases and Database Tools

** WORM STORAGE **

 - PostgreSQL
 - Postgres App
 - Pgadmin
 - SQLite - lightweight, self-contained SQL database engine.
 - Psycopg2
 - Postico
 - Postman
 - JetBrains Database Navigator 


## Being a Good Citizen

 - Robot.txt files - tell you what the site does and does not allow from crawlers.
 - Rate limiting - limiting the frequency at which you ping a website.
 - Too much traffic too quickly may bring down a smaller website.
 - Larger websites may block your IP address.
