# Web Scraping with Python 

**An Introduction to Data Ingestion from the Internet &middot; March 31, 2017**

## Outline 

1. Introduction 
2. Basic Workflow 
3. Basic Page Fetch 
4. HTTP Basics 
5. Parsing Data 
6. HTML Basics 
7. CSS Selectors 
8. Data Extraction 
9. Data Storage 
10. Scraping Basics

## Introduction 

Two of the most popular ways of ingesting data from the internet are web scraping and web crawling. Scraping (done by scrapers) refers to the automated extraction of specific information from a web page. This information is often a page's text content, but it may also include the headers, the date the page was published, what links are present on the page, or any other specific information the page contains. 

Crawling (done by crawlers or spiders) involves the traversal of a website's link network, while saving or indexing all the pages in that network. 

Scraping is done with an explicit purpose of extracting specific information from a page, while crawling is done in order to obtain information about link networks within and between websites. 

It is possible to both crawl a website and scrape each of the pages, but only if we know what specific content we want from each page and have information about its structure in advance.

![Scraping vs. Crawling](images/scraping_v_crawling.png)

### What is Web Scraping?

 - Automated extraction of specific information from a web page. 
 - Often a page's text content, but it may also include: 
     - Headers
     - Date the page was published
     - Links are present on the page
     - Any other specific information the page contains
 - Objective: extracting specific information from a page

#### Challenges of Web Scraping

 - Need to determine what information you want
 - Need custom scraper for each site
 - Different pages have different structure
 - Page structure/content changes periodically
 - Javascript can make scraping difficult
 - Potential legal issues


### What is Web Crawling?

 - Traversal of a website's link network
 - Saving or indexing all the pages in that network
 - Obtain information about link networks within and between websites.


#### Challenges of Web Crawling

 - Need to know the site structure in advance
 - Determining depth of crawl
 - Latency/bandwidth variations
 - Site mirrors and duplicate pages
 - Spider/crawler traps

### From Crawling to Scraping

 - Different Objectives
     - Scraping - extracting specific information from a page.
     - Crawling - obtain information about link networks within and between websites.

 - Possible to crawl a site and scrape pages.
 - Need to know specific content we want from each page .
 - Need to have information about site structure in advance.


## Basic Workflow 

Create a function that takes as input a url and returns data. 

In [2]:
def scrape(url):
    # Perform a web request 
    # Perform data extraction 
    # Handle or raise exceptions 
    return data 

This simple function is then operationalized across an entire site or sites and can be parallelized for better performance. Visually:

![Basic Workflow](images/workflow.png)