# Web Scraping with Python 

**An Introduction to Data Ingestion from the Internet &middot; March 31, 2017**

## Outline 

1. Introduction 
2. Basic Workflow 
3. Basic Page Fetch 
4. HTTP Basics 
5. Parsing Data 
6. HTML Basics  
7. Data Extraction 
8. Data Storage 
9. Scraping Basics

In [65]:
import os 
import bs4 
import requests 

from readability import readability 

## Introduction 

Two of the most popular ways of ingesting data from the internet are web scraping and web crawling. Scraping (done by scrapers) refers to the automated extraction of specific information from a web page. This information is often a page's text content, but it may also include the headers, the date the page was published, what links are present on the page, or any other specific information the page contains. 

Crawling (done by crawlers or spiders) involves the traversal of a website's link network, while saving or indexing all the pages in that network. 

Scraping is done with an explicit purpose of extracting specific information from a page, while crawling is done in order to obtain information about link networks within and between websites. 

It is possible to both crawl a website and scrape each of the pages, but only if we know what specific content we want from each page and have information about its structure in advance.

![Scraping vs. Crawling](images/scraping_v_crawling.png)

### What is Web Scraping?

 - Automated extraction of specific information from a web page. 
 - Often a page's text content, but it may also include: 
     - Headers
     - Date the page was published
     - Links are present on the page
     - Any other specific information the page contains
 - Objective: extracting specific information from a page

#### Challenges of Web Scraping

 - Need to determine what information you want
 - Need custom scraper for each site
 - Different pages have different structure
 - Page structure/content changes periodically
 - Javascript can make scraping difficult
 - Potential legal issues


### What is Web Crawling?

 - Traversal of a website's link network
 - Saving or indexing all the pages in that network
 - Obtain information about link networks within and between websites.


#### Challenges of Web Crawling

 - Need to know the site structure in advance
 - Determining depth of crawl
 - Latency/bandwidth variations
 - Site mirrors and duplicate pages
 - Spider/crawler traps

### From Crawling to Scraping

 - Different Objectives
     - Scraping - extracting specific information from a page.
     - Crawling - obtain information about link networks within and between websites.

 - Possible to crawl a site and scrape pages.
 - Need to know specific content we want from each page .
 - Need to have information about site structure in advance.


## Basic Workflow 

Create a function that takes as input a url and returns data. 

In [2]:
def scrape(url):
    # Perform a web request 
    # Perform data extraction 
    # Handle or raise exceptions 
    return data 

This simple function is then operationalized across an entire site or sites and can be parallelized for better performance. Visually:

![Basic Workflow](images/workflow.png)

Python comes with an HTTP library, but it is far easier to use [requests.py](http://docs.python-requests.org/); an elegant, simple HTTP library for Python. 

How it works:

- Make a request to a web page (get, post, put, etc.)
- Receive a response from server
- Read content of server response
- Headers
- Cookies
- Content
- Etc.

A basic example of this is to fetch data from [News API](https://newsapi.org/). The input here is a [source](https://newsapi.org/sources), which constructs a URL -- performs the fetch, parses JSON and returns the headlines for the top news stories. 

In [8]:
import requests

def topnews(source='the-washington-post'):
    params = {
        "source": source, 
        "sortBy": "top",
        "apiKey": os.environ.get('NEWS_API_KEY'),
    }
    url = "https://newsapi.org/v1/articles"
    req = requests.get(url, params=params)
    req.raise_for_status() 
    
    for article in req.json()['articles']:
        yield article['title']

In [9]:
for title in topnews(): print(title)

Three White House officials tied to files shared with House intelligence chairman
Flynn offers to cooperate with congressional probe in exchange for immunity
Trump struggles against some of the forces that helped get him elected
Secretary of State Rex Tillerson spends his first weeks isolated from an anxious bureaucracy
Disabled, or just desperate? Rural Americans turn to disability as jobs dry up


A lot happened there. First we created some parameters to compose a URL for a web request, including fetching authentication keys from the environment, then executed the request. The server returned a response, which we checked to make sure no exceptions occurred. Finally, we parsed the JSON data and extracted the title of each article returned. 

This is a pretty simple mechanism, but of course operationalizing such a simple function can often grow to much larger code bases. 

## HTTP Basics

HTTP or _HyperText Transfer Protocol_ is a protocol that implements a client-server relationship wherein a client makes a request of a server for some resource (usually HTML, which is hypertext) and the server responds with a _document_. First described by Tim Berners-Lee, it is the foundation of the web:

 - HyperText Transfer Protocol
 - Foundation of data communication on the web
 - Send request, receive response

HTTP is based around the idea of _linked documents_ (hypertext) being edited and served from the Internet. An HTTP request is to perform some _method_ against a _resource_, identified by a _uniform resource location_. 

### The Anatomy of an URL

Because HTTP is meant to work with linked documents, URLs are the heart of requests. Let's take a closer look at one:

![URL Anatomy](images/url_anatomy.png)

We can use `urllib` to parse URLs: 

In [46]:
from urllib.parse import urlparse

urlparse("https://newsapi.org/v1/articles?source=techcrunch&sortBy=top")

ParseResult(scheme='https', netloc='newsapi.org', path='/v1/articles', params='', query='source=techcrunch&sortBy=top', fragment='')

We can also _compose_ urls using the parse methods of `urllib` so that we can programatically create URLs on demand with completely different components, paths, queries, etc. 

In [25]:
from urllib.parse import urlunparse, ParseResult

urlunparse(ParseResult(
    scheme="http", netloc="newsapi.org", path="the-wall-street-journal-api", 
        params="", query="", fragment="top-headlines",
))

'http://newsapi.org/the-wall-street-journal-api#top-headlines'

In [26]:
from urllib.parse import urljoin

urljoin("http://newsapi.org", "v1/articles")

'http://newsapi.org/v1/articles'

In [28]:
from urllib.parse import urlencode

urlencode({"source": "the-wall-street-journal", "sortBy": "latest"})

'source=the-wall-street-journal&sortBy=latest'

Once we have a URL, we can start making HTTP requests. A typical view of the client/server relationship:

![HTTP REST Protocol](images/restful.png)

Rather than documents on disk, interaction with a database is more common in modern web applications.

### HTTP Request 

The HTTP request is broken into two parts:

1. Headers 
2. Body 

The **headers** contain meta-information about the request that can change the behavior of the server. Common headers include:

- **User-Agent**: the name of the software (browser, OS, system) that is conducting the request 
- **Authentication**: credentials for secure access to a web resource 
- **Accept**: the file type of the expected response 

The request header must also identify the **path** of the resource requested (specified in the URL) along with a **method** of interaction. HTTP methods are:

- **GET**: retreive a document at the URL, possibly modified by query/parameters 
- **HEAD**: only return the headers of the response 
- **POST**: create a new document at the given path whose contents are the data sent in the body 
- **PUT**: update a document at the given location with the contents sent in the body 
- **DELETE**: delete the document at the given location

The **body** contains arbitrary data, but is usually only used in `POST` and `PUT` operations. 

HTTP is all plain text, so we can see it in action using `curl`: 
 
```
$ curl -v -X HEAD https://www.washingtonpost.com/news/speaking-of-science/
*   Trying 192.33.31.166...
* Connected to www.washingtonpost.com (192.33.31.166) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate: www.washingtonpost.com
* Server certificate: Entrust Certification Authority - L1M
* Server certificate: Entrust Root Certification Authority - G2
> HEAD /news/speaking-of-science/ HTTP/1.1
> Host: www.washingtonpost.com
> User-Agent: curl/7.43.0
> Accept: */*
```

Note that `curl` has added some headers for us. We can also access the properties of a request using `requests.py`:

In [34]:
response = requests.get("https://www.washingtonpost.com/news/speaking-of-science/")

In [38]:
response.request

<PreparedRequest [GET]>

In [64]:
response.request.headers

{'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'User-Agent': 'Data Science Scraper', 'Connection': 'keep-alive'}

In [40]:
print(response.request.body)

None


In [41]:
response.request.method

'GET'

Modify headers by passing a dictionary to the request with the `headers` keyword argument, modify the body by passing `data` to the request:

In [42]:
headers = {'User-Agent': 'Data Science Scraper'}
response = requests.get("https://www.washingtonpost.com/news/speaking-of-science/", headers=headers)

### HTTP Response 

The HTTP response also contains **headers** and a **body**. The body usually contains the document that you requested (or nothing if you performed a `HEAD` request) and the headers describe meta information about the result. From our `curl` above, here is the response:

```
< HTTP/1.1 200 OK
< Content-Type: text/html;charset=UTF-8
< Connection: keep-alive
< Age: 0
< Access-Control-Allow-Origin: *
< Cache-Control: s-maxage=120
< X-Backend: http://pagebuilder-app.wpprivate.com
< Date: Fri, 31 Mar 2017 11:23:01 GMT
< PB-PID: front-blog
< PB-RID: r0MWV4izGFUcfq
< Server: nginx
< X-Served-By: pb
< X-Instart-Request-ID: 16161893799503592096:VNQ01-NPPRY42:1490959381:165
< X-Instart-Debug-Header: auth_status:200, origin:origin-web.washingtonpost.com, cache key modifier:, num_auth_cookies:6
< Set-Cookie: de=;Expires=Sunday, 31-March-2019 11:23:00 GMT; path=/; domain=.washingtonpost.com
< Set-Cookie: client_region=1;Expires=Friday, 31-March-2017 11:33:00 GMT; path=/; domain=.washingtonpost.com
< Set-Cookie: X-WP-Split=X;Expires=Thursday, 01-January-1970 00:00:00 GMT; path=/; domain=.washingtonpost.com
< Set-Cookie: devicetype=0;Expires=Sunday, 30-April-2017 21:52:00 GMT; path=/; domain=.washingtonpost.com
< Set-Cookie: osfam=0;Expires=Sunday, 30-April-2017 21:52:00 GMT; path=/; domain=.washingtonpost.com
< Set-Cookie: rpld1=23:38.898689|24:-77.033203|0:verizon.net|20:usa|21:dc|22:washington|;Expires=Friday, 31-March-2017 12:23:00 GMT; path=/; domain=.washingtonpost.com
< Content-Security-Policy: upgrade-insecure-requests
```

There are a couple of important things to note here. The first line specifies the *status* and *status message* of the request. HTTP status codes are three digit numbers that can be interpreted as follows:

#### HTTP Status Codes 

 - **1xx** - Informational
 - **2xx** - Success
 - **3xx** - Redirection
 - **4xx** - Client Error
 - **5xx** - Server Error
 
A complete list can be found here: [HTTP Statuses](https://httpstatuses.com/). 

A response of `200 Ok` means that the request was successful. Other common status codes include:

- **404 Not Found**: the requested path does not exist on the server 
- **500 Server Error**: something went very wrong on the server 
- **301 Redirect**: the resource has moved to a different URL 
- **403 Forbidden**: the resource requires authentication 
     
We can inspect the status code, headers, etc. from the response as follows:

In [67]:
response.status_code

200

In [71]:
response.reason

'OK'

In [74]:
## Raise an exception if status != 2XX
response.raise_for_status()

In [66]:
for item in response.headers.items():
    print("{}: {}".format(*item))

Content-Type: text/html;charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
X-Served-By: pb
Server: nginx
PB-RID: r0MWV4izGFUcfq
PB-PID: front-blog
Date: Fri, 31 Mar 2017 11:37:27 GMT
Content-Encoding: gzip
X-Backend: http://pagebuilder-app.wpprivate.com
Cache-Control: s-maxage=120
Access-Control-Allow-Origin: *
Age: 0
X-Instart-Request-ID: 15998957679212256336:VNQ01-NPPRY09:1490960247:165
X-Instart-Debug-Header: auth_status:200, origin:origin-web.washingtonpost.com, cache key modifier:0, num_auth_cookies:6
Set-Cookie: de=;Expires=Sunday, 31-March-2019 11:37:27 GMT; path=/; domain=.washingtonpost.com, client_region=1;Expires=Friday, 31-March-2017 11:47:27 GMT; path=/; domain=.washingtonpost.com, X-WP-Split=X;Expires=Thursday, 01-January-1970 00:00:00 GMT; path=/; domain=.washingtonpost.com, devicetype=0;Expires=Sunday, 30-April-2017 22:06:27 GMT; path=/; domain=.washingtonpost.com, osfam=0;Expires=Sunday, 30-April-2017 22:06:27 GMT; path=/; domain=.washingtonpost.com, rpld1

Finally, note that the header:

```
Content-Type: text/html;charset=UTF-8
```

Specifies exactly the kind of data that we received, and it's encoding. This allows us to decode the body of the message correctly. The `requests.py` package automatically decodes the string for you, though you can also get access directly to the bytes content as follows:

In [79]:
## Not encoded 
type(response.content)

bytes

In [80]:
## Encoded 
type(response.text)

str

In [82]:
response.encoding

'UTF-8'

The bytes method is useful if you're downloading files, images, etc. For the most part, however, we'll use the encoded text to read our content. 

### HTTP Security 

Finally, a few notes on HTTP security, which may affect your scraping methods. 

- **TLS**: Transport Layer Security encrypts the connection between the client and server. 
- **Basic Authentiation**: Send username and password with request, note it is in plain text if not using TLS. 

Typically, security != authentication; but be aware these things exist. 

Auth with reqeusts: 

```python
response = requests.get(url, auth=("username", "password"))
```

To use TLS, simply specify `https` in the URL. **Always use TLS where possible**. 

## Parsing Data 


Data comes from a variety of sources in a format that was intended for the producer; not necessarily as you require it. Once you have it stored locally you can wrangle it to your needs and input it into a database.

### Common Data Formats

 - **CSV**: stores tabular data in plain text where each row is a record and the values are delimited by commas.
 - **JSON**: a data-interchange format that is easy for humans to read and write and for machines to parse and generate.
 - **XML**: a markup language designed to carry data, with a focus on what the data is.
 - **HTML**: a markup language designed to display data, with a focus on how the data looks.
 
### Serialization

 - Converting structured data into format to be shared, stored, or updated 
 - Original structure can be restored. 
 - Minimizes the size of the data so that it takes up less disk space when stored or bandwidth when shared.
 - .write()

### Parsing

 - Processing input into meaningful structures to extract information.
 - Examples:
     - A student parses a sentence into subject, verb, and object.
     - A compiler parses source code.
     - A CSV parser reads a stream according to rules (comma delimiters, quoting, etc) to extract the data in each line of a file.
 - .read()
 
So let's take a look at the response from our news API:

In [83]:
params = {
    "source": "associated-press", 
    "sortBy": "top",
    "apiKey": os.environ.get('NEWS_API_KEY'),
}

url = "https://newsapi.org/v1/articles"
req = requests.get(url, params=params)
req.raise_for_status() 

print(req.text)

{"status":"ok","source":"associated-press","sortBy":"top","articles":[{"author":"CHAD DAY","title":"Michael Flynn in talks with Congress, wary of prosecution","description":"WASHINGTON (AP) — Former National Security Adviser Michael Flynn is in discussions with the House and Senate intelligence committees on receiving immunity from \"unfair prosecution\" in exchange for agreeing to be questioned…","url":"http://bigstory.ap.org/article/58df241057844d46b35a8f0229811f00/michael-flynn-talks-congress-wary-prosecution","urlToImage":"http://binaryapi.ap.org/efa0cd3fa6b34163976415ae2bd21ebc/460x.jpg","publishedAt":"2017-03-31T10:18:03Z"},{"author":"JULIE PACE","title":"Trump faces questions of interference in investigations","description":"WASHINGTON (AP) — President Donald Trump is facing new questions about political interference in the investigations into Russian election meddling following reports that White House officials secretly funneled material…","url":"http://bigstory.ap.org/article

This is a text representation of a JSON object. JSON can be easily parsed into a Python dictionary using the standard library `json` package in order to quickly retrieve values such as the title of each article. JSON is a **data interchange** format (as is CSV and XML), meaning it is machine readable, and is so common, that requests.py comes with a JSON parser:

In [84]:
req.json()

{'articles': [{'author': 'CHAD DAY',
   'description': 'WASHINGTON (AP) — Former National Security Adviser Michael Flynn is in discussions with the House and Senate intelligence committees on receiving immunity from "unfair prosecution" in exchange for agreeing to be questioned…',
   'publishedAt': '2017-03-31T10:18:03Z',
   'title': 'Michael Flynn in talks with Congress, wary of prosecution',
   'url': 'http://bigstory.ap.org/article/58df241057844d46b35a8f0229811f00/michael-flynn-talks-congress-wary-prosecution',
   'urlToImage': 'http://binaryapi.ap.org/efa0cd3fa6b34163976415ae2bd21ebc/460x.jpg'},
  {'author': 'JULIE PACE',
   'description': 'WASHINGTON (AP) — President Donald Trump is facing new questions about political interference in the investigations into Russian election meddling following reports that White House officials secretly funneled material…',
   'publishedAt': '2017-03-31T10:22:32Z',
   'title': 'Trump faces questions of interference in investigations',
   'url': 'h

This is because NewsAPI.org is a **Web API**, an application programming interface -- which is specifically meant for data ingestion and scraping. 

### APIs

Although computer scientists are used to APIs; most of the time APIs refer to Web APIs now - and this is essentially a data ingestion topic.

> “In the simplest terms, APIs are sets of requirements that govern how one application can talk to another. APIs aren't at all new; whenever you use a desktop or laptop, APIs are what make it possible to move information between programs."

> These days, APIs are especially important because they dictate how developers can create new apps that tap into big Web services—social networks like Facebook or Pinterest, for instance, or utilities like Google Maps or Dropbox. The developer of a game app, for instance, can use the Dropbox API to let users store their saved games in the Dropbox cloud instead of working out some other cloud-storage option from scratch.

> Viewed more broadly, though, APIs make possible a sprawling array of Web-service "mashups," in which developers use mix and match APIs from the likes of Google or Facebook or Twitter to create entirely new apps and services. In many ways, the widespread availability of APIs for major services is what's made the modern Web experience possible.”

http://readwrite.com/2013/09/19/api-defined

#### Examples

 - Twitter
 - Amazon
 - Soundcloud
 - Goodreads
 - Weather Underground
 - Wordnik
 - RSS feeds 
 
#### Always use an API whenever Possible 

Typically this means registering for an API key, and including that with all your requests. 

It is an agreement between you and the data provider in order to create an *economic* relationship that is *mutually beneficial*. 
 
#### REST

Most APIs are "RESTful". REST is a simple way to organize interactions between independent systems. 

REST allows you to interact with minimal overhead with clients as diverse as mobile phones and other websites. In theory, REST is not tied to the web, but it's almost always implemented as such, and was inspired by HTTP. As a result, REST can be used wherever HTTP can.

REST basically specifies the relationship between API endpoints and how they're defined as URLs, and the effective use of HTTP verbs for fetching. But this is more for a different topic.

## HTML Basics 

Unfortunately, not all data is available as an API (and the term scraping usually refers to handling content that's not available as an API). Take for example the [Billboard Year-End Hot 100 singles of 1960](https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_1960):

In [85]:
resp = requests.get("https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_1960")
print(resp.text)

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Billboard Year-End Hot 100 singles of 1960 - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Billboard_Year-End_Hot_100_singles_of_1960","wgTitle":"Billboard Year-End Hot 100 singles of 1960","wgCurRevisionId":772382213,"wgRevisionId":772382213,"wgArticleId":30448712,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["1960 record charts","Billboard charts"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","

The response is **HTML**: hypertext markup language (the thing HTTP is meant to send around). 

Consider that a web page is usually defined by the following elements:

- **HTML**: defines the structure of the document 
- **CSS**: defines the style and layout of the document 
- **Javascript**: defines user interaction with the document 
- **Assets**: images and other elements embedded into the page 

The browser fetches all the assets, then puts together the HTML and CSS to *render* a document, then executes Javascript *events*, creating a complete experience; one that is not usually able to be mimicked programatically.

HTML is interesting because although it is *meant for human readability* it also contains structural elements defined with HTML tags. This makes it *semi-structured* data, in that there are parsing opportunties that do not require NLP, like the table on the Wikipedia page. 

HTML, does, however pose challenges for parsing since it is meant for human consumption.

HTML (like an HTTP request) contains both a header (metadata for the document) and a body (content). HTML is a set of specific XML tags, a minimal structure of which is as follows:

```
<!DOCTYPE html>
<html>
<head>
    <title>Title of the document</title>
</head>

<body>
The content of the document......
</body>

</html>
```

The document is usually composed of many other tags, a complete listing of which can be found here: [W3 HTML tags](https://www.w3schools.com/tags/). A few interesting ones:

- `<p>`: a paragraph 
- `<div>`: a block of content 
- `<span>`: an inline set of content 
- `<a>`: an anchor or hyperlink 
- `<table>`: a data table 

If you're only interested in the text of a document, the [readability library](https://pypi.python.org/pypi/readability-lxml) uses the structure to discover the core content, excluding navigation, ads, links, and other elements of a web page:

In [99]:
from bs4 import BeautifulSoup
from readability.readability import Document 

def fetch_text(url):
    resp = requests.get(url)
    resp.raise_for_status()
    
    doc = Document(resp.content)
    title = doc.short_title() 
    soup = BeautifulSoup(doc.summary(), 'lxml')
    article = soup.text
    
    return title, article 

In [100]:
title, article = fetch_text("https://www.washingtonpost.com/news/speaking-of-science/wp/2017/03/30/nasa-astronauts-lose-key-piece-of-iss-shield-and-now-its-floating-free-in-space/")
print(title)
print()
print(article)

NASA astronauts lose key piece of ISS shield, and now it’s floating free in space

     Spacewalking astronaughts Shane Kimbrough and Peggy Whitson went out on the International Space Station's 199th spacewalk on March 30, to carry out routine maintenance. A piece of important cloth shielding to guard against micrometeoroids floated away while the astronauts were at work. (Reuters)   NASA astronauts on a spacewalk Thursday accidentally lost a fabric shield needed for the International Space Station — a minor setback in what was otherwise a record-setting mission for one of the crew members. Astronauts Peggy Whitson and Shane Kimbrough were working on an area of the space station where a docking port had been disconnected and moved last week. They were in the process of using four large cloth panels to cover the access point where the docking port had been when one of the fabric shields suddenly drifted away and floated off into space. There was audible frustration in Whitson's voice as

## Data Extraction

In order to extract more specific data, we'll use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and the [lxml](http://lxml.de/) parser to quickly parse the HTML document into a data structure that we can navigate. BeautifulSoup creates the following objects from the page:

- **Tags**: correspond to an HTML or XML tag in the original document 
- **NavigableString**: corresponds to the content in the tag 

A `BeautifulSoup` object is composed of tags, which is then composed of navigable strings (and can also include HTML comments). Create a soup by passing the HTML string (content) along with the name of the parser:

In [101]:
req = requests.get("https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_1960")
soup = BeautifulSoup(req.text, 'lxml')

Tags can be accessed directly from the soup using "dot notation". Keep in mind that tags are nested, and you need to chain dots to get to specific elements from the Soup. Moreover, if there are more than one of the kind of tag you're requesting, then dot notation will return the first one. 

Once fetched, tags have `name` and `string` attributes, that allow you to access meta data, and you can also access attributes of the tag using dictionary-like syntax. To get the text content without inner tags, use the tag's `text` attribute. 

In [103]:
soup.title

<title>Billboard Year-End Hot 100 singles of 1960 - Wikipedia</title>

In [104]:
soup.title.name

'title'

In [105]:
soup.title.string

'Billboard Year-End Hot 100 singles of 1960 - Wikipedia'

In [106]:
soup.h1

<h1 class="firstHeading" id="firstHeading" lang="en"><i>Billboard</i> Year-End Hot 100 singles of 1960</h1>

In [108]:
soup.h1.string # returns None because of nested tags 

In [109]:
soup.h1.text

'Billboard Year-End Hot 100 singles of 1960'

In [110]:
soup.h1['class']

['firstHeading']

### Navigation 

You can navigate down the document using the `parent` and `children` attributes of tags. We can also get to siblings using the `next_sibling` and `previous_sibling` attributes. 

In [112]:
soup.body.table

<table class="wikitable sortable" style="text-align: center">
<tr>
<th scope="col" style="background:#dde;">№</th>
<th scope="col" style="background:#dde;">Title</th>
<th scope="col" style="background:#dde;">Artist(s)</th>
</tr>
<tr>
<td>1</td>
<td>"<a href="/wiki/Theme_from_A_Summer_Place" title="Theme from A Summer Place">Theme from A Summer Place</a>"</td>
<td><a href="/wiki/Percy_Faith" title="Percy Faith">Percy Faith</a></td>
</tr>
<tr>
<td>2</td>
<td>"<a href="/wiki/He%27ll_Have_to_Go" title="He'll Have to Go">He'll Have to Go</a>"</td>
<td><a href="/wiki/Jim_Reeves" title="Jim Reeves">Jim Reeves</a></td>
</tr>
<tr>
<td>3</td>
<td>"<a href="/wiki/Cathy%27s_Clown" title="Cathy's Clown">Cathy's Clown</a>"</td>
<td><a href="/wiki/The_Everly_Brothers" title="The Everly Brothers">The Everly Brothers</a></td>
</tr>
<tr>
<td>4</td>
<td>"<a href="/wiki/Running_Bear" title="Running Bear">Running Bear</a>"</td>
<td><a href="/wiki/Johnny_Preston" title="Johnny Preston">Johnny Preston</a></t

In [113]:
soup.body.table.parent

<div class="mw-content-ltr" dir="ltr" id="mw-content-text" lang="en"><p>This is a list of <i><a href="/wiki/Billboard_(magazine)" title="Billboard (magazine)">Billboard</a></i> magazine's Top <b><a href="/wiki/Billboard_Hot_100" title="Billboard Hot 100">Hot 100</a></b> songs of <a href="/wiki/1960_in_music" title="1960 in music">1960</a>.<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup></p>
<table class="wikitable sortable" style="text-align: center">
<tr>
<th scope="col" style="background:#dde;">№</th>
<th scope="col" style="background:#dde;">Title</th>
<th scope="col" style="background:#dde;">Artist(s)</th>
</tr>
<tr>
<td>1</td>
<td>"<a href="/wiki/Theme_from_A_Summer_Place" title="Theme from A Summer Place">Theme from A Summer Place</a>"</td>
<td><a href="/wiki/Percy_Faith" title="Percy Faith">Percy Faith</a></td>
</tr>
<tr>
<td>2</td>
<td>"<a href="/wiki/He%27ll_Have_to_Go" title="He'll Have to Go">He'll Have to Go</a>"</td>
<td><a href="/wiki/Jim_Reeves"

In [121]:
for child in soup.body.table.children:
    if child.name is not None:
        print(child)
        break

<tr>
<th scope="col" style="background:#dde;">№</th>
<th scope="col" style="background:#dde;">Title</th>
<th scope="col" style="background:#dde;">Artist(s)</th>
</tr>


In [122]:
soup.body.table.tr

<tr>
<th scope="col" style="background:#dde;">№</th>
<th scope="col" style="background:#dde;">Title</th>
<th scope="col" style="background:#dde;">Artist(s)</th>
</tr>

In [126]:
soup.body.table.tr.next_sibling.next_sibling

<tr>
<td>1</td>
<td>"<a href="/wiki/Theme_from_A_Summer_Place" title="Theme from A Summer Place">Theme from A Summer Place</a>"</td>
<td><a href="/wiki/Percy_Faith" title="Percy Faith">Percy Faith</a></td>
</tr>

### Retrieving Specific Elements 

BeautifulSoup supports two methods to retrieve specific elements:

- `find(filter)`: get the first descendant of the tag that matches the filter 
- `find_all(filter)`: get all descendants of the tag that matches the filter 
- `select(css)`: use a css selector to get all descendants of the tag 

The first two methods `find` and `find_all` are usually applied to a specific tag, whereas the `select` method is generally applied to the entire soup, though it can be applied to a specific tag. 

The filter argument can be one of the following:

- a string specifying the name of the tag to find 
- a regular expression to match against the name of the tag 
- a list of tag strings 
- True to match everything possible 
- a function that takes as input a tag and returns True for match or False for not 

You can also pass attribute information, but it's usually easier to use a selector:

In [127]:
len(soup.find_all('table'))

2

In [128]:
table = soup.find('table')
len(table.find_all('tr'))

101

### CSS Selectors

So we had a problem above, there are two tables, but we want to retreive data from the bilboard list, we can manually specify which one we want, but if we're doing this programatically on a bunch of pages, we need to be more efficient. 

HTML elements can have attributes, such as:

```html
<a href="https://www.google.com" title="search">Google</a>
```

There are two primary attributes that are beneficial to select on:

- **class**: usually defines a collection of related tags 
- **id**: usually identifies a single, unique tag 

CSS (cascading style sheets) are used to format and style the display of HTML documents, therefore CSS selectors are usually reliable mechanisms of extracting exactly the data needed.

> Because web developers use CSS selectors, it is often easiest to use them for data extraction 

To select by class, use the `".class"` notation and to select by ID use `"#id"` notation as follows: 

In [129]:
len(soup.select(".wikitable"))

1

In [130]:
soup.select("#mw-content-text")

[<div class="mw-content-ltr" dir="ltr" id="mw-content-text" lang="en"><p>This is a list of <i><a href="/wiki/Billboard_(magazine)" title="Billboard (magazine)">Billboard</a></i> magazine's Top <b><a href="/wiki/Billboard_Hot_100" title="Billboard Hot 100">Hot 100</a></b> songs of <a href="/wiki/1960_in_music" title="1960 in music">1960</a>.<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup></p>
 <table class="wikitable sortable" style="text-align: center">
 <tr>
 <th scope="col" style="background:#dde;">№</th>
 <th scope="col" style="background:#dde;">Title</th>
 <th scope="col" style="background:#dde;">Artist(s)</th>
 </tr>
 <tr>
 <td>1</td>
 <td>"<a href="/wiki/Theme_from_A_Summer_Place" title="Theme from A Summer Place">Theme from A Summer Place</a>"</td>
 <td><a href="/wiki/Percy_Faith" title="Percy Faith">Percy Faith</a></td>
 </tr>
 <tr>
 <td>2</td>
 <td>"<a href="/wiki/He%27ll_Have_to_Go" title="He'll Have to Go">He'll Have to Go</a>"</td>
 <td><a href="/

CSS selectors can get even more complex, specifying specific attributes, tags, nested structures, first or last element, groups of elements, etc. Frankly, they are a very easy way to navigate HTML content. 

### Extracting Data 

Create a function that takes as input a year, and returns all the Billboard top 100 singles of the year. 

In [165]:
baseurl = "https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_{}"

def top_100_singles(year=1960):
    # Fetch the page 
    url = baseurl.format(year)
    req = requests.get(url)
    req.raise_for_status()
    
    # Parse the page
    soup = BeautifulSoup(req.text)
    
    # Get the table
    table = soup.select("#mw-content-text table")[0]
    
    # Parse each row and yield the data required 
    for tr in table.find_all('tr'):
        # Ensure that this is not the header or an emptyrow
        if len(tr.find_all('td')) < 2:
            continue 
        
        # Handle 1984 vs 1960 table 
        if tr.th:
            notd = tr.th
            titletd, artisttd = tr.find_all('td')
        else:
            notd, titletd, artisttd = tr.find_all('td')
        
        # Create the row 
        row = {
            "year": year, 
            "number": int(notd.text),
            "title": {
                "name": titletd.text,
                "link": "",
            }, 
            "artist": {
                "name": artisttd.text, 
                "link": "", 
            }, 
        }
        
        # Add the links 
        if titletd.a is not None:
            row["title"]["link"] = titletd.a['href']
        
        if artisttd.a is not None:
            row["artist"]["link"] = artisttd.a['href']
        
        # Yield the row 
        yield row 

In [167]:
for single in top_100_singles(1984):
    print(single)

{'title': {'link': '/wiki/When_Doves_Cry', 'name': '"When Doves Cry"'}, 'number': 1, 'artist': {'link': '/wiki/Prince_(musician)', 'name': 'Prince'}, 'year': 1984}
{'title': {'link': '/wiki/What%27s_Love_Got_to_Do_with_It_(song)', 'name': '"What\'s Love Got to Do with It"'}, 'number': 2, 'artist': {'link': '/wiki/Tina_Turner', 'name': 'Tina Turner'}, 'year': 1984}
{'title': {'link': '/wiki/Say_Say_Say', 'name': '"Say Say Say"'}, 'number': 3, 'artist': {'link': '/wiki/Paul_McCartney', 'name': 'Paul McCartney and Michael Jackson'}, 'year': 1984}
{'title': {'link': '/wiki/Footloose_(song)', 'name': '"Footloose"'}, 'number': 4, 'artist': {'link': '/wiki/Kenny_Loggins', 'name': 'Kenny Loggins'}, 'year': 1984}
{'title': {'link': '/wiki/Against_All_Odds_(Take_a_Look_at_Me_Now)', 'name': '"Against All Odds (Take a Look at Me Now)"'}, 'number': 5, 'artist': {'link': '/wiki/Phil_Collins', 'name': 'Phil Collins'}, 'year': 1984}
{'title': {'link': '/wiki/Jump_(Van_Halen_song)', 'name': '"Jump"'}, 

## Data Storage 

Now that we're extracting data, we should consider how to store the data that we're retrieving to disk, the final part of web scraping. The easiest thing to do is to simply write the file to disk. Since we're yielding *nested* JSON, the simplest thing to do is simply to write the JSON data to disk, in a file whose name includes the year. 

In [174]:
import os 
import json 

def fetch_and_store_json(year=1960, path="data"):
    """
    Specify a year and a directory on disk to write the JSON data out to. 
    """
    singles = list(top_100_singles(year)) 
    if len(singles) == 0:
        raise ValueError("No singles retreived for year {}".format(year))
    
    outpath = os.path.join(path, "billboard_singles_{}.json".format(year))
    with open(outpath, 'w') as f:
        json.dump(singles, f, indent=2)

In [175]:
fetch_and_store_json(1984)

It is also straight forward to [_denormalize_](https://en.wikipedia.org/wiki/Denormalization) our data and write to a CSV file

In [176]:
import csv 

def fetch_and_store_csv(year=1960, path="data"):
    """
    Specify a year and a directory on disk to write the CSV data out to. 
    """
    singles = list(top_100_singles(year)) 
    if len(singles) == 0:
        raise ValueError("No singles retreived for year {}".format(year))
    
    outpath = os.path.join(path, "billboard_singles_{}.csv".format(year))
    with open(outpath, 'w') as f:
        writer = csv.writer(f) 
        writer.writerow(["number", "year", "title", "link", "artist", "artist link"])
        
        for song in singles:
            writer.writerow([
                song["number"],
                song["year"],
                song["title"]["name"],
                song["title"]["link"],
                song["artist"]["name"],
                song["artist"]["link"],
            ])

In [177]:
fetch_and_store_csv(1992)

### Databases

Writing data to disk like this does not give us the ability to query very easily; instead it may be better to write to a database. Here is a list of commonly used databases with Python and data ingestion:

- Sqlite3 (standard library)
- PostgreSQL (with psycopg2)
- MongoDB (with pymongo) 

Sqlite3 is a local embeded database and is fast to use, whereas PostgreSQL is a relational database server with far more performance. Mongo is a document store, so is ideal for storing unstructured or semi-structured data. 

Other tools include:

 - Postgres App
 - Pgadmin
 - Postico
 - Postman
 - JetBrains Database Navigator 

In [178]:
import sqlite3 

### Serialization 

Deciding on data serialization, e.g. the form that data takes can be important in web scraping. Here is a quick guide and flow chart of the possibilities:

![Data Serialization](images/serialization.png) 

As you can see there is a close relationship between data formats and data storage. 

### WORM Storage 

Note that the data extraction methodology is necessarily _destructive_ in that it applies irreversable transformations to the data. To prevent re-ingestion or to accomodate historical ingestion and monitoring, most data scraping techniques use a **WORM** store to save data in as raw as possible a form. 

> WORM: Write Once Read Many 

![WORM Storage](images/worm.png)

WORM storage is now usually called a "data lake". 

## Scraping Basics

### Scrapy

Open source framework for crawling websites and extracting structured data. 

 - Spiders - define how a certain site (or group of sites) will be scraped.
 - Selectors - select certain parts of the HTML document.
 - Items - objects that serve as simple containers used to collect the scraped data. 
 - Scrapy Shell - debug scraping code quickly without having to run spider. 
 - Pipelines, extractors, and more!


*For more advanced crawling and scraping, it may be worth looking into the following tools.*

* Selenium - a Python library that allows you to simulate user interaction with a website.
* Apache Nutch - a highly extensible and scalable open source web crawler.

In [179]:
import scrapy

class CSSSpider(scrapy.Spider):
    name = "css-spider"
    start_urls = [
        'http://quotes.toscrape.com/',
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                'text': quote.css("span.text::text").extract_first(),
                'author': quote.css("small.author::text").extract_first(),
                'tags': quote.css("div.tags > a.tag::text").extract()
            }

        next_page_url = response.css("li.next > a::attr(href)").extract_first()
        if next_page_url is not None:
            yield scrapy.Request(response.urljoin(next_page_url))

ImportError: No module named 'scrapy'

### Being a Good Citizen

 - Robot.txt files - tell you what the site does and does not allow from crawlers.
 - Rate limiting - limiting the frequency at which you ping a website.
 - Too much traffic too quickly may bring down a smaller website.
 - Larger websites may block your IP address.

## Conclusion 

Your mission is to get the data you need to the job you need to do.

However:
- Data is designed for operations, not analysis.
- Data used in analysis usually needs to be denormalized.
- There can be many gatekeepers.

So what makes a good data source?

As data scientists, we rely heavily on structure and patterns, not only in the content of our data, but in its history and provenance. In general, good data sources have a determinable structure, where different pieces of content are organized according to some schema and can be extracted systematically via the application of some logic to that schema. If there is no common structure or schema between documents, it becomes difficult to discern any patterns for extracting the information we want, which often results in either no data retrieved at all or significant cleaning required to correct what the ingestion process got wrong.

### Publicly Available Datasets

 - [Amazon S3 Cloud Public Datasets](https://aws.amazon.com/datasets/)
 - [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/)
 - [Awesome Public Datasets](https://github.com/caesar0301/awesome-public-datasets)
 - [More Datasets](http://rs.io/100-interesting-data-sets-for-statistics/)
 - [Kaggle](https://www.kaggle.com/)
 - [Data.gov](https://www.data.gov/)
 - [Sunlight Foundation](https://sunlightfoundation.com/)

Strategy: look for academic data sets that implement techniques that you’re interested in - these may lead you to initial data or other primary sources.

Also, we’re in DC - Data.gov is a very important resource for data collection and aggregation - with APIs that are constantly being updated with new data. More importantly, Federal agencies in this area are desperate for community data work and visualizations - there are reverse pitches and more to get data scientists involved. Also keep in mind that Data.gov is just a start - the Federal Reserve Board has massive amounts of data, but cannot participate in Data.gov.

**In the end, the best data is always the data you gather yourself.**