# Web Scraping Intro

Ideally we want the data formatted for us but sometimes it's not clean.

Usual order of preference:
1. Direct access to formatted data (database, CSV, etc.)
2. API call to get data
3. Scrape the data
4. Other sources....

# HTML - HyperText Markup Language

Essentially tells an application (web browser) what things should look like (**renders**)

Backbone of almost everything you see on the web.

## Syntax

Specialized XML

### Tags

Tags like `<p>` tell what the stuff is.
So `<p>This is a paragraph</p>` informs that the stuff in the tag should be interpretted as a paragraph (whatever that means)

### Attributes

Go within tags to alter the content in the tags. 

For example, `<a href="google.com">Click here for Google!</a>` tells the tag `<a>` (hyperlink tag) that the content should go to "google.com" when clicked. Here's what it looks like when rendered:

> <a href="google.com">Click here for Google!</a>

### Comments

Basically stuff that won't be rendered. 

Format:
> `<!-- This is a comment and won't get rendered!-->`

> 
> ```
> <!-- This is a multiple
> line commment
> 
> 
> fancy
> -->
> ```
>

## HTML Document Structure (DOM)

Always start with the declaration to explain it's HTML5:
```
<!DOCTYPE html>
```

Note that some will still render using HTML5 regardless

Overall structure is the `<html>` document, then within is the `<head>` & `<body>`.

* Head: meta data & other extras about the HTML document
* Body: holds all the content of the page

Example:
```
<!DOCTYPE html>
<html>
    <head>
        <!-- metadata about the HTML document as a whole -->

    </head>

    <body>
        <!-- content of our page will be here! -->

    </body>
</html>
```

## Common HTML Elements

* Paragraphs: `<p>paragraph stuff...</p>`
* Headers: `<h1>Main header</h1>`, `<h2>Sub-header</h2>`, `<h3>Sub-sub-header</h3>`, and so on
* Images `<img src="path_to_image" alt="Hover over object; used for accessibility">`
* Lists 
  + Unordered:
  ```
  <ul>
    <li>Coffee</li>
    <li>Vinyl Records</li>
    <li>Pickling</li>
  </ul>
  ```
  + Ordered:
  ```
    <ol>
        <li>DiFara Pizza</li>
        <li>Lucali's</li>
        <li>Sal and Carmine's</li>
        <li>Juliana's</li>
        <li>Joe's</li>
    </ol>
  ```
* Tables
```
<table>
  <tr>
    <th>header 0</th>
    <th>header 1</th> 
    <th>header 2</th>
  </tr>
  <tr>
    <td>data_item0 1st col</td>
    <td>data_item0 2nd col</td> 
    <td>data_item0 3rd col</td>
  </tr>
  <tr>
    <td>data_item1 1st col</td>
    <td>data_item1 2nd col</td> 
    <td>data_item1 3rd col</td>
  </tr>
</table>
```

<table>
  <tr>
    <th>header 0</th>
    <th>header 1</th> 
    <th>header 2</th>
  </tr>
  <tr>
    <td>data_item0 1st col</td>
    <td>data_item0 2nd col</td> 
    <td>data_item0 3rd col</td>
  </tr>
  <tr>
    <td>data_item1 1st col</td>
    <td>data_item1 2nd col</td> 
    <td>data_item1 3rd col</td>
  </tr>
</table>

## HTML 5

Allows more specific terms (especiall the `<div>` tag) which spawned:

* `<header>`
* `<nav>`
* `<footer>`
* `<main>`
* `<aside>`
* `<article>`
* `<section>`
* `<canvas>`
* And more!

# CSS

Allows for customizing further (usually on the visuals themselves)

- We can specify by element type
    + using `<p>`, `<h2>`, `<a>`, etc.
- class 
    + using `.name-of-class`
- id
    + using `#name-of-id`
- value of an attribute
- relative position to other elements

Selector then attributes to change

```
h1 {
  color: blue;
}
```

# Let's Do Some Web Scraping

## The DOM  - Document Object Model

A way for us to think of the document structure; allows us to parse through

Think of siblings, children/descendents, parents

## Scrape the DOM with Beautiful Soup

- Relatively fast execution in the parser
- Uses natural, Pythonic calls
- Does encodings (how the site is written)

In [None]:
# Note the import looks different
from bs4 import BeautifulSoup
# Requests (we've seen this already) gets the webpage
import requests
# Regular expressions allows us to parse text easier
import re

### Get the page

In [None]:
def getSoup(url):
    webpage = requests.get(url)
    return webpage, BeautifulSoup(webpage.content, 'html.parser')

In [None]:
page, soup = getSoup("https://www.androidpolice.com/")

In [None]:
# soup.prettify()

### Parse the page

In [None]:
images = soup.find_all('img')

In [None]:
div = soup.find_all('div')
len(div)

In [None]:
div[:10]

In [None]:
div[4].findNextSiblings()

In [None]:
soup.find('div', {"class": "container"})
# soup.find_all('div', {"class": "container"})