# Web Scraping


## Objectives

1. Understand motivation for web scraping:
    * What does a web data pipeline look like?
    * How should we store data from the web?
2. Know high level differences between NoSQL and SQL.


<div style="text-align: center"><h3>The Reality of Scraping</h3><img src="images/scraping_meme.png" style="width: 600px"></div>

## Why do we scrape the web?

* Realistically, data that you want to study won't always be available to you in the form of a curated data set.
* Need to go to the internets to find interesting data:
    * From an existing company
    * Text for NLP
    * Images
    <div style="text-align: center"><h3>Web Data Pipeline</h3><img src="images/web_data_pipeline.png" style="width: 600px"></div>

## Storing data from the web

* We have seen how to store data -> SQL (RBDMS).
    * Why wouldn't SQL necessarily be the best tool for storing data that we retrieve from the web?
        * Data are messy!
* Enter No SQL. Stands for **N**ot **o**nly **SQL**. MongoDB is a flavor of NoSQL, like PosgreSQL is a flavor of SQL.
    * A NoSQL paradigm may be preferable to SQL because it is **schemaless**.
    * Great for **storing unstructured data**, as we may find on the web!
    * MongoDB is a document-oriented DBMS:
      <div style="text-align: center"><h3>Centered around "Documents"</h3><img src="images/document_based_storage.png" style="width: 600px"></div>

## SQL vs. Mongo

* SQL - want to prevent redundancy in data by having **tables with unique information and relations** between them (normalized data).
    * Creates a **framework for querying** with joins.
    * Makes it easier to update database. Only ever have to **change information in a single place**.
    * This can result in **"simple" queries being slower, but more complex queries are often faster**.
* Mongo - **document based storage system**. Does not enforce normalized data. Can have data **redundancies in documents** (denormalized data).
    * **No joins**.
    * A change to database generally results in needing to **change many documents**.
    * Since there is redundancy in the documents, **simple queries are generally faster. But complex queries are often slower**.
    

|         | SQL          | Mongo          |
|---------|--------------|----------------|
| Schema  | Yes => Joins | No => No Joins |
| Storage | Table        | Collection     |
|         | Row          | Document       |
|         | Column       | Field          |

# Scraping from a Web Page with Python

Scraping a web site basically comes down to making a **request from Python and parsing through the HTML** that is returned from each page. For each of these tasks we have a Python library, **`requests` and `bs4`**, respectively.

### Getting Info from a Web Page

Now that we can gain easy access to the HMTL for a web page, we need **some way to pull the desired content from it**. Luckily there is already a system in place to do this. With a **combination of HMTL and CSS selectors** we can identify the information on a HMTL page that we wish to retrieve and grab it with [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree).

In [1]:
html = '''<!DOCTYPE html>
<html>
<head>
<title>The title of this web page</title>
</head>
<body>
<h1>My Photos</h1>
<div class='intro'>
<p>These are some photos of my trips.</p>
<img src="me.png">
</div>

<h3>Italy</h3>
<div class='country'>
<img src="venice1.png" alt="Venice"> <br />
<img src="venice2.png" alt="Venice"> <br />
<img src="rome.png" alt="Roma">
</div>

<h3>Germany</h3>
<div class='country'>
<img src="berlin.png" alt="Berlin">
</div>
</body>
</html>
'''

In [2]:
from bs4 import BeautifulSoup

# we create a soup object with the html:
soup = BeautifulSoup(html, 'html.parser')

In [3]:
# now we can query it
soup.title

<title>The title of this web page</title>

In [4]:
soup.title.string

'The title of this web page'

In [5]:
soup.h1

<h1>My Photos</h1>

In [6]:
soup.h3

<h3>Italy</h3>

In [7]:
soup.find('h3')

<h3>Italy</h3>

In [8]:
soup.find_all('h3')

[<h3>Italy</h3>, <h3>Germany</h3>]

In [22]:
soup.find_all('h3')[1].string

'Germany'

In [13]:
soup.find_all('div', class_='country')

[<div class="country">
 <img alt="Venice" src="venice1.png"/> <br/>
 <img alt="Venice" src="venice2.png"/> <br/>
 <img alt="Roma" src="rome.png"/>
 </div>,
 <div class="country">
 <img alt="Berlin" src="berlin.png"/>
 </div>]

In [19]:
soup.find_all('img', alt='Venice')

[<img alt="Venice" src="venice1.png"/>, <img alt="Venice" src="venice2.png"/>]

In [19]:
soup.find('div', class_='country').find_previous_siblings('h3')

[<h3>Italy</h3>]

<div style="color:white">
for div in soup.find_all('div', class_='country'):
    h3 = div.find_previous_siblings('h3')[0]
    country = h3.string
    print(country)

for div in soup.find_all('div', class_='country'):
    h3 = div.find_previous_siblings('h3')[0]
    country = h3.string
    for img in div.find_all('img'):
        image = img.get('src')
        print('Country: {}: image: {}'.format(country, image))
</div>

## Getting Info from a Web Page

### Requests Library

The [requests](http://docs.python-requests.org/en/latest/index.html) library is designed to simplify the process of making **http requests within Python**. The interface is mind-bogglingly simple. Instantiate a requests object to the request, this will mostly be a `get`, with the URL and optional parameters you'd like passed through the request. That instance make the results of the request available via attributes/methods.

In [6]:
from bs4 import BeautifulSoup
import requests
fun_cheap = 'http://sf.funcheap.com'
r = requests.get('http://sf.funcheap.com/2018/06/25/')

In [7]:
r.text[:1000] # First 1000 characters of the HTML

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="https://www.w3.org/1999/xhtml" lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec" >\n\n<head profile="https://gmpg.org/xfn/11">\n\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n\n\n<title>Events for June 25, 2018 - Funcheap</title>\n\n<meta name="generator" content="WordPress" /> <!-- leave this for stats -->\n\n<link rel="stylesheet" href="https://cdn.funcheap.com/wp-content/themes/arthemia-premium/style.css?v=1.8.23" type="text/css" media="screen" />\n<link rel="stylesheet" href="https://cdn.funcheap.com/wp-content/themes/arthemia-premium/madmenu.css?v=1.1" type="text/css" media="screen" />\n<!--[if IE 6]>\n    <style type="text/css">\n    body {\n        behavior:url("https://cdn.funcheap.com/wp-content/themes/arthemia-premium/scripts/csshover2.htc");\n    }\n  

### Now that we have the web page, we can parse it with beautifulsoup:

In [8]:
soup = BeautifulSoup(r.text, 'html.parser')

#### Get the title of the page using the tag 'title':

In [9]:
soup.select('h2.title')[0].string

'Events for  June 25, 2018'

In [10]:
title = soup.find_all('h2', class_='title')[0]

title

<h2 class="title">Events for  June 25, 2018</h2>

In [11]:
good_clear_float = title.next_sibling.next_sibling

# good_clear_float

#### Same all the urls under the 'a' tag:

In [12]:
urls = []
for tag in good_clear_float.find_all('a', rel=True):
    href = tag.attrs['href']
    urls.append(href)

In [13]:
urls

[]