# Web Scraping


## Objectives

1. Understand motivation for web scraping:
    * What does a web data pipeline look like?
    * How should we store data from the web?
2. Know high level differences between NoSQL and SQL.


<div style="text-align: center"><h3>The Reality of Scraping</h3><img src="images/scraping_meme.png" style="width: 600px"></div>

## Why do we scrape the web?

* Realistically, data that you want to study won't always be available to you in the form of a curated data set.
* Need to go to the internets to find interesting data:
    * From an existing company
    * Text for NLP
    * Images
    <div style="text-align: center"><h3>Web Data Pipeline</h3><img src="images/web_data_pipeline.png" style="width: 600px"></div>

## Storing data from the web

* We have seen how to store data -> SQL (RBDMS).
    * Why wouldn't SQL necessarily be the best tool for storing data that we retrieve from the web?
        * Data are messy!
* Enter No SQL. Stands for **N**ot **o**nly **SQL**. MongoDB is a flavor of NoSQL, like PosgreSQL is a flavor of SQL.
    * A NoSQL paradigm may be preferable to SQL because it is **schemaless**.
    * Great for **storing unstructured data**, as we may find on the web!
    * MongoDB is a document-oriented DBMS:
      <div style="text-align: center"><h3>Centered around "Documents"</h3><img src="images/document_based_storage.png" style="width: 600px"></div>

## SQL vs. Mongo

* SQL - want to prevent redundancy in data by having **tables with unique information and relations** between them (normalized data).
    * Creates a **framework for querying** with joins.
    * Makes it easier to update database. Only ever have to **change information in a single place**.
    * This can result in **"simple" queries being slower, but more complex queries are often faster**.
* Mongo - **document based storage system**. Does not enforce normalized data. Can have data **redundancies in documents** (denormalized data).
    * **No joins**.
    * A change to database generally results in needing to **change many documents**.
    * Since there is redundancy in the documents, **simple queries are generally faster. But complex queries are often slower**.
    

|         | SQL          | Mongo          |
|---------|--------------|----------------|
| Schema  | Yes => Joins | No => No Joins |
| Storage | Table        | Collection     |
|         | Row          | Document       |
|         | Column       | Field          |

# Scraping from a Web Page with Python

Scraping a web site basically comes down to making a **request from Python and parsing through the HTML** that is returned from each page. For each of these tasks we have a Python library, **`requests` and `bs4`**, respectively.

### Getting Info from a Web Page

Now that we can gain easy access to the HMTL for a web page, we need **some way to pull the desired content from it**. Luckily there is already a system in place to do this. With a **combination of HMTL and CSS selectors** we can identify the information on a HMTL page that we wish to retrieve and grab it with [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree).

In [None]:
html = '''<!DOCTYPE html>
<html>
<head>
<title>The title of this web page</title>
</head>
<body>
<h1>My Photos</h1>
<div class='intro'>
<p>These are some photos of my trips.</p>
<img src="me.png">
</div>

<h3>Italy</h3>
<div class='country'>
<img src="venice1.png" alt="Venice"> <br />
<img src="venice2.png" alt="Venice"> <br />
<img src="rome.png" alt="Roma">
</div>

<h3>Germany</h3>
<div class='country'>
<img src="berlin.png" alt="Berlin">
</div>
</body>
</html>
'''

In [7]:
from bs4 import BeautifulSoup

# we create a soup object with the html:
soup = BeautifulSoup(html, 'html.parser')

In [10]:
# now we can query it
soup.title

<title>The title of this web page</title>

In [11]:
soup.title.string

'The title of this web page'

In [12]:
soup.h1

<h1>My Photos</h1>

In [13]:
soup.h3

<h3>Italy</h3>

In [14]:
soup.find('h3')

<h3>Italy</h3>

In [15]:
soup.find_all('h3')

[<h3>Italy</h3>, <h3>Germany</h3>]

In [16]:
soup.find_all('h3')[1].string

'Germany'

In [17]:
soup.find_all('div', class_='country')

[<div class="country">
 <img alt="Venice" src="venice1.png"/> <br/>
 <img alt="Venice" src="venice2.png"/> <br/>
 <img alt="Roma" src="rome.png"/>
 </div>, <div class="country">
 <img alt="Berlin" src="berlin.png"/>
 </div>]

In [18]:
soup.find_all('img', alt='Venice')

[<img alt="Venice" src="venice1.png"/>, <img alt="Venice" src="venice2.png"/>]

In [19]:
soup.find('div', class_='country').find_previous_siblings('h3')

[<h3>Italy</h3>]

### If I wanted to get a list of all of the countries visited, how would I do it?

In [None]:
# A:

<div style="color:white">
for div in soup.find_all('div', class_='country'):
    h3 = div.find_previous_siblings('h3')[0]
    country = h3.string
    print(country)

for div in soup.find_all('div', class_='country'):
    h3 = div.find_previous_siblings('h3')[0]
    country = h3.string
    for img in div.find_all('img'):
        image = img.get('src')
        print('Country: {}: image: {}'.format(country, image))
</div>

## Getting Info from a Web Page

### Requests Library

The [requests](http://docs.python-requests.org/en/latest/index.html) library is designed to simplify the process of making **http requests within Python**. The interface is mind-bogglingly simple. Instantiate a requests object to the request, this will mostly be a `get`, with the URL and optional parameters you'd like passed through the request. That instance make the results of the request available via attributes/methods.

In [21]:
import requests
fun_cheap = 'http://sf.funcheap.com'
r = requests.get('http://sf.funcheap.com/2018/06/25/')

In [22]:
r.text[:1000] # First 1000 characters of the HTML

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="https://www.w3.org/1999/xhtml" lang="en-US" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">\n\n<head profile="https://gmpg.org/xfn/11">\n<script src="//cdn.optimizely.com/js/195632799.js"></script>\n\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n\n\n<title>Events for June 25, 2018 Archives - FunCheapSF.com</title>\n\n<meta name="generator" content="WordPress" /> <!-- leave this for stats -->\n\n<link rel="stylesheet" href="https://cdn.funcheap.com/wp-content/themes/arthemia-premium/style.css?v=1.8.9" type="text/css" media="screen" />\n<link rel="stylesheet" href="https://cdn.funcheap.com/wp-content/themes/arthemia-premium/madmenu.css?v=1.1" type="text/css" media="screen" />\n<!--[if IE 6]>\n    <style type="text/css">\n    body {\n        behavior:url("https://cdn.funcheap.com/wp

### Now that we have the web page, we can parse it with beautifulsoup:

In [23]:
soup = BeautifulSoup(r.text, 'html.parser')

#### Get the title of the page using the tag 'title':

In [24]:
soup.select('h2.title')[0].string

'Events for  June 25, 2018'

In [25]:
title = soup.find_all('h2', class_='title')[0]

title

<h2 class="title">Events for  June 25, 2018</h2>

In [27]:
good_clear_float = title.next_sibling.next_sibling

# good_clear_float

#### Same all the urls under the 'a' tag:

In [28]:
urls = []
for tag in good_clear_float.find_all('a', rel=True):
    href = tag.attrs['href']
    urls.append(href)

In [29]:
urls

['https://sf.funcheap.com/67th-beginning-korean-war-commemoration-presidio/',
 'https://sf.funcheap.com/67th-beginning-korean-war-commemoration-presidio/',
 'https://sf.funcheap.com/live-lagunitas-white-buffalo-petaluma/',
 'https://sf.funcheap.com/live-lagunitas-white-buffalo-petaluma/',
 'https://sf.funcheap.com/popping-the-%e2%80%8bscience-%e2%80%8bbubble-conversation-w-cal-scientists-berkeley-3/',
 'https://sf.funcheap.com/popping-the-%e2%80%8bscience-%e2%80%8bbubble-conversation-w-cal-scientists-berkeley-3/',
 'https://sf.funcheap.com/nerd-nite-east-bay-geeky-lectures-in-a-bar-oakland-54/',
 'https://sf.funcheap.com/case-of-the-mondays-game-night-free-comedy-milk-bar-3/',
 'https://sf.funcheap.com/premier-scottish-indiepop-band-belle-sebastian-fox-theater/',
 'https://sf.funcheap.com/free-standup-comedy-night-pop-food-oakland-19/',
 'https://sf.funcheap.com/monday-night-comedy-free-cake-blondies-3/',
 'https://sf.funcheap.com/laughgasm-monday-comedy-at-the-rite-spot-sf-23/']

### Checkout (in pair): Explain the fundamental difference between mongodb and sqlite. Why would you use one over the other? What is the trade off?

## Mongo and API Scraping

Many APIs will give you a choice of how it will return data to you, **choosing json will make life easier since we will frequently be using Mongo for our storage unit** during our scraping endeavors, and it plays very well with json. 

Interacting with Mongo from Python is done with the other **Mongo client** that we talked about earlier **PyMongo**. It is designed to have a similar interface as the Mongo shell does, this ends up being fairly intuitive since both **Python and JavaScript are object oriented languages**, and therefore store and refer to things in a similar manner.

Remember, in mongoDB the vocabulary is slightly different than with the regular RDBMS:
- collection <=> table
- doc <=> row
- field <=> column



### Install and Run MongoDB with Homebrew
- Open the Terminal app and type brew update.
- After updating Homebrew brew install mongodb
- After downloading Mongo, create the “db” directory. This is where the Mongo data files will live. You can create the directory in the default location by running mkdir -p /data/db
- Make sure that the /data/db directory has the right permissions by running in the terminal:

```
sudo chmod 0755 /data/db
# Enter your password when prompted
```
If you get an error because the folder does not exist, enter :

```
sudo mkdir -p /data/db
```
- Run the Mongo daemon (ie: in one of your terminal windows run `sudo mongod`). This should start the Mongo server (In multitasking computer operating systems, a daemon is a computer program that runs as a background process (also called service), rather than being under the direct control of an interactive user.)
- Run the Mongo shell, with the Mongo daemon running in one terminal, type mongo in another terminal window. This will run the Mongo shell which is an application to access data in MongoDB.
- You can now check the databases by typing: `show dbs` in mongo shell
- To exit the Mongo shell run quit() or ctrl-c
- To stop the Mongo daemon hit ctrl-c

# Mongo Shell Demo Code

## Using Mongo - General Commands for Inspecting Mongo

```javascript
help                        // List top level mongo commands

db.help()                   // List database level mongo commands

db.<collection name>.help() // List collection level mongo commands.

show dbs                    // Get list of databases on your system

use <database name>         // Change the database that you're current using if the db does not exist, create one.

show collections            // Get list of collections within the database that you're currently using
```

## Inserting

Once you're using a database you refer to it with the name **db**. Collections within databases are accessible through dot notation.

```javascript
db.users.insert({ name: 'Jon', age: '45', friends: [ 'Henry', 'Ashley']})

db.getCollectionNames()  // Another way to get the names of collections in current database

db.users.insert({ name: 'Ashley', age: '37', friends: [ 'Jon', 'Henry']})
db.users.insert({ name: 'Frank', age: '17', friends: [ 'Billy'], car : 'Civic'})

db.users.find()

    { "_id" : ObjectId("573a39"), "name" : "Jon", "age" : "45", "friends" : [ "Henry", "Ashley" ] }
    { "_id" : ObjectId("573a3a"), "name" : "Ashley", "age" : "37", "friends" : [ "Jon", "Henry" ] }
    { "_id" : ObjectId("573a3b"), "name" : "Frank", "age" : "17", "friends" : [ "Billy" ], "car" : "Civic" }
```

Things to note:
* The three documents that we inserted into the above database didn't all have the same fields.
* Mongo creates an ` _id` field for each document if one isn't provided.

## Querying

```javascript
db.users.find({ name: 'Jon'})                       // find by single field

db.users.find({ car: { $exists : true } })          // find by presence of field

db.users.find({ friends: 'Henry' })                 // find by value in array

db.users.find({}, { name: true })                   // field selection (only return name)
```

A quick way to figure out how to write a Mongo query is to think about how you would do it in SQL and check out a resource like this Mongo endorsed [conversion guide](https://docs.mongodb.com/manual/reference/sql-comparison/#create-and-alter), or use something like a [query translator](http://www.querymongo.com/).

## Updating

```javascript
db.users.update({name: "Jon"}, { $set: {friends: ["Phil"]}})            // replaces friends array

db.users.update({name: "Jon"}, { $push: {friends: "Susie"}})            // adds to friends array

db.users.update({name: "Stevie"}, { $push: {friends: "Nicks"}}, true)   // upsert

db.users.update({}, { $set: { activated : false } }, false, true)       // multiple updates
```

### Let's use mongo with python:

Install the modules pymongo and tqdm, by running the following commands in the terminal:
```
pip install pymongo
pip install tqdm
```

In [30]:
from pymongo import MongoClient
from tqdm import tqdm

client = MongoClient()
db = client.uk_police2
collection = db.all_crime

### We are going to use some of the UK police data through their public API:

https://data.police.uk/docs/method/crimes-no-location/

In [31]:
other_request = requests.get('https://data.police.uk/api/crimes-no-location?category=all-crime&force=leicestershire&date=2018-02')

In [32]:
other_request.json()

[{'category': 'bicycle-theft',
  'location_type': None,
  'location': None,
  'context': '',
  'outcome_status': {'category': 'Investigation complete; no suspect identified',
   'date': '2018-03'},
  'persistent_id': 'db31a3c713ed240ed0ec9229abdd122fe5ef795ad7d2e36e9a456f739ef4715d',
  'id': 62982250,
  'location_subtype': '',
  'month': '2018-02'},
 {'category': 'burglary',
  'location_type': None,
  'location': None,
  'context': '',
  'outcome_status': {'category': 'Investigation complete; no suspect identified',
   'date': '2018-02'},
  'persistent_id': 'd8d71e117157ec7df5977ec0b1df808c673c0b58cecef3e6c62c5ce9cc1bb8ff',
  'id': 62978776,
  'location_subtype': '',
  'month': '2018-02'},
 {'category': 'burglary',
  'location_type': None,
  'location': None,
  'context': '',
  'outcome_status': {'category': 'Unable to prosecute suspect',
   'date': '2018-03'},
  'persistent_id': '9a6b6df740fbacce82a40aee5de726119dd408cad36e7de712fe301b6d58efbd',
  'id': 62982759,
  'location_subtype':

### Let's grab the data and insert it in our mongoDB

In [34]:
# Possible way to grab data for range of months and years
for year in range(2017,2018):
    for month in tqdm(range(1, 12)):
#         print('Scraping year/month: {}/{}'.format(year, month))
        r = requests.get('https://data.police.uk/api/crimes-no-location?category=all-crime&force=leicestershire&date={}-{}'.format(year, month))
        collection.insert_many(r.json())

100%|██████████| 11/11 [00:17<00:00,  1.57s/it]


In [35]:
collection.insert_many(other_request.json())

<pymongo.results.InsertManyResult at 0x10ddc5e08>

In [36]:
from pprint import pprint

for item in collection.find({ 'category' : 'burglary' }):
    pprint(item)

{'_id': ObjectId('5b562e7e2a5c647591332bc3'),
 'category': 'burglary',
 'context': '',
 'id': 54725401,
 'location': None,
 'location_subtype': '',
 'location_type': None,
 'month': '2017-02',
 'outcome_status': {'category': 'Investigation complete; no suspect identified',
                    'date': '2017-02'},
 'persistent_id': '2f54b149a18892c1d19b2cb971e60af2daee051c563b57e7b0d4e3163e476180'}
{'_id': ObjectId('5b562e7e2a5c647591332bc4'),
 'category': 'burglary',
 'context': '',
 'id': 54732498,
 'location': None,
 'location_subtype': '',
 'location_type': None,
 'month': '2017-02',
 'outcome_status': {'category': 'Investigation complete; no suspect identified',
                    'date': '2017-02'},
 'persistent_id': 'b9c512d3c6408aa45276a50fcd66c119117ca23c5e07e700886eebc988a4e42a'}
{'_id': ObjectId('5b562e7f2a5c647591332bc6'),
 'category': 'burglary',
 'context': '',
 'id': 56862854,
 'location': None,
 'location_subtype': '',
 'location_type': None,
 'month': '2017-03',
 'outco

In [37]:
# Remember to close the connection
client.close()