# Web Scraping
by Chris North, Virginia Tech

Topics: Fetch, Parse, and Crawl web pages with Requests and BeautifulSoup


## Data on the web

* What data on the web might be useful for analysis?
    * https://www.yelp.com/search?find_desc=Restaurants&find_loc=Blacksburg%2C+VA+24060&ns=1&sortby=rating
    * https://www.amazon.com/Hutzler-571-Banana-Slicer/product-reviews/B0047E0EII 
    * https://twitter.com/search?f=tweets&q=march%20madness
    * ?
* Vast opportunity space for data science empowerment
    * https://www.softwarefindr.com/how-many-websites-are-there/    
![Websites](https://saasscout.com/wp-content/uploads/Total-number-of-Websites.png)


## How web publishing works

* Web publishing process 
    * **Data** &rarr; (server side scripts) &rarr; HTML &rarr; (browser) &rarr; Visual page
![Web data](http://i2.sitepoint.com/graphics/1733_first_principles.thumb.jpg)

* HTML5 DOM: Document Object Model
    * https://en.wikipedia.org/wiki/Document_Object_Model
    * https://www.w3schools.com/js/js_htmldom.asp
    * Browser developer menu, view page source, inspector
        * designed to specify the document structure and page layout
        * tags
![DOM](https://upload.wikimedia.org/wikipedia/commons/5/5a/DOM-model.svg)


## How to get the data?

* Download
    * https://www.kaggle.com
* Web DB APIs
    * REST https://en.wikipedia.org/wiki/Representational_state_transfer
    * Twitter API  
        * https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets
        * http://socialmedia-class.org/twittertutorial.html
        * Twint: https://github.com/twintproject/twint
    * Spotify API:  https://developer.spotify.com/documentation/web-api/
* **Scraping**
    * but Web pages are designed for human consumption


## How web scraping works

* Web scraping process:  Fetch, Parse, Crawl
    * URL &rarr; ([Fetch](#1.-Fetch)) &rarr; HTML &rarr; ([Parse](#2.-Parse)) &rarr; **Data** &rarr; ([Crawl](#3.-Crawl)) &rarr; URL &rarr; ...
* Need to algorithmically parse HTML DOM structure
    * Search the DOM for unique tags
    * Navigate the DOM tree structure
* Common parsing patterns of page&rarr;data  
    * Page ==> Data row : e.g. product page
    * Page ==> k data rows : e.g. paginated search result list
    * Page ==> Data column :  e.g. index page, list
* Cautions:
    * Getting blocked by the server
    * Human activity or cyber attack?
    * Retrieved page might not look the same as in browser, due to JavaScript
* Ethical concerns?
    * Does the provider allow scraping?
    * Are there commercial restrictions?
    * Intellectual property?
    * Will rapid crawling harm the site (eg. denial of service attack)?
    * Privacy issues?  Personal identifying information?



### 1. Fetch

Python "Requests" library
* http://requests.readthedocs.io/en/latest/user/quickstart/
* Browser user agent:  https://www.whatismybrowser.com/detect/what-is-my-user-agent
* Other libraries:
    * Selenium:  https://pypi.python.org/pypi/selenium

Example:  Yelp review of restaurants near blacksburg
https://www.yelp.com/search?find_desc=Restaurants&find_loc=Blacksburg%2C+VA+24060&ns=1&sortby=rating


In [1]:
### Example:  Yelp reviews
import requests


In [2]:
url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=Blacksburg%2C+VA+24060&ns=1&sortby=rating"
page = requests.get(url)

In [3]:
page.reason

'OK'

In [4]:
page.text

'<!DOCTYPE html><html lang="en-US" prefix="og: http://ogp.me/ns#" style="margin: 0;padding: 0; border: 0; font-size: 100%; font: inherit; vertical-align: baseline;"><head><script>document.documentElement.className=document.documentElement.className.replace(/\x08no-js\x08/,"js");</script><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><meta http-equiv="Content-Language" content="en-US" /><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"><link rel="mask-icon" sizes="any" href="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_large_assets/b2bb2fb0ec9c/assets/img/logos/yelp_burst.svg" content="#FF1A1A"><link rel="shortcut icon" href="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_large_assets/b05852393ae5/assets/img/logos/favicon.ico"><script> window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;window.ygaPageStartTime=new Date().getTime();</script><script>\n            window.yelp = window.yelp || {};\

In [5]:
open('page.html', 'w').write(page.text)

405099

### 2. Parse

Python "BeautifulSoup4" library
* https://www.crummy.com/software/BeautifulSoup/
* https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* "html5lib", and "lxml" also useful for parsing

HTML5 DOM structure:
* https://www.w3.org/TR/WD-DOM/introduction.html
* https://html.spec.whatwg.org (section 4: The elements of HTML)
* Browser's Developer Tools, Inspect Element

Two strategies to parse HTML DOM:
1. Search for unique tag attributes (class, id, …)
2. Navigate DOM tree structure, top down


In [6]:
### Put page content into BeautifulSoup for easy parsing
import bs4


In [7]:
soup = bs4.BeautifulSoup(page.text, 'html5lib')

#### 2.1 Search

Use browser "Inspect Element" feature to examine the objects you want to extract.
Look for unique identifiers.  Then search for those uniquie identifiers in soup: 

`tag = soup.find('tag', {dictionary of identifiers})`

`tags = soup.find_all('tag', {dictionary of identifiers})`

In [8]:
## Example: Find the restaurant names

tags = soup.find_all('a', {"class": "css-1m051bw", "rel": "noopener"})
tags

[<a class="css-1m051bw" href="/biz/taqueria-el-paso-christiansburg?osq=Restaurants" name="Taqueria el Paso" rel="noopener" target="_blank">Taqueria el Paso</a>,
 <a class="css-1m051bw" href="/biz/the-blacksburg-tavern-blacksburg?osq=Restaurants" name="The Blacksburg Tavern" rel="noopener" target="_blank">The Blacksburg Tavern</a>,
 <a class="css-1m051bw" href="/biz/cabo-fish-taco-blacksburg?osq=Restaurants" name="Cabo Fish Taco" rel="noopener" target="_blank">Cabo Fish Taco</a>,
 <a class="css-1m051bw" href="/biz/gaucho-brazilian-grille-blacksburg-2?osq=Restaurants" name="Gaucho Brazilian Grille" rel="noopener" target="_blank">Gaucho Brazilian Grille</a>,
 <a class="css-1m051bw" href="/biz/blacksburg-wine-lab-blacksburg?osq=Restaurants" name="Blacksburg Wine Lab" rel="noopener" target="_blank">Blacksburg Wine Lab</a>,
 <a class="css-1m051bw" href="/biz/spicity-blacksburg?osq=Restaurants" name="Spicity" rel="noopener" target="_blank">Spicity</a>,
 <a class="css-1m051bw" href="/biz/joes-

In [9]:
tags[0].text

'Taqueria el Paso'

In [10]:
tags[0]['name']

'Taqueria el Paso'

In [11]:
names = [t.text for t in tags]
names

['Taqueria el Paso',
 'The Blacksburg Tavern',
 'Cabo Fish Taco',
 'Gaucho Brazilian Grille',
 'Blacksburg Wine Lab',
 'Spicity',
 'Joe’s Diner',
 'Our Daily Bread Bakery and Bistro',
 'Avellinos',
 'Benny Marzano’s']

#### 2.2 Navigate

What if we didn't have unique class names to search for?

Use browser DOM viewer to navigate the DOM hierarchy, navigate from Top down to desired target element. Recreate the navigation pattern in code.  Or, Find the closest enclosing unique tag to search for, then navigate down from there.
Navigate tag children using:

`childtag = tag.contents[childnumber]`

In [12]:
## Example:  Navigate from the restaurant name to the restaraunt review counts
parent = tags[0].parent.parent.parent.parent.parent.parent

In [13]:
parent.contents[1]

<div class="border-color--default__09f24__NPAKY"><div aria-hidden="false" class="display--inline-block__09f24__fEDiJ border-color--default__09f24__NPAKY"><div class="border-color--default__09f24__NPAKY"><div class="display--inline-block__09f24__fEDiJ margin-t0-5__09f24__gboxT border-color--default__09f24__NPAKY"><div class="display--inline-block__09f24__fEDiJ margin-r1__09f24__rN_ga border-color--default__09f24__NPAKY"><span class="display--inline__09f24__c6N_k border-color--default__09f24__NPAKY"><div aria-label="5 star rating" class="five-stars__09f24__mBKym five-stars--regular__09f24__DgBNj display--inline-block__09f24__fEDiJ border-color--default__09f24__NPAKY" role="img"><div class="star__09f24__YoVH5 star--regular__09f24__IopbI display--inline-block__09f24__fEDiJ border-color--default__09f24__NPAKY"><svg height="20" viewBox="0 0 20 20" width="20"><path d="M0 4C0 1.79086 1.79086 0 4 0H10V20H4C1.79086 20 0 18.2091 0 16V4Z" fill="rgba(251,67,60,1)" opacity="1"></path><path d="M20 4C

In [14]:
parent.contents[1].text

'75'

In [21]:
rcs = [int((t.parent.parent.parent.parent.parent.parent).contents[1].text) for t in tags]
rcs

[75, 151, 560, 163, 53, 134, 100, 221, 110, 152]

In [16]:
## Example:  Navigate from the restaurant name to the restaraunt star rating

#### 2.3 Scraping multiple data fields into a DataFrame

In [22]:
import pandas

pandas.DataFrame(zip(names, rcs))

Unnamed: 0,0,1
0,Taqueria el Paso,75
1,The Blacksburg Tavern,151
2,Cabo Fish Taco,560
3,Gaucho Brazilian Grille,163
4,Blacksburg Wine Lab,53
5,Spicity,134
6,Joe’s Diner,100
7,Our Daily Bread Bakery and Bistro,221
8,Avellinos,110
9,Benny Marzano’s,152


### 3. Crawl

* Scrape many pages
* use parsed hyperlinks on an index page to scrape more pages
* or use URL query string
* common patterns:
    * results list, n rows on 1 page
    * paginated result lists, k rows per page, n/k pages
    * index page + n detail pages, 1 row per page


In [23]:
## Example: crawl to each restaurant's review page
url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=Blacksburg%2C+VA+24060&ns=1&sortby=rating&start="
urli = url + str(40)
urli

'https://www.yelp.com/search?find_desc=Restaurants&find_loc=Blacksburg%2C+VA+24060&ns=1&sortby=rating&start=40'

In [24]:
## Exercise: crawl paginated lists: restaurants 1-10, 11-20, 21,30, ...

pages = [requests.get(url + str(i*10)) for i in range(5)]
pages

[<Response [200]>,
 <Response [200]>,
 <Response [200]>,
 <Response [200]>,
 <Response [200]>]

In [25]:
pages[0].text

'<!DOCTYPE html><html lang="en-US" prefix="og: http://ogp.me/ns#" style="margin: 0;padding: 0; border: 0; font-size: 100%; font: inherit; vertical-align: baseline;"><head><script>document.documentElement.className=document.documentElement.className.replace(/\x08no-js\x08/,"js");</script><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><meta http-equiv="Content-Language" content="en-US" /><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"><link rel="mask-icon" sizes="any" href="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_large_assets/b2bb2fb0ec9c/assets/img/logos/yelp_burst.svg" content="#FF1A1A"><link rel="shortcut icon" href="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_large_assets/b05852393ae5/assets/img/logos/favicon.ico"><script> window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;window.ygaPageStartTime=new Date().getTime();</script><script>\n            window.yelp = window.yelp || {};\

In [26]:
soups = [bs4.BeautifulSoup(page.text, 'html5lib') for page in pages]
soups

[<!DOCTYPE html>
 <html lang="en-US" prefix="og: http://ogp.me/ns#" style="margin: 0;padding: 0; border: 0; font-size: 100%; font: inherit; vertical-align: baseline;"><head><script>document.documentElement.className=document.documentElement.className.replace(/no-js/,"js");</script><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><meta content="en-US" http-equiv="Content-Language"/><meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/><link content="#FF1A1A" href="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_large_assets/b2bb2fb0ec9c/assets/img/logos/yelp_burst.svg" rel="mask-icon" sizes="any"/><link href="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_large_assets/b05852393ae5/assets/img/logos/favicon.ico" rel="shortcut icon"/><script> window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;window.ygaPageStartTime=new Date().getTime();</script><script>
             window.yelp = window.yelp || {};
   

In [27]:
tags = [soup.find_all('a', {"class": "css-1m051bw", "rel": "noopener"}) for soup in soups]
tags

[[<a class="css-1m051bw" href="/biz/taqueria-el-paso-christiansburg?osq=Restaurants" name="Taqueria el Paso" rel="noopener" target="_blank">Taqueria el Paso</a>,
  <a class="css-1m051bw" href="/biz/the-blacksburg-tavern-blacksburg?osq=Restaurants" name="The Blacksburg Tavern" rel="noopener" target="_blank">The Blacksburg Tavern</a>,
  <a class="css-1m051bw" href="/biz/cabo-fish-taco-blacksburg?osq=Restaurants" name="Cabo Fish Taco" rel="noopener" target="_blank">Cabo Fish Taco</a>,
  <a class="css-1m051bw" href="/biz/gaucho-brazilian-grille-blacksburg-2?osq=Restaurants" name="Gaucho Brazilian Grille" rel="noopener" target="_blank">Gaucho Brazilian Grille</a>,
  <a class="css-1m051bw" href="/biz/blacksburg-wine-lab-blacksburg?osq=Restaurants" name="Blacksburg Wine Lab" rel="noopener" target="_blank">Blacksburg Wine Lab</a>,
  <a class="css-1m051bw" href="/biz/spicity-blacksburg?osq=Restaurants" name="Spicity" rel="noopener" target="_blank">Spicity</a>,
  <a class="css-1m051bw" href="/bi

## Exercise
Scrape march madness rankings
https://www.ncaa.com/rankings/basketball-men/d1/ncaa-mens-basketball-net-rankings

see QAC notes...