<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Web Scraping: Solutions

_Author: Dave Yerrington (SF)_

---

## Learning Objectives
- Revisit how to locate elements on a webpage
- Aquire unstructure data from the internet using Beautiful soup.
- Discuss limitations associated with simple requests and urllib libraries
- Introduce Selenium as a solution, and implement a scraper using selenium

## Lesson Guide

- [Introduction](#intro)
- [Building a web scraper](#building-scraper)
- [Retrieving data from the HTML page](#retrieving-data)
    - [Retrieving the restaurant names](#retrieving-names)
    - [Challenge: Retrieving the restaurant locations](#retrieving-locations)
    - [Retrieving the restaurant prices](#retrieving-prices)
    - [Retrieving the restaurant number of bookings](#retrieving-bookings)


- [Introducting Selenium](#selenium)
    - [Running JavaScript before scraping](#selenium-js)
    - [Using regex to only get digits](#selenium-regex)
    - [Challenge: Use Pandas to create a DataFrame of bookings](#challenge-pandas)
    - [Auto-typing using Selenium](#selenium-typing)


- [Summary](#summary)

<a id="intro"></a>
## Introduction

In this codealong lesson, we'll build a web scraper using urllib and BeautifulSoup. We will also explore how to use a headless browser called Selenium.

We'll begin by scraping OpenTable's DC listings. We're interested in knowing the restaurant's **name, location, and price.**

OpenTable provides all of this information on this given page: http://www.opentable.com/washington-dc-restaurant-listings

Let's inspect the elements of this page to assure we can find each of the bits of information in which we're interested.

---

<a id="building-scraper"></a>
## Building a web scraper

Now, let's build a web scraper for OpenTable using urllib and Beautiful Soup:

In [1]:
# import our necessary first packages
from bs4 import BeautifulSoup
import requests

In [2]:
# set the url we want to visit
url = "http://www.opentable.com/washington-dc-restaurant-listings"

# visit that url, and grab the html of said page
html = requests.get(url)

At this point, what is in html?

In [3]:
# .text returns the request content in Unicode
html.text[:500]

'           <!DOCTYPE html><html lang="en"><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=9; IE=8; IE=7; IE=EDGE"/> <title>Restaurant Reservation Availability</title>    <meta  name="robots" content="noindex,nofollow" > </meta>     <link rel="shortcut icon" href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon.ico" type="image/x-icon"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-16.png" sizes="16x16"/><l'

We will need to convert this html objct into a soup object so we can parse it using python and BS4

In [4]:
# convert this into a soup object
soup = BeautifulSoup(html.text, 'html.parser')
soup

 <!DOCTYPE html>
<html lang="en"><head><meta charset="utf-8"/><meta content="IE=9; IE=8; IE=7; IE=EDGE" http-equiv="X-UA-Compatible"/> <title>Restaurant Reservation Availability</title> <meta content="noindex,nofollow" name="robots"> </meta> <link href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon.ico" rel="shortcut icon" type="image/x-icon"/><link href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-16.png" rel="icon" sizes="16x16"/><link href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-32.png" rel="icon" sizes="32x32"/><link href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-48.png" rel="icon" sizes="48x48"/><link href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-64.png" rel="icon" sizes="64x64"/><link href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-128.png" rel="icon" sizes="128x128"/><link href="//components.otstatic.com/components/favicon/1.0.6/fa

<a id="retrieving-data"></a>
### Retrieving data from the HTML page

### Open Table
Let's first find each restaurant name listed on the page we've loaded. How do we find the page location of the restaurant? (Hint: We need to know where in the **HTML** the restaurant element is housed.) In order to find the HTML that renders the restaurant location, we can use Google Chrome's Inspect tool:

> http://www.opentable.com/washington-dc-restaurant-listings

> 1. Visit the URL above. 

> 2. Right-click on an element you are interested in, then choose Inspect (in Chrome). 

> 3. This will open the Developer Tools and show the HTML used to render the selected page element. 

> Throughout this lesson, we will use this method to find tags associated with elements of the page we want to scrape.

See if you can find the restaurant name on the page. Keep in mind there are many restaurants loaded on the page.

In [5]:
# print the restaurant names
soup.find_all(name='span', attrs={'class':'rest-row-name-text'})

[<span class="rest-row-name-text">1331 Metz</span>,
 <span class="rest-row-name-text">345 D'Amore</span>,
 <span class="rest-row-name-text">Est Klocko</span>,
 <span class="rest-row-name-text">Eius Courts</span>,
 <span class="rest-row-name-text">Velit Collier</span>,
 <span class="rest-row-name-text">Akeems</span>,
 <span class="rest-row-name-text">Curts</span>,
 <span class="rest-row-name-text">Volkman</span>,
 <span class="rest-row-name-text">Et</span>,
 <span class="rest-row-name-text">Laudantium Gerhold</span>,
 <span class="rest-row-name-text">Veum Highway</span>,
 <span class="rest-row-name-text">849 Stoltenberg</span>,
 <span class="rest-row-name-text">Court</span>,
 <span class="rest-row-name-text">1123 Bernhard</span>,
 <span class="rest-row-name-text">A Expressway</span>,
 <span class="rest-row-name-text">Jacobs</span>,
 <span class="rest-row-name-text">Viaduct</span>,
 <span class="rest-row-name-text">Explicabo Jones</span>,
 <span class="rest-row-name-text">Occaecati Berni

It is important to always keep in mind the data types that were returned. Note this is a `list`, and we know that immediately by observing the outer square brackets and commas separating each tag.

Next, **note the elements of the list are `Tag` objects, not strings**. (If they were strings, they would be surrounded by quotes.) The Beautiful Soup authors chose to display a `Tag` object visually as a text representation of the tag and its contents. However, **being an object, it has many methods that we can call on it**.

<a id="retrieving-names"></a>
#### Retrieving the restaurant names

Now that we found a list of tags containing the restaurant names, let's think how we can loop through them all one-by-one. In the following cell, we'll print out the name (and **only** the clean name, not the rest of the html) of each restaurant.

In [6]:
# for each element you find, print out the restaurant name
for entry in soup.find_all(name='span', attrs={'class':'rest-row-name-text'}):
    print(entry.text)

1331 Metz
345 D'Amore
Est Klocko
Eius Courts
Velit Collier
Akeems
Curts
Volkman
Et
Laudantium Gerhold
Veum Highway
849 Stoltenberg
Court
1123 Bernhard
A Expressway
Jacobs
Viaduct
Explicabo Jones
Occaecati Bernier
131 Rohan
Quia Orchard
Grady Views
Beatae Mission
393 Miller
Placeat Price
Hermann Shoal
Agloe Bar & Grill
Views
Estate
Jaron Turnpike
Lilys
Points
Pats
Row
Sheridan Bradtke
Doyle
Walter
Auer
865 Kihn
Nam Barrows
519 Jones
Optio Points
White
Optio Trail
Esmeralda Adams
Burg
891 Block
Grant Estates
Wilmas
Ullam Cassin
Herman
Pasquales
Andrew Kiehn
Earum Wisozk
Ward Coves
Dolorum Murray
Leanns
Tod Trail
Iure Jones
Omnis Coves
Reynolds Glens
Dolor Kling
Dolorum Junction
Kemmer
Aut
330 Steuber
Facilis Gerhold
Amet
Hoppe
Katheryns
Koch Islands
White
Forge
Waynes
Corwin Spur
Sequi
Crossroad
Sint Fields
Haven
Ea Curve
704 Skiles
Ut Macejkovic
Mollitia Camp
Dolore Reinger
Riley Rapid
Officiis Road
Est Borer
Enim Spinka
Springs
Sunt
Eum
Hill
Lakes
Road
999 Rempel
Aminas
Nigels
Et Exten

Great!

<a id="retrieving-locations"></a>
#### Challenge: Retrieving the restaurant locations

Can you repeat that process for finding the location? For example, barmini by Jose Andres is in the location listed as "Penn Quarter" in our search results.

In [8]:
# first, see if you can identify the location for all elements -- print it out
soup.find_all('span', {'class':'rest-row-meta--location'})

[<span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">South Frederique</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Elainaborough</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Port Letitia</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">East Roxanestad</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Schummville</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Doloresburgh</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Jaskolskitown</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Lake Crawfordburgh</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">North Dorris</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Schaefertown</span>,
 <span class="re

In [9]:
# now print out EACH location for the restaurants
for entry in soup.find_all('span', {'class':'rest-row-meta--location'}):
    print(entry.text)

South Frederique
Elainaborough
Port Letitia
East Roxanestad
Schummville
Doloresburgh
Jaskolskitown
Lake Crawfordburgh
North Dorris
Schaefertown
Martaville
Robertshaven
New Jackside
Port Alayna
Port Viviennebury
Romagueraville
Demetrisshire
East Kali
South Louisaville
Kodymouth
Cheyennemouth
Lake Diego
Lake Lavinia
Pollichside
Kentonchester
Enidside
Hilpertland
Lionelport
West Nettie
Alessiaburgh
Christiansenville
McLaughlinville
Simonisfurt
Cletaview
West Lemuel
South Jaquelinside
Modestoshire
Corteztown
East Terrellmouth
Lake Fatimashire
Wymanmouth
North Cynthiamouth
New Ewell
Russelbury
Tylerbury
North Jazminland
East Elyse
Durwardside
Runolfssonton
Leuschkechester
Barrowsborough
West Loisville
West Hermannhaven
Lake Nickmouth
Runolfssonfort
Herzogville
West Jess
South Francisport
Lake Joannychester
South Brad
East Manleybury
New Kylie
South Brenna
Kochfurt
New Trever
New Gushaven
New Alanborough
Zulastad
North Tracyfurt
Port Edna
Binsview
South Dorotheatown
East Bethelberg
Bustermou

<a id="retrieving-prices"></a>
#### Retrieving the restaurant prices

Ok, we've figured out the restaurant name and location. Now we need to grab the price (number of dollar signs on a scale of one to four) for each restaurant. We'll follow the same process.

In [10]:
# print out all prices
soup.find_all('div', {'class':'rest-row-pricing'})

[<div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    </i> <span class="pricing--not-the-price">  $        </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $      </i> <span class="pricing--not-the-price">  $    $      </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $      </i> <span class="pricing--not-the-price">  $    $      </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $      </i> <span class="pricing--not-the-price">  $    $      </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    </i> <span class="pricing--not-the-price">  $        

In [17]:
# print out EACH number of dollar signs per restaurant
# this one is trickier to eliminate the html. Hint: try a nested find
# print the number of dollars signs per restaurant

for entry in soup.find_all('div', {'class':'rest-row-pricing'}):
    price = entry.find('i').text
    print(price.count('$'), "  -->  ", price)

4   -->     $    $    $    $  
4   -->     $    $    $    $  
3   -->     $    $    $    
2   -->     $    $      
2   -->     $    $      
2   -->     $    $      
3   -->     $    $    $    
4   -->     $    $    $    $  
4   -->     $    $    $    $  
3   -->     $    $    $    
4   -->     $    $    $    $  
4   -->     $    $    $    $  
3   -->     $    $    $    
3   -->     $    $    $    
2   -->     $    $      
2   -->     $    $      
4   -->     $    $    $    $  
4   -->     $    $    $    $  
2   -->     $    $      
4   -->     $    $    $    $  
2   -->     $    $      
4   -->     $    $    $    $  
2   -->     $    $      
2   -->     $    $      
2   -->     $    $      
4   -->     $    $    $    $  
4   -->     $    $    $    $  
2   -->     $    $      
4   -->     $    $    $    $  
3   -->     $    $    $    
4   -->     $    $    $    $  
4   -->     $    $    $    $  
2   -->     $    $      
3   -->     $    $    $    
4   -->     $    $    $    $  
4   --> 

That looks great, but what if I wanted just the number of dollar signs per restaurant? Can you figure out a way to simply print out the number of dollar signs per restaurant listed?

### Craigslist

In [18]:
html =  requests.get('https://santabarbara.craigslist.org/d/apts-housing-for-rent/search/apa')
soup_cl = BeautifulSoup(html.text, 'html.parser')

In [15]:
results = soup_cl.find_all('p', attrs={'class': 'result-info'})
listings = []
for result in results:
    result_dict = {}
    result_dict['title'] = result.find('a').text
    result_dict['price'] = result.find('span', attrs={'class': 'result-price'}).text
    listings.append(result_dict)

listings

[{'price': '$2660',
  'title': 'The Perfect Pool to Jump right into! Arrive Los Carneros!'},
 {'price': '$1500', 'title': 'Urgently need a home'},
 {'price': '$3550', 'title': 'THIS PLACE WILL MAKE YOU FEEL RIGHT AT HOME!'},
 {'price': '$3295', 'title': 'Everything You Need. All Right Here.'},
 {'price': '$1900', 'title': "Beautiful 1920's Farmhouse for rent"},
 {'price': '$2100', 'title': 'Gorgeous Upper State Apartment'},
 {'price': '$3100', 'title': 'Modern, Stylish, Close to West Beach'},
 {'price': '$2395',
  'title': '2 Bedroom, 1.5 Bath - Townhouse Apt. - Montecito Gardens Apts'},
 {'price': '$2760',
  'title': '2 bedroom 2 bath - Immediate Move in available!'},
 {'price': '$2100', 'title': 'Gorgeous Upper State Apartment'},
 {'price': '$1125', 'title': 'Looking for a roommate!'},
 {'price': '$1250', 'title': '2Bd/1Ba apartment in Lompoc'},
 {'price': '$2660',
  'title': 'Last day to receive up to 1 month free! Call now!'},
 {'price': '$3750', 'title': '5 Bd/3Ba+2 Car garage Dup

<a id="selenium"></a>
## Introducing Selenium

Selenium is a headless browser. It allows us to render JavaScript just as a human-navigated browser would.

To install Selenium, use one of the following:
- **Anaconda:** `conda install -c conda-forge selenium`
- **pip:** `pip install selenium`

You will also need GeckoDriver (this assumes you are using Homebrew for Mac): 

- ```brew install geckodriver```

In [19]:
from selenium import webdriver

Selenium requires us to determine a default browser to run. I'm going to opt for Chrome. http://selenium-python.readthedocs.io/faq.html

**STOP - What is going to happen when I run the next cell?**

In [22]:
# create a driver called Firefox
driver = webdriver.Chrome()

Pretty crazy, right? Let's close that driver.

In [23]:
# close it
driver.close()

In [24]:
# let's boot it up, and visit a URL of our choice
driver = webdriver.Chrome()
driver.get("http://www.python.org")

Awesome. Now we're getting somewhere: programmatically controlling our browser like a human.

Let's return to our problem at hand. We need to visit the OpenTable listing for DC. Once there, we need to get the html to load. In the next cell, prove you can programmatically visit the page.

In [25]:
# visit our OpenTable page
driver = webdriver.Chrome()
driver.get("http://www.opentable.com/washington-dc-restaurant-listings")

# always good to check we've got the page we think we do
assert "OpenTable" in driver.title

In [26]:
driver.title

'Washington, D.C. Area Restaurants List | OpenTable'

In [27]:
driver.close()

<a id="selenium-js"></a>
### Running JavaScript before scraping

Now, to resolve our JavaScript problem, there's a few things we can do. What I'll do in this case is request that the page load, wait one second, and then I'm going to grab the source html from the page. Because the page should believe I'm visiting from a live connection on a browser client, the JavaScript should render to be a part of the page source. I can then grab the page source.

In [29]:
from time import sleep

In [30]:
# visit our relevant page
driver = webdriver.Chrome()
driver.get("http://www.opentable.com/washington-dc-restaurant-listings")

# wait one second
sleep(1)

#grab the page source
html = driver.page_source

**Pop Quiz:** What do we need to do with this HTML?

In [31]:
# BeautifulSoup it!
html = BeautifulSoup(html, "lxml")

How do we locate bookings on the page?

In [34]:
# print out the number bookings for all restaurants
html.find_all('div', {'class':'booking'})

[<div class="booking"><span class="tadpole"></span> Booked 11 times today</div>,
 <div class="booking"><span class="tadpole"></span> Booked 20 times today</div>,
 <div class="booking"><span class="tadpole"></span> Booked 58 times today</div>,
 <div class="booking"><span class="tadpole"></span> Booked 34 times today</div>,
 <div class="booking"><span class="tadpole"></span> Booked 12 times today</div>,
 <div class="booking"><span class="tadpole"></span> Booked 14 times today</div>,
 <div class="booking"><span class="tadpole"></span> Booked 8 times today</div>,
 <div class="booking"><span class="tadpole"></span> Booked 11 times today</div>,
 <div class="booking"><span class="tadpole"></span> Booked 13 times today</div>,
 <div class="booking"><span class="tadpole"></span> Booked 15 times today</div>,
 <div class="booking"><span class="tadpole"></span> Booked 15 times today</div>,
 <div class="booking"><span class="tadpole"></span> Booked 11 times today</div>,
 <div class="booking"><span c

In [35]:
# grabbing only the text content
for entry in html.find_all('div', {'class':'booking'}):
    print(entry.text)

 Booked 11 times today
 Booked 20 times today
 Booked 58 times today
 Booked 34 times today
 Booked 12 times today
 Booked 14 times today
 Booked 8 times today
 Booked 11 times today
 Booked 13 times today
 Booked 15 times today
 Booked 15 times today
 Booked 11 times today
 Booked 13 times today
 Booked 45 times today
 Booked 6 times today
 Booked 15 times today
 Booked 12 times today
 Booked 8 times today
 Booked 14 times today
 Booked 5 times today
 Booked 12 times today
 Booked 90 times today
 Booked 36 times today
 Booked 9 times today
 Booked 69 times today
 Booked 16 times today
 Booked 7 times today
 Booked 1 times today
 Booked 3 times today
 Booked 23 times today
 Booked 17 times today
 Booked 1 times today
 Booked 7 times today
 Booked 73 times today
 Booked 255 times today
 Booked 3 times today
 Booked 10 times today
 Booked 1 times today
 Booked 13 times today
 Booked 15 times today
 Booked 13 times today
 Booked 3 times today
 Booked 2 times today
 Booked 27 times today
 

In [36]:
driver.close()

We've succeeded!

<a id="selenium-regex"></a>
### Using regex to only get digits

But we can clean this up a little bit. We're going to use regular expressions (regex) to grab only the digits that are available in each of the text.

The best way to get good at regex is to, well, just keep trying and testing: http://pythex.org/

In [37]:
import re

Given we haven't covered regex, I'll show you how to use the search function to match any given digit.

In [38]:
# for each entry, grab the text
for booking in html.find_all('div', {'class':'booking'}):
    # match all digits
    match = re.search(r'\d+', booking.text)
    
    if match:
        # print if found
        print(match.group())
    else:
        # otherwise pass
        pass

11
20
58
34
12
14
8
11
13
15
15
11
13
45
6
15
12
8
14
5
12
90
36
9
69
16
7
1
3
23
17
1
7
73
255
3
10
1
13
15
13
3
2
27
6
5
13
8
6
9
2
29
13
22
25
1
13
16
62
16
39
16
68
15
11
18
8
8
12
6
17
10
3
5
5
18
8
8
2
42
19
4
9
1
21
1
3
14
6
11
5
2
3
5
3


The above example (and many others) are available in the Selenium docs: http://selenium-python.readthedocs.io/getting-started.html

What is especially important is exploring functionality like locating elements: http://selenium-python.readthedocs.io/locating-elements.html#locating-elements

FAQ:
http://selenium-python.readthedocs.io/faq.html

### Summary

In this lesson, we used the Beautiful Soup library to locate elements on a website then scrape their text. We also used the Selenium headless browser to run JavaScript first before retrieving the page contents.