<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Webscraping OpenTable with Selenium: Guided Lab

_Authors: Joseph Nelson (DC)_

---

> *Note: this lab is intended to be instructor-guided.*


In today's codealong lab, we will build a scraper using urllib and BeautifulSoup. We will remedy some of the pitfalls of automated scraping by using a a "headless" browser called Selenium.

You will be scraping OpenTable's DC listings. We're interested in knowing the restaurant's **name, location, price, and how many people booked it today.**

OpenTable provides all of this information on this given page: http://www.opentable.com/washington-dc-restaurant-listings.

### 1. Inspect the elements of this page to assure we can find each of the bits of information in which we're interested.

### 2. Use `urllib` and `BeautifulSoup` to read the contents of the HTML.

In [11]:
from bs4 import BeautifulSoup
import requests

In [2]:
# set the url we want to visit
url = "http://www.opentable.com/washington-dc-restaurant-listings"

# visit that url, and grab the html of said page
html = requests.get(url).text

### 3. Print out the HTML (only print a fraction of it). What is in it?

In [3]:
len(html)

539772

In [4]:
html[0:1000]

'           <!DOCTYPE html><html lang="en"><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=9; IE=8; IE=7; IE=EDGE"/> <title>Restaurant Reservation Availability</title>    <meta  name="robots" content="noindex" > </meta><link  rel="canonical" href="https://www.opentable.com/washington-dc-restaurant-listings" > </link>      <link rel="shortcut icon" href="//components.otstatic.com/components/favicon/1.0.5/favicon/favicon.ico" type="image/x-icon"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.5/favicon/favicon-16.png" sizes="16x16"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.5/favicon/favicon-32.png" sizes="32x32"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.5/favicon/favicon-48.png" sizes="48x48"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.5/favicon/favicon-64.png" sizes="64x64"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0

In [5]:
# This is the raw HTML of the page.

### 4. Use Beautiful Soup to convert the raw HTML into a soup object.

In [6]:
# we need to convert this into a soup object
soup = BeautifulSoup(html, 'html.parser')

### 5. Extract the name of each restaurant.

Let's first find each restaurant name listed on the page we've loaded. How do we find the page location of the restaurant? 

> *Hint: we need to know where in the **html** the restaurant element is housed.*

**5.A See if you can find the restaurant name on the page. Keep in mind there are many restaurants loaded on the page.**

In [7]:
# print the restaurant names
print(soup.find_all('span', {'class': 'rest-row-name-text'})[0:20])

[<span class="rest-row-name-text">Willms</span>, <span class="rest-row-name-text">Napoleon Gutmann</span>, <span class="rest-row-name-text">Officia</span>, <span class="rest-row-name-text">908 Rolfson</span>, <span class="rest-row-name-text">Nels Zulauf</span>, <span class="rest-row-name-text">Tristin Graham</span>, <span class="rest-row-name-text">Douglas</span>, <span class="rest-row-name-text">1184 Greenfelder</span>, <span class="rest-row-name-text">Stravenue</span>, <span class="rest-row-name-text">924 Kreiger</span>, <span class="rest-row-name-text">Ipsa</span>, <span class="rest-row-name-text">Possimus Stravenue</span>, <span class="rest-row-name-text">Dolor</span>, <span class="rest-row-name-text">Spinka Plains</span>, <span class="rest-row-name-text">Fugit</span>, <span class="rest-row-name-text">Austen Tremblay</span>, <span class="rest-row-name-text">Dolorum</span>, <span class="rest-row-name-text">Reichert Creek</span>, <span class="rest-row-name-text">Halvorson</span>, <sp

**5.B Create a list of _only_ the restaurant names (no tags).**


In [8]:
r_names = []

# for each element you find, print out the restaurant name
for entry in soup.find_all('span', {'class': 'rest-row-name-text'}):
    r_names.append(entry.text)

In [9]:
r_names[0:20]

['Willms',
 'Napoleon Gutmann',
 'Officia',
 '908 Rolfson',
 'Nels Zulauf',
 'Tristin Graham',
 'Douglas',
 '1184 Greenfelder',
 'Stravenue',
 '924 Kreiger',
 'Ipsa',
 'Possimus Stravenue',
 'Dolor',
 'Spinka Plains',
 'Fugit',
 'Austen Tremblay',
 'Dolorum',
 'Reichert Creek',
 'Halvorson',
 '921 Schuster']

### 6. Repeat this process but for location.

For example, barmini by Jose Andres is in the location listed as "Penn Quarter" in our search results.

In [10]:
# first, see if you can identify the location for all elements -- print it out
print(soup.find_all('span', {'class': 'rest-row-meta--location rest-row-meta-text'})[0:5])

[<span class="rest-row-meta--location rest-row-meta-text">Wilfordborough</span>, <span class="rest-row-meta--location rest-row-meta-text">Annabellestad</span>, <span class="rest-row-meta--location rest-row-meta-text">West Lauriannestad</span>, <span class="rest-row-meta--location rest-row-meta-text">West Isacmouth</span>, <span class="rest-row-meta--location rest-row-meta-text">Darestad</span>]


In [11]:
r_loc = []
for entry in soup.find_all('span', {'class': 'rest-row-meta--location rest-row-meta-text'}):
    r_loc.append(entry.text)
    
r_loc[0:10]

['Wilfordborough',
 'Annabellestad',
 'West Lauriannestad',
 'West Isacmouth',
 'Darestad',
 'Jacobstown',
 'East Cale',
 'New Shemar',
 'Jeffton',
 'Ankundingport']

### 7. Get the price for each restaurant.

The price is number of dollar signs on a scale of one to four for each restaurant. We'll follow the same process.

In [12]:
# print out all prices
print(soup.find_all('div', {'class': 'rest-row-pricing'})[0:5])

[<div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    </i> <span class="pricing--not-the-price">  $        </span></div>, <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $      </i> <span class="pricing--not-the-price">  $    $      </span></div>, <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    </i> <span class="pricing--not-the-price">  $        </span></div>, <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>, <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>]


In [13]:
r_dollars = []

# get EACH number of dollar signs per restaurant
# this one is trickier to eliminate the html. Hint: try a nested find
for entry in soup.find_all('div', {'class': 'rest-row-pricing'}):
    r_dollars.append(entry.find('i').text)
    
r_dollars[0:10]

['  $    $    $    ',
 '  $    $      ',
 '  $    $    $    ',
 '  $    $    $    $  ',
 '  $    $    $    $  ',
 '  $    $      ',
 '  $    $    $    ',
 '  $    $      ',
 '  $    $    $    ',
 '  $    $    $    ']

**7.B Convert the dollar sign strings to a count of the number of dollar signs.**

Can you figure out a way to simply print out the number of dollar signs per restaurant listed?

In [14]:
r_dollar_count = []

for entry in soup.find_all('div', {'class': 'rest-row-pricing'}):
    price = entry.find('i').text
    r_dollar_count.append(price.count('$'))
    
r_dollar_count[0:10]

[3, 2, 3, 4, 4, 2, 3, 2, 3, 3]

### 8. Can you find the number of times a restaurant was booked.

In the next cell, print out a sample of objects that contain the number of times the restaurant was booked.

> *Note: if you can't, why do you think this is happening?*

In [15]:
# print out all objects that contain the number of times the restaurant was booked
print(soup.find_all('span', {'class': 'tadpole'})[0:20])

[]


That's weird -- an empty set. Did we find the wrong element? What's going on here? Discuss.

How can we debug this? Any ideas?

In [16]:
# let's first try printing out all 'span' class objects
for entry in soup.find_all('span')[0:30]:
    print(entry)

<span class="menu-list-link-meta">4705</span>
<span class="menu-list-link-meta">901</span>
<span class="menu-list-link-meta">2354</span>
<span class="menu-list-link-meta">1450</span>
<span itemprop="name">Home</span>
<span itemprop="item"><span itemprop="name">United States</span></span>
<span itemprop="name">United States</span>
<span itemprop="name">Washington, D.C. Area</span>
<span class="show-filter-text">Show filters</span>
<span class="hide-filter-text">Hide filters</span>
<span class="sort-dropdown__option-text">Best Match</span>
<span class="sort-dropdown__option-text">A-Z</span>
<span class="sort-dropdown__option-text">Highest Rated</span>
<span class="pref-label">List</span>
<span class="pref-label">Map</span>
<span class="rest-row-name-text">Willms</span>
<span class="recommended-container"><span class="thumbs-up-icon"></span> <span class="recommended-small">230%</span> <span class="recommended-text">230% Recommend</span></span>
<span class="thumbs-up-icon"></span>
<span cl

In [17]:
# Can't find the booking count in the object. This requires javascript.

## Enter Selenium

---

Selenium is a headless browser. That means it enables us to mock human browsing behavior -- even waiting for JavaScript elements to load.

If you do not already have Selenium installed, you can do so via pip. Simply: `pip install selenium`

In [1]:
# import
from selenium import webdriver

Selenium requires us to determine a default browser to run. I'm going to opt for Firefox, but Chromium is also a very common choice. http://selenium-python.readthedocs.io/faq.html

### 9. What is going to happen when I run the next cell?

The chromedriver (for mac) is already contained in this repository.

For other OS's, just download the latest appropriate driver from [here](https://sites.google.com/a/chromium.org/chromedriver/downloads) and run the following commands from the folder containing the chromedriver you downloaded:

`$sudo mv chromedriver /usr/bin/chromedriver`  
`$sudo chown root:root /usr/bin/chromedriver`  
`$sudo chmod +x /usr/bin/chromedriver`

In [2]:
# create a driver called driver
# If mac use: 
driver = webdriver.Chrome(executable_path="../chromedriver/chromedriver")
# Else follow directions above and use command below:
# driver = webdriver.Chrome(executable_path="/usr/bin/chromedriver")
#driver = webdriver.Firefox()

Pretty crazy, right? Let's close that driver. 

In case you're wondering. this should have opened up a new browswer window.  Check all of your desktop displays if you didn't see it automatically pop up. 

In [3]:
# close it
driver.close()

### 10. Use the driver to visit `www.python.org`

In [4]:
# let's boot it up, and visit a URL of our choice
driver = webdriver.Chrome(executable_path="../chromedriver/chromedriver")
#driver = webdriver.Chrome(executable_path="/usr/bin/chromedriver")
driver.get("http://www.python.org")

Awesome. Now we're getting somewhere: programmatically controlling our browser like a human.

### 11. Visit the OpenTable page using the driver

Let's return to our problem at hand. We need to visit the OpenTable listing for DC. Once there, we need to get the html to load. 

In the next cell, prove you can programmatically visit the page.

In [5]:
# visit our OpenTable page
driver = webdriver.Chrome(executable_path="../chromedriver/chromedriver")
#driver = webdriver.Chrome(executable_path="/usr/bin/chromedriver")
driver.get("http://www.opentable.com/washington-dc-restaurant-listings")

# always good to check we've got the page we think we do
assert "OpenTable" in driver.title

In [6]:
driver.close()

In [7]:
driver.close()

WebDriverException: Message: no such session
  (Driver info: chromedriver=2.28.455517 (2c6d2707d8ea850c862f04ac066724273981e88f),platform=Mac OS X 10.13.2 x86_64)


### 12. Resolve the javascript issue using the driver and find the bookings.

What we can do in this case is:
1. Request that the page load
2. wait one second
3. grab the source html from the page 

Because the page should believe I'm visiting from a live connection on a browser client, the JavaScript should render to be a part of the page source. I can then grab the page source.

**Once you have the HTML with the javascript rendered, repeat the processes above to find the bookings.**

In [8]:
# import sleep
from time import sleep

In [9]:
# visit our relevant page
driver = webdriver.Chrome(executable_path="../chromedriver/chromedriver")
#driver = webdriver.Chrome(executable_path="/usr/bin/chromedriver")
driver.get("http://www.opentable.com/washington-dc-restaurant-listings")

# wait one second
sleep(1)

#grab the page source
html = driver.page_source

In [12]:
# BeautifulSoup it!
html = BeautifulSoup(html, 'lxml')

In [13]:
# Now, let's return to our earlier problem: how do we locate bookings on the page?

In [14]:
# print out the number bookings for all restaurants
print(html.find_all('div', {'class':'booking'})[0:10])

[<div class="booking"><span class="tadpole"></span>Booked 432 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 174 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 161 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 48 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 128 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 103 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 51 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 41 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 52 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 28 times today</div>]


In [15]:
r_bookings = []
for booking in html.find_all('div', {'class':'booking'}):
    r_bookings.append(booking.text)
    
r_bookings[0:15]

['Booked 432 times today',
 'Booked 174 times today',
 'Booked 161 times today',
 'Booked 48 times today',
 'Booked 128 times today',
 'Booked 103 times today',
 'Booked 51 times today',
 'Booked 41 times today',
 'Booked 52 times today',
 'Booked 28 times today',
 'Booked 60 times today',
 'Booked 31 times today',
 'Booked 50 times today',
 'Booked 30 times today',
 'Booked 21 times today']

In [29]:
# We've succeeded!

# But we can clean this up a little bit. 
# We're going to use regular expressions (regex) to grab only the 
# digits that are available in each of the text.

# The best way to get good at regex is to, well, just keep trying and testing: http://pythex.org/

In [30]:
# import regex
import re

In [31]:
# Given we haven't covered regex, I'll show you how to use the search function to match any given digit.

In [32]:
r_bookings_num = []

# for each entry, grab the text
for booking in html.find_all('div', {'class':'booking'}):
    # match all digits
    match = re.search('\d+', booking.text)
    
    if match:
        # append if found
        r_bookings_num.append(int(match.group()))
    else:
        # otherwise 0
        r_bookings_num.append(0)
        
r_bookings_num[0:15]

[1, 426, 197, 141, 52, 152, 96, 49, 46, 44, 26, 65, 28, 57, 34]

### 13. Can we get all of the items we want from the page in a single `find_all`?

To be most efficient, we want to only do a single loop for each entry on the page. That means we want to find what element all of other other elements (name, location, price, bookings) is housed within. Where on the page is each entry located?

In [33]:
# print out all entries
entries = html.find_all('div', {'class':'result content-section-list-row cf with-times'})

### 14. Does every single entry have each element we want?

In [34]:
# I did this previously. I know for a fact that not every element has a 
# number of recent bookings. That's probably exactly why OpenTable houses 
# this in JavaScript: they want to continously update the number of bookings 
# with the most relevant number of values.

In [35]:
# what happens when a booking is not available?
# print out some booking entries, using the identification code we wrote above
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'})[0:50]:
    print(entry.find('div', {'class':'booking'}))

<div class="booking"><span class="tadpole"></span>Booked 1 times today</div>
None
None
None
None
None
None
None
None
None
None
None
None
None
<div class="booking"><span class="tadpole"></span>Booked 426 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 197 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 141 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 52 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 152 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 96 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 49 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 46 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 44 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 26 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 65 times today</d

### 15. Use python exceptions to handle cases when bookings aren't found.

When a booking is not found, store `'ZERO'`.

In [36]:
# if we find the element we want, we print it. Otherwise, we print 'ZERO'
entries = []

for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    try:
        entries.append(entry.find('div', {'class':'booking'}).text)
    except:
        entries.append('ZERO')
        
print(entries.count('ZERO'))

15


### 16. Putting it all together in a dataframe.

**Loop through each entry. For each entry:**
1. Grab the relevant information we want (name, location, price, bookings). 
2. Produce a dataframe with the columns "name","location","price","bookings" that contains the 100 entries we would like.

In [37]:
# I'm going to create my empty df first
import pandas as pd

dc_eats = pd.DataFrame(columns=["name","location","price","bookings"])

In [38]:
# loop through each entry
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    # grab the name
    name = entry.find('span', {'class': 'rest-row-name-text'}).text
    # grab the location
    location = entry.find('span', {'class': 'rest-row-meta--location rest-row-meta-text'}).text
    # grab the price
    price = entry.find('div', {'class': 'rest-row-pricing'}).find('i').text.count('$')
    # try to find the number of bookings
    try:
        temp = entry.find('div', {'class':'booking'}).text
        match = re.search('\d+', temp)
        if match:
            bookings = match.group()
    except:
        bookings = 'NA'
    
    # add to df
    dc_eats.loc[len(dc_eats)]=[name, location, price, bookings]

In [39]:
# check out our work
dc_eats.head()

Unnamed: 0,name,location,price,bookings
0,Ruffino's - Arlington,Arlington,2,1.0
1,Joe's Place Pizza and Pasta,Arlington,2,
2,Peter Chang - Arlington,Palisades Northwest,4,
3,Hunan Village Restaurant,Arlington,2,
4,Fairfax Company Pub,Palisades Northwest,2,


### 17. [Bonus] Sending keys over the driver.

We can send keys to the page using the driver. Below is a demonstration of how to search the page using the Selenium driver.

In [40]:
# we can send keys as well

from selenium.webdriver.common.keys import Keys

In [41]:
# open the driver
driver = webdriver.Chrome(executable_path="../chromedriver/chromedriver")
#driver = webdriver.Chrome(executable_path="/usr/bin/chromedriver")

# visit Python
driver.get("http://www.python.org")

# verify we're in the right place
assert "Python" in driver.title

In [42]:
# find the search position
elem = driver.find_element_by_name("q")

# clear it
elem.clear()

# type in pycon
elem.send_keys("pycon")

In [43]:
# send those keys
elem.send_keys(Keys.RETURN)

# no results
assert "No results found." not in driver.page_source

In [44]:
# close
driver.close()

In [45]:
# all at once:
driver = webdriver.Chrome(executable_path="../chromedriver/chromedriver")
#driver = webdriver.Chrome(executable_path="/usr/bin/chromedriver")
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()

## Additional resources

---

The above example (and many others) are available in the Selenium docs: http://selenium-python.readthedocs.io/getting-started.html

What is especially important is exploring functionality like locating elements: http://selenium-python.readthedocs.io/locating-elements.html#locating-elements

FAQ:
http://selenium-python.readthedocs.io/faq.html