---

Here, we will create a web scraper to scrape OpenTable's DC listings. We're interested in knowing the restaurant's name, location, price, and how many people booked it today. OpenTable provides all of this information on their website page: http://www.opentable.com/washington-dc-restaurant-listings. Let's inspect the elements of this page to assure we can find each of the bits of information in which we're interested and begin with importing our needed libraries.

In [1]:
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
import pandas as pd
import requests
import re

In [2]:
# Let's set the url we want to visit #
url = 'http://www.opentable.com/washington-dc-restaurant-listings'

# Let's visit that url and grab the html #
html = requests.get(url)

In [3]:
# Let's check what's in the html (.text returns the request content in Unicode) #
html.text[:500]

'           <!DOCTYPE html><html lang="en"><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=9; IE=8; IE=7; IE=EDGE"/> <title>Restaurant Reservation Availability</title>    <meta  name="robots" content="noindex,nofollow" > </meta>     <link rel="shortcut icon" href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon.ico" type="image/x-icon"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-16.png" sizes="16x16"/><l'

In [4]:
# Let's convert into a soup object so we can parse it #
soup = BeautifulSoup(html.text, 'html.parser')

Note: We will utilize the web browser inspect tool to find the tags associated with elements of the page we want to scrape.

As a soup object, we can now begin to retrieve data from the HTML page.

In [5]:
# Let's print the restaurant names #
soup.find_all(name='span', attrs={'class':'rest-row-name-text'})

[<span class="rest-row-name-text">Consequatur Zboncak</span>,
 <span class="rest-row-name-text">Crooks</span>,
 <span class="rest-row-name-text">Dolorem Waelchi</span>,
 <span class="rest-row-name-text">Huel Corner</span>,
 <span class="rest-row-name-text">Et Street</span>,
 <span class="rest-row-name-text">1499 Schultz</span>,
 <span class="rest-row-name-text">Accusantium Trace</span>,
 <span class="rest-row-name-text">Harum</span>,
 <span class="rest-row-name-text">Nemo</span>,
 <span class="rest-row-name-text">Brody Hirthe</span>,
 <span class="rest-row-name-text">Mohammads</span>,
 <span class="rest-row-name-text">Rerum</span>,
 <span class="rest-row-name-text">Square</span>,
 <span class="rest-row-name-text">Doyle</span>,
 <span class="rest-row-name-text">Bernita Price</span>,
 <span class="rest-row-name-text">Walk</span>,
 <span class="rest-row-name-text">Vel</span>,
 <span class="rest-row-name-text">Willms</span>,
 <span class="rest-row-name-text">Consequatur Fadel</span>,
 <spa

In [6]:
# Let's print out the restaurant names for each element we find #
for entry in soup.find_all(name='span', attrs={'class':'rest-row-name-text'}):
    print(entry.text)

Consequatur Zboncak
Crooks
Dolorem Waelchi
Huel Corner
Et Street
1499 Schultz
Accusantium Trace
Harum
Nemo
Brody Hirthe
Mohammads
Rerum
Square
Doyle
Bernita Price
Walk
Vel
Willms
Consequatur Fadel
965 Hills
561 Sipes
Genevieves
1484 Kirlin
Commodi Mountain
Hand
Greenholt
Molestias Renner
Beier
Leonors
Officia Course
Vel Torphy
Consequatur Lights
Cassin
Stream
Veniam McClure
Demetriss
Parkways
Voluptatibus Islands
Cove
Spinka Gateway
O'Connell
Windler
Parisian
572 Gutmann
Ducimus
Halle Bergstrom
Claudine Schinner
103 Lebsack
Walks
105 Moen
Estelle Rau
Briana Cummings
Baby Trace
Eos Mall
Emies
Bruce Plains
Aspernatur Avenue
Ara Corners
Dolores Pike
Elaina Isle
Bashirian
Wilderman
Sapiente Cronin
Aut McKenzie
Roberts Prairie
Ipsam Leuschke
Brooks
Seths
Mayert
Nihil Armstrong
Dolorem
Molestiae Jast
Sandras
Officia Stream
Dallins
Agloe Bar & Grill
Iusto
Minima Prairie
Addisons
Port
Brown Walks
Voluptate
Christianas
Adipisci Freeway
Nisi
Lorines
Edna D'Amore
Daugherty Ways
Row
Jerde Common
E

In [7]:
# Let's print the restaurant locations #
soup.find_all(name='span', attrs={'class':'rest-row-meta--location'})

[<span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Lake Rasheedside</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Jerdefurt</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">West Annabelmouth</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Kossport</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Babychester</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Jonathonstad</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Everettfurt</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Robbieville</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">East Ledabury</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">West Lemuelport</span>,
 <span class="rest-row-meta

In [8]:
# Let's print out the restaurant locations for each element we find #
for entry in soup.find_all('span', {'class':'rest-row-meta--location'}):
    print(entry.text)

Lake Rasheedside
Jerdefurt
West Annabelmouth
Kossport
Babychester
Jonathonstad
Everettfurt
Robbieville
East Ledabury
West Lemuelport
Barrowstown
North Adonisborough
West Ellisshire
Johnsonbury
Port Alessandro
New Ericashire
Wilmastad
Jessycaville
Tatumland
Jabarichester
East Al
New Maymie
New Beulahfurt
Bricemouth
New Uriah
Powlowskiport
Elinorfurt
North Lazaroville
East Margeshire
New Cathyside
West Jana
Loycestad
Port Sage
Port Geovany
Dereckborough
Port Devon
Lake Ofelia
New Terry
West Kristian
Littelmouth
Jayhaven
Port Brett
Batztown
Rosinaview
Octaviamouth
Alizemouth
Port Edythe
New Tyson
Jenkinsport
West Justineville
Donnellyhaven
Durganborough
West Larryland
East Norwood
Brownview
Baumbachland
Sawaynburgh
Andreanetown
Lorenamouth
Wolfhaven
New Hector
Johnstonshire
Fletchershire
Nettieview
Murphyhaven
Johnsburgh
North Marcelo
Dorianton
South Gracie
Hoppebury
Riceville
Kemmerborough
Ornmouth
West Marco
New Brody
Bogisichport
Lindville
Rodriguezstad
Gudrunhaven
Mayertmouth
West Rav

In [9]:
# Let's print the restaurant prices #
soup.find_all('div', {'class':'rest-row-pricing'})

[<div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    </i> <span class="pricing--not-the-price">  $        </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    </i> <span class="pricing--not-the-price">  $        </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    </i> <span class="pricing--not-the-price">  $        </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    </i> <span class="pricing--not-the-price">  $        </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span>

In [10]:
# Let's print out the restaurant prices (dollar signs) for each element we find #
for entry in soup.find_all('div', {'class':'rest-row-pricing'}):
    print(entry.find('i').text)

  $    $    $    
  $    $    $    
  $    $    $    $  
  $    $    $    
  $    $    $    
  $    $    $    $  
  $    $    $    $  
  $    $      
  $    $    $    $  
  $    $    $    
  $    $    $    
  $    $      
  $    $    $    
  $    $    $    
  $    $    $    $  
  $    $      
  $    $    $    
  $    $    $    $  
  $    $    $    
  $    $    $    
  $    $      
  $    $    $    
  $    $    $    $  
  $    $    $    
  $    $    $    $  
  $    $    $    $  
  $    $    $    
  $    $      
  $    $      
  $    $    $    
  $    $    $    
  $    $      
  $    $      
  $    $      
  $    $    $    $  
  $    $      
  $    $      
  $    $    $    
  $    $    $    $  
  $    $      
  $    $      
  $    $    $    
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $    $    $  
  $    $    $    $  
  $    $    $    
  $    $    $    $  
  $    $    $    
  $    $      
  $    $    $    
  $    $    $    $  
  $    $    $    
  $ 

In [11]:
# Let's try to print the number of dollars signs per restaurant #
for entry in soup.find_all('div', {'class':'rest-row-pricing'}):
    price = entry.find('i').text
    print('Number of $:',price.count('$'))

Number of $: 3
Number of $: 3
Number of $: 4
Number of $: 3
Number of $: 3
Number of $: 4
Number of $: 4
Number of $: 2
Number of $: 4
Number of $: 3
Number of $: 3
Number of $: 2
Number of $: 3
Number of $: 3
Number of $: 4
Number of $: 2
Number of $: 3
Number of $: 4
Number of $: 3
Number of $: 3
Number of $: 2
Number of $: 3
Number of $: 4
Number of $: 3
Number of $: 4
Number of $: 4
Number of $: 3
Number of $: 2
Number of $: 2
Number of $: 3
Number of $: 3
Number of $: 2
Number of $: 2
Number of $: 2
Number of $: 4
Number of $: 2
Number of $: 2
Number of $: 3
Number of $: 4
Number of $: 2
Number of $: 2
Number of $: 3
Number of $: 2
Number of $: 2
Number of $: 2
Number of $: 2
Number of $: 2
Number of $: 4
Number of $: 4
Number of $: 3
Number of $: 4
Number of $: 3
Number of $: 2
Number of $: 3
Number of $: 4
Number of $: 3
Number of $: 4
Number of $: 4
Number of $: 2
Number of $: 4
Number of $: 4
Number of $: 2
Number of $: 2
Number of $: 3
Number of $: 4
Number of $: 3
Number of 

In [12]:
# Let's print the number of times each restaurant was booked #
soup.find_all('div', {'class':'booking'})

[]

It seems like we can't find the number of bookings for each resturant. This may be due to the fact that the number of bookings can be considered as dynamic data (as opposed to the restaurant's name, location, and price, which can be considered as static data). Thus, we must run JavaScript before scraping. To resolve our JavaScript issue, there's a few things we can do. Here, we'll request that the page load, wait one second, and then we're going to grab the source html from the page.

Let's continue with Selenium (a headless browser that allows us to render JavaScript just as a human-navigated browser would). The page should believe we're visiting from a live connection on a browser client and the JavaScript should render to be a part of the page source.

In [13]:
# Let's visit our relevant page #
driver = webdriver.Firefox()
driver.get('http://www.opentable.com/washington-dc-restaurant-listings')

# Let's wait one second #
sleep(1)

# Let's grab the page source #
html = driver.page_source

In [14]:
# Let's convert into a soup object so we can parse it #
html = BeautifulSoup(html, "lxml")

In [15]:
# Let's print the number of times each restaurant was booked again #
html.find_all('div', {'class':'booking'})

[<div class="booking"><span class="tadpole"></span> Booked 1 times today</div>,
 <div class="booking"><span class="tadpole"></span> Booked 8 times today</div>]

In [16]:
# Let's print out the number of times each restaurant was booked today #
for entry in html.find_all('div', {'class':'booking'}):
    print(entry.text)

 Booked 1 times today
 Booked 8 times today


In [17]:
# Let's close our driver #
driver.close()

Note: This notebook was created during the Covid-19 pandemic (resulting in the sparse amount of bookings as seen above).

Let's clean this up a little bit. We're going to use Regular Expressions (Regex) to grab only the digits that are available in each of the text.

In [18]:
# Let's grab the text for each entry #
for booking in html.find_all('div', {'class':'booking'}):
    
    # Let's match all digits #
    match = re.search(r'\d+', booking.text)
    
    if match:
        print(match.group())
    else: pass

1
8
