---

Here, we will create a web scraper to scrape OpenTable's DC listings. We're interested in knowing the restaurant's name, location, price, and how many people booked it today. OpenTable provides all of this information on their website page: http://www.opentable.com/washington-dc-restaurant-listings. Let's inspect the elements of this page to assure we can find each of the bits of information in which we're interested and begin with importing our needed libraries.

In [1]:
import matplotlib.pyplot as plt
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
import pandas as pd
import requests
import urllib
import json
import re

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
# Let's set the url we want to visit #
url = 'http://www.opentable.com/washington-dc-restaurant-listings'

# Let's visit that url and grab the html #
html = requests.get(url)

In [3]:
# Let's check what's in the html (.text returns the request content in Unicode) #
html.text[:500]

'           <!DOCTYPE html><html lang="en"><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=9; IE=8; IE=7; IE=EDGE"/> <title>Restaurant Reservation Availability</title>    <meta  name="robots" content="noindex,nofollow" > </meta>     <link rel="shortcut icon" href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon.ico" type="image/x-icon"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-16.png" sizes="16x16"/><l'

In [4]:
# Let's convert into a soup object so we can parse it #
soup = BeautifulSoup(html.text, 'html.parser')

Note: We will utilize the web browser inspect tool to find the tags associated with elements of the page we want to scrape.

As a soup object, we can now begin to retrieve data from the HTML page.

In [5]:
# Let's print the restaurant names #
soup.find_all(name='span', attrs={'class':'rest-row-name-text'})

[<span class="rest-row-name-text">Manor</span>,
 <span class="rest-row-name-text">Inventore Lodge</span>,
 <span class="rest-row-name-text">Quia</span>,
 <span class="rest-row-name-text">Architecto</span>,
 <span class="rest-row-name-text">Minervas</span>,
 <span class="rest-row-name-text">Voluptas Circle</span>,
 <span class="rest-row-name-text">At Homenick</span>,
 <span class="rest-row-name-text">Strosin</span>,
 <span class="rest-row-name-text">Efrens</span>,
 <span class="rest-row-name-text">559 Hickle</span>,
 <span class="rest-row-name-text">Weber Brooks</span>,
 <span class="rest-row-name-text">Est</span>,
 <span class="rest-row-name-text">Agloe Bar &amp; Grill</span>,
 <span class="rest-row-name-text">Impedit</span>,
 <span class="rest-row-name-text">Andreanne Homenick</span>,
 <span class="rest-row-name-text">Lilians</span>,
 <span class="rest-row-name-text">Emard Run</span>,
 <span class="rest-row-name-text">Dooley Ports</span>,
 <span class="rest-row-name-text">Hand Mill</s

In [6]:
# Let's print out the restaurant names for each element we find #
for entry in soup.find_all(name='span', attrs={'class':'rest-row-name-text'}):
    print(entry.text)

Manor
Inventore Lodge
Quia
Architecto
Minervas
Voluptas Circle
At Homenick
Strosin
Efrens
559 Hickle
Weber Brooks
Est
Agloe Bar & Grill
Impedit
Andreanne Homenick
Lilians
Emard Run
Dooley Ports
Hand Mill
Corkery Junctions
Sint Summit
West
Ryan
Mias
Sunt Crossing
1131 Corwin
742 Kihn
Emmet Witting
Mountain
Margarete Hill
898 Kub
Boyle
Eum Haley
Carroll Alley
Voluptas Islands
Becker
Elva Howell
951 Cassin
Romaguera Track
Ernser River
Angelas
Possimus
Boyds
Gerlach
Aut
491 O'Connell
Keyshawns
Emmerich
Freddy Cliff
Markus Mews
Ratke Turnpike
Enim Ortiz
Cleo Heidenreich
Mayert
Odio Beier
Harbors
Lora Abernathy
Dorotheas
Kuphal
Will Roads
Aperiam Prairie
Incidunt
Christiansen
Presleys
Voluptatum
Desmond Morissette
1097 Rutherford
Stravenue
Harvey
875 Cartwright
Roads
Heaven Larkin
Quo
1146 Hand
Alfredo Road
Zelmas
Ethelyns
Ipsam Spring
Eum Trail
Quia Lind
Macejkovic
Ondricka
Schoen Turnpike
Gregorys
Hollies
Trace
Stream
Mayert
Verlas
Garrick Gutkowski
326 Champlin
42 Russel
Abby Wall
Rerum R

In [7]:
# Let's print the restaurant locations #
soup.find_all(name='span', attrs={'class':'rest-row-meta--location'})

[<span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">North Kennyport</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">East Raegan</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Port Lenny</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Howellshire</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Crookston</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">New Elzachester</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Oswaldoton</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Nolaland</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">South Zachery</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">North Natashaport</span>,
 <span class="rest-row-meta--lo

In [8]:
# Let's print out the restaurant locations for each element we find #
for entry in soup.find_all('span', {'class':'rest-row-meta--location'}):
    print(entry.text)

North Kennyport
East Raegan
Port Lenny
Howellshire
Crookston
New Elzachester
Oswaldoton
Nolaland
South Zachery
North Natashaport
Deshawnborough
South Aaronshire
Cathyton
North Shemar
New Isom
New Genoveva
North Lydia
South Garret
Myaside
North Marquesborough
South Jacquestown
Lake Rustyland
Simview
Port Kendra
Lake Alessiashire
New Zeldaside
West Elinoreland
Bettyeburgh
Hirtheland
North Katrine
Port Alex
New Lonnyhaven
Brodybury
Destanyport
East Fanniemouth
Kemmertown
Sipeston
East Maynardfort
Lake Brendan
Lefflerton
Felixmouth
Loniebury
Macshire
Caspermouth
West Pierremouth
Connfort
Abdielland
Rathport
East Marcfurt
Burdettehaven
Marleyberg
McLaughlinville
Port Jaceyport
Port Karina
Port Lucio
Port Emilianomouth
Theronville
Sipesview
Ashleeport
Dawsonfort
Lake Rita
Erwinton
Turcotteville
Williamsonfort
Grahamshire
Ortizport
Camronville
Lake Rafaelton
North Natmouth
Lake Robertaport
Elmirabury
Darebury
Cathymouth
Port Orphaberg
Fisherbury
New Stanford
Legrosbury
Lake Kennedyberg
East S

In [9]:
# Let's print the restaurant prices #
soup.find_all('div', {'class':'rest-row-pricing'})

[<div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $      </i> <span class="pricing--not-the-price">  $    $      </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $      </i> <span class="pricing--not-the-price">  $    $      </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $      </i> <span class="pricing--not-the-price">  $    $      </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    </i> <span class="pricing--not-the-price">  $        </span>

In [10]:
# Let's print out the restaurant prices (dollar signs) for each element we find #
for entry in soup.find_all('div', {'class':'rest-row-pricing'}):
    print(entry.find('i').text)

  $    $    $    $  
  $    $    $    $  
  $    $      
  $    $      
  $    $      
  $    $    $    $  
  $    $    $    
  $    $      
  $    $      
  $    $    $    
  $    $    $    
  $    $    $    $  
  $    $      
  $    $    $    $  
  $    $    $    $  
  $    $    $    
  $    $    $    $  
  $    $    $    
  $    $    $    $  
  $    $    $    $  
  $    $      
  $    $    $    
  $    $    $    $  
  $    $    $    
  $    $    $    
  $    $      
  $    $    $    
  $    $      
  $    $      
  $    $      
  $    $    $    $  
  $    $    $    $  
  $    $    $    
  $    $    $    $  
  $    $    $    
  $    $    $    
  $    $    $    
  $    $    $    $  
  $    $    $    $  
  $    $    $    $  
  $    $    $    $  
  $    $    $    
  $    $      
  $    $      
  $    $    $    $  
  $    $      
  $    $      
  $    $    $    
  $    $    $    
  $    $      
  $    $    $    $  
  $    $    $    $  
  $    $      
  $    $    $    
  $    $      
  $ 

In [11]:
# Let's try to print the number of dollars signs per restaurant #
for entry in soup.find_all('div', {'class':'rest-row-pricing'}):
    price = entry.find('i').text
    print('Number of $:',price.count('$'))

Number of $: 4
Number of $: 4
Number of $: 2
Number of $: 2
Number of $: 2
Number of $: 4
Number of $: 3
Number of $: 2
Number of $: 2
Number of $: 3
Number of $: 3
Number of $: 4
Number of $: 2
Number of $: 4
Number of $: 4
Number of $: 3
Number of $: 4
Number of $: 3
Number of $: 4
Number of $: 4
Number of $: 2
Number of $: 3
Number of $: 4
Number of $: 3
Number of $: 3
Number of $: 2
Number of $: 3
Number of $: 2
Number of $: 2
Number of $: 2
Number of $: 4
Number of $: 4
Number of $: 3
Number of $: 4
Number of $: 3
Number of $: 3
Number of $: 3
Number of $: 4
Number of $: 4
Number of $: 4
Number of $: 4
Number of $: 3
Number of $: 2
Number of $: 2
Number of $: 4
Number of $: 2
Number of $: 2
Number of $: 3
Number of $: 3
Number of $: 2
Number of $: 4
Number of $: 4
Number of $: 2
Number of $: 3
Number of $: 2
Number of $: 4
Number of $: 3
Number of $: 3
Number of $: 3
Number of $: 4
Number of $: 3
Number of $: 3
Number of $: 4
Number of $: 4
Number of $: 2
Number of $: 2
Number of 

In [12]:
# Let's print the number of times each restaurant was booked #
soup.find_all('div', {'class':'booking'})

[]

It seems like we can't find the number of bookings for each resturant. This may be due to the fact that the number of bookings can be considered as dynamic data (as opposed to the restaurant's name, location, and price, which can be considered as static data). Thus, we must run JavaScript before scraping. To resolve our JavaScript issue, there's a few things we can do. Here, we'll request that the page load, wait one second, and then we're going to grab the source html from the page.

Let's continue with Selenium (a headless browser that allows us to render JavaScript just as a human-navigated browser would). The page should believe we're visiting from a live connection on a browser client and the JavaScript should render to be a part of the page source.

In [13]:
# Let's visit our relevant page #
driver = webdriver.Firefox()
driver.get('http://www.opentable.com/washington-dc-restaurant-listings')

# Let's wait one second #
sleep(1)

# Let's grab the page source #
html = driver.page_source

In [14]:
# Let's convert into a soup object so we can parse it #
html = BeautifulSoup(html, "lxml")

In [15]:
# Let's print the number of times each restaurant was booked again #
html.find_all('div', {'class':'booking'})

[<div class="booking"><span class="tadpole"></span> Booked 6 times today</div>,
 <div class="booking"><span class="tadpole"></span> Booked 1 times today</div>,
 <div class="booking"><span class="tadpole"></span> Booked 1 times today</div>]

In [16]:
# Let's print out the number of times each restaurant was booked today #
for entry in html.find_all('div', {'class':'booking'}):
    print(entry.text)

 Booked 6 times today
 Booked 1 times today
 Booked 1 times today


In [17]:
# Let's close our driver #
driver.close()

Note: This notebook was created during the Covid-19 pandemic (resulting in the sparse amount of bookings as seen above).

Let's clean this up a little bit. We're going to use Regular Expressions (Regex) to grab only the digits that are available in each of the text.

In [18]:
# Let's grab the text for each entry #
for booking in html.find_all('div', {'class':'booking'}):
    
    # Let's match all digits #
    match = re.search(r'\d+', booking.text)
    
    if match:
        print(match.group())
    else: pass

6
1
1


Sometimes an API doesn't provide all the information we would like to get. Let's continue by using a combination of scraping and API calls to find the ratings and networks of famous television shows. The Internet Movie Database contains data about movies and TV shows. Unfortunately, it does not have a public API. However, the webpage http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2 contains a list of the top 250 tv shows of all time.

In [19]:
# Let's create a function to obtain the list of the top 250 results #
def get_top_250():
    response = requests.get('http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2')
    html = response.text
    
    # Let's find everything after title to the next backslash in the a href element #
    entries = re.findall('<a href.*?/title/(.*?)/', html)
    
    # Let's create a list of the top 250 results #
    return list(set(entries))

In [20]:
# Let's call the function and check the length of the list #
entries = get_top_250()
len(entries)

251

In [21]:
# Let's check out why we get an extra entry in our list #
entries

['tt0281491',
 'tt2306299',
 'tt0278238',
 'tt4288182',
 'tt0111958',
 'tt6108262',
 'tt6111552',
 'tt7920978',
 'tt2861424',
 'tt1641384',
 'tt0988818',
 'tt5897304',
 'tt0118421',
 'tt2395695',
 'tt5249462',
 'tt2802850',
 'tt3895150',
 'tt0075537',
 'tt0121220',
 'tt0436992',
 'tt2356777',
 'tt5421602',
 'tt0229889',
 'tt0988824',
 'tt7259746',
 'tt12004706',
 'tt0081846',
 'tt1534360',
 'tt0472954',
 'tt2243973',
 'tt0092455',
 'tt1489428',
 'tt0080306',
 'tt7366338',
 'tt2442560',
 'tt2560140',
 'tt0994314',
 'tt0094517',
 'tt0088484',
 'tt6077448',
 'tt2571774',
 'tt0141842',
 'tt2100976',
 'tt0264235',
 'tt1758429',
 'tt0094525',
 'tt4742876',
 'tt0121955',
 'tt1124373',
 'tt0090509',
 'tt0118273',
 'tt1586680',
 'tt5189670',
 'tt2303687',
 'tt0367279',
 'tt0096657',
 'tt5753856',
 'tt9432978',
 'tt0421357',
 'tt4574334',
 'tt0268093',
 'tt0290978',
 'tt0193676',
 'tt0487831',
 'tt0096639',
 'tt1474684',
 'tt9398466',
 'tt0380136',
 'tt7660850',
 'tt0877057',
 'tt7278862',
 'tt0

In [22]:
# Let's create a loop to find the index value of the entry we don't need and drop it from the list #
nn = 0

for x in range(len(entries)):
    if 'tt' not in entries[x]:
        nn = x
    else: pass

entries.pop(nn)

'?count=100&amp;groups=oscar_best_picture_winners&amp;sort=year%2Cdesc&amp;ref_=nv_ch_osc" tabindex="-1" aria-disabled="false"><span class="ipc-list-item__text" role="presentation">Best Picture Winners<'

In [23]:
# Let's check the length of the list again #
len(entries)

250

Although the Internet Movie Database does not have a public API, an open API exists at http://www.tvmaze.com/api. Let's use this API to retrieve information about each of the 250 TV shows we have just extracted.

In [24]:
# Let's create a function to pull information from the API using Json interaction and store into a DataFrame #
shows_df1= pd.DataFrame(columns = ['show_name', 'rating_avg', 'genres', 'network', 'premiere_date', 'status'])

def get_entry(entry):
    res=requests.get('http://api.tvmaze.com/lookup/shows?imdb='+entry)
    
    if res.status_code == 200:
        try:
            status = json.loads(res.text).get('status')
        except AttributeError:
            status = 'NA'
        try: 
            rating = json.loads(res.text).get('rating').get('average')
        except AttributeError:
            rating = 'NA'
            
        try:
            network = json.loads(res.text).get('network').get('name')
        except AttributeError:
            network = 'NA'
            
        try:
            title = json.loads(res.text).get('name')
        except AttributeError:
            title = 'NA'
            
        try:
            premier = json.loads(res.text).get('premiered')
        except AttributeError:
            premier = 'NA'
            
        try:
            genres = json.loads(res.text).get('genres')
        except AttributeError:
            genres = 'NA'

       
        shows_df1.loc[len(shows_df1)] = [title, rating, genres, network, premier, status]

In [25]:
# Let's call the above function #
for entry in entries:
    get_entry(entry)
    
shows_df1

Unnamed: 0,show_name,rating_avg,genres,network,premiere_date,status
0,Still Game,8.8,[Comedy],BBC Scotland,2002-09-06,Ended
1,Vikings,8.7,"[Drama, Action, History]",History,2013-03-03,Running
2,Samurai Jack,8.7,"[Action, Adventure]",Adult Swim,2001-08-10,Ended
3,Atlanta,7.4,"[Drama, Comedy, Music]",FX,2016-09-06,Running
4,Father Ted,8.1,[Comedy],Channel 4,1995-04-21,Ended
5,Senke nad Balkanom,,"[Drama, Crime, Thriller]",RTS1,2017-10-22,Running
6,Şahsiyet,,"[Drama, Crime, Thriller]",,2018-03-17,Ended
7,Rick and Morty,9.1,"[Comedy, Adventure, Science-Fiction]",Adult Swim,2013-12-02,Running
8,Young Justice,8.5,"[Action, Adventure, Science-Fiction]",,2010-11-26,Running
9,Gintama,8.7,"[Comedy, Action, Anime]",TV Tokyo,2006-04-04,To Be Determined


In [26]:
# Let's check for null values present, as well as data types for each column #
shows_df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 227 entries, 0 to 226
Data columns (total 6 columns):
show_name        227 non-null object
rating_avg       202 non-null float64
genres           227 non-null object
network          227 non-null object
premiere_date    227 non-null object
status           227 non-null object
dtypes: float64(1), object(5)
memory usage: 12.4+ KB


In [27]:
# Let's create a function to pull information from the API converting Json into a python dictionary element #
shows_df2= pd.DataFrame(columns = ['show_name', 'rating_avg', 'genres', 'network', 'premiere_date', 'status'])

def get_entry(entry):
    res=requests.get('http://api.tvmaze.com/lookup/shows?imdb='+entry)
    if res.status_code == 200:
        results = json.loads(res.text)
        
        try:    
            status = results['status']
        except TypeError:
            status = 'NA'   
        try:
            rating = results['rating']['average']
        except TypeError:
            rating = 'NA'
        try:
            network = results['network']['name']
        except TypeError:
            network = 'NA'
        try:   
            title = results['name']
        except TypeError:
            title = 'NA'
        try:   
            genres = results['genres']
        except TypeError:
            genres = 'NA'
        try:   
            premier = results['premiered']
        except TypeError:
            premier = 'NA'
        
        shows_df2.loc[len(shows_df2)] = [title, rating, genres, network, premier, status]

In [28]:
# Let's call the above function #
for entry in entries:
    get_entry(entry)
    
shows_df2

Unnamed: 0,show_name,rating_avg,genres,network,premiere_date,status
0,Still Game,8.8,[Comedy],BBC Scotland,2002-09-06,Ended
1,Vikings,8.7,"[Drama, Action, History]",History,2013-03-03,Running
2,Samurai Jack,8.7,"[Action, Adventure]",Adult Swim,2001-08-10,Ended
3,Atlanta,7.4,"[Drama, Comedy, Music]",FX,2016-09-06,Running
4,Father Ted,8.1,[Comedy],Channel 4,1995-04-21,Ended
5,Senke nad Balkanom,,"[Drama, Crime, Thriller]",RTS1,2017-10-22,Running
6,Şahsiyet,,"[Drama, Crime, Thriller]",,2018-03-17,Ended
7,Rick and Morty,9.1,"[Comedy, Adventure, Science-Fiction]",Adult Swim,2013-12-02,Running
8,Young Justice,8.5,"[Action, Adventure, Science-Fiction]",,2010-11-26,Running
9,Gintama,8.7,"[Comedy, Action, Anime]",TV Tokyo,2006-04-04,To Be Determined


In [29]:
# Let's check for null values present, as well as data types for each column #
shows_df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 227 entries, 0 to 226
Data columns (total 6 columns):
show_name        227 non-null object
rating_avg       202 non-null float64
genres           227 non-null object
network          227 non-null object
premiere_date    227 non-null object
status           227 non-null object
dtypes: float64(1), object(5)
memory usage: 12.4+ KB
