# Webscraping 2.0
---

# ![logo](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) 
## Learning Objectives
- Revisit how to locate elements on a page
- Discuss limitations associated with simple requests and urllib libraries
- Introduce Selenium as a solution, and implement a scraper using selenium


## Scraping 

http://www.opentable.com/washington-dc-restaurant-listings

When scraping websites, not all elements are static. We need a solution. Today is our solution, plus more scraping practice.


## Resources

- Find elements [Selenium docs](http://selenium-python.readthedocs.io/locating-elements.html#locating-elements)
- Using Selenium to enter website information [demo](http://thiagomarzagao.com/2013/11/12/webscraping-with-selenium-part-1/)
- Python regex tester [here](https://regex101.com/)



In today's codealong, I'll walkthrough how to build a scraper using urllib and BeautifulSoup. We'll discover the problems we discussed in the lesson readme associated with doing so, and we'll remedy this problem using a headless browser called Selenium.

For starter's we're going to be scraping OpenTable's DC listings. We're interested in knowing the restaurant's **name, location, price, and how many people booked it today.**

OpenTable provides all of this information on this given page: http://www.opentable.com/washington-dc-restaurant-listings

Let's inspect the elements of this page to assure we can find each of the bits of information in which we're interested.
>Class proceeds to do exactly that ^

We'll then build our scraper:

In [1]:
# import our necessary first packages
from bs4 import BeautifulSoup
import urllib

In [2]:
# set the url we want to visit
url = "http://www.opentable.com/washington-dc-restaurant-listings"

# visit that url, and grab the html of said page
html = urllib.request.urlopen(url).read()

Pause.

At this point, what is in html?

In [3]:
html

b'          <!DOCTYPE html><html lang="en"><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=9; IE=8; IE=7; IE=EDGE"/> <title>Washington, D.C. Area Restaurants List | OpenTable</title>  <meta  name="description" content="Find Washington, D.C. Area restaurants. Search by location, cuisine, or price to refine restaurant results in the Washington, D.C. Area area." > </meta>  <meta  name="robots" content="noindex" > </meta>    <link rel="shortcut icon" href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon.ico" type="image/x-icon"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon-16.png" sizes="16x16"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon-32.png" sizes="32x32"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.4/favicon/favicon-48.png" sizes="48x48"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.4/favicon/f

This is a mess and will be a pain to deal with. Let's clean it up with Beautiful Soup.

In [6]:
# we need to convert this into a soup object
soup = BeautifulSoup(html, 'lxml')

Any questions so far?

*"lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language." More information [here](http://lxml.de/)*

Let's first find each restaurant name listed on the page we've loaded. How do we find the page location of the restaurant? (Psst: we need to know where in the **html** the restaurant element is housed)


**Let's Check out:** http://www.opentable.com/washington-dc-restaurant-listings

See if you can find the restaurant name on the page. Keep in mind there are many restaurants loaded on the page.

In [7]:
# print the restaurant names with the element tags

soup.findAll(name='span', class_='rest-row-name-text')


[<span class="rest-row-name-text">Ruffino's - Arlington</span>,
 <span class="rest-row-name-text">Joe's Place Pizza and Pasta</span>,
 <span class="rest-row-name-text">Rasika West End</span>,
 <span class="rest-row-name-text">Blue Duck Tavern</span>,
 <span class="rest-row-name-text">Farmers Fishers Bakers</span>,
 <span class="rest-row-name-text">Founding Farmers - Tysons</span>,
 <span class="rest-row-name-text">Founding Farmers - DC</span>,
 <span class="rest-row-name-text">Filomena Ristorante</span>,
 <span class="rest-row-name-text">District Commons</span>,
 <span class="rest-row-name-text">Ambar - Arlington</span>,
 <span class="rest-row-name-text">Eddie V's - Tysons Corner</span>,
 <span class="rest-row-name-text">Clyde's of Georgetown</span>,
 <span class="rest-row-name-text">Chez Billy Sud</span>,
 <span class="rest-row-name-text">Fogo de Chao Brazilian Steakhouse - Tyson's</span>,
 <span class="rest-row-name-text">Seasons 52 - Tysons Corner</span>,
 <span class="rest-row-name

Great! However, this is not clean.

We want to print out only the name restaurant name. 

BeautifulSoup has two functions to help us do exactly that? What are they?

In [20]:
# for each element you find, print out ONLY the restaurant names
[name.text for name in soup.findAll(name='span', class_='rest-row-name-text')]


["Ruffino's - Arlington",
 "Joe's Place Pizza and Pasta",
 'Rasika West End',
 'Blue Duck Tavern',
 'Farmers Fishers Bakers',
 'Founding Farmers - Tysons',
 'Founding Farmers - DC',
 'Filomena Ristorante',
 'District Commons',
 'Ambar - Arlington',
 "Eddie V's - Tysons Corner",
 "Clyde's of Georgetown",
 'Chez Billy Sud',
 "Fogo de Chao Brazilian Steakhouse - Tyson's",
 'Seasons 52 - Tysons Corner',
 'Sugar Factory - Pentagon Mall',
 'J. Gilbert’s – Wood Fired Steaks & Seafood - McLean',
 'CIRCA at Foggy Bottom',
 'Boqueria - DC',
 'Il Canale',
 'Tupelo Honey - Arlington',
 'Nostos Restaurant',
 '2941 Restaurant',
 'Et Voila',
 'Barrel & Bushel',
 'Texas Jacks Barbecue',
 'The Capital Grille - Tysons Corner',
 'BlackSalt',
 'Lyon Hall',
 'Palette 22',
 "Chef Geoff's Tysons",
 'Barley Mac',
 'The Palm Tysons Corner',
 'Chima Steakhouse - Washington',
 'Copperwood Tavern',
 'Grillfish DC',
 "Tony & Joe's Seafood Place",
 'La Chaumiere',
 'CIRCA at Clarendon',
 'Sequoia',
 'Medium Rare - 

**Important Caveats**: 

What happens if I try and clean the name this way?

In [19]:
# Why do I get an error here? Tip: check print(type(x)). 

name = soup.find('span', class_='rest-row-name-text')
print(name.text)

Ruffino's - Arlington


Any questions so far?

**Can you repeat that process for finding the location?**

In [24]:
# first, see if you can identify the location for all elements -- print it out

soup.findAll('span', class_='rest-row-meta--location rest-row-meta-text')

[<span class="rest-row-meta--location rest-row-meta-text">Arlington</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Arlington</span>,
 <span class="rest-row-meta--location rest-row-meta-text">West End</span>,
 <span class="rest-row-meta--location rest-row-meta-text">West End</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Georgetown</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Tysons Corner / McLean</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Foggy Bottom</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Georgetown</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Foggy Bottom</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Arlington</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Tysons Corner / McLean</span>,
 <span class="rest-row-meta--location rest-row-meta-text">Georgetown</span>,
 <span class="rest-row-meta--location rest-row-meta-tex

In [25]:
# now print out EACH a clean version for location at each restaurants

for location in soup.findAll('span', 
                             class_='rest-row-meta--location rest-row-meta-text'):
    print(location.text)
    

Arlington
Arlington
West End
West End
Georgetown
Tysons Corner / McLean
Foggy Bottom
Georgetown
Foggy Bottom
Arlington
Tysons Corner / McLean
Georgetown
Georgetown
Tysons Corner / McLean
Tysons Corner / McLean
Arlington
McLean
Foggy Bottom
Dupont Circle
Georgetown
Arlington
Vienna
Falls Church
Palisades Northwest
Tysons Corner / McLean
Arlington
Tysons Corner / McLean
Palisades Northwest
Arlington
Arlington
Tysons Corner / McLean
Arlington
Tysons Corner / McLean
Vienna
Arlington
Dupont Circle
Georgetown
Georgetown
Arlington
Georgetown
Cleveland Park
Dupont Circle
Georgetown
Dupont Circle
Alexandria
Dupont Circle
Friendship Heights
Cleveland Park
Bethesda / Chevy Chase
Friendship Heights
Georgetown
Chevy Chase
Tysons Corner / McLean
West End
Georgetown
Georgetown
Arlington
Dupont Circle
Bethesda / Chevy Chase
Downtown
Georgetown
Arlington
Dupont Circle
Georgetown
Arlington
Arlington
Arlington
Georgetown
Cleveland Park
Foggy Bottom
Tysons Corner / McLean
Foggy Bottom
Arlington
Georgetown

Ok, we've figured out the restaurant name and location. 

Now we need to grab the price (number of dollar signs on a scale of one to four) for each restaurant. 

**Let's use the same process**

In [46]:
# print out the dollar signs with with the element tag

for pricing in soup.findAll('div', class_='rest-row-pricing'):
    print(len(pricing.find('i').text.replace(' ', '')))


2
2
3
3
2
2
2
3
2
2
3
2
2
4
2
2
3
2
3
2
2
2
2
2
2
2
4
3
2
2
2
2
4
4
2
2
2
2
2
2
2
4
2
2
2
2
2
2
2
2
2
4
3
3
3
2
3
2
4
3
2
2
3
4
2
3
2
2
2
4
2
2
2
3
3
2
4
3
2
2
2
2
3
4
3
3
3
4
2
2
2
2
2
2
2
2
2
4
2
2


Great! Now print out ONLY the dollar signs. 

This one is trickier to eliminate the html. **Hint:** try a nested find

Spend some time on your own or with somebody next to you to figure this out. 

**Check:** How come I did not need to use ".findAll()"?

What if I wanted just the number of dollar signs per restaurant? 

Can you figure out a way to simply print out the number of dollar signs per restaurant listed? There are a few of different ways to do this. 

In [43]:
# print the number of dollars signs per restaurant
fou

Phew, nice work. Only one more column left. 


We only need to find the number times a restaurant was booked. 
In the next cell, print out all objects that contain the number of times the restaurant was booked.

Refer back to the site [here](http://www.opentable.com/washington-dc-restaurant-listings)

In [None]:
# print out all objects that contain the number of times the restaurant was booked



**What?!... empty set!**

This is weird. Did we find the wrong element? What's going on here? Discuss.

How can we debug this? Any ideas?

In [47]:
# let's first try printing out all 'div' class objects
for entry in soup.find_all('div'):
    print(entry.text)

 
  if (/PhantomJS/.test(window.navigator.userAgent)) {
    window.__TestGaCalls = [];
    window.MapTrackGA = function(dataPoint) {
      window.ga = function(_) {
        window.__TestGaCalls.push(arguments);
      }
      var data = {};
      data[dataPoint] = '1';
      ga('gtm1.send', 'event', 'map_event', 'map_event', data);
    }
  }
  else {
    window.MapTrackGA = function(dataPoint) {
      if (typeof ga === 'function') {
        var data = {};
        data[dataPoint] = '1';
        ga('gtm1.send', 'event', 'map_event', 'map_event', data);
      }
    }
  }
 .icon-font{font-family:icons;speak:none;font-style:normal;font-weight:400;font-variant:normal;text-transform:none;line-height:1;-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale}.breadcrumb li.icon-visible a:before{font-family:icons;speak:none;font-style:normal;font-weight:400;font-variant:normal;text-transform:none;line-height:1;-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale}.brea

     CIRCA at Clarendon       (290) 290 reviews   100% 100% Recommend     $    $         $    $        American Arlington   Hurry, we only have 4 timeslots left!      6:30 pm      6:45 pm           7:45 pm      8:00 pm       
     CIRCA at Clarendon       (290) 290 reviews   100% 100% Recommend     $    $         $    $        American Arlington   Hurry, we only have 4 timeslots left!      6:30 pm      6:45 pm           7:45 pm      8:00 pm      
 
  CIRCA at Clarendon       (290) 290 reviews   100% 100% Recommend     $    $         $    $        American Arlington   Hurry, we only have 4 timeslots left!      6:30 pm      6:45 pm           7:45 pm      8:00 pm      
  CIRCA at Clarendon  
    (290) 290 reviews   100% 100% Recommend     $    $         $    $       
   (290) 290 reviews   100% 100% Recommend  
  (290) 290 reviews 



  $    $         $    $      
 American Arlington
  Hurry, we only have 4 timeslots left! 
 Hurry, we only have 4 timeslots left!
    6:30 pm      6:45 pm  

**Ahhh....** I still don't see it! What do we do?

[Let's go back to the source](http://www.opentable.com/washington-dc-restaurant-listings)

In [None]:
# We know it is there. Now lets physically look for it in our entire soup object using command+f "booked"
soup

**What do you notice? Why is this happening?**

## Break

## Enter Selenium

Selenium is a headless browser. That means it enables us to mock human browsing behavior -- even waiting for JavaScript elements to load.

If you do not already have Selenium installed, you can do so via pip. Simply: `conda install -c conda-forge selenium`

In [48]:
# import
from selenium import webdriver

Selenium requires us to determine a default browser to run. Chrome and Firefox are both very common choices. 

Go here: http://selenium-python.readthedocs.io/faq.html

In [None]:
# STOP
# what is going to happen when I run the next cell?

Replace `joeklein` with your computer's user name. 

In [53]:
# create a driver
driver = webdriver.Chrome(executable_path="/Users/joeklein/Downloads/chromedriver")

Pretty cool, right? 

REMEMBER to close your driver once you are done with it.

In [50]:
# close it
driver.close()

In [54]:
# let's boot it up, and visit a URL of our choice
driver = webdriver.Chrome(executable_path="/Users/joeklein/Downloads/chromedriver")
driver.get("http://www.python.org")

Awesome. Now we're getting somewhere: programmatically controlling our browser like a human.

In [55]:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

What else can we do?

In [56]:
driver.back()

In [57]:
driver.forward()

In [58]:
driver.save_screenshot('save.png')

True

In [59]:
driver.close()

Alright, now let's return to our problem at hand. We need to visit the OpenTable listing for DC. 

Once there, we need to get the html to load.

In [60]:
# visit our OpenTable page
driver = webdriver.Chrome(executable_path="/Users/joeklein/Downloads/chromedriver")
driver.get("http://www.opentable.com/washington-dc-restaurant-listings")

It is always good to check we've got the page we think we do


In [61]:
driver.title

'Washington, D.C. Area Restaurants List | OpenTable'

In [64]:
# So we can check that we have the right page by using the assert function

assert "OpenTable" in driver.title

Now, to resolve our JavaScript problem, there's a few things we can do. 

What I'll do in this case is request that the page load, then I'm going to grab the source html from the page. 

**Because the page should believe I'm visiting from a live connection on a browser client, the JavaScript should render to be a part of the page source. I can then grab the page source.**

In [65]:
# visit our relevant page
driver = webdriver.Chrome(executable_path="/Users/joeklein/Downloads/chromedriver")
driver.get("http://www.opentable.com/washington-dc-restaurant-listings")

#grab the page source
html = driver.page_source
print(html)

<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" lang="en" class="wf-sourcesanspro-n4-active wf-sourcesanspro-n6-active wf-active"><head><script src="https://rules.quantcount.com/rules-p-MnD7K1EKp7dQs.js"></script><script src="//bat.bing.com/bat.js" async=""></script><script type="text/javascript" async="" src="//opentbl.netmng.com/?aid=3109&amp;siclientid=&amp;p1=&amp;p2=search&amp;p3="></script><script type="text/javascript" src="//s.thebrighttag.com/tag?site=d0upCKz&amp;H=-3e9he69&amp;referrer=https%3A%2F%2Fwww.opentable.com%2Fwashington-dc-restaurant-listings&amp;mode=v2&amp;cf=4834639%2C5241253%2C5479134%2C5701738"></script><script src="https://connect.facebook.net/signals/config/725308910857169?v=2.7.25" async=""></script><script type="text/javascript" src="https://connect.facebook.net/en_US/fbevents.js"></script><script type="text/javascript" src="https://www.googleadservices.com/pagead/conversion_async.js"></script><script src="https://secure.quantserve.com/quant.js" a

Pop quiz: what do we need to do with this html?

In [66]:
html = BeautifulSoup(html, 'lxml')

In [67]:
html.prettify

<bound method Tag.prettify of <!DOCTYPE html>
<html class="wf-sourcesanspro-n4-active wf-sourcesanspro-n6-active wf-active" lang="en" xmlns="http://www.w3.org/1999/xhtml"><head><script src="https://rules.quantcount.com/rules-p-MnD7K1EKp7dQs.js"></script><script async="" src="//bat.bing.com/bat.js"></script><script async="" src="//opentbl.netmng.com/?aid=3109&amp;siclientid=&amp;p1=&amp;p2=search&amp;p3=" type="text/javascript"></script><script src="//s.thebrighttag.com/tag?site=d0upCKz&amp;H=-3e9he69&amp;referrer=https%3A%2F%2Fwww.opentable.com%2Fwashington-dc-restaurant-listings&amp;mode=v2&amp;cf=4834639%2C5241253%2C5479134%2C5701738" type="text/javascript"></script><script async="" src="https://connect.facebook.net/signals/config/725308910857169?v=2.7.25"></script><script src="https://connect.facebook.net/en_US/fbevents.js" type="text/javascript"></script><script src="https://www.googleadservices.com/pagead/conversion_async.js" type="text/javascript"></script><script async="" src="h

Now, let's return to our earlier problem: how do we locate bookings on the page?

In [69]:
# print out the number bookings for all restaurants

# I will use dictionary format this time. 

print(html.find_all('div', class_='booking'))

[<div class="booking"><span class="tadpole"></span>Booked 1 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 199 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 152 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 271 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 261 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 585 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 203 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 93 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 81 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 53 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 71 times today</div>, <div class="booking"><span class="tadpole"></span>Booked 66 times today</div>, <div class="booking"><span class="tadpole"></s

In [72]:
# now print out each booking for the listings using a loop
for booking in html.find_all('div', {'class':'booking'}):
    print(booking.text)

Booked 1 times today
Booked 199 times today
Booked 152 times today
Booked 271 times today
Booked 261 times today
Booked 585 times today
Booked 203 times today
Booked 93 times today
Booked 81 times today
Booked 53 times today
Booked 71 times today
Booked 66 times today
Booked 20 times today
Booked 28 times today
Booked 63 times today
Booked 43 times today
Booked 64 times today
Booked 76 times today
Booked 45 times today
Booked 42 times today
Booked 40 times today
Booked 34 times today
Booked 35 times today
Booked 27 times today
Booked 9 times today
Booked 38 times today
Booked 42 times today
Booked 29 times today
Booked 4 times today
Booked 31 times today
Booked 14 times today
Booked 31 times today
Booked 41 times today
Booked 15 times today
Booked 23 times today
Booked 11 times today
Booked 55 times today
Booked 20 times today
Booked 56 times today
Booked 15 times today
Booked 47 times today
Booked 18 times today
Booked 30 times today
Booked 31 times today
Booked 21 times today
Booked 

Let's grab just the text of each of these entries.

In [None]:
# do the same as above, but grabbing only the text content
for booking in html.find_all('div', {'class':'booking'}):
    print(booking.text)

We've succeeded!

But we can clean this up a little bit. We're going to use regular expressions (regex) to grab only the digits that are available in each of the text.

The best way to get good at regex is to, well, just keep trying and testing: https://regex101.com/

In [84]:
# import regex
import re

regex = r"\d+"

match = []
for book in html.findAll('div', class_='booking'):
    matches = re.search(regex, book.text)
    match.append(matches)
#     matches = re.search(regex, book.text)
#     print(matches.group())



Given we haven't covered regex, I'll show you how to use the search function to match any given digit.

In [83]:
# for each entry, grab the text

regex = r"\d"

for booking in html.find_all('div', {'class':'booking'}):
    # match all digits
    matches = re.search(regex, booking.text)
    # print if found
    if matches:
        print(matches.group())
    # otherwise pass
    else:
        pass

1
1
1
2
2
5
2
9
8
5
7
6
2
2
6
4
6
7
4
4
4
3
3
2
9
3
4
2
4
3
1
3
4
1
2
1
5
2
5
1
4
1
3
3
2
6
2
1
1
2
3
3
3
2
1
2
3
2
2
1
1
5
5
5
3
9
1
2
2
5
2
1
1
2
1
7
5
3
3
9
3
1
1
1
3
1
3
1
2
1
7
1
5
7
1
1
7
9
2


Before we demonstrate all the other amazing things about headless browsers, let's finish up collecting the data we want from this current example. 

Do you suppose the html parsing we wrote above will still work on the page source we've grabbed from our headless browser?

To be most efficient, we want to only do a single loop for each entry on the page. That means we want to find what element all of other other elements (name, location, price, bookings) is housed within. Where on the page is each entry located?

In [89]:
# print out all entries
for content in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    print(content)

<div class="result content-section-list-row cf with-times" data-id="0" data-index="1" data-lat="38.8974400" data-lon="-77.1242130" data-offers="" data-rid="66715"><div class="rest-row with-image"> <div class="rest-row-image"> <a href="/ruffinos-arlington" onclick="OT.BestAnalytics.logRestaurantVisit(66715)" target="_blank"><img alt="photo of ruffino's - arlington restaurant" class="lazy rest-image loaded" data-src="//resizer.otstatic.com/v2/profiles/legacy/66715.jpg" src="//resizer.otstatic.com/v2/profiles/legacy/66715.jpg"/></a></div> <div class="rest-row-info"><div class="rest-row-header"> <a class="rest-row-name rest-name " href="/ruffinos-arlington" onclick="OT.BestAnalytics.logRestaurantVisit(66715)" target="_blank"> <span class="rest-row-name-text">Ruffino's - Arlington</span> </a> </div> <div class=" flex-row-justify"> <div class="rest-row-review"> <div class="star-rating review-container"><div class="star-wrapper small"><div class="all-stars"></div><div class="all-stars filled"

Look over the page. Does every single entry have each element we're seeking?
> I did this previously. I know for a fact that not every element has a number of recent bookings. That's probably exactly why OpenTable houses this in JavaScript: they want to continously update the number of bookings with the most relevant number of values.

In [87]:
# what happens when a booking is not available?
# print out each booking entry, using the identification code we wrote above
for content in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    print(content.find('div', {'class':'booking'}))

<div class="booking"><span class="tadpole"></span>Booked 1 times today</div>
None
<div class="booking"><span class="tadpole"></span>Booked 199 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 152 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 271 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 261 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 585 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 203 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 93 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 81 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 53 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 71 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 66 times today</div>
<div class="booking"><span class="tadpole"></span>Book

In [90]:
for content in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    print(content.find('div', {'class':'booking'}).text)

Booked 1 times today


AttributeError: 'NoneType' object has no attribute 'text'

**UH OH....** What happened?

What do you notice takes the place when booking is not found?

Thus, we will use exceptions. Here's a demo:

In [91]:
# if we find the element we want, we print it. Otherwise, we print 'ZERO'
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    try:
        print(entry.find('div', {'class':'booking'}).text)
    except:
        print((0))

Booked 1 times today
0
Booked 199 times today
Booked 152 times today
Booked 271 times today
Booked 261 times today
Booked 585 times today
Booked 203 times today
Booked 93 times today
Booked 81 times today
Booked 53 times today
Booked 71 times today
Booked 66 times today
Booked 20 times today
Booked 28 times today
Booked 63 times today
Booked 43 times today
Booked 64 times today
Booked 76 times today
Booked 45 times today
Booked 42 times today
Booked 40 times today
Booked 34 times today
Booked 35 times today
Booked 27 times today
Booked 9 times today
Booked 38 times today
Booked 42 times today
Booked 29 times today
Booked 4 times today
Booked 31 times today
Booked 14 times today
Booked 31 times today
Booked 41 times today
Booked 15 times today
Booked 23 times today
Booked 11 times today
Booked 55 times today
Booked 20 times today
Booked 56 times today
Booked 15 times today
Booked 47 times today
Booked 18 times today
Booked 30 times today
Booked 31 times today
Booked 21 times today
Booke

From previously completing this, I know all other elements WILL be returned. That means we do not have to create exceptions for them.


Loop through each entry. For each entry, grab the relevant information we want (name, location, price, bookings). Produce a dataframe with the columns "name","location","price","bookings" that contains the 100 entries we would like.

In [97]:
# I'm going to create my empty df first
import pandas as pd
dc_eats = pd.DataFrame(columns=["name","location","price","bookings"])

**Check:** What is my for-loop doing?

In [98]:

# loop through each entry
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    # grab the name
    name = entry.find('span', {'class': 'rest-row-name-text'}).text
    # grab the location
    location = entry.find('span', {'class': 'rest-row-meta--location rest-row-meta-text'}).text
    # grab the price
    try:
        price = entry.find('i').text.count('$')
    except:
        price = "Didnt find price"
    # try to find the number of bookings
    try:
        temp = entry.find('div', {'class':'booking'}).text
        match = re.search(r'\d+', temp)
        if match:
            bookings = match.group()
    except:
        bookings = 0
    # add to df
    dc_eats.loc[len(dc_eats)]=[name, location, price, bookings]

In [99]:
# check out our work
dc_eats

Unnamed: 0,name,location,price,bookings
0,Ruffino's - Arlington,Arlington,2,1
1,Joe's Place Pizza and Pasta,Arlington,2,0
2,Rasika West End,West End,3,199
3,Blue Duck Tavern,West End,3,152
4,Farmers Fishers Bakers,Georgetown,2,271
5,Founding Farmers - Tysons,Tysons Corner / McLean,2,261
6,Founding Farmers - DC,Foggy Bottom,2,585
7,Filomena Ristorante,Georgetown,3,203
8,District Commons,Foggy Bottom,2,93
9,Ambar - Arlington,Arlington,2,81


Awesome! We succeeded.

Now, let's explore some of the other functionality of a webdriver. We've barely scratched the surface.

In [None]:
# we can send keys as well
# import
from selenium.webdriver.common.keys import Keys

In [None]:
# open Chrome
driver = webdriver.Chrome(executable_path="/Users/joeklein/Downloads/chromedriver")

In [None]:
# visit Python
driver.get("http://www.python.org")
# verify we're in the right place
assert "Python" in driver.title

In [None]:
# find the search position
elem = driver.find_element_by_name("q")
# clear it
elem.clear()
# type in pycon
elem.send_keys("pycon")


In [None]:
# send those keys
elem.send_keys(Keys.RETURN)
# no results
assert "No results found." not in driver.page_source

In [None]:
# close
driver.close()

In [None]:
# all at once:
driver = webdriver.Chrome(executable_path="/Users/joeklein/Downloads/chromedriver")
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()

The above example (and many others) are available in the Selenium docs: http://selenium-python.readthedocs.io/getting-started.html

What is especially important is exploring functionality like locating elements: http://selenium-python.readthedocs.io/locating-elements.html#locating-elements

FAQ:
http://selenium-python.readthedocs.io/faq.html