<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Web Scraping

_Author: Joseph Nelson (DC)_

---

## Before Class

### Install Selenium

Students will need to install Selenium using one of the following:
- **Anaconda:** `conda install -c conda-forge selenium`
- **pip:** `pip install selenium`

## Learning Objectives
- Revisit how to locate elements on a webpage
- Aquire unstructure data from the internet using Beautiful soup.
- Discuss limitations associated with simple requests and urllib libraries
- Introduce Selenium as a solution, and implement a scraper using selenium

## Lesson Guide

- [Introduction](#intro)
- [Building a web scraper](#building-scraper)
- [Retrieving data from the HTML page](#retrieving-data)
    - [Retrieving the restaurant names](#retrieving-names)
    - [Challenge: Retrieving the restaurant locations](#retrieving-locations)
    - [Retrieving the restaurant prices](#retrieving-prices)
    - [Retrieving the restaurant number of bookings](#retrieving-bookings)


- [Introducing Selenium](#selenium)
    - [Running JavaScript before scraping](#selenium-js)
    - [Using regex to only get digits](#selenium-regex)
    - [Challenge: Use Pandas to create a DataFrame of bookings](#challenge-pandas)
    - [Auto-typing using Selenium](#selenium-typing)


- [Summary](#summary)

<a id="intro"></a>
## Introduction

In this codealong lesson, we'll build a web scraper using requests and BeautifulSoup. We will also explore how to use a headless browser called Selenium.

We'll begin by scraping OpenTable's DC listings. We're interested in knowing the restaurant's **name, location, price, and how many people booked it today.**

OpenTable provides all of this information on this given page: http://www.opentable.com/washington-dc-restaurant-listings

Let's inspect the elements of this page to assure we can find each of the bits of information in which we're interested.

---

<a id="building-scraper"></a>
## Building a web scraper

Now, let's build a web scraper for OpenTable using urllib and Beautiful Soup:

In [1]:
# import our necessary first packages
from bs4 import BeautifulSoup
import requests

In [2]:
# set the url we want to visit
url = "http://www.opentable.com/washington-dc-restaurant-listings"

# visit that url, and grab the html of said page
html = requests.get(url)

At this point, what is in html?

In [4]:
# .text returns the request content in Unicode
html.text[:1000]

'           <!DOCTYPE html><html lang="en"><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=9; IE=8; IE=7; IE=EDGE"/> <title>Restaurant Reservation Availability</title>    <meta  name="robots" content="noindex" > </meta><link  rel="canonical" href="https://www.opentable.com/washington-dc-restaurant-listings" > </link>      <link rel="shortcut icon" href="//components.otstatic.com/components/favicon/1.0.5/favicon/favicon.ico" type="image/x-icon"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.5/favicon/favicon-16.png" sizes="16x16"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.5/favicon/favicon-32.png" sizes="32x32"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.5/favicon/favicon-48.png" sizes="48x48"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.5/favicon/favicon-64.png" sizes="64x64"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0

We will need to convert this html objct into a soup object so we can parse it using python and BS4

In [9]:
# convert this into a soup object
soup = BeautifulSoup(html.text, 'html.parser')

<a id="retrieving-data"></a>
### Retrieving data from the HTML page

Let's first find each restaurant name listed on the page we've loaded. How do we find the page location of the restaurant? (Hint: We need to know where in the **HTML** the restaurant element is housed.) In order to find the HTML that renders the restaurant location, we can use Google Chrome's Inspect tool:

> http://www.opentable.com/washington-dc-restaurant-listings

> 1. Visit the URL above. 

> 2. Right-click on an element you are interested in, then choose Inspect (in Chrome). 

> 3. This will open the Developer Tools and show the HTML used to render the selected page element. 

> Throughout this lesson, we will use this method to find tags associated with elements of the page we want to scrape.

See if you can find the restaurant name on the page. Keep in mind there are many restaurants loaded on the page.

In [10]:
# print the restaurant names
soup.find_all(name='span', attrs={'class':'rest-row-name-text'})

[<span class="rest-row-name-text">Path</span>,
 <span class="rest-row-name-text">487 Larkin</span>,
 <span class="rest-row-name-text">Fay</span>,
 <span class="rest-row-name-text">481 Rutherford</span>,
 <span class="rest-row-name-text">Curve</span>,
 <span class="rest-row-name-text">Aut</span>,
 <span class="rest-row-name-text">Est</span>,
 <span class="rest-row-name-text">292 Kris</span>,
 <span class="rest-row-name-text">Eveniet</span>,
 <span class="rest-row-name-text">Winstons</span>,
 <span class="rest-row-name-text">Aliquid Light</span>,
 <span class="rest-row-name-text">Vista</span>,
 <span class="rest-row-name-text">In</span>,
 <span class="rest-row-name-text">Muller Point</span>,
 <span class="rest-row-name-text">Assumenda Motorway</span>,
 <span class="rest-row-name-text">Vallies</span>,
 <span class="rest-row-name-text">Iure Falls</span>,
 <span class="rest-row-name-text">Netties</span>,
 <span class="rest-row-name-text">Newtons</span>,
 <span class="rest-row-name-text">114

In [17]:
str(soup.find_all(name='span', attrs={'class':'rest-row-name-text'})[0])

'<span class="rest-row-name-text">Path</span>'

It is important to always keep in mind the data types that were returned. Note this is a `list`, and we know that immediately by observing the outer square brackets and commas separating each tag.

Next, note the elements of the list are `Tag` objects, not strings. (If they were strings, they would be surrounded by quotes.) The Beautiful Soup authors chose to display a `Tag` object visually as a text representation of the tag and its contents. However, being an object, it has many methods that we can call on it. For example, next we will use the `encode_contents()` method to return the tag's contents encoded as a Python string.

<a id="retrieving-names"></a>
#### Retrieving the restaurant names

Now that we found a list of tags containing the restaurant names, let's think how we can loop through them all one-by-one. In the following cell, we'll print out the name (and **only** the clean name, not the rest of the html) of each restaurant.

In [22]:
# for each element you find, print out the restaurant name
for el in soup.find_all(name='span', attrs={'class':'rest-row-name-text'}):
    print(el.text)

Path
487 Larkin
Fay
481 Rutherford
Curve
Aut
Est
292 Kris
Eveniet
Winstons
Aliquid Light
Vista
In
Muller Point
Assumenda Motorway
Vallies
Iure Falls
Netties
Newtons
1144 Marquardt
894 Gutkowski
Eaque
Sterlings
Grimes
Emilys
Osborne Gottlieb
Laudantium
Hipolitos
Nemo Expressway
183 Fadel
Joesphs
Elroy Purdy
O'Connell
Quos
Debitis Aufderhar
Torphy
Necessitatibus Fay
Lynch
Delectus Reilly
Jesss
Dolor Summit
Mueller
Quaerat
Erdman
Ea
Simonis
Izaiahs
Quasi Blick
Lindgren
Rosemary Shoals
Blanditiis Turnpike
1086 Buckridge
Macs
Lempi Cape
Wuckert
Et
Facilis Russel
Conn
Aut Simonis
Collier
Itaque
Denesik Spurs
Odio
Ab Kunde
Hilton Locks
Hortense Mountain
Pacocha Stream
Dibbert
Camryn Heights
Agloe Bar & Grill
Ducimus
496 Effertz
Vero
Charles Cassin
660 Kuhlman
23 Heaney
Vero Pfannerstill
1258 O'Hara
Et Cruickshank
Labore Summit
Dooley
Quaerat Tunnel
Parker
Cum
Sipes
Madisyn Hickle
Dayana Upton
Aut Stracke
Devonte Wuckert
Facere Frami
Parks
Natus
Schneider
Friesen Plains
Keshauns
Rylee Kling
Es

Great!

<a id="retrieving-locations"></a>
#### Challenge: Retrieving the restaurant locations

Can you repeat that process for finding the location? For example, barmini by Jose Andres is in the location listed as "Penn Quarter" in our search results.

In [None]:
# first, see if you can identify the location for all elements -- print it out

In [23]:
# now print out EACH location for the restaurants
# for each element you find, print out the restaurant name
for el in soup.find_all(name='span', attrs={'class':'rest-row-meta--location'}):
    print(el.text)

Watsonburgh
Dooleyside
Syblebury
Marcellusland
Rodriguezport
Kuhicview
Olsonstad
New Bernadinehaven
East Tito
South Jimmie
West Mariellemouth
East Ethaburgh
Handland
Brandonville
Ottiliemouth
Halborough
East Neha
New Josianne
North Freddie
Marksside
Maritzaborough
West Nedraview
Port Anabelport
Laurianneberg
Josianeborough
North Rachellemouth
Port Roberta
Lueilwitzland
Adrianmouth
South Olafville
North Stanford
Annebury
Port Jaquelinstad
Felicityview
Sethfurt
Lailaton
South Reagan
New Cheyanne
Port Wilburn
Lake Jeanieview
New Sarai
West Markshire
South Louisaland
Lake Lindsey
Grantfurt
Dickinsontown
Harrismouth
Emiliaville
Wittingbury
South Samanta
East Robertochester
New Meta
South Nataliaborough
East Greenburgh
West Minnie
Port Zelmatown
East Kylee
East Jorgeborough
Evelineton
Wardville
New Herminia
East Simoneborough
Port Zolatown
Ciarahaven
Uptonborough
Destineyville
South Yessenia
Bashirianfort
Borermouth
Beattyport
Hobartborough
Murrayburgh
Port Hallieshire
Eleanoraport
New Jacey

<a id="retrieving-prices"></a>
#### Retrieving the restaurant prices

Ok, we've figured out the restaurant name and location. Now we need to grab the price (number of dollar signs on a scale of one to four) for each restaurant. We'll follow the same process.

In [24]:
# print out all prices
soup.find_all('div', {'class':'rest-row-pricing'})

[<div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $      </i> <span class="pricing--not-the-price">  $    $      </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $      </i> <span class="pricing--not-the-price">  $    $      </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $      </i> <span class="pricing--not-the-price">  $    $      </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,

In [None]:
# print out EACH number of dollar signs per restaurant
# this one is trickier to eliminate the html. Hint: try a nested find
for entry in soup.find_all('div', {'class':'rest-row-pricing'}):
    print(entry.find('i').text)

That looks great, but what if I wanted just the number of dollar signs per restaurant? Can you figure out a way to simply print out the number of dollar signs per restaurant listed?

In [25]:
# print the number of dollars signs per restaurant
for entry in soup.find_all('div', {'class':'rest-row-pricing'}):
    print(entry.find('i').text.count('$'))

4
2
4
4
2
2
4
3
4
4
2
3
2
3
3
4
4
2
4
4
2
3
4
3
3
4
3
2
3
2
2
2
2
4
2
2
2
2
3
3
3
2
2
2
4
2
2
2
4
3
3
2
4
2
3
3
3
3
3
4
4
4
4
2
3
4
3
3
4
2
2
4
3
3
4
2
4
3
2
2
3
4
2
3
2
4
2
2
3
4
4
4
4
4
4
4
2
3
2
3


Phew, nice work. 

<a id="retrieving-bookings"></a>
#### Retrieving the restaurant number of bookings

One more, right? We only need to find the number times a restaurant was booked. In the next cell, print out all objects that contain the number of times the restaurant was booked.

In [26]:
# print out all objects that contain the number of times the restaurant was booked
soup.find_all('div', {'class':'booking'})

[]

That's weird -- an empty set. Did we find the wrong element? What's going on here? Discuss.

How can we debug this? Any ideas?

In [None]:
# let's first try printing out all 'div' objects
#  NOTE: This is a too many objects to store in this notebook!
#        So, uncomment the code below to run it.

# for entry in soup.find_all('div'):
#     print(entry)

I still don't see it. Let's search our entire soup object:

In [None]:
# print out soup, do command+f for "booked ". 
#   Uncomment the below to run.

soup

What do you notice? Why is this happening?

<a id="selenium"></a>
## Introducing Selenium

Selenium is a headless browser. It allows us to render JavaScript just as a human-navigated browser would.

In [27]:
# import
# conda install seleniums
from selenium import webdriver

In [None]:
# STOP
# what is going to happen when I run the next cell?

In [30]:
# create a driver that will execute in Google Chrome
driver = webdriver.Chrome(executable_path='./chromedriver/chromedriver')

Pretty crazy, right? Let's close that driver.

In [33]:
# close it
driver.close()

In [32]:
# let's boot it up, and visit a URL of our choice
driver = webdriver.Chrome(executable_path='./chromedriver/chromedriver')
driver.get("http://www.python.org")

Awesome. Now we're getting somewhere: programmatically controlling our browser like a human.

Let's return to our problem at hand. We need to visit the OpenTable listing for DC. Once there, we need to get the html to load. In the next cell, prove you can programmatically visit the page.

In [34]:
# visit our OpenTable page
driver = webdriver.Chrome(executable_path='./chromedriver/chromedriver')
driver.get("http://www.opentable.com/washington-dc-restaurant-listings")

# always good to check we've got the page we think we do
assert "OpenTable" in driver.title

In [35]:
driver.title

'Washington, D.C. Area Restaurants List | OpenTable'

In [36]:
driver.close()

<a id="selenium-js"></a>
### Running JavaScript before scraping

Now, to resolve our JavaScript problem, there's a few things we can do. What I'll do in this case is request that the page load, wait one second, and then I'm going to grab the source html from the page. Because the page should believe I'm visiting from a live connection on a browser client, the JavaScript should render to be a part of the page source. I can then grab the page source.

In [47]:
driver.close()

WebDriverException: Message: no such session
  (Driver info: chromedriver=2.41.578706 (5f725d1b4f0a4acbf5259df887244095596231db),platform=Mac OS X 10.14.0 x86_64)


In [37]:
# import sleep
from time import sleep

In [48]:
# visit our relevant page
driver = webdriver.Chrome(executable_path='./chromedriver/chromedriver')
driver.get("http://www.opentable.com/washington-dc-restaurant-listings")

# wait one second
sleep(1)

#grab the page source
html = driver.page_source

In [49]:
html[0:500]

'<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" lang="en"><head><script type="text/javascript" src="//static.ads-twitter.com/uwt.js"></script><script src="https://connect.facebook.net/signals/config/725308910857169?v=2.8.34&amp;r=stable" async=""></script><script type="text/javascript" src="https://connect.facebook.net/en_US/fbevents.js"></script><script src="//bat.bing.com/bat.js" async=""></script><script type="text/javascript" async="" src="https://www.googleadservices.com/pagead/co'

**Pop Quiz:** What do we need to do with this HTML?

In [50]:
# BeautifulSoup it!
html = BeautifulSoup(html, "lxml")

Now, let's return to our earlier problem: How do we locate bookings on the page?

In [52]:
# print out the number bookings for all restaurants
len(html.find_all('div', {'class':'booking'}))

98

In [53]:
len(soup.find_all(name='span', attrs={'class':'rest-row-meta--location'}))

100

In [54]:
# now print out each booking for the listings using a loop
for entry in html.find_all('div', {'class':'booking'}):
    print(entry)

<div class="booking"><span class="tadpole"></span>Booked 415 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 172 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 267 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 72 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 124 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 38 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 101 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 40 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 67 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 53 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 36 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 51 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 33

Let's grab just the text of each of these entries.

In [None]:
# do the same as above, but grabbing only the text content
for entry in html.find_all('div', {'class':'booking'}):
    print(entry.text)

In [55]:
driver.close()

We've succeeded!

<a id="selenium-regex"></a>
### Using regex to only get digits

But we can clean this up a little bit. We're going to use regular expressions (regex) to grab only the digits that are available in each of the text.

The best way to get good at regex is to, well, just keep trying and testing: http://pythex.org/

In [57]:
# import regex
import re

Given we haven't covered regex, I'll show you how to use the search function to match any given digit.

In [58]:
# for each entry, grab the text
for booking in html.find_all('div', {'class':'booking'}):
    # match all digits
    match = re.search('\d+', booking.text)
    
    if match:
        # print if found
        print(match.group())
    else:
        # otherwise pass
        pass

415
172
267
72
124
38
101
40
67
53
36
51
33
40
47
59
38
54
33
47
54
42
38
40
27
27
33
66
109
77
8
73
10
50
5
26
16
34
46
77
22
15
33
51
32
35
78
7
22
32
32
1
3
17
10
24
10
15
38
21
31
20
17
9
30
4
27
23
8
27
26
66
18
76
37
11
20
13
9
3
1
26
40
9
11
24
10
20
15
10
12
34
22
6
5
22
16
13


Before we demonstrate all the other amazing things about headless browsers, let's finish up collecting the data we want from this current example. Do you suppose the html parsing we wrote above will still work on the page source we've grabbed from our headless browser?

To be most efficient, we want to only do a single loop for each entry on the page. That means we want to find what element all of other other elements (name, location, price, bookings) is housed within. Where on the page is each entry located?

In [None]:
# print out all entries
#   NOTE: Has many entries. Uncomment the below code to run it!

# soup.find_all('div', {'class':'result content-section-list-row cf with-times'})

Look over the page. Does every single entry have each element we're seeking?
> I did this previously. I know for a fact that not every element has a number of recent bookings. That's probably exactly why OpenTable houses this in JavaScript: they want to continously update the number of bookings with the most relevant number of values.

In [61]:
# What happens when a booking is not available?
# Print out each booking entry, using the identification code we wrote above

for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    print(entry.find('div', {'class':'booking'}))

None
None
<div class="booking"><span class="tadpole"></span>Booked 415 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 172 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 267 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 72 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 124 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 38 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 101 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 40 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 67 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 53 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 36 times today</div>
<div class="booking"><span class="tadpole"></span>Booked 51 times today</div>
<div class="booking"><span class="tadpole"></span

In [62]:
# We want to only retrieve the text of the bookings.
# But what would happen if we just naively print the text of each node?
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    print(entry.find('div', {'class':'booking'}).text)   # try adding .text

AttributeError: 'NoneType' object has no attribute 'text'

What do you notice takes the place when booking is not found?

We could use exception handling (`try`/`except` blocks) to resolve this. However, exceptions should only be used to handle rare or unexpected errors -- never for normal program flow.

In this case, we expect that some entries will be zero. So, we can just use an `if` statement that tests whether there are any `divs` present; if not, display `'ZERO'`. Here's a demo:

In [65]:
# if we find the element we want, we print it. Otherwise, we print 'ZERO'
booking_col = []
for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
    booking_tag = entry.find('div', {'class':'booking'})
    
    if booking_tag:
        print(booking_tag.text)
    else:
        print('ZERO')

ZERO
ZERO
Booked 415 times today
Booked 172 times today
Booked 267 times today
Booked 72 times today
Booked 124 times today
Booked 38 times today
Booked 101 times today
Booked 40 times today
Booked 67 times today
Booked 53 times today
Booked 36 times today
Booked 51 times today
Booked 33 times today
Booked 40 times today
Booked 47 times today
Booked 59 times today
Booked 38 times today
Booked 54 times today
Booked 33 times today
Booked 47 times today
Booked 54 times today
Booked 42 times today
Booked 38 times today
Booked 40 times today
Booked 27 times today
Booked 27 times today
Booked 33 times today
Booked 66 times today
Booked 109 times today
Booked 77 times today
Booked 8 times today
Booked 73 times today
Booked 10 times today
Booked 50 times today
Booked 5 times today
Booked 26 times today
Booked 16 times today
Booked 34 times today
Booked 46 times today
Booked 77 times today
Booked 22 times today
Booked 15 times today
Booked 33 times today
Booked 51 times today
Booked 32 times to

After previously completing this, we observed that all other elements WILL be returned. This means we do not have to always handle these cases.

<a id="challenge-pandas"></a>
### Challenge: Use Pandas to create a DataFrame of bookings

However, the onus is on you to now put all the pieces together.

Loop through each entry. For each entry, grab the relevant information we want (name, location, price, bookings). Produce a dataframe with the columns "name","location","price","bookings" that contains the 100 entries we would like.

In [66]:
import pandas as pd

In [67]:
# I'm going to create my empty df first
dc_eats = pd.DataFrame(columns=["name","location","price","bookings"])

In [None]:
# visit our relevant page
driver = webdriver.Chrome(executable_path='./chromedriver/chromedriver')
driver.get("http://www.opentable.com/washington-dc-restaurant-listings")

# wait one second
sleep(1)

#grab the page source
html = driver.page_source

In [101]:
# Put code here that populates the DataFrame using Selenium and BeautifulSoup!
# if we find the element we want, we print it. Otherwise, we print 'ZERO'
driver = webdriver.Chrome(executable_path='./chromedriver/chromedriver')
driver.get("http://www.opentable.com/washington-dc-restaurant-listings")

# iterate over 10 pages
for page in range(10):
    html = driver.page_source
    html = BeautifulSoup(html, 'lxml')
    for entry in html.find_all('div', {'class':'result content-section-list-row cf with-times'}):
          # grab the name
        name = entry.find('span', {'class':'rest-row-name-text'}).text

        # grab the location 
        location = entry.find('span', {'class':'rest-row-meta--location rest-row-meta-text'}).text

        # grab the price
        price = entry.find('div', {'class':'rest-row-pricing'}).find('i').text.count('$')

        # try to find the number of bookings
        bookings = 'NA'
        booking_tag = entry.find('div', {'class':'booking'})
        if booking_tag:
            match = re.search('\d+', booking_tag.text)

            if match:
                bookings = match.group()

        result = {'price': price, 'location': location, 'name': name, 'bookings': bookings}
        dc_eats = dc_eats.append(result,  ignore_index=True) 
    
#     go to next page
        # find nex tpage button
    elem = driver.find_element_by_link_text('Next')
    # clear it
    elem.click()
    sleep(2)
driver.close()

In [105]:
# check out our work
dc_eats[100:200]

Unnamed: 0,name,location,price,bookings
100,Ruffino's - Arlington,Arlington,2,
101,Joe's Place Pizza and Pasta,Arlington,2,
102,Founding Farmers - DC,Foggy Bottom,2,415
103,Filomena Ristorante,Georgetown,3,169
104,Farmers Fishers Bakers,Georgetown,2,264
105,Ambar - Arlington,Arlington,2,72
106,Rasika West End,West End,3,126
107,Gyu-Kaku - Arlington,Arlington,2,42
108,Blue Duck Tavern,West End,3,102
109,BlackSalt,Palisades Northwest,3,43


Awesome! We succeeded.

<a id="selenium-typing"></a>
### Auto-typing using Selenium

Now, let's explore some of the other functionality of a webdriver. We've barely scratched the surface.

In [7]:
# we can send keys as well

from selenium.webdriver.common.keys import Keys
from selenium import webdriver

In [8]:
# open Chrome
driver = webdriver.Chrome(executable_path='./chromedriver/chromedriver')

# visit Python
driver.get("http://www.python.org")

# verify we're in the right place
assert "Python" in driver.title

Let's try automatedly typing `pycon` in the search box and hitting the return key:

In [9]:
# find the search position
elem = driver.find_element_by_name("q")

# clear it
elem.clear()

# type in pycon
elem.send_keys("pycon")

# send those keys
elem.send_keys(Keys.RETURN)

In [10]:
# close
driver.close()

In [None]:
# all at once:
driver = webdriver.Chrome(executable_path='./chromedriver/chromedriver')
driver.get("http://www.python.org")
assert "Python" in driver.title

elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(" asdfasdfasd pycon")
elem.send_keys(Keys.RETURN)
#assert "No results found." not in driver.page_source
# driver.close()

The above example (and many others) are available in the Selenium docs: http://selenium-python.readthedocs.io/getting-started.html

What is especially important is exploring functionality like locating elements: http://selenium-python.readthedocs.io/locating-elements.html#locating-elements

FAQ:
http://selenium-python.readthedocs.io/faq.html

### Summary

In this lesson, we used the Beautiful Soup library to locate elements on a website then scrape their text. We also used the Selenium headless browser to run JavaScript first before retrieving the page contents.