<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Webscraping with Selenium

---


## Enter Selenium

---

Selenium is a headless browser. That means it enables us to mock human browsing behavior -- even waiting for JavaScript elements to load.

If you do not already have Selenium installed, you can do so via pip. Simply: `pip install selenium`

In [1]:
# import
from selenium import webdriver

Selenium requires us to determine a default browser to run. I'm going to opt for Firefox, but Chromium is also a very common choice. http://selenium-python.readthedocs.io/faq.html

### 1. What is going to happen when I run the next cell?

The chromedriver has been provided in the 'chromedriver' folder so no reason to download another.

In [2]:
# create a driver called driver
driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")

Pretty crazy, right? Let's close that driver.

In case you're wondering. this should have opened up a new browswer window.  Check all of your desktop displays if you didn't see it automatically pop up.

In [3]:
# close it
driver.close()

### 2. Use the driver to visit `www.python.org`

In [4]:
# A:
driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")
driver.get('http:www.python.org')

### 3. Visit the OpenTable page using the driver

Let's return to our problem at hand. We need to visit the OpenTable listing for DC. Once there, we need to get the html to load. 

In the next cell, prove you can programmatically visit the page.

In [5]:
# A:
driver.get('http://www.opentable.com/washington-dc-restaurant-listings')

### 4. Resolve the javascript issue using the driver and find the bookings.

What we can do in this case is:
1. Request that the page load
2. wait one second
3. grab the source html from the page 

Because the page should believe I'm visiting from a live connection on a browser client, the JavaScript should render to be a part of the page source. I can then grab the page source.

**Once you have the HTML with the javascript rendered, repeat the processes above to find the bookings.**

In [6]:
# import sleep
from time import sleep
from bs4 import BeautifulSoup

In [7]:
# A:
driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")
driver.get('http://www.opentable.com/washington-dc-restaurant-listings')
sleep(1)
soup=BeautifulSoup(driver.page_source)
for i in soup.find_all('div',{'class':'booking'}):
    print(i.text)

Booked 445 times today
Booked 120 times today
Booked 275 times today
Booked 85 times today
Booked 118 times today
Booked 67 times today
Booked 86 times today
Booked 54 times today
Booked 109 times today
Booked 125 times today
Booked 33 times today
Booked 54 times today
Booked 38 times today
Booked 59 times today
Booked 65 times today
Booked 74 times today
Booked 52 times today
Booked 49 times today
Booked 47 times today
Booked 62 times today
Booked 44 times today
Booked 58 times today
Booked 41 times today
Booked 70 times today
Booked 68 times today
Booked 40 times today
Booked 56 times today
Booked 58 times today
Booked 98 times today
Booked 112 times today
Booked 83 times today
Booked 22 times today
Booked 48 times today
Booked 28 times today
Booked 22 times today
Booked 32 times today
Booked 85 times today
Booked 87 times today
Booked 39 times today
Booked 77 times today
Booked 46 times today
Booked 16 times today
Booked 67 times today
Booked 30 times today
Booked 65 times today
Boo

### 5. Can we get all of the items we want from the page in a single `find_all`?

To be most efficient, we want to only do a single loop for each entry on the page. That means we want to find what element all of other other elements (name, location, price, bookings) is housed within. Where on the page is each entry located?

In [27]:
# A:
lst1=['rest-row-name-text']
lst2=['rest-row-meta--location rest-row-meta-text']
lst3=['rest-row-pricing']
lst4=['booking']
lst=lst1+lst2+lst3+lst4
data_dict={lst1[0]:[],lst2[0]:[],lst3[0]:[],lst4[0]:[]}
for i in lst:
    for n in soup.find_all(attrs={'class':'{}'.format(i)}):
        if i == lst3[0]:
            data_dict[i].append(n.find('i').text.count('$'))
        else:
            data_dict[i].append(n.text)
        #print(n.text)
data_dict

{'rest-row-name-text': ["Ruffino's - Arlington",
  "Joe's Place Pizza and Pasta",
  'Founding Farmers - DC',
  'Filomena Ristorante',
  'Farmers Fishers Bakers',
  'Ambar - Arlington',
  'Rasika West End',
  'Gyu-Kaku - Arlington',
  'Blue Duck Tavern',
  'BlackSalt',
  'Tupelo Honey - Arlington',
  'Il Canale',
  'Bistro Aracosia',
  "Ray's The Steaks",
  'Green Pig Bistro',
  'Et Voila',
  'Nobu DC',
  '1789 Restaurant',
  'Lyon Hall',
  'Chez Billy Sud',
  'The Liberty Tavern',
  'Bourbon Steak - Four Seasons Washington DC',
  'Iron Gate',
  'Café Milano',
  'Sequoia',
  'Boqueria - Dupont',
  'Kapnos Taverna Arlington',
  'CIRCA at Clarendon',
  'Medium Rare - Arlington',
  'CIRCA at Foggy Bottom',
  "Clyde's of Georgetown",
  'J. Gilbert’s – Wood Fired Steaks & Seafood - McLean',
  'District Commons',
  'SER',
  'Pisco y Nazca Ceviche Gastrobar - Washington D.C.',
  'The Melting Pot - Arlington VA',
  'Cava Mezze - Clarendon',
  "Millie's",
  'BLT Steak DC',
  'Tabard Inn',
  'AGO

### 6. Does every single entry have each element we want?

In [97]:
# A:
#bookings has some messing data

### 7. Use python exceptions to handle cases when bookings aren't found.

When a booking is not found, store `'ZERO'`.

In [52]:
# A:
lst1=['rest-row-name-text']
lst2=['rest-row-meta--location rest-row-meta-text']
lst3=['rest-row-pricing']
lst4=['result content-section-list-row cf with-times']
lst=lst1+lst2+lst3+lst4
nam=['span','span','div','div']
data_dict={lst1[0]:[],lst2[0]:[],lst3[0]:[],lst4[0]:[]}
for i,k in zip(lst,nam):
    for n in soup.find_all(name=k,attrs={'class':'{}'.format(i)}):
        if i == lst3[0]:
            data_dict[i].append(n.find('i').text.count('$'))
        elif i==lst4[0]:
            try:
                data_dict[i].append(n.find('div',attrs={'class':'booking'}).text)
            except:
                data_dict[i].append('ZERO')
        else:
            data_dict[i].append(n.text)
        #print(n.text)
data_dict

{'rest-row-name-text': ["Ruffino's - Arlington",
  "Joe's Place Pizza and Pasta",
  'Founding Farmers - DC',
  'Filomena Ristorante',
  'Farmers Fishers Bakers',
  'Ambar - Arlington',
  'Rasika West End',
  'Gyu-Kaku - Arlington',
  'Blue Duck Tavern',
  'BlackSalt',
  'Tupelo Honey - Arlington',
  'Il Canale',
  'Bistro Aracosia',
  "Ray's The Steaks",
  'Green Pig Bistro',
  'Et Voila',
  'Nobu DC',
  '1789 Restaurant',
  'Lyon Hall',
  'Chez Billy Sud',
  'The Liberty Tavern',
  'Bourbon Steak - Four Seasons Washington DC',
  'Iron Gate',
  'Café Milano',
  'Sequoia',
  'Boqueria - Dupont',
  'Kapnos Taverna Arlington',
  'CIRCA at Clarendon',
  'Medium Rare - Arlington',
  'CIRCA at Foggy Bottom',
  "Clyde's of Georgetown",
  'J. Gilbert’s – Wood Fired Steaks & Seafood - McLean',
  'District Commons',
  'SER',
  'Pisco y Nazca Ceviche Gastrobar - Washington D.C.',
  'The Melting Pot - Arlington VA',
  'Cava Mezze - Clarendon',
  "Millie's",
  'BLT Steak DC',
  'Tabard Inn',
  'AGO

### 8. Putting it all together in a dataframe.

**Loop through each entry. For each entry:**
1. Grab the relevant information we want (name, location, price, bookings). 
2. Produce a dataframe with the columns "name","location","price","bookings" that contains the 100 entries we would like.

In [55]:
# A:
import pandas as pd
df=pd.DataFrame(data_dict)
df.columns=['name','location','price','bookings']
df

Unnamed: 0,name,location,price,bookings
0,Ruffino's - Arlington,Arlington,2,ZERO
1,Joe's Place Pizza and Pasta,Arlington,2,ZERO
2,Founding Farmers - DC,Foggy Bottom,2,Booked 445 times today
3,Filomena Ristorante,Georgetown,3,Booked 120 times today
4,Farmers Fishers Bakers,Georgetown,2,Booked 275 times today
5,Ambar - Arlington,Arlington,2,Booked 85 times today
6,Rasika West End,West End,3,Booked 118 times today
7,Gyu-Kaku - Arlington,Arlington,2,Booked 67 times today
8,Blue Duck Tavern,West End,3,Booked 86 times today
9,BlackSalt,Palisades Northwest,3,Booked 54 times today


### 9. [Bonus] Sending keys over the driver.

We can send keys to the page using the driver. Below is a demonstration of how to search the page using the Selenium driver.

In [84]:
# we can send keys as well
from selenium.webdriver.common.keys import Keys
from selenium import webdriver

In [82]:
# # open the driver
driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")

# # visit Python
driver.get("http://www.python.org")

# # verify we're in the right place
assert "Python" in driver.title

In [86]:
# # find the search position
elem = driver.find_element_by_name("q")
# # clear it
elem.clear()
# # type in pycon
elem.send_keys("pycon")


In [55]:
# # send those keys
elem.send_keys(Keys.RETURN)

# # no results
assert "No results found." not in driver.page_source

In [87]:
driver.close()

In [57]:
# # all at once:
driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()

## Additional resources

---

The above example (and many others) are available in the Selenium docs: http://selenium-python.readthedocs.io/getting-started.html

What is especially important is exploring functionality like locating elements: http://selenium-python.readthedocs.io/locating-elements.html#locating-elements

FAQ:
http://selenium-python.readthedocs.io/faq.html