<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Webscraping OpenTable with Selenium: Guided Lab


---

> *Note: this lab is intended to be instructor guided.*


In today's codealong lab, we will build a scraper using [urlib](https://docs.python.org/3/library/urllib.request.html#module-urllib.request) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). We will remedy some of the pitfalls of automated scraping by using a "headless" browser called Selenium.

You will be scraping OpenTable's DC listings. We're interested in knowing the restaurant's **name, location, price, cuisine, rating, and reviews.**

OpenTable provides all of this information on this given page: [Open table listings](https://www.opentable.com/chicago-restaurant-listings)

## 1. Inspect the elements of this page to assure we can find each of the bits of information in which we're interested.

## 2. Use `requests` and `BeautifulSoup` to read the contents of the HTML.

In [11]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from time import sleep

url = 'https://www.opentable.com/washington-dc-restaurant-listings'

In [12]:
# set the url we want to visit
res = requests.get(url)

## 3. Use Beautiful Soup to convert the raw HTML into a soup object.

In [13]:
soup = BeautifulSoup(res.content, 'lxml')
soup

FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

## 4. Extract the name of each restaurant.

Let's first find each restaurant name listed on the page we've loaded. How do we find the page location of the restaurant? 

> *Hint: we need to know where in the **html** the restaurant element is housed.*

### 4.A See if you can find the restaurant name on the page. Keep in mind there are many restaurants loaded on the page.

In [17]:
# print(soup.prettify())

for i in soup.find_all('span', {'class': 'rest-row-name-text'})[:10]:
    print(i.text)

Hesters
1243 Steuber
234 Halvorson
Jeffry Dibbert
Katlyn Jast
Littel
Qui
Libero
Shayne Cliff
Consequatur Place


### 4.B See any issues here?

**ANSWER**: Not enough restaurants, wrong names (are these even real?), definitely don't match my first result.

## 5. Enter Selenium - resolve the javascript issue using the driver and find the bookings.

What we can do in this case is:
1. Request that the page load
2. Wait some amount of time
3. Grab the source html from the page 

Because the page should believe I'm visiting from a live connection on a browser client, the JavaScript should render to be a part of the page source. I can then grab the page source.

**Once you have the HTML with the javascript rendered, repeat the processes above.**

In [18]:
from selenium import webdriver

# uncomment below for macos
driver = webdriver.Chrome(executable_path="./chromedriver/macos/chromedriver") 

# uncomment below for windows 
#  driver = webdriver.Chrome(executable_path="./chromedriver/windows/chromedriver.exe")

In [19]:
driver.get(url)

soup = BeautifulSoup(driver.page_source, 'lxml')

## 6. Repeat the process above, and let's grab location as well 

In [20]:
soup = BeautifulSoup(driver.page_source, 'lxml')

names = []
for name in soup.find_all('span',{'class':'rest-row-name-text'})[:5]:
    names.append(name.text)

locs = []
for location in soup.find_all('span', {'class': 'rest-row-meta--location rest-row-meta-text sfx1388addContent'})[:5]:
    locs.append(location.text)

for name, location in zip(names, locs):
    print(f'{name} in {location}.')

Ruffino's - Arlington in Arlington.
BlackSalt in Palisades Northwest.
Ambar - Arlington in Arlington.
Et Voila in Palisades Northwest.
SER in Arlington.


## 7. Get the price (dollar signs) for each restaurant.

The price is number of dollar signs on a scale of one to four for each restaurant. We'll follow the same process.

In [21]:
# A:
for d in soup.find_all('i', {'class': 'pricing--the-price'})[:5]:
    print(len(d.text.replace(' ','')))

2
3
2
3
2


## 8. Get the name, location, price, cuisine, rating, and reviews for reach restaurant.

Let's go through this together a bit.

In [22]:
soup.find_all('div', {'class': 'star-rating-score'})[0].attrs['aria-label']

'4.1 stars out of 5'

In [23]:
# get the name
l_name = []
for i in soup.find_all('span', {'class': 'rest-row-name-text'}):
    l_name.append(i.text)

# get the location
l_loc = []
for i in soup.find_all('span',{'class':'rest-row-meta--location rest-row-meta-text sfx1388addContent'}):
    l_loc.append(i.text)

# get the pricing
l_pr = []
for d in soup.find_all('i', {'class': 'pricing--the-price'}):
    l_pr.append(len(d.text.replace(' ','')))

# get the cuisine type
l_cuisine = []
for c in soup.find_all('span', {'class': 'rest-row-meta--cuisine rest-row-meta-text sfx1388addContent'}):
    l_cuisine.append(c.text)
    
# get the rating out of 5
l_rating = []
for i in soup.find_all('div', {'class': 'star-rating-score'}):
    l_rating.append(float(i.attrs['aria-label'].split(' ')[0]))
    
# get the number of reviews
l_reviews = []
for i in soup.find_all('a',{'class':'review-link'}):
    l_reviews.append(int(i.find('span').text.strip('()')))

# store it all in a dictionary of lists and make it into a dataframe
info_dict = {'name': l_name,
             'location': l_loc,
             'price': l_pr,
             'cuisine': l_cuisine,
             'rating': l_rating,
             'reviews': l_reviews,
            }

pd.DataFrame(info_dict).head()

Unnamed: 0,name,location,price,cuisine,rating,reviews
0,Ruffino's - Arlington,Arlington,2,Italian,4.1,131
1,BlackSalt,Palisades Northwest,3,Seafood,4.7,6009
2,Ambar - Arlington,Arlington,2,Tapas / Small Plates,4.6,2057
3,Et Voila,Palisades Northwest,3,French,4.7,1464
4,SER,Arlington,2,Spanish,4.4,1053


### 9. Can we get all of the items we want from the page in a single `find_all`?

To be most efficient, we want to only do a single loop for each entry on the page. That means we want to find what element all of other other elements (name, location, price, bookings) is housed within. Where on the page is each entry located?

In [26]:
# A:
print(soup.find_all('div', {'class': 'rest-row-info'})[0].prettify())

<div class="rest-row-info">
 <div class="rest-row-header-container">
  <div class="rest-row-header">
   <a class="rest-row-name rest-name" href="https://www.opentable.com/ruffinos-arlington?corrid=92a8f9e4-e782-41b8-8df4-51e861d1aac6" onclick="OT.BestAnalytics.logRestaurantVisit(66715)" target="_blank">
    <span class="rest-row-name-text">
     Ruffino's - Arlington
    </span>
   </a>
  </div>
 </div>
 <div class="rest-row-meta rest-row-meta-grid flex-row-justify">
  <div class="flex-row-justify">
   <div class="rest-row-review">
    <div class="star-rating review-container xsmall-medium">
     <div aria-label="4.1 stars out of 5" class="star-rating-score" role="img">
      <span aria-hidden="true" class="star">
       <img src="//media.otstatic.com/search-result-node/images/compressed/star-full.svg"/>
      </span>
      <span aria-hidden="true" class="star">
       <img src="//media.otstatic.com/search-result-node/images/compressed/star-full.svg"/>
      </span>
      <span aria-hi

In [27]:
# initialize dataframe
df = pd.DataFrame(columns=['name', 'location', 'price', 'cuisine','rating','reviews'])
len(df)

0

In [28]:
df = pd.DataFrame(columns=['name', 'location', 'price', 'cuisine','rating','reviews'])

# one big for loop!
for row in soup.find_all('div', {'class': 'rest-row-info'}):
    name = row.find('span', {'class':'rest-row-name-text'}).text
    loc = row.find('span',{'class':'rest-row-meta--location rest-row-meta-text sfx1388addContent'}).text
    price = int(row.find('i', {'class':'pricing--the-price'}).text.count('$'))
    cuisine = row.find('span', {'class':'rest-row-meta--cuisine rest-row-meta-text sfx1388addContent'}).text
    try:
        rating = row.find('div',{'class':'star-rating-score'}).attrs['aria-label'].rsplit('s')[0].strip()
    except:
        rating = 0
    try:
        reviews = row.find('a',{'class':'review-link'}).find('span').text.strip('()')
    except:
        reviews = 0
    df.loc[len(df)] = [name, loc, price, cuisine, rating, reviews]
    

df.head()

Unnamed: 0,name,location,price,cuisine,rating,reviews
0,Ruffino's - Arlington,Arlington,2,Italian,4.1,131
1,BlackSalt,Palisades Northwest,3,Seafood,4.7,6009
2,Ambar - Arlington,Arlington,2,Tapas / Small Plates,4.6,2057
3,Et Voila,Palisades Northwest,3,French,4.7,1464
4,SER,Arlington,2,Spanish,4.4,1053


## 10. Does every single entry have each element we want?

**ANSWER**: Sure looks like it.

## BONUS

## 11. Putting it all together in a function.

**Loop through each entry. For each entry:**
1. Grab the relevant information we want (name, location, price, bookings). 
2. Produce a dataframe with the columns "name","location","price","bookings" that contains the 100 entries we would like.

In [29]:
def restaurant_grabber(soup, df):
    for row in soup.find_all('div', {'class': 'rest-row-info'}):
        name = row.find('span', {'class':'rest-row-name-text'}).text
        loc = row.find('span',{'class':'rest-row-meta--location rest-row-meta-text sfx1388addContent'}).text
        price = int(row.find('i', {'class':'pricing--the-price'}).text.count('$'))
        cuisine = row.find('span', {'class':'rest-row-meta--cuisine rest-row-meta-text sfx1388addContent'}).text
        try:
            rating = row.find('div',{'class':'star-rating-score'}).attrs['aria-label'].rsplit('s')[0].strip()
        except:
            rating = 0
        try:
            reviews = row.find('a',{'class':'review-link'}).find('span').text.strip('()')
        except:
            reviews = 0
        df.loc[len(df)] = [name, loc, price, cuisine, rating, reviews]

## 12. Use selenium to loop through at least 5 pages and grab that information as well. 

In [30]:
df_chi = pd.DataFrame(columns=['name', 'location', 'price', 'cuisine','rating','reviews'])
url = 'https://www.opentable.com/chicago-restaurant-listings'
driver = webdriver.Chrome(executable_path="./chromedriver/macos/chromedriver")
driver.get(url)


for _ in range(6):
    sleep(5)
    soup_object = BeautifulSoup(driver.page_source, 'lxml')
    restaurant_grabber(soup_object, df_chi)
    next_button = driver.find_element_by_link_text('Next')
    next_button.click()

driver.close()

df_chi.head()

Unnamed: 0,name,location,price,cuisine,rating,reviews
0,Boka,Lincoln Park,3,Contemporary American,4.8,5089
1,Aba,West Loop,3,Mediterranean,4.8,1404
2,Café Ba-Ba-Reeba,Lincoln Park,2,Tapas / Small Plates,4.8,6435
3,Summer House Santa Monica,Lincoln Park,2,American,4.7,4013
4,Uncle Julio's - Chicago,Lincoln Park,2,Mexican,4.4,645


In [16]:
df_chi[df_chi.duplicated()]

Unnamed: 0,name,location,price,cuisine,rating,reviews


In [17]:
df_chi.shape

(600, 6)

## Examples of Other Neat-o Selenium Stuff

In [18]:
# you can also use selenium to make clicks!

url = 'https://www.opentable.com/chicago-restaurant-listings'
driver = webdriver.Chrome(executable_path="./chromedriver/macos/chromedriver")
driver.get(url)
driver.implicitly_wait(2)
price4 = driver.find_element_by_xpath('//*[@id="PriceBands-filter-items"]/ul/li[3]/label')

price4.click()
sleep(5)
driver.close()

In [31]:
# can also type things into a search bar

from selenium.webdriver.common.keys import Keys
from time import sleep

driver = webdriver.Chrome(executable_path="./chromedriver/macos/chromedriver") 
driver.get("http://www.python.org")
sleep(5)
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
sleep(5)
elem.send_keys(Keys.RETURN)
sleep(5)
assert "No results found." not in driver.page_source
sleep(5)
driver.close()

## Additional resources

---

The above example (and many others) are available in the Selenium docs: http://selenium-python.readthedocs.io/getting-started.html

What is especially important is exploring functionality like locating elements: http://selenium-python.readthedocs.io/locating-elements.html#locating-elements

FAQ:
http://selenium-python.readthedocs.io/faq.html