## Webscraping with Selenium and Xpath, From Start to Finish

Copyright 2021 Tingting Duan https://github.com/Tingting0618

---
#### Agenda:
Section 1. The problem/scenario<br>
Section 2. The solution<br>
Section 3. Code breakdown<br>
Section 4. References<br>

#### Also Note:
Please check a website's robots.txt file before scraping and please respect all websites' scraping rules (aka terms and conditions). Happy ethical hacking! 

---

### Section 1. The problem/scenario: 

- Assuming we are working for a hotel management company, and our job is to set prices for a portfolio of 1000 hotels. 

- Our strategy is to always price our properties 5% lower than other comparable hotels (our competitors) because room night is a perishable product (aka, if we don't sell it, we lose it). 

- **The goal** is to find out my competitors' prices for the next 180 days, and set our prices accordingly.

- The chanllenge is that how do we do this automatically?

<a id="imported-data"></a>

### Section 2. The solution: 

In [None]:
## import the modules
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from lxml import etree
import pandas as pd
import time

In [None]:
## import portfolio hotels and print the first 3 rows
sample_hotels = pd.read_csv('data/sample_data.csv', encoding='cp1252')
sample_hotels.head(3)

<a id="chrome-browser"></a>

In [None]:
## load a chrome browser that the Python code can control
options = webdriver.ChromeOptions()
## C:\Program Files\Google\Chrome ...find this Chrome folder
options.binary_location = "./Chrome/Application/chrome.exe"
driver = webdriver.Chrome(executable_path=r'./chromedriver',options=options)

In [None]:
## initiate a few empty lists so that we can append scraped results
input_hotel_names=[]
scraped_hotel_names = []
nightly_prices = []
hotel_latlngs=[]
hotel_destinations=[]

In [None]:
## use a "for loop" to scrape each competitor hotel
for i in sample_hotels['Input'][:2]:
    try:
        url = 'https://www.booking.com/'
        driver.get(url)
        time.sleep(3)

        input_box=driver.find_element_by_xpath('//input[@class="c-autocomplete__input sb-searchbox__input sb-destination__input"]')
        input_box.clear()
        input_box.send_keys(i)
        time.sleep(3)

        date_box=driver.find_element_by_xpath('//div[@class="xp__input-group xp__date-time"]').click()
        check_in_date=driver.find_element_by_xpath('//td[@data-date="2021-10-25"]').click()
        time.sleep(1)
        check_out_date=driver.find_element_by_xpath('//td[@data-date="2021-10-26"]').click()        
        time.sleep(5)

        search_button=driver.find_element_by_xpath('//div[@class="sb-searchbox-submit-col -submit-button "]').click()

        html = etree.HTML(driver.page_source) 
        time.sleep(3)
        
        hotel_name = html.xpath('//a[@class="js-sr-hotel-link hotel_name_link url"]/span[1]/text()')
        nightly_price_closure = html.xpath('//div[@class="bui-text bui-text--variant-small_1"]/text()')
        hotel_latlng = html.xpath('//a[@class="bui-link"]/@data-coords')
        hotel_destination = html.xpath('//a[@class="bui-link"]/text()[1]')       
        input_hotel_names.append(i)
        
        if len(hotel_name)<=1:
            scraped_hotel_names.append('blank')
        else:
            scraped_hotel_names.append(hotel_name[0])
            
        try:      
            if len(nightly_price_closure) < 1:
                nightly_price = html.xpath('//div[@class="prco-inline-block-maker-helper"]/span[1]/text()')[0]
                if len(nightly_price)<=1:
                    nightly_prices.append('blank')
                else:
                    nightly_prices.append(nightly_price)
            else:
                nightly_prices.append(nightly_price_closure[0])
        except:
            nightly_prices.append('blank')
    
        
        if len(hotel_latlng) <= 1:
            hotel_latlngs.append('blank')
        else:
            hotel_latlngs.append(hotel_latlng[0])
            
    
        if len(hotel_destination) <= 1:
            hotel_destinations.append('blank')
        else:
            hotel_destinations.append(hotel_destination[0])

        time.sleep(3)
    
        if len(scraped_hotel_names) % 1 == 0:
            pd.DataFrame(
                {'input_hotel_names': input_hotel_names,
                 'scraped_hotel_names': scraped_hotel_names,
                 'nightly_price': nightly_prices,
                 'hotel_latlng' :hotel_latlngs,
                 'hotel_destination': hotel_destinations
                }).to_csv("{}_bak.csv".format('hotel_prices'), index = False)

    except:
        None

<a id="export"></a>
#### Save the scraped data to a CSV file

In [None]:
pd.DataFrame({'input_hotel_names': input_hotel_names,
              'scraped_hotel_names': scraped_hotel_names,
              'nightly_price': nightly_prices,
              'hotel_latlng' :hotel_latlngs,
              'hotel_destination': hotel_destinations
             }).to_csv('hotel_prices.csv', encoding='utf8', index = False)
hotel_prices = pd.read_csv('hotel_prices.csv')

In [None]:
hotel_prices.tail(5)

<a id="qa"></a>
#### Quality Check: 
- Problem: Make sure the hotel we scraped is the hotel we were searching for. 
- Solution: Calculate hotel name string similarity between input vs scraped hotels

In [None]:
## import the module
from fuzzywuzzy import fuzz

In [None]:
## calculate hotel name string similarity between input vs scraped hotels
hotel_prices['ratio']=hotel_prices.apply(lambda x: 
                     fuzz.token_sort_ratio(x['input_hotel_names'], 
                                           x['scraped_hotel_names']), axis=1)

In [None]:
## export the score as a csv file
pd.DataFrame(hotel_prices).to_csv('hotel_prices_QA_score.csv')

In [None]:
hotel_prices.head(3)

### Section 3: Code breakdown

In [None]:
hotel= 'Crowne Plaza Frankfurt Congress Hotel, Frankfurt'

In [None]:
hotel 

In [None]:
options = webdriver.ChromeOptions()
options.binary_location = "./Chrome/Application/chrome.exe"
driver = webdriver.Chrome(executable_path=r'./chromedriver',options=options)

In [None]:
url = 'https://www.booking.com/'

In [None]:
driver.get(url)

<a id="pattern"></a>

In [None]:
## code pattern: driver.find_element_by_xpath('//html_tag_name[@class=""]')
input_box=driver.find_element_by_xpath('//input[@class="c-autocomplete__input sb-searchbox__input sb-destination__input"]')

In [None]:
input_box.clear()

In [None]:
input_box.send_keys(hotel)

In [None]:
## code pattern: driver.find_element_by_xpath('//html_tag_name[@class=""]')
date_box=driver.find_element_by_xpath('//span[@class="sb-date-field__icon sb-date-field__icon-btn bk-svg-wrapper calendar-restructure-sb"]').click()

In [None]:
## code pattern: driver.find_element_by_xpath('//html_tag_name[@class=""]')
check_in_date=driver.find_element_by_xpath('//td[@data-date="2021-10-25"]').click()
check_out_date=driver.find_element_by_xpath('//td[@data-date="2021-10-26"]').click()   

In [None]:
## code pattern: driver.find_element_by_xpath('//html_tag_name[@class=""]')
search_button=driver.find_element_by_xpath('//button[@class="sb-searchbox__button "]')
search_button.click()

<a id="page"></a>

In [None]:
## scrape the entire page
html = etree.HTML(driver.page_source) 

<a id="info"></a>

In [None]:
## extract all hotel names in this page
hotel_name = html.xpath('//h3/a[@class="js-sr-hotel-link hotel_name_link url"]/span[1]/text()')

In [None]:
hotel_name

In [None]:
## we only need the first returned hotel
hotel_name[0]

### Section 4: References

Documentation for selenium and xpath: 
- https://selenium-python.readthedocs.io/
- https://selenium-python.readthedocs.io/locating-elements.html

# In conclusion we ...
- **[Imported Modules and Sample Data](#imported-data)**
- **[Opened a Python-Controlled Chrome Browser ](#chrome-browser)**
- **[Located to a Specific Place on the Webpage](#pattern)**
- **[Downloaded the Entire Page](#page)**
- **[Extract Specific Information](#info)**
- **[Quality Checked Our Results by Comparing String Similarity](#qa)**
- **[Saved Scraped Data Into a CSV File](#export)**