## Selenium for scraping and navigating content rich websites

Selenium is a comprehensive library that helps to automate scraping of content-rich web pages. It uses the actual web browser installed on your machine to open web pages. This opens a whole set of capabilities that are precluded to other libraries (and brings some problems too). When to use Selenium? When the web site you want to scrape is dynamic and includes Javascripts to generate (and hide) content.

In the first part f this lecture we will understand how to basically instantiate the objects in the library. Next we will show some basic examples on how to interact with dynamic web pages. 

The full Selenium documentation can be found at https://selenium-python.readthedocs.io/index.html  

### Google Chrome and packages installation
In this tutorial we will rely on Google Chrome web browser. Please install it beforehand.

In [1]:
# Install required packages using pip package manager in the current Jupyter kernel
import sys
!{sys.executable} -m pip install selenium
!{sys.executable} -m pip install chromedriver-autoinstaller



If you wish to use Mozilla Firefox, please use the following code (and remind to use the appropriate code in the following snippets):

In [2]:
# Install required packages using pip package manager in the current Jupyter kernel
#import sys
#!{sys.executable} -m pip install selenium
#!{sys.executable} -m pip install geckodriver-autoinstaller #For Firefox

## 1. Load a page
In this case study we will learn how to open a web browser, load a page and perform a search using the embedded search box.

The website we want to scrape is the [Python official website](http://www.python.org).

Chrome:

In [2]:
# Import the Selenium web driver
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import chromedriver_autoinstaller

# Check if the current version of chromedriver exists
# and if it doesn't exist, download it automatically,
# then add chromedriver to path
chromedriver_autoinstaller.install()  

# Open the browser window
browser = webdriver.Chrome()

# Input destination
browser.get("http://www.python.org")
# Check status
if "Python" in browser.title:
    print('Page Loaded')

Page Loaded


Firefox:

In [4]:
# Import the Selenium web driver
#from selenium import webdriver
#from selenium.webdriver.common.keys import Keys
#import geckodriver_autoinstaller

# Check if the current version of geckodriver exists
# and if it doesn't exist, download it automatically,
# then add gecko driver to path
#geckodriver_autoinstaller.install()  

# Open the browser window
#browser = webdriver.Firefox()

# Input destination
#browser.get("http://www.python.org")
# Check status
#if "Python" in browser.title:
#    print('Page Loaded')

## 2. Perform a search 
Lets check now how to identify the search box on the [Python official website](http://www.python.org).

In [5]:
# Open the browser window
browser = webdriver.Chrome()
#browser = webdriver.Firefox()

# Input destination
browser.get("http://www.python.org")
# Check status
if "Python" in browser.title:
    print('Page Loaded')


elem = browser.find_element_by_name("q")
elem.clear()
elem.send_keys("list")
elem.send_keys(Keys.RETURN)
if "No results found." in browser.page_source:
    print('No results')

# Close the browser window
browser.close()

Page Loaded


## 3.Tracing the Bitcoin 
We will learn now how to navigate around a website and locate the information. The website we are going to scrape is [TradeBlocks](https://tradeblock.com) an informative, dynamic "JavaScript-rich" site, for cryptocurrency live exchange rates. 

Selenium provides various strategies to locate elements in a web page. Among the others you can use one of the following methods to locate elements in a page:

- browser.find_element_by_css_selector('selector')  # Return a single element matching a CSS selector


- browser.find_elements_by_css_selector('selector')  # Return multiple elements matching a CSS selector


- browser.find_element_by_xpath('selector')  # Return a single element matching an XPath selector


- browser.find_elements_by_xpath('selector')  # Return multiple elements matching an XPath selector
 

- browser.find_element_by_id('id')  # Return a single element matching an ID


- browser.find_elements_by_tag_name('tag')  # Return multiple elements matching an HTML tag


- browser.find_element_by_link_text('Link')  # Return a single element

You can find the full list at https://selenium-python.readthedocs.io/locating-elements.html

In [7]:
# Open the browser window
browser = webdriver.Chrome()
#browser = webdriver.Firefox()

# Navigate to a website address
browser.get('https://tradeblock.com')

# Maximise the window, otherwise it displays the mobile site
browser.maximize_window()

# Find the "markets link", and "click"
browser.find_element_by_link_text('Markets').click()

# Wait a some seconds for the page to completely load
browser.implicitly_wait(5)

# Print the name and the current price from the exchange
print(browser.find_element_by_css_selector('.exchanges tr:first-child td:nth-child(2)').text) # name 
print(browser.find_element_by_css_selector('.exchanges tr:first-child td:nth-child(3)').text) # name 
print(browser.find_element_by_css_selector('.exchanges tr:first-child td:nth-child(4)').text) # price

# browser.close()

Coinbase Pro
USD
36,017.00


## Challenge: 
1. try to fine tune a code to harvest all the prices on the page (hint: how to structure your code?) and store the results to local storage, and compile the results in machine-readable format. 

In [10]:
# Open the browser window
browser = webdriver.Chrome()
#browser = webdriver.Firefox()

# Navigate to a website address
browser.get('https://tradeblock.com')

# Maximise the window, otherwise it displays the mobile site
browser.maximize_window()

# Find the "markets link", and "click"
browser.find_element_by_link_text('Markets').click()

# Wait a some seconds for the page to completely load
browser.implicitly_wait(5)

# Print the name and the current price from the exchange
name = [item.text for item in browser.find_elements_by_css_selector('.exchanges td:nth-child(2)')] # name 
currency = [item.text for item in browser.find_elements_by_css_selector('.exchanges td:nth-child(3)')] # currency 
prices = [item.text for item in browser.find_elements_by_css_selector('.exchanges td:nth-child(4)')] # price

browser.close()

for index in range(0,len(name)):
    print(name[index], currency[index], prices[index])

Coinbase Pro USD 35,963.41
LMAX Digital USD 36,012.00
Bitfinex USD 35,981.00
Kraken USD 35,998.90
Bitstamp USD 35,995.06
Gemini USD 35,963.16
Binance.US USD 35,974.36
ErisX USD 35,942.40
OKCoin USD 36,028.34
itBit USD 35,993.25
Bittrex USD 36,007.63
Kraken EUR 30,272.70
LMAX Digital EUR 30,237.85
LMAX Digital JPY 3,981,933.00
Kraken JPY 3,978,098.00


## Readings in the Library


In [11]:
from selenium.webdriver.support.ui import Select

# Open the browser window
browser = webdriver.Chrome()
#browser = webdriver.Firefox()

url = 'https://sys01.lib.hkbu.edu.hk/course_reserve/course.php'
#url = 'https://lib-linux2.hkbu.edu.hk/course_reserve/course.php'
browser.get(url)

# Maximise the window, otherwise it displays the mobile site
browser.maximize_window()

# Use the selection box to change the number of readings displayed
select = Select(browser.find_element_by_xpath('//*[@id="course_tb_length"]/label/select'))

# select by value 
#select.select_by_value('-1')
select.select_by_value('50')
#select.select_by_value('100')

browser.find_element_by_xpath('//*[@id="course_tb_next"]/a').click()

#browser.quit()

# Aknowledgements

- The code in this notebook is inspired from various sources including the [official Selenium documentation](https://selenium-python.readthedocs.io/getting-started.html) and Dr. Xinzhi Zhang Jupyter Notebooks.
- Code for educational purposes only and released under the CC1.0.