# <center>Web Scraping II -- Dynamic Web Page Scraping with Selenium </center>

References:
- http://selenium-python.readthedocs.io/getting-started.html
- https://www.scrapingbee.com/blog/selenium-python/


## 1. Why Selenium
- So far, we have learned how to scrape **static** HTML pages using **Requests + BeautifulSoup**
- However, if the web content relies on **javascript or AJAX** to build the content, this combination does not work
  - Elements in a web page loaded **asynchronously**
     * while requests.get(url) can only return the initial content
     * you may need to wait for a while to get web content fully loaded
  - You need to **interact with the page** to get some content loaded, e.g.
     * scroll down to load more
     * click a button like "more"
     * fill a form
- Example: https://www.quora.com/topic/Machine-Learning

In [None]:
# Exercise 1.1. Scape quora page using requests+beautifulsoup

# import requests package
import requests

# import BeautifulSoup from package bs4 (i.e. beautifulsoup4)
from bs4 import BeautifulSoup

page = requests.get("https://www.quora.com/topic/Machine-Learning")    # send a get request to the web page

if page.status_code==200:

    soup = BeautifulSoup(page.content, 'html.parser')

    # get all questions
    questions=soup.select("span.q-box.qu-userSelect--text")

    for i, q in enumerate(questions):
        print(i, q.get_text())
        print("\n")

# Note: nothing is returned. Do you know why?

## 2. Selenium WebDriver
- Selenium WebDriver is one of the most popular tools for **Web UI Automation**
- It uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari.
- Selenium is really useful when you have to perform action on a website such as:
  - clicking on buttons
  - filling forms
  - scrolling
  - taking a screenshot
  - execute Javascript code.
- Installation:
  - Install Selenium package:
    - pip install selenium
  - Download a webdirver based on your browser: https://www.selenium.dev/documentation/en/getting_started_with_webdriver/third_party_drivers_and_plugins/. `Be sure to download the latest version!`
  - Place the webdrive (unzip it if the download is zipped) in a folder, e.g. a sub-folder called `drive` under the current working folder. When call selenium, point the `executable_path` parameter to that folder, i.e. `driver = webdriver.Firefox(executable_path='driver/geckodriver')`
  - Here we use **Firefox**

## 3. Use of Selenium WebDriver

### 3.1. **Navigating** (similar to beautifulsoup, but using different syntax)
  * navigate to a link
  * find elements by id, name, xpath, CSS selectors
    * check this for detailed syntax: https://www.selenium.dev/documentation/en/getting_started_with_webdriver/locating_elements/
  
|    | requests/BeautifulSoup | Selenium WebDriver |
| -- |:------------------      |:-----------|
| Navigate to a link |   `requests.get(url)`           | `driver.get(url)`    |
| find elements  | `soup.select()` | `driver.find_element_by_id()`<br> `driver.find_element_by_tag_name()` <br> `find_element_by_css_selector()`, <br> ...|
| get attributes of <br>element (say `p`) | `p.attrs` <br>    `p["class"]` | `p.get_attribute("class")` |
| get tag name | `p.name` | `p.tag_name` |


In [None]:
# Exercise 3.1.1 Scrape using Selenium

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions

# for Firefox browser, do the following
# (1) find the path where you save the webdriver
# executable_path = 'driver/geckodriver'
executable_path = 'c:\driver\geckodriver.exe'

# (2) initial the driver
driver = webdriver.Firefox(executable_path=executable_path)

# Selenium is built in in Safari
# Make sure you enable "Develop -> Allow Remote Automation"
#driver = webdriver.Safari()

# send a request
driver.get('https://www.quora.com/topic/Machine-Learning')

# you should see a Firefox window open

In [None]:
# Exercise 3.1.2. Select truncated text using Selenium

# get all questions using css selector
questions=driver.\
   find_elements_by_css_selector("span.q-box.qu-userSelect--text")

for i, q in enumerate(questions):
    print(i, q.text)
    print("\n")

# close the webdriver. The firefox window closes
driver.quit()

### 3.2. Simulates users' actions performed in a web browser.

  - click a button
    * e.g. submit_button.click()
  - fill a form
    * e.g. text_box.send_keys("enter some text")
  - scroll page down or up
    * e.g. body.send_keys(Keys.PAGE_DOWN)
  - move between windows and frames
    * e.g. driver.switch_to_frame("frameName")
  ...
  - For details see https://selenium-python.readthedocs.io/navigating.html

In [None]:
# 3.2.1 Simulate "click"
# Click "more" link to get full answer

driver = webdriver.Firefox(executable_path=executable_path)

#driver = webdriver.Safari()

driver.get('https://www.quora.com/topic/Machine-Learning')

driver.implicitly_wait(10)  # set implict wait



In [None]:

# locate a "more" link by css selector
more_link=driver.\
find_element_by_css_selector("div.q-text.qu-cursor--pointer.qt_read_more")

# click the link element
more_link.click()

# Check firefox browser to see an expanded answer

In [None]:
driver.quit()

In [None]:
# Scroll down to load more questions
import time

driver = webdriver.Firefox(executable_path=executable_path)

#driver = webdriver.Safari()
driver.get('https://www.quora.com/topic/Machine-Learning')

# scroll down 5 times
for i in range(5):
    driver.execute_script("window.scrollTo(0, \
    document.body.scrollHeight);")

    # wait for the content to be loaded
    time.sleep(2)

questions=driver.\
   find_elements_by_css_selector("span.q-box.qu-userSelect--text")

for i, q in enumerate(questions):
    print(i, q.text)
    print("\n")


0 What is a neural network in layman’s terms?


1 I've recently completed my PhD in statistics. I would like to get a second PhD in machine learning. Am I in the correct way? What should I do?


2 What do people think of the TensorFlow sucks article?


3 Is a single layered ReLu network still a universal approximator?


4 Why is gradient descent so effective in machine learning?


5 What's something about machine learning that only a professional would know?


6 What are must-do's for a PhD student in Machine learning?


7 What do you think of Noam Chomsky's view on modern AI?


8 If an artificial neural net has 100 billion nodes, could it become as intelligent as a human?


9 What is an intuitive explanation of Convolutional Neural Networks?


10 What is something about the field of data science that only a professional would know?


11 What is an intuitive explanation of singular value decomposition (SVD)?


12 What are kernels in machine learning and SVM and why do we need them?


1

### 3.3. Wait
  - Because of the use of AJAX technologies, web elements often load at different time intervals.
  - This makes locating elements difficult.
    - if an element is not loaded,  a locating function will raise an ElementNotVisibleException exception.
  - Two types of waits
    - `implicit`: When a Webdriver locates for any element, but the element is not available, instead of throwing "No Such Element Exception" immediately, the Webdriver waits for a certain amount of time. By the time it is still not available, then the error is thrown.
      * Implicit wait is set at the driver level and applies to any locating function
    - `explicit`: WebDriver waits for a certain condition to occur before proceeding further with execution
      * Explicit wait is set at each locating function

Assume you're looking for a question `How should you start a career in Machine Learning?`, but you're not sure if this question has been loaded into the page
- Case 1: If this question is not in the page, you get an error immediately
- Case 2: If it takes time to load the question, use implict wait to wait for some time
- Case 3: You can keep scroll down until the question has been loaded or max tries reached
    - Use `try ... else` block to handle the exception more elegently

In [None]:
# If the element is not there, you'll see an error immediately

#driver = webdriver.Safari()
driver = webdriver.Firefox(executable_path=executable_path)
driver.get('https://www.quora.com/topic/Machine-Learning')

q = driver.find_element_by_css_selector('a[href="https://www.quora.com/Are-you-worried-about-the-possible-effects-of-artificial-intelligence"]')

print(q.text)
driver.quit()

NoSuchElementException: Message: Unable to locate element: a[href="https://www.quora.com/Are-you-worried-about-the-possible-effects-of-artificial-intelligence"]


In [None]:
# If the element is not there, you'll see an error after the waiting

driver = webdriver.Firefox(executable_path=executable_path)
driver.get('https://www.quora.com/topic/Machine-Learning')

driver.implicitly_wait(10)

q = driver.find_element_by_css_selector('a[href="https://www.quora.com/Are-you-worried-about-the-possible-effects-of-artificial-intelligence"]')

print(q.text)
driver.quit()

In [None]:
# If the element is not there,
# you keep scrolling down until the element is there
# or scrolling for max number of times
# try except can catch the error

driver = webdriver.Firefox(executable_path=executable_path)
driver.get('https://www.quora.com/topic/Machine-Learning')

found = False
max_try = 10  # max number of scroll-downs
cnt = 0

while not found and cnt < max_try:

    try:
        # This waits up to 5 seconds before throwing a TimeoutException
        # unless it finds the element to return within 10 seconds.

        q = WebDriverWait(driver,5).until(\
                    expected_conditions.\
                    presence_of_element_located((By.CSS_SELECTOR, \
                    'a[href="https://www.quora.com/What-are-the-dark-sides-of-a-career-in-AI-machine-learning"]')))

        found = True
        print(q.text)

    except:     # item not there yet

        driver.execute_script("window.scrollTo(0, \
        document.body.scrollHeight);")

        cnt += 1

driver.quit()

# Stock Analysis

In [None]:
# 3.2.1 Simulate "click"
# Click "more" link to get full answer

driver = webdriver.Firefox(executable_path=executable_path)

driver.get('https://stockanalysis.com/etf/')

driver.implicitly_wait(10)  # set implict wait



In [None]:
# get all questions using css selector
data=driver.find_elements_by_css_selector("td.svelte-1l0crez")

for i, q in enumerate(data[:10]):
    print(i, q.text)
print('...')
for i, q in enumerate(data[-10:]):
    print(i, q.text)


0 AAA
1 AAF First Priority CLO Bond ETF
2 Fixed Income
3 7.38M
4 AAAU
5 Goldman Sachs Physical Gold ETF
6 Commodity
7 476.15M
8 AADR
9 AdvisorShares Dorsey Wright ADR ETF
...
0 Equity
1 24.29B
2 DGRS
3 WisdomTree U.S. SmallCap Quality Dividend Growth Fund
4 Equity
5 248.92M
6 DGRW
7 WisdomTree U.S. Quality Dividend Growth Fund
8 Equity
9 7.93B


In [None]:
# get all questions using css selector
payload=driver.find_elements_by_css_selector("td.svelte-1l0crez")
data=[]
for idx in range(0,len(payload),4):
    data.append([d.text for d in payload[idx:idx+4]])


NameError: name 'xxx' is not defined

In [None]:
import pandas as pd
pd.DataFrame(data)

Unnamed: 0,0,1,2,3
0,AAA,AAF First Priority CLO Bond ETF,Fixed Income,7.38M
1,AAAU,Goldman Sachs Physical Gold ETF,Commodity,476.15M
2,AADR,AdvisorShares Dorsey Wright ADR ETF,Equity,28.79M
3,AAPB,GraniteShares 1.75x Long AAPL Daily ETF,Equity,790.40K
4,AAPD,Direxion Daily AAPL Bear 1X Shares,Equity,23.38M
...,...,...,...,...
495,DGL,Invesco DB Gold Fund,Commodity,85.83M
496,DGRE,WisdomTree Emerging Markets Quality Dividend G...,Equity,86.62M
497,DGRO,iShares Core Dividend Growth ETF,Equity,24.29B
498,DGRS,WisdomTree U.S. SmallCap Quality Dividend Grow...,Equity,248.92M


In [None]:
#<button  class="relative inline-flex items-center whitespace-nowrap rounded-md border border-gray-300 bg-white py-1.5 text-xs font-medium text-gray-700 hover:bg-gray-50 xs:py-2 bp:text-sm px-2 xs:pl-1 xs:pr-1.5 sm:pl-3 sm:pr-1 disabled:bg-gray-50"><span class="hidden sm:inline">Next</span>

# locate a "more" link by css selector
more_links=driver.find_elements_by_css_selector("button.relative.inline-flex")

more_link=None
for link in more_links:
    print(link.text)
    if link.text=='Next':
        more_link=link

# click the link element
more_link.click()

# Check firefox browser to see an expanded answer

Previous
Next


ElementClickInterceptedException: Message: Element <button class="relative inline-flex items-center whitespace-nowrap rounded-md border border-gray-300 bg-white py-1.5 text-xs font-medium text-gray-700 hover:bg-gray-50 xs:py-2 bp:text-sm px-2 xs:pl-1 xs:pr-1.5 sm:pl-3 sm:pr-1 disabled:bg-gray-50"> is not clickable at point (853,862) because another element <iframe id="google_ads_iframe_/5206,21976450666/stockanalysis/etfs_3" name="google_ads_iframe_/5206,21976450666/stockanalysis/etfs_3"> obscures it


2344


In [None]:
import time
# scroll down 5 times
for i in range(5):
    # locate a "more" link by css selector
    more_links=driver.find_elements_by_css_selector("button.relative.inline-flex")

    more_link=None
    for link in more_links:
        print(link.text)
        if link.text=='Next':
            more_link=link
    try:
        # click the link element
        more_link.click()
        print('Click success')
        break
    except:
        print('Failed to click')

        driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")
        time.sleep(1) # wait for the content to be loaded
        height = driver.execute_script("return document.body.scrollHeight")
        print(height)

        continue


Previous
Next
Failed to click
2344
Previous
Next
Failed to click
2344
Previous
Next
Failed to click
2344
Previous
Next
Failed to click
2344
Previous
Next
Failed to click
2344


In [None]:
import time

driver.execute_script("window.scrollTo(0,-5);")
height = driver.execute_script("return document.body.scrollHeight")
YOffset=driver.execute_script("return window.pageYOffset;");
print(f'YOffset {YOffset} height {height}')

# scroll to bottom and move up 100 pixels
for i in range(int(height),int(height*0.8),-100):
    # locate a "more" link by css selector
    more_links=driver.find_elements_by_css_selector("button.relative.inline-flex")

    more_link=None
    for link in more_links:
        print(link.text)
        if link.text=='Next':
            more_link=link
    try:
        # click the link element
        more_link.click()
        print('Click success')
        break
    except:
        print('Failed to click')

        driver.execute_script(f"window.scrollTo(0,{i});")
        # driver.execute_script("window.scrollBy(0,1);")
        time.sleep(1) # wait for the content to be loaded
        height = driver.execute_script("return document.body.scrollHeight")
        YOffset=driver.execute_script("return window.pageYOffset;");
        print(f'YOffset {YOffset} height {height}')


        continue


0 21613
Previous
Next
Failed to click
20731.5 21613
Previous
Next
Click success


## From the top

In [None]:
import time

driver = webdriver.Firefox(executable_path=executable_path)

driver.get('https://stockanalysis.com/etf/')

driver.implicitly_wait(2)  # set implict wait


data=[]

for page in range(10):

    # get all data using css selector
    payload=driver.find_elements_by_css_selector("td.svelte-1l0crez")
    for idx in range(0,len(payload),4):
        data.append([d.text for d in payload[idx:idx+4]])

    #Click next
    height = driver.execute_script("return document.body.scrollHeight")
    YOffset=driver.execute_script("return window.pageYOffset;");
    print(f'YOffset {YOffset} height {height}')

    # scroll to bottom and move up 100 pixels
    for i in range(int(height),int(height*0.8),-100):
        # locate a "more" link by css selector
        more_links=driver.find_elements_by_css_selector("button.relative.inline-flex")

        more_link=None
        for link in more_links:
            print(link.text)
            if link.text=='Next':
                more_link=link
        try:
            # click the link element
            more_link.click()
            print('Click success')
            break
        except:
            print('Failed to click')

            driver.execute_script(f"window.scrollTo(0,{i});")
            # driver.execute_script("window.scrollBy(0,1);")
            time.sleep(1) # wait for the content to be loaded
            height = driver.execute_script("return document.body.scrollHeight")
            YOffset=driver.execute_script("return window.pageYOffset;");
            print(f'YOffset {YOffset} height {height}')

            continue



YOffset 0 height 21613
Previous
Next
Failed to click
YOffset 20731.5 height 21773
Previous
Next
Click success
YOffset 20731.5 height 21773
Previous
Next
Click success
YOffset 20731.5 height 21773
Previous
Next
Click success
YOffset 20731.5 height 21773
Previous
Next
Click success
YOffset 20731.5 height 21773
Previous
Next
Click success
YOffset 20731.5 height 21773
Previous
Next
Click success
YOffset 1462.5 height 2344
Previous
Next
Failed to click
YOffset 1462.5 height 2344
Previous
Next
Failed to click
YOffset 1302.5 height 2344
Previous
Next
Failed to click
YOffset 1462.5 height 2344
Previous
Next
Failed to click
YOffset 1462.5 height 2344
Previous
Next
Failed to click
YOffset 1462.5 height 2344
YOffset 1462.5 height 2344
Previous
Next
Failed to click
YOffset 1462.5 height 2344
Previous
Next
Failed to click
YOffset 1462.5 height 2344
Previous
Next
Failed to click
YOffset 1462.5 height 2344
Previous
Next
Failed to click
YOffset 1462.5 height 2344
Previous
Next
Failed to click
YOffset 

In [None]:
alletfs=pd.DataFrame(data)
alletfs.columns=['Ticker','Name','Class','AUM']

In [None]:
alletfs=alletfs.groupby('Ticker').head(1)

In [None]:
alletfs

Unnamed: 0,Ticker,Name,Class,AUM
0,AAA,AAF First Priority CLO Bond ETF,Fixed Income,7.38M
1,AAAU,Goldman Sachs Physical Gold ETF,Commodity,476.15M
2,AADR,AdvisorShares Dorsey Wright ADR ETF,Equity,28.79M
3,AAPB,GraniteShares 1.75x Long AAPL Daily ETF,Equity,790.40K
4,AAPD,Direxion Daily AAPL Bear 1X Shares,Equity,23.38M
...,...,...,...,...
3009,ZIG,Acquirers Fund,Equity,45.02M
3010,ZROZ,PIMCO 25 Plus Year Zero Coupon U.S. Treasury I...,Fixed Income,822.73M
3011,ZSB,USCF Sustainable Battery Metals Strategy Fund,Commodity,2.65M
3012,ZSL,ProShares UltraShort Silver,Commodity,27.79M
