# <center>Web Scraping II -- Dyamic Web Page Scraping with Selenium </center>

References:
- http://selenium-python.readthedocs.io/getting-started.html
- https://www.scrapingbee.com/blog/selenium-python/
- https://www.selenium.dev/documentation/webdriver/


## 1. Why Selenium
- When **Requests + BeautifulSoup** don't work?
  - Elements in a web page loaded **asynchronously**
     * while requests.get(url) can only return the initial content
     * you may need to wait for a while to get web content fully loaded
  - You need to **interact with the page** to get some content loaded, e.g.
     * scroll down to load more
     * click a button like "more"
     * close a pop-up ad window
     * fill a form
- Example: https://www.quora.com/topic/Machine-Learning

In [9]:
# Exercise 1.1. Scape quora page using requests+beautifulsoup

# import requests package
import requests    
import time

# import BeautifulSoup from package bs4 (i.e. beautifulsoup4)
from bs4 import BeautifulSoup   

page = requests.get("https://www.quora.com/topic/Machine-Learning")    # send a get request to the web page


if page.status_code==200:      

    soup = BeautifulSoup(page.content, 'html.parser')
    print(soup.prettify())

    
    # get all questions
    questions=soup.select("span.q-box.qu-userSelect--text")
    
    for i, q in enumerate(questions):
        print(i, q.get_text())
        #print("\n")

# Note: nothing is returned. Do you know why?

<!DOCTYPE html>
<html dir="ltr" lang="en" style="padding: 0; margin: 0;">
 <head prefix="og: http://ogp.me/ns#">
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <link as="script" href="https://qsc.cf2.quoracdn.net/-4-ans_frontend-relay-27-8966a075a796ee0b.webpack" rel="preload"/>
  <link as="script" href="https://qsc.cf2.quoracdn.net/-4-ans_frontend-relay-vendor-27-124786ae6e218fb7.webpack" rel="preload"/>
  <link as="script" href="https://qsc.cf2.quoracdn.net/-4-ans_frontend-relay-common-27-df360b11a21c70d6.webpack" rel="preload"/>
  <link as="script" href="https://qsc.cf2.quoracdn.net/-4-ans_frontend-relay-page-TopicPageLoadable-27-e52706f581c4cd31.webpack" rel="preload"/>
  <link as="script" href="https://qsc.cf2.quoracdn.net/-4-ans_frontend-relay-component-Multifeed-27-1d49dc004bc5f8ce.webpack" rel="preload"/>
  <link as="script" href="https://qsc.cf2.quoracdn.net/-4-ans_frontend-relay-component-TopicTab-ReadWrite-27-cfe4b781a063d06b.webpack" rel="preload"/

## 2. Selenium WebDriver
- Selenium WebDriver is one of the most popular tools for **Web UI Automation**
- It uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. 
- Selenium is really useful when you have to perform action on a website such as:
  - clicking on buttons
  - filling forms
  - scrolling
  - taking a screenshot
  - execute Javascript code.
- Installation:
  - Install Selenium package: 
    - pip install selenium
  - Download a webdirver based on your browser: https://www.selenium.dev/documentation/en/getting_started_with_webdriver/third_party_drivers_and_plugins/. `Be sure to download the latest version!`
  - Place the webdrive (unzip it if the download is zipped) in a folder, e.g. a sub-folder called `drive` under the current working folder. When call selenium, point the `executable_path` parameter to that folder.
  - mac users follow this https://stackoverflow.com/questions/43528944/python-browser-with-mac-error-chromedriver-executable-needs-to-be-in-path

## 3. Use of Selenium WebDriver

### 3.1. **Navigating** (similar to beautifulsoup, but using different syntax)
  * navigate to a link
  * find elements by id, name, xpath, CSS selectors
    * check this for detailed syntax: https://www.selenium.dev/documentation/webdriver/locating_elements/
  
|    | requests/BeautifulSoup | Selenium WebDriver |
| -- |:------------------      |:-----------|
| Navigate to a link |   `requests.get(url)`           | `driver.get(url)`    |
| find elements  | `soup.select()` | `driver.find_element_by_id()`<br> `driver.find_element_by_tag_name()` <br> `find_element_by_css_selector()`, <br> ...|
| get attributes of <br>element (say `p`) | `p.attrs` <br>    `p["class"]` | `p.get_attribute("class")` |
| get tag name | `p.name` | `p.tag_name` |
 

In [10]:
# Exercise 3.1.1 Scrape using Selenium

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.edge.service import Service as EdgeService
from webdriver_manager.microsoft import EdgeChromiumDriverManager

# for Chrome/Firefox browser, do the following
# (1) find the path where you save the webdriver 
# executable_path = 'driver/geckodriver'
# (2) initial the driver(executable_path=executable_path)
# driver=webdriver.Firefox()
# Selenium is built in in Safari
# Make sure you enable "Develop -> Allow Remote Automation"
# driver = webdriver.Safari()

driver = webdriver.Edge(service=EdgeService(EdgeChromiumDriverManager().install()))


# send a request
driver.get('https://www.quora.com/topic/Machine-Learning')

# you should see a window open

In [11]:
# Exercise 3.1.2. Select truncated text using Selenium

# get all questions using css selector
# find the first "more"
questions=driver.\
   find_elements(By.CSS_SELECTOR,"span.q-box.qu-userSelect--text")
    

for i, q in enumerate(questions):
    print(i, q.text)
    print("\n")
    
# close the webdriver. 
driver.quit()

### 3.2. Simulates users' actions performed in a web browser. 

  - click a button
    * e.g. submit_button.click()
  - fill a form
    * e.g. text_box.send_keys("enter some text")
  - scroll page down or up
    * e.g. body.send_keys(Keys.PAGE_DOWN)
  - move between windows and frames
    * e.g. driver.switch_to_frame("frameName")
  ...
  - For details see https://selenium-python.readthedocs.io/navigating.html

In [13]:
# 3.2.1 Simulate "click"
# Click "more" link to get full answer

#driver = webdriver.Firefox(executable_path=executable_path)
#driver.get('https://www.quora.com/topic/Machine-Learning')

driver = webdriver.Edge()
driver.get('https://www.quora.com/topic/Machine-Learning')

driver.implicitly_wait(10)  # set implict wait

# locate a "more" link by css selector
more_link=driver.\
find_element_by_css_selector("div.q-text.qu-cursor--pointer.qt_read_more")

# click the link element, must contain a link
more_link.click()

# Check browser to see an expanded answer


WebDriverException: Message: 'msedgedriver' executable needs to be in PATH. Please download from https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/


In [None]:
driver.quit()

In [None]:
# Scroll down to load more questions
import time

driver = webdriver.Chrome()
driver.get('https://www.quora.com/topic/Machine-Learning')

# scroll down 5 times
for i in range(5):
    driver.execute_script("window.scrollTo(0, \
    document.body.scrollHeight);")
    
    # wait for the content to be loaded
    time.sleep(2)   

questions=driver.\
   find_elements_by_css_selector("span.q-box.qu-userSelect--text")
    
for i, q in enumerate(questions):
    print(i, q.text)
    print("\n")
    

0 What do you mean by a neural network?


1 Why is there a sudden craze of programmers with little math background jumping onto machine learning which requires a much different skill set than traditional programming?


2 What is the creepiest thing any AI has done so far?


3 What is the creepiest thing any AI has done so far?


4 Are you worried about the possible effects of artificial intelligence?


5 Do you believe human art and design is about to crumble because of the introduction of artificial intelligence?


6 What are the dangers of using machine learning libraries without any understanding?


7 Does being able to talk to a robot mean it thinks?


8 What is an intuitive explanation of singular value decomposition (SVD)?


9 How does deep learning work and how is it different from normal neural networks applied with SVM? How does one go about starting to understand them (papers/blogs/articles)?


10 What are some relatively unknown but powerful data science / machine learning t

In [None]:
driver.quit()

### 3.3. Wait
  - Because of the use of AJAX technologies, web elements often load at different time intervals. 
  - This makes locating elements difficult. 
    - if an element is not loaded,  a locating function will raise an ElementNotVisibleException exception.
  - Two types of waits 
    - `implicit`: When a Webdriver locates for any element, but the element is not available, instead of throwing "No Such Element Exception" immediately, the Webdriver waits for a certain amount of time. By the time it is still not available, then the error is thrown. 
      * Implicit wait is set at the driver level and applies to <font color='blue'>any locating function</font>
    - `explicit`: WebDriver waits for a certain condition to occur before proceeding further with execution
      * Explicit wait is set at <font color='blue'>each locating function</font> 
    - check out https://www.selenium.dev/documentation/webdriver/waits/

Assume you're looking for a question, but you're not sure if this question has been loaded into the page
- Case 1: If this question is not in the page, you get an error immediately
- Case 2: If it takes time to load the question, use implict wait to wait for some time
- Case 3: You can keep scroll down until the question has been loaded or max tries reached
    - Use `try ... else` block to handle the exception more elegently

In [None]:
# If the element is not there, you'll see an error immediately

#driver = webdriver.Safari()
driver = webdriver.Chrome()
driver.get('https://www.quora.com/topic/Machine-Learning')

q = driver.find_element_by_css_selector('a[href="https://www.quora.com/Whats-the-coolest-thing-that-AI-has-achieved-so-far"]')

print(q.text)
driver.quit()

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"a[href="https://www.quora.com/Whats-the-coolest-thing-that-AI-has-achieved-so-far"]"}
  (Session info: chrome=105.0.5195.125)


In [None]:
driver.quit()

In [None]:
# If the element is not there, you'll see an error after the waiting

driver = webdriver.Chrome()
driver.get('https://www.quora.com/topic/Machine-Learning')

driver.implicitly_wait(10)

q = driver.find_element_by_css_selector('a[href="https://www.quora.com/Whats-the-coolest-thing-that-AI-has-achieved-so-far"]')
print(q.text)
driver.quit()

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"a[href="https://www.quora.com/Whats-the-coolest-thing-that-AI-has-achieved-so-far"]"}
  (Session info: chrome=105.0.5195.125)


In [None]:
driver.quit()

In [None]:
# If the element is not there, 
# you keep scrolling down until the element is there
# or scrolling for max number of times
# try except can catch the error
# https://selenium-python.readthedocs.io/waits.html#explicit-waits

driver = webdriver.Chrome()
driver.get('https://www.quora.com/topic/Machine-Learning')

found = False
max_try = 10  # max number of scroll-downs
cnt = 0

while not found and cnt < max_try:
    
    try:
        # This waits up to 5 seconds before throwing a TimeoutException 
        # unless it finds the element to return 
        
        q = WebDriverWait(driver,5).until(\
                    expected_conditions.\
                    presence_of_element_located((By.CSS_SELECTOR, \
                    # An expectation for checking that an element is present on the DOM of a page.
                   'a[href="https://www.quora.com/Whats-the-coolest-thing-that-AI-has-achieved-so-far"]')))
        
        found = True
        print(q.text)

    except:     # item not there yet
        
        driver.execute_script("window.scrollTo(0, \
        document.body.scrollHeight);")
     
        cnt += 1
        
driver.quit()

What's the coolest thing that AI has achieved so far?
