# <center>Web Scraping II -- Dyamic Web Page Scraping with Selenium </center>

References:
- http://selenium-python.readthedocs.io/getting-started.html
- https://medium.com/the-andela-way/introduction-to-web-scraping-using-selenium-7ec377a8cf72


## 1. Why Selenium
- So far, we have learned how to scrape **static** HTML pages using **Requests + BeautifulSoup**
- However, if the web content rely on **javascript or AJAX** to build the content, this combination does not work
  - Elements in a web page loaded **asynchronously**
     * while requests.get(url) can only return the initial content
     * you may need to wait for a while to get web content fully loaded
  - You need to **interact with the page** to get some content loaded, e.g.
     * scroll down to load more
     * click a button like "more"
     * fill a form
- Example: 'https://www.quora.com/topic/Machine-Learning'

In [None]:
# Exercise 1.1. Scape quora page using requests+beautifulsoup

# import requests package
import requests                   

# import BeautifulSoup from package bs4 (i.e. beautifulsoup4)
from bs4 import BeautifulSoup   

page = requests.get("https://www.quora.com/topic/Machine-Learning")    # send a get request to the web page

if page.status_code==200:      

    soup = BeautifulSoup(page.content, 'html.parser')
    
    # get all questions
    questions=soup.select("a.question_link span.ui_qtext_rendered_qtext")
    
    for i, q in enumerate(questions):
        print(i, q.get_text())
        print("\n")
    
# how many questions are returned? 
# If you scroll down, more questions are loaded in the browser
# but these questions can't be captured 

## 2. Selenium WebDriver
- Selenium WebDriver is one of the most popular tools for **Web UI Automation**
- Installation:
  - Install Selenium package: 
    - pip install selenium
  - Download a webdirver based on your browser
    - Chrome:	https://sites.google.com/a/chromium.org/chromedriver/downloads
    - Firefox:	https://github.com/mozilla/geckodriver/releases
    - Safari:	https://webkit.org/blog/6900/webdriver-support-in-safari-10/
  - Here we use **Firefox**

## 3. Use of Selenium WebDriver

### 3.1. **Navigating** (similar to beautifulsoup, but using different syntax)
  * navigate to a link
  * find elements by id, name, xpath, CSS selectors
    * check this for detailed syntax: https://seleniumhq.github.io/selenium/docs/api/py/webdriver_remote/selenium.webdriver.remote.webelement.html
  
|    | requests/BeautifulSoup | Selenium WebDriver |
| -- |:------------------      |:-----------|
| Navigate to a link |   requests.get(url)           | driver.get(url)    |
| find elements  | soup.find_all() <br> soup.select() | driver.find_element_by_id()<br> driver.find_element_by_tag_name() <br> find_element_by_css_selector(), <br> ...|
| get attributes of <br>element (say *p*) | p.attrs <br>    p["class"] | p.get_attribute("class") |
| get tag name | p.name | p.tag_name |
 

In [None]:
# Exercise 3.1.1 Scrape using Selenium

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions 

# Path where you save the webdriver 
executable_path = 'driver/geckodriver'

# initiator the webdriver for Firefox browser
driver = webdriver.Firefox(executable_path=executable_path)

# send a request
driver.get('https://www.quora.com/topic/Machine-Learning')

# you should see a Firefox window open

In [None]:
# Exercise 3.1.2. Select truncated text using Selenium

# get all questions using css selector
questions=driver.\
   find_elements_by_css_selector("a.question_link \
   span.ui_qtext_rendered_qtext")
    
for i, q in enumerate(questions):
    print(i, q.text)
    print("\n")
    
# close the webdriver. The firefox window closes
driver.quit()

# Note that only questions in the current screen are captured

### 3.2. Simulates users' actions performed in a web browser. 

  - click a button
    * e.g. submit_button.click()
  - fill a form
    * e.g. text_box.send_keys("enter some text")
  - scroll page down or up
    * e.g. body.send_keys(Keys.PAGE_DOWN)
  - move between windows and frames
    * e.g. driver.switch_to_frame("frameName")
  ...
  - For details see https://selenium-python.readthedocs.io/navigating.html

In [None]:
# 3.2.1 Simulate "click"
# Click "more" link to get full answer

driver = webdriver.Firefox(executable_path=executable_path)


driver.implicitly_wait(20)

driver.get('https://www.quora.com/topic/Machine-Learning')

# locate a "more" link by css selector
more_link=driver.\
find_element_by_css_selector("a.ui_qtext_more_link")

# click the link element
more_link.click()

# Check firefox browser to see an expanded answer

#driver.quit()

### 3.3. Wait
  - Because of the use of AJAX technologies, web elements often load at different time intervals. 
  - This makes locating elements difficult. 
    - if an element is not loaded,  a locating function will raise an ElementNotVisibleException exception.
  - Two types of waits 
    - **implicit**: When a Webdriver locates for any element, but the element is not available, instead of throwing "No Such Element Exception" immediately, the Webdriver waits for a certain amount of time. By the time it is still not available, then the error is thrown. 
      * Implicit wait is set at the driver level and applies to any locating function
    - **explicit**: WebDriver waits for a certain condition to occur before proceeding further with execution
      * Explicit wait is set at each locating function 

In [None]:
# 3.3.1 Implicit Wait

from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox(executable_path=executable_path)

# set implicit wait for 10 seconds
# Any time, the webdriver is locating an element,
# it will wait for at max. 10 seconds 

driver.implicitly_wait(10)

# send a request
driver.get('https://www.quora.com/topic/Machine-Learning')

# find the body element so we can do page down on body
body = driver.find_element_by_css_selector('body')

# Simulate page down
body.send_keys(Keys.PAGE_DOWN)

# note that without line 11, you may get an error immediately
# With line 10, the webdriver waits 
# if the element cannot be loaded after 10 seconds
# you get an error

q=driver.find_element_by_css_selector('a[href="/Does-data-science-need-statistics"]')
print(q.text)


#driver.quit()

In [None]:
# Explicit Wait

driver = webdriver.Firefox(executable_path=executable_path)

# send a request
driver.get('https://www.quora.com/topic/Machine-Learning')

body = driver.find_element_by_css_selector('body')
body.send_keys(Keys.PAGE_DOWN)

# WebDriver will wait for at max. 10 seconds
# to allow the element becomes present
# if the element does not show up in 10 seconds
# show an error 
q = WebDriverWait(driver, 10).until(\
          expected_conditions.presence_of_element_located(
              (By.CSS_SELECTOR, 'a[href="/Does-data-science-need-statistics"]')))

print(q.text)
driver.quit()

### 4. Example: Pull all Q&As until the end of the page

In [None]:
# Exercise 4.1. Get all Q&A pairs

import time

driver = webdriver.Firefox(executable_path=executable_path)
driver.get('https://www.quora.com/topic/Machine-Learning')

# keep scroll down to the bottom of the window
# check the page source in each scroll-down
# if page source is not updated any more
# stop 


src_updated = driver.page_source
src = ""

while src != src_updated:
    
    # save page source (i.e. html document) before page-down
    src = src_updated
    
    # execute javascript to scroll to the bottom of the window
    # you can also use page-down
    driver.execute_script("window.scrollTo(0, \
    document.body.scrollHeight);")
    
    # sleep to allow content loaded
    time.sleep(.5)
    
    # save page source after page-down
    src_updated = driver.page_source

# list to save Q&A pairs
data=[]

# get all Q&A list using XPATH locator
lists=driver.find_elements_by_xpath(\
            "//div[@class='paged_list_wrapper']/div")

print("total Q&A pairs: ",len(lists))

# loop through each div to get details
for idx,item in enumerate(lists):
    
    # each Q&A pair has an unique ID
    div_id=item.get_attribute("id")
    
    # Locate question by the unique ID 
    question_css="div#"+div_id+" "+"a.question_link span.ui_qtext_rendered_qtext"
    more_link_css="div#"+div_id+" "+"a.ui_qtext_more_link"
    
    # Use exception handling in case something wrong
    try:
        # Find the question link by CSS selector
        # This waits up to 10 seconds before throwing a TimeoutException 
        question=WebDriverWait(driver, 10).until(\
                    expected_conditions.\
             presence_of_element_located((By.CSS_SELECTOR, question_css)))
        
        
        # Get "more" link
        # however, for some questions, there is no more link
        # use exception handling to catch such a situation
        try:
            # This waits up to 10 seconds before throwing a TimeoutException 
            # unless it finds the clickable element to return within 10 seconds.
            more_link=WebDriverWait(driver, 10).until(\
                    expected_conditions.element_to_be_clickable((By.CSS_SELECTOR, \
                                                                 more_link_css)))
            
            # click the link
            more_link.click()
            answer_css="div#"+div_id+" "+"div.ui_qtext_expanded span"
        
        except Exception as e: # if "more" link is not found
            
            # get the truncated text by CSS selector
            answer_css="div#"+div_id+" "+"div.answer_body_preview span.ui_qtext_rendered_qtext"
   
        # Wait for the loading of expanded or (truncated) text located
        answer=WebDriverWait(driver, 10).until(\
                    expected_conditions.presence_of_element_located((By.CSS_SELECTOR, answer_css)))
        
        
        # append the question/answer text pairs
        data.append((question.text,answer.text ))
        
    except Exception as e:
            print("error")
            print(idx,item)
        
            
driver.quit()

print("Total Q&As scraped: ", len(data))
print("First Q&A pair\n", data[0])


### 5. More about Selenium WebDriver
- You can still use BeautifulSoup to parse scraped page source, but BeautifulSoup cannot simulate user interactions
- You can also take a snapshot of the Firefox window!

In [None]:
# 5.1 Use beautifulsoup to parse html content retrieved from Selenium WebDriver

from bs4 import BeautifulSoup

driver = webdriver.Firefox(executable_path=executable_path)

# send a request
driver.get('https://www.quora.com/topic/Machine-Learning')


soup = BeautifulSoup(driver.page_source, 'html.parser')
    
    # get all questions
questions=soup.select("a.question_link span.ui_qtext_rendered_qtext")
    
for i, q in enumerate(questions):
    print(i, q.get_text())
    print("\n")
        

# Take a screenshot
driver.save_screenshot('screenshot.png')

driver.quit()