# Data Extraction with Selenium
In this tutorial, we discuss how to use Selenium to extract data from the web.  Please see https://selenium-python.readthedocs.io for more details.

## Installation
We first install selenium package.

        pip install selenium

In [None]:
from selenium import webdriver
import time
import os

browser = webdriver.Chrome()

In [None]:
options = webdriver.FirefoxOptions()
browser = webdriver.Firefox(options=options)

## Browsing a webpage
Once the browser starts, we can tell it to visit a webpage.

In [None]:
url = 'https://www.cp.eng.chula.ac.th'

In [None]:
browser.get(url=url)

In [None]:
html = browser.execute_script("return document.documentElement.outerHTML")
html[:3000]

## Interact with a webpage
When the page is loaded, we can interact with all elements in the webpage.  In this example, we will perform a search for a particular keyword in Google.  We will have to locate the correct element and then send the proper keys.

In [None]:
from selenium.webdriver.common.by import By

In [None]:
q_element = browser.find_element(By.CSS_SELECTOR, 'input[name=s]')
q_element.clear()
q_element.send_keys('อาจารย์')


In [None]:
q_element.send_keys(u'\ue007')

## Wait for Conditions
We will need to wait until some conditions met e.g. page is loaded before continuing.  There are several conditions that we can check:
- alertIsPresent()
- elementSelectionStateToBe()
- elementToBeClickable()
- elementToBeSelected()
- frameToBeAvaliableAndSwitchToIt()
- invisibilityOfTheElementLocated()
- invisibilityOfElementWithText()
- presenceOfAllElementsLocatedBy()
- presenceOfElementLocated()
- textToBePresentInElement()
- textToBePresentInElementLocated()
- textToBePresentInElementValue()
- titleIs()
- titleContains()
- visibilityOf()
- visibilityOfAllElements()
- visibilityOfAllElementsLocatedBy()
- visibilityOfElementLocated()

In [None]:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [None]:
e = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "body.search-results")))

In [None]:
print(e)

## Navigate the webpage
We can navigate the current webpage, similar to Beautiful Soup.  Selenium supports several navigation approaches.

In [None]:
all_links = browser.find_elements(By.CSS_SELECTOR, 'article a')

In [None]:
for link in all_links:
    print('[link text]', link.text)
    print('[link href]', link.get_attribute('href'))
    print('---')

In [None]:
all_links[0].click()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "h1.title-post")))
print('page is ready')

## Save contents
We want to save all images shown below the article.  However, this page is lazy-loading.  We will have to scorll to the bottom first before we can get all images.  Then, we will use urllib to save those images.

In [None]:
from urllib.request import urlretrieve

In [None]:
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

In [None]:
all_images = browser.find_elements(By.CSS_SELECTOR, 'figure img')

In [None]:
counter = 1
for img in all_images:
    src = img.get_attribute('src')
    filename = f'img-{counter}.jpg'
    urlretrieve(src, filename)
    print(f'Saving {src} to {filename}')
    if counter >= 5:
        break
    counter += 1

## End browsing session

In [None]:
browser.quit()