# Web Scraping Using Selenium

## Installation
While Selenium supports a number of browser engines, we will use Chrome for the following example, so please make sure you have the following packages installed:

Chrome download page: https://www.google.com/chrome/

A ChromeDriver binary matching your Chrome version: https://chromedriver.chromium.org/downloads

In [2]:
# pip install selenium

Note: you may need to restart the kernel to use updated packages.


Once you have downloaded, both, Chrome and ChromeDriver and installed the Selenium package, you should be ready to start the browser:

In [1]:
from selenium import webdriver


driver = webdriver.Chrome()
driver.get('https://google.com')

Running the browser from Selenium the way we just did is particularly helpful during development. It allows you observe exactly what's going on and how the page and the browser is behaving in the context of your code. Once you are happy with everything, it is generally advisable, however, to switch to said headless mode in production.

In that mode, Selenium will start Chrome in the "background" without any visual output or windows. Fortunately, enabling headless mode only takes a few flags.

In [2]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options)

We only need to instantiate an *Options* object, set its headless field to True, and pass it to our WebDriver constructor. Done.

### WebDriver Page Properties

Building on our headless mode example, let's go full Mario and check out Nintendo's website.

In [4]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options)
driver.get("https://www.csueastbay.edu/futurestudents/index.html")
print(driver.page_source)

<html lang="en"><head>
        <meta content="IE=edge" http-equiv="X-UA-Compatible">
        <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
        <meta content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0" name="viewport">
        <meta content="Prospective and future freshmen and transfer students can learn more about Cal State East Bay and what it has to offer, including housing, admission information, financial aid, cost of attendance and more. " name="description">
        <title>Future Students - Cal State East Bay</title>
        <!--BEGIN:GLOBAL-SCRIPTS-HEAD-->
        
        

<link href="https://www.csueastbay.edu/_global/bootstrap/css/bootstrap.min.css" rel="stylesheet"><link href="https://www.csueastbay.edu/_global/bootstrap/css/bootstrap-accessibility.css" rel="stylesheet"><link href="https://www.csueastbay.edu/_global/font-awesome/css/font-awesome.min.css" rel="stylesheet"><link href="https://www.csueastbay.edu/_global

In [5]:
print(driver.title)

Future Students - Cal State East Bay


### Locating Elements

WebDriver provides two main methods for finding elements.

- find_element
- find_elements

They are pretty similar, with the difference that the former looks for one single element, which it returns, whereas the latter will return a list of all found elements. Check https://selenium-python.readthedocs.io/locating-elements.html for more details.

In [6]:
driver.find_elements(By.TAG_NAME, 'p')

[<selenium.webdriver.remote.webelement.WebElement (session="e1c691330d78d6cbc83f8f21a1e0ee8a", element="0AE3859AC611DA525ED5F7FB86269C54_element_194")>,
 <selenium.webdriver.remote.webelement.WebElement (session="e1c691330d78d6cbc83f8f21a1e0ee8a", element="0AE3859AC611DA525ED5F7FB86269C54_element_230")>,
 <selenium.webdriver.remote.webelement.WebElement (session="e1c691330d78d6cbc83f8f21a1e0ee8a", element="0AE3859AC611DA525ED5F7FB86269C54_element_1137")>,
 <selenium.webdriver.remote.webelement.WebElement (session="e1c691330d78d6cbc83f8f21a1e0ee8a", element="0AE3859AC611DA525ED5F7FB86269C54_element_1144")>,
 <selenium.webdriver.remote.webelement.WebElement (session="e1c691330d78d6cbc83f8f21a1e0ee8a", element="0AE3859AC611DA525ED5F7FB86269C54_element_1129")>,
 <selenium.webdriver.remote.webelement.WebElement (session="e1c691330d78d6cbc83f8f21a1e0ee8a", element="0AE3859AC611DA525ED5F7FB86269C54_element_1210")>,
 <selenium.webdriver.remote.webelement.WebElement (session="e1c691330d78d6cbc8

### Selenium WebElement
A WebElement is a Selenium object representing an HTML element.

There are many actions that you can perform on those objects, here are the most useful:

- Accessing the text of the element with the property *element.text*
- Clicking the element with *element.click()*
- Accessing an attribute with *element.get_attribute('class')
- Sending text to an input with *element.send_keys('mypassword')

In [7]:
result = driver.find_element(By.TAG_NAME, 'p')
result_text = result.text
print ('element.text: {0}'.format(result_text))

element.text: For more than 60 years, Cal State East Bay has served the Bay Area and beyond, providing access to higher education for a diverse student body and advancing regional engagement. In recent years, the university and its many programs have been included in several state and national rankings. Learn more about joining the Pioneer family. 


## Full example
Here is a full example using the Selenium API methods we just covered.

In [8]:
driver.get("https://news.ycombinator.com/login")

login = driver.find_element(By.XPATH,"//input").send_keys('yourusername')
password = driver.find_element(By.XPATH,"//input[@type='password']").send_keys('yourpassword')
submit = driver.find_element(By.XPATH,"//input[@value='login']").click()

Now there is one important thing that is missing here. How do we know if we are logged in?

We could try a couple of things:

- Check for an error message (like "Wrong password")
- Check for one element on the page that is only displayed once logged in.

So, we're going to check for the logout button. The logout button has the ID *logout* (easy)!

In [9]:
from selenium.common.exceptions import NoSuchElementException
# dont forget from selenium.common.exceptions import NoSuchElementException
try:
    logout_button = driver.find_element(By.ID,"logout")
    print('Successfully logged in')
except NoSuchElementException:
    print('Incorrect login/password')

Incorrect login/password


### Taking screenshots

The beauty of browser approaches, like Selenium, is that we do not only get the data and the DOM tree, but that - being a browser - it also properly and fully renders the whole page. This, of course, also allows for screenshots and Selenium comes fully prepared here.

In [10]:
driver.save_screenshot('screenshot.png')

True

Please, do note, a few things can still go wrong or need tweaking, when you take a screenshot with Selenium. 
- First, you have to make sure that the window size is set correctly. 
- Then, you need to make sure that every asynchronous HTTP call made by the frontend JavaScript code has finished, and that the page is fully rendered.

### Waiting for an element to be present
Dealing with a website that uses lots of JavaScript to render its content can be tricky. 

That means, we can't just send a request and immediately scrape the data, but we may have to wait until JavaScript completed its work. There are typically two ways to approach that:

- Use *time.sleep()* before taking the screenshot.
- Employ a *WebDriverWait* object.

In [11]:
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("https://www.espn.com/")
try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.NAME, 'title'))
    )
finally:
    driver.quit()

This will wait until the element with the HTML ID mySuperId appears, or the timeout of ten seconds has been reached. 