# Lecture 5.7 - Web Scraping with Selenium

In [3]:
import requests
from bs4 import BeautifulSoup

s = requests.Session()
r = s.get('https://wsu-datascience.github.io/binomial_simulation/index.html')
binom_sim = BeautifulSoup(r.content, 'html.parser')

In [4]:
binom_sim

<!DOCTYPE html>

<html lang="en">
<head>
<title>Binomial Simulation</title>
</head>
<link crossorigin="anonymous" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/katex.min.css" integrity="sha384-wITovz90syo1dJWVh32uuETPVEtGigN07tkttEqPv+uR2SE/mbQcG7ATL28aI9H0" rel="stylesheet"/>
<style>
    /* LaTeX display environment will effect the LaTeX characters but not the layout on the page */
    span.katex-display {
      display: inherit; /* You may comment this out if you want the default behavior */
    }
  </style>
<script crossorigin="anonymous" integrity="sha384-/y1Nn9+QQAipbNQWU65krzJralCnuOasHncUFXGkdwntGeSvQicrYkiUBwsgUqc1" src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/katex.min.js"></script>
<script crossorigin="anonymous" integrity="sha384-dq1/gEHSxPZQ7DdrM82ID4YVol9BYyU7GbWlIwnwyPzotpoc57wDw/guX8EaYGPx" src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/contrib/auto-render.min.js"></script>
<!-- These scripts link to the Vega-Lite runtime -->
<script src="h

## <font color="red"> Exercise 1 </font>

Load the [binomial simulation app](https://wsu-datascience.github.io/binomial_simulation/index.html) in a browser and inspect some elements.  Verify that our usual technique failed to load the elements you find on the page.

## Why use `selenium`

`selenium` allows us to interact with the web page by

1. Clicking on non-HTML elements on the page
2. Filling in forms
3. Waiting for elements to load
4. Taking screen shots of the current state.

## Installation -- Installing the browser driver.

First, we need to install a special driver that allows Python to interact with your browser.  Here we will be using Google Chrome and the [Chrome Driver](https://chromedriver.chromium.org/getting-started).  Be sure to install this driver (or the [one for your favorite browser](https://selenium-python.readthedocs.io/installation.html#drivers). 

## Installation -- Installing `selenium`

Next we will use `pip` to install the `selenium` package.

In [2]:
# For local machine
!pip install selenium

Collecting selenium
  Downloading selenium-3.141.0-py2.py3-none-any.whl (904 kB)
[K     |████████████████████████████████| 904 kB 1.2 MB/s eta 0:00:01     |████████████████████▋           | 583 kB 1.2 MB/s eta 0:00:01
Installing collected packages: selenium
Successfully installed selenium-3.141.0


In [2]:
# For running in colab
!pip install kora -q

## Starting a `selenium` session

Note that this will pop up a Chrome window that will be controlled by your Python `driver` object.  **Don't close this window**

In [7]:
# For running locally (with a pop up browser)
from selenium import webdriver

DRIVER_PATH = '/mnt/c/Users/nm0257ms/Desktop/geckodriver.exe'
url = 'https://duckduckgo.com/'
driver = webdriver.Firefox(executable_path=DRIVER_PATH)
driver.get(url)

In [18]:
# For running in colab
url = 'https://duckduckgo.com/'
from kora.selenium import wd as driver
driver.get(url)

## Note on working in Google colab

Since you don't have the pop up browser, you will need to open another browser window and mimic the steps you perform with the selenium driver.

In [8]:
driver.page_source



In [9]:
driver.title

'DuckDuckGo — Privacy, simplified.'

In [10]:
driver.current_url

'https://duckduckgo.com/'

## Locating the First Element

Can locate by

* Tag name
* Class name
* IDs
* XPath
* CSS selectors

**Note:** This is the same as `find` in `bs4`

In [11]:
p = driver.find_element_by_tag_name('p')
p

<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f0a15560-a4e4-44eb-a663-f52599cb18cb", element="36ef0985-4305-4f38-b8c7-296d9c309b57")>

In [12]:
faq_button = driver.find_element_by_class_name('faq__button')
faq_button

<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f0a15560-a4e4-44eb-a663-f52599cb18cb", element="7e7c9e64-dc3f-4f80-b7fe-289092e36e3d")>

In [13]:
error_div = driver.find_element_by_id('error_homepage')
error_div

<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f0a15560-a4e4-44eb-a663-f52599cb18cb", element="51d7c2f0-ec7a-4ce7-9259-f2323e773f65")>

## Locating all the Elements

Use the plural (i.e. `elements`) to `find_all` tags.

In [14]:
a_tags = driver.find_elements_by_tag_name('a')
a_tags

[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f0a15560-a4e4-44eb-a663-f52599cb18cb", element="5ed9f10c-fa0e-4ee1-8d8a-8b076f116115")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f0a15560-a4e4-44eb-a663-f52599cb18cb", element="10952784-a88b-4bd1-bcc4-10f9b7c73f82")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f0a15560-a4e4-44eb-a663-f52599cb18cb", element="3222283e-1cb3-4f6a-a767-66ebac4bb795")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f0a15560-a4e4-44eb-a663-f52599cb18cb", element="df92d08b-50a7-4a81-8f86-5f6593bcb5f8")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f0a15560-a4e4-44eb-a663-f52599cb18cb", element="b1a409e8-4606-433a-9acf-34ed41ceee66")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f0a15560-a4e4-44eb-a663-f52599cb18cb", element="5883ce1a-2b51-4139-a7fe-32226f150ec6")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement 

## The `WebElement` object

`selenium` object representing an `html` tag.

In [15]:
p.text

''

In [16]:
faq_button.click()

In [17]:
faq_button.get_attribute('class')

'faq__button'

In [18]:
[a.get_attribute('href') for a in a_tags]

['https://duckduckgo.com/#',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/newsletter',
 'https://duckduckgo.com/newsletter',
 'https://spreadprivacy.com/tag/device-privacy-tips/',
 'https://spreadprivacy.com/tag/device-privacy-tips/',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/donations',
 'https://duckduckgo.com/donations',
 'https://duckduckgo.com/traffic',
 'https://duckduckgo.com/traffic',
 'https://donttrack.us/',
 'https://donttrack.us/',
 'https://twitter.com/duckduckgo',
 'https://reddit.com/r/duckduckgo',
 'https://spreadprivacy.com/',
 'https://duckduckgo.com/newsletter',
 'https://duckduckgo.com/about',
 'https://duckduckgo.com/bang',
 'https://duckduckgo.com/spread',
 'https://addons.mozilla.org/firefox/downloads/latest/duckduckgo-for-firefox/addon-385621-latest.xpi?src=external-home-top',
 'https://addons.mozilla.org/firefox/addon/duckduckgo-for-firefox/reviews/',
 'https://duckduck

In [19]:
! pip install composable



In [20]:
from composable import pipeable
get_attribute = pipeable(lambda attr, tag: tag.get_attribute(attr))

(p
>> get_attribute('class')
)

'showcase__subheading'

In [21]:
from composable.strict import map

(a_tags
>> map(get_attribute('href'))
)

['https://duckduckgo.com/#',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/newsletter',
 'https://duckduckgo.com/newsletter',
 'https://spreadprivacy.com/tag/device-privacy-tips/',
 'https://spreadprivacy.com/tag/device-privacy-tips/',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/donations',
 'https://duckduckgo.com/donations',
 'https://duckduckgo.com/traffic',
 'https://duckduckgo.com/traffic',
 'https://donttrack.us/',
 'https://donttrack.us/',
 'https://twitter.com/duckduckgo',
 'https://reddit.com/r/duckduckgo',
 'https://spreadprivacy.com/',
 'https://duckduckgo.com/newsletter',
 'https://duckduckgo.com/about',
 'https://duckduckgo.com/bang',
 'https://duckduckgo.com/spread',
 'https://addons.mozilla.org/firefox/downloads/latest/duckduckgo-for-firefox/addon-385621-latest.xpi?src=external-home-top',
 'https://addons.mozilla.org/firefox/addon/duckduckgo-for-firefox/reviews/',
 'https://duckduck

## Anatomy of a Search

<img src="https://github.com/wsu-DSCI330/module_5_lectures/blob/main/img/duckduck_search.png?raw=1" width="400"/>

In [22]:
input_field = driver.find_element_by_id('search_form_input_homepage')
input_button = driver.find_element_by_id('search_button_homepage')

In [23]:
input_field.send_keys('Silas Bergen')

In [24]:
input_button.click()

## IMPORTANT -- The page will change!

In [25]:
input_field = driver.find_element_by_id('search_form_input_homepage')

NoSuchElementException: Message: Unable to locate element: [id="search_form_input_homepage"]


## <font color="red"> Exercise 2 </font> 

Use `selenium` to get the links for all of the results shown on the page.

In [26]:
find_elements_by_class_name = pipeable(lambda class_name, driver: driver.find_elements_by_class_name(class_name))
find_element_by_tag_name = pipeable(lambda tag_name, driver: driver.find_element_by_tag_name(tag_name))

(driver
 >> find_elements_by_class_name("results_links_deep")
 >> map(find_element_by_tag_name('a'))
 >> map(get_attribute('href'))
)

['http://driftlessdata.space/',
 'https://www.linkedin.com/in/silas-bergen-3039b785',
 'https://www.ratemyprofessors.com/ShowRatings.jsp?tid=1946369',
 'https://www.instantcheckmate.com/people/silas-bergen/',
 'https://public.tableau.com/profile/silas.bergen#!',
 'https://github.com/silasbergen/',
 'https://myspace.com/287027346',
 'https://mylife.com/silas-bergen/e6039070026648',
 'http://driftlessdata.space/courses/dsci310/midterm/',
 'https://public.tableau.com/profile/silas.bergen#!/vizhome/Uncertainty_0/Dashboard1']

## <font color="red"> Exercise 3 </font> 

Suppose that we want more than 1 page of results.  Inspect the page and find the more button. **Hint:** Right click won't work here, you will need to inspect a nearby element then navigate to the element in the console.

Once you have found this button, click it at least three times, then get all of search result links.

In [27]:
driver.find_element_by_class_name('result--more__btn').click()

In [28]:
driver.find_element_by_class_name('result--more__btn').click()

In [29]:
driver.find_element_by_class_name('result--more__btn').click()

In [30]:
(driver
 >> find_elements_by_class_name("results_links_deep")
 >> map(find_element_by_tag_name('a'))
 >> map(get_attribute('href'))
)

['http://driftlessdata.space/',
 'https://www.linkedin.com/in/silas-bergen-3039b785',
 'https://www.ratemyprofessors.com/ShowRatings.jsp?tid=1946369',
 'https://www.instantcheckmate.com/people/silas-bergen/',
 'https://public.tableau.com/profile/silas.bergen#!',
 'https://github.com/silasbergen/',
 'https://myspace.com/287027346',
 'https://mylife.com/silas-bergen/e6039070026648',
 'http://driftlessdata.space/courses/dsci310/midterm/',
 'https://public.tableau.com/profile/silas.bergen#!/vizhome/Uncertainty_0/Dashboard1',
 'https://silasbergen.github.io/DSCI310-F17/Syllabus.html',
 'https://jimmyjhickey.github.io/old-site/silas.html',
 'https://silasbergen.github.io/DSCI310-F17/Midterm.html',
 'https://www.researchgate.net/scientific-contributions/Silas-Bergen-2009213336',
 'https://es-la.facebook.com/public/Silas-Bergen',
 'https://www.goodreads.com/user/show/55807579-silas-bergen',
 'https://www.findagrave.com/memorial/71758675/laverne-silas-dewees',
 'https://www.intelius.com/people-

## Headless operation

While the extra window is useful for exploring results, it gets annoying when rerunning common searches.  In this case, we want to run the search in headless mode.

In [32]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Firefox(options=options, executable_path=DRIVER_PATH)
driver.get("https://www.nintendo.com/")
print(driver.page_source)
driver.quit()

<html class="js flexbox canvas canvastext webgl no-touch geolocation postmessage no-websqldatabase indexeddb hashchange history draganddrop websockets rgba hsla multiplebgs backgroundsize borderimage borderradius boxshadow textshadow opacity cssanimations csscolumns cssgradients no-cssreflections csstransforms csstransforms3d csstransitions fontface generatedcontent video audio localstorage sessionstorage webworkers applicationcache svg inlinesvg smil svgclippaths no-isios flexboxlegacy supports alps-ua-firefox alps-os-windows" lang="en-US"><head><style class="vjs-styles-defaults">
      .video-js {
        width: 300px;
        height: 150px;
      }

      .vjs-fluid {
        padding-top: 56.25%
      }
    </style>
    <meta charset="UTF-8">
    <title>
    Nintendo - Official Site - Video Game Consoles, Games
</title>
    
    <meta name="description" content="Discover Nintendo Switch, the video game system you can play at home or on the go. Plus, get the latest games and news on 