<a href="https://colab.research.google.com/github/KyleMaciej/module_5_my_work/blob/main/Copy_of_5_7_web_scraping_with_selenium.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture 5.7 - Web Scraping with Selenium

In [None]:
import requests
from bs4 import BeautifulSoup

s = requests.Session()
r = s.get('https://wsu-datascience.github.io/binomial_simulation/index.html')
binom_sim = BeautifulSoup(r.content, 'html.parser')

In [None]:
binom_sim

<!DOCTYPE html>

<html lang="en">
<head>
<title>Binomial Simulation</title>
</head>
<link crossorigin="anonymous" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/katex.min.css" integrity="sha384-wITovz90syo1dJWVh32uuETPVEtGigN07tkttEqPv+uR2SE/mbQcG7ATL28aI9H0" rel="stylesheet"/>
<style>
    /* LaTeX display environment will effect the LaTeX characters but not the layout on the page */
    span.katex-display {
      display: inherit; /* You may comment this out if you want the default behavior */
    }
  </style>
<script crossorigin="anonymous" integrity="sha384-/y1Nn9+QQAipbNQWU65krzJralCnuOasHncUFXGkdwntGeSvQicrYkiUBwsgUqc1" src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/katex.min.js"></script>
<script crossorigin="anonymous" integrity="sha384-dq1/gEHSxPZQ7DdrM82ID4YVol9BYyU7GbWlIwnwyPzotpoc57wDw/guX8EaYGPx" src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.7.1/contrib/auto-render.min.js"></script>
<!-- These scripts link to the Vega-Lite runtime -->
<script src="h

## <font color="red"> Exercise 1 </font>

Load the [binomial simulation app](https://wsu-datascience.github.io/binomial_simulation/index.html) in a browser and inspect some elements.  Verify that our usual technique failed to load the elements you find on the page.

## Why use `selenium`

`selenium` allows us to interact with the web page by

1. Clicking on non-HTML elements on the page
2. Filling in forms
3. Waiting for elements to load
4. Taking screen shots of the current state.

## Installation -- Installing the browser driver.

First, we need to install a special driver that allows Python to interact with your browser.  Here we will be using Google Chrome and the [Chrome Driver](https://chromedriver.chromium.org/getting-started).  Be sure to install this driver (or the [one for your favorite browser](https://selenium-python.readthedocs.io/installation.html#drivers). 

## Installation -- Installing `selenium`

Next we will use `pip` to install the `selenium` package.

In [None]:
# For local machine
!pip install selenium

You should consider upgrading via the '/Users/bn8210wy/.pyenv/versions/anaconda3-5.3.1/bin/python -m pip install --upgrade pip' command.[0m


In [None]:
# For running in colab
!pip install kora -q

## Starting a `selenium` session

Note that this will pop up a Chrome window that will be controlled by your Python `driver` object.  **Don't close this window**

In [None]:
# For running locally (with a pop up browser)
from selenium import webdriver

DRIVER_PATH = '/Users/bn8210wy/Downloads/chromedriver'
url = 'https://duckduckgo.com/'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get(url)

WebDriverException: ignored

In [None]:
# For running in colab
url = 'https://duckduckgo.com/'
from kora.selenium import wd as driver
driver.get(url)

## Note on working in Google colab

Since you don't have the pop up browser, you will need to open another browser window and mimic the steps you perform with the selenium driver.

In [None]:
driver.page_source

'<html lang="en_US" class="has-zcm   is-mobile-header-exp js no-touch opacity csstransforms3d csstransitions svg cssfilters is-not-mobile-device full-urls has-footer"><head><script type="text/javascript" src="s2477.js"></script><meta http-equiv="content-type" content="text/html; charset=utf-8"><title>Silas Bergen at DuckDuckGo</title><link rel="stylesheet" href="/s1936.css" type="text/css"><link rel="stylesheet" href="/r1936.css" type="text/css"><meta name="robots" content="noindex,nofollow"><meta name="referrer" content="origin"><meta name="apple-mobile-web-app-title" content="Silas Bergen"><link rel="preload" href="/font/ProximaNova-Reg-webfont.woff2" as="font" type="font/woff2" crossorigin="anonymous"><link rel="preload" href="/font/ProximaNova-Sbold-webfont.woff2" as="font" type="font/woff2" crossorigin="anonymous"><link rel="shortcut icon" href="/favicon.ico" type="image/x-icon"><link id="icon60" rel="apple-touch-icon" href="/assets/icons/meta/DDG-iOS-icon_60x60.png?v=2"><link id=

In [None]:
driver.title

'Silas Bergen at DuckDuckGo'

In [None]:
driver.current_url

'https://duckduckgo.com/'

## Locating the First Element

Can locate by

* Tag name
* Class name
* IDs
* XPath
* CSS selectors

**Note:** This is the same as `find` in `bs4`

In [None]:
p = driver.find_element_by_tag_name('p')
p

<p class="showcase__subheading">Protect your data on every device.</p>

In [None]:
faq_button = driver.find_element_by_class_name('faq__button')
faq_button

<button class="faq__button" aria-expanded="true" aria-disabled="true" aria-controls="faq-answer-0" id="faq-btn-0"><svg width="20" height="21" viewBox="0 0 20 21" fill="none" xmlns="http://www.w3.org/2000/svg"><circle cx="10" cy="10.5" r="10" transform="rotate(-180 10 10.5)" fill="#E5E5E5"></circle><path d="M9.94454 12.8483L13.5355 9.25736C13.7308 9.0621 14.0474 9.0621 14.2426 9.25736C14.4379 9.45262 14.4379 9.7692 14.2426 9.96447L10.3536 13.8536C10.2418 13.9653 10.0903 14.0131 9.94454 13.9969C9.79879 14.0131 9.64729 13.9653 9.53553 13.8536L5.64645 9.96447C5.45118 9.7692 5.45118 9.45262 5.64645 9.25736C5.84171 9.0621 6.15829 9.0621 6.35355 9.25736L9.94454 12.8483Z" fill="#353748"></path></svg></button>

In [None]:
error_div = driver.find_element_by_id('error_homepage')
error_div

<div id="error_homepage"></div>

## Locating all the Elements

Use the plural (i.e. `elements`) to `find_all` tags.

In [None]:
a_tags = driver.find_elements_by_tag_name('a')
a_tags

[<a tabindex="-1" href="/?t=h_&amp;" class="header__logo-wrap js-header-logo"><span class="header__logo js-logo-ddg">DuckDuckGo</span></a>,
 <a id="search_dropdown" class="search__dropdown" href="javascript:;" tabindex="4"></a>,
 <a class="no-visited js-acp-footer-link" href="/bang">Learn More</a>,
 <a data-zci-link="web" class="zcm__link  js-zci-link  js-zci-link--web  is-active" href="#">All</a>,
 <a data-zci-link="images" class="zcm__link  js-zci-link  js-zci-link--images" href="#">Images</a>,
 <a data-zci-link="videos" class="zcm__link  js-zci-link  js-zci-link--videos" href="#">Videos</a>,
 <a data-zci-link="news" class="zcm__link  js-zci-link  js-zci-link--news" href="#">News</a>,
 <a data-zci-link="maps_expanded" class="zcm__link  js-zci-link  js-zci-link--maps_expanded" href="#">Maps</a>,
 <a class="zcm__link dropdown__button js-dropdown-button ">Settings</a>,
 <a class="header__button--menu  js-side-menu-open" href="#">⇶</a>,
 <a href="/app" class="eighteen js-hl-item" aria-hi

## The `WebElement` object

`selenium` object representing an `html` tag.

In [None]:
p.text

''

In [None]:
faq_button.click()

StaleElementReferenceException: ignored

In [None]:
faq_button.get_attribute('class')

StaleElementReferenceException: ignored

In [None]:
[a.get_attribute('href') for a in a_tags]

['https://duckduckgo.com/?t=h_&',
 'javascript:;',
 'https://duckduckgo.com/bang',
 'https://duckduckgo.com/?q=Silas+Bergen&t=h_&ia=web#',
 'https://duckduckgo.com/?q=Silas+Bergen&t=h_&ia=web#',
 'https://duckduckgo.com/?q=Silas+Bergen&t=h_&ia=web#',
 'https://duckduckgo.com/?q=Silas+Bergen&t=h_&ia=web#',
 'https://duckduckgo.com/?q=Silas+Bergen&t=h_&ia=web#',
 None,
 'https://duckduckgo.com/?q=Silas+Bergen&t=h_&ia=web#',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/newsletter',
 'https://duckduckgo.com/newsletter',
 'https://spreadprivacy.com/tag/device-privacy-tips/',
 'https://spreadprivacy.com/tag/device-privacy-tips/',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/donations',
 'https://duckduckgo.com/donations',
 'https://duckduckgo.com/traffic',
 'https://duckduckgo.com/traffic',
 'https://donttrack.us/',
 'https://donttrack.us/',
 'https://twitter.com/duckduckgo',
 'https://reddit.com/r/duck

In [None]:
! pip install composable



In [None]:
from composable import pipeable
get_attribute = pipeable(lambda attr, tag: tag.get_attribute(attr))

(p
>> get_attribute('class')
)

StaleElementReferenceException: ignored

In [None]:
from composable.strict import map

(a_tags
>> map(get_attribute('href'))
)

['https://duckduckgo.com/?t=h_&',
 'javascript:;',
 'https://duckduckgo.com/bang',
 'https://duckduckgo.com/?q=Silas+Bergen&t=h_&ia=web#',
 'https://duckduckgo.com/?q=Silas+Bergen&t=h_&ia=web#',
 'https://duckduckgo.com/?q=Silas+Bergen&t=h_&ia=web#',
 'https://duckduckgo.com/?q=Silas+Bergen&t=h_&ia=web#',
 'https://duckduckgo.com/?q=Silas+Bergen&t=h_&ia=web#',
 None,
 'https://duckduckgo.com/?q=Silas+Bergen&t=h_&ia=web#',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/app',
 'https://duckduckgo.com/newsletter',
 'https://duckduckgo.com/newsletter',
 'https://spreadprivacy.com/tag/device-privacy-tips/',
 'https://spreadprivacy.com/tag/device-privacy-tips/',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/spread',
 'https://duckduckgo.com/donations',
 'https://duckduckgo.com/donations',
 'https://duckduckgo.com/traffic',
 'https://duckduckgo.com/traffic',
 'https://donttrack.us/',
 'https://donttrack.us/',
 'https://twitter.com/duckduckgo',
 'https://reddit.com/r/duck

## Anatomy of a Search

<img src="https://github.com/wsu-DSCI330/module_5_lectures/blob/main/img/duckduck_search.png?raw=1" width="400"/>

In [None]:
input_field = driver.find_element_by_id('search_form_input_homepage')
input_button = driver.find_element_by_id('search_button_homepage')

In [None]:
input_field.send_keys('Silas Bergen')

In [None]:
input_button.click()

## IMPORTANT -- The page will change!

In [None]:
input_field = driver.find_element_by_id('search_form_input_homepage')

NoSuchElementException: ignored

## <font color="red"> Exercise 2 </font> 

Use `selenium` to get the links for all of the results shown on the page.

In [None]:
find_elements_by_class_name = pipeable( lambda class_name, driver: driver.find_elements_by_class_name(class_name))
find_element_by_class_name = pipeable(lambda class_name, driver: driver.find_elements_by_class_name(class_name))
find_elements_by_id = pipeable( lambda class_name, driver: driver.find_elements_by_id(class_name))
find_elements_by_tag_name = pipeable( lambda tag_name, driver: driver.find_elements_by_tag_name(tag_name))
find_element_by_tag_name = pipeable(lambda tag_name, driver: driver.find_element_by_tag_name(tag_name))
(driver
 >> find_elements_by_class_name('results_links_deep')
 >> map(find_element_by_tag_name('a'))
 >> map(get_attribute('href'))
)

['http://driftlessdata.space/',
 'https://www.pubfacts.com/author/Silas+Bergen',
 'https://www.facebook.com/silas.bergen.5',
 'https://www.semanticscholar.org/author/Silas-Bergen/48578284',
 'https://www.goodreads.com/user/show/55807579-silas-bergen',
 'https://www.instagram.com/silasbjerregaard/?hl=en',
 'https://en.visitbergen.com/visitor-information/travel-information/skyss-bus-and-bergen-light-rail-p913973',
 'https://en.wikipedia.org/wiki/Silas_Young',
 'https://highlander.fandom.com/wiki/Silas',
 'https://www.imdb.com/title/tt0081929/']

## <font color="red"> Exercise 3 </font> 

Suppose that we want more than 1 page of results.  Inspect the page and find the more button. **Hint:** Right click won't work here, you will need to inspect a nearby element then navigate to the element in the console.

Once you have found this button, click it at least three times, then get all of search result links.

In [None]:
driver.find_element_by_class_name('result--more__btn').click()

In [None]:
driver.find_element_by_class_name('result--more__btn').click()

In [None]:
driver.find_element_by_class_name('result--more__btn').click()

In [None]:
(driver
 >> find_element_by_class_name('results_links_deep')
 >> map(find_element_by_tag_name('a'))
 >> map(get_attribute('href'))
 )

['http://driftlessdata.space/',
 'https://www.pubfacts.com/author/Silas+Bergen',
 'https://www.facebook.com/silas.bergen.5',
 'https://www.semanticscholar.org/author/Silas-Bergen/48578284',
 'https://www.goodreads.com/user/show/55807579-silas-bergen',
 'https://www.instagram.com/silasbjerregaard/?hl=en',
 'https://en.visitbergen.com/visitor-information/travel-information/skyss-bus-and-bergen-light-rail-p913973',
 'https://en.wikipedia.org/wiki/Silas_Young',
 'https://highlander.fandom.com/wiki/Silas',
 'https://www.imdb.com/title/tt0081929/',
 'https://www.lilsebergen.be/',
 'https://www.wowhead.com/item=156634/silas-vial-of-continuous-curing',
 'https://vimeo.com/16912537',
 'https://www.tripadvisor.de/Tourism-g190502-Bergen_Hordaland_Western_Norway-Vacations.html',
 'https://www.silasdeanepawn.com/',
 'https://www.sats.no/treningssentre/bergen/bergen-sats/',
 'https://feheroes.gamepedia.com/Silas:_Loyal_Knight',
 'https://gezimanya.com/bergen',
 'https://www.dict.cc/?s=bergen',
 'htt

## Headless operation

While the extra window is useful for exploring results, it gets annoying when rerunning common searches.  In this case, we want to run the search in headless mode.

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver.get("https://www.nintendo.com/")
print(driver.page_source)
driver.quit()

WebDriverException: ignored