# Selenium Bootcamp

## Why Selenium?

Most of the time, scraping methods like RegEx or BeautifulSoup will be fine for dealing with websites. However, some websites handle things a little bit differently. Let's take a look. Run the following two cells:

In [16]:
import os

In [None]:
os.system("open my_website.html")

### Let's try downloading the site and scraping the data. 

In [17]:
from urllib.request import Request, urlopen
my_website_url = "file://" + os.getcwd() + "/my_website.html"
html = str(urlopen(my_website_url).read())

### Seems to work fine. Let's try scraping some data from it.

In [38]:
import re

In [None]:
static_data = re.findall(r'<td class = "static_input">(.+?)<\/td><td class = "static_output">(.+?)<\/td>', html)
static_data

### Looks good! But I think we have some data missing... Not a problem, let's try to scrape it

In [None]:
dynamic_data = re.findall(r'<td class = "dynamic_input">(.+?)<\/td><td class = "dynamic_output">(.+?)<\/td>', html)
dynamic_data

## What went wrong?

## Cases where you might need Selenium
* Data is generated via interaction e.g. searching, clicking more, etc.
* Data is generated via "ajax" requests
* Website requires login of some kind
* Dealing with the html parsing and regex is just too damn annoying

## Download Instructions

1. Install Selenium for Python. ```python3 -m pip install selenium```. [Full Instructions](https://selenium-python.readthedocs.io/installation.html)

2. [Install chrome webdriver](https://sites.google.com/a/chromium.org/chromedriver/downloads).

3. Move the resulting file to this folder.


### Great! Now let's get started

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions

In [43]:
driver = webdriver.Chrome("chromedriver")

# for Windows users
# driver = webdriver.Chrome("chromedriver.exe")

### Some helper functions that will be useful later

In [3]:
def wait_until_present(driver, time, *locator):
    return WebDriverWait(driver, time).until(expected_conditions.presence_of_element_located(locator))

def find_element_by_text(driver, text):
    return driver.find_element_by_xpath("//*[contains(text(), '{}')]".format(text))

### Let's try navigating to a url

In [4]:
driver.get("http://google.com")

### Not too bad, now let's see if we can interact with the webpage

In [5]:
search_box = wait_until_present(driver, 10, By.NAME, "q")

In [6]:
search_box.send_keys("Harvard" + Keys.ENTER)

### A quick demo of what this can be useful for

In [14]:
result_divs = driver.find_elements_by_class_name("r")
link_elts = [div.find_element_by_tag_name("a") for div in result_divs if div.text != ""]
links = [(elt.text, elt.get_attribute("href")) for elt in link_elts]
links

[('Harvard University\n\nhttps://www.harvard.edu', 'https://www.harvard.edu/'),
 ('Admissions', 'https://college.harvard.edu/admissions'),
 ('Admissions & Aid', 'https://www.harvard.edu/admissions-aid'),
 ('Harvard College', 'https://college.harvard.edu/'),
 ('Visit Harvard', 'https://www.harvard.edu/on-campus/visit-harvard'),
 ('Harvard University - Wikipedia\n\nhttps://en.wikipedia.org › wiki › Harvard_University',
  'https://en.wikipedia.org/wiki/Harvard_University'),
 ('Harvard Business School\n\nhttps://www.hbs.edu › Pages',
  'https://www.hbs.edu/Pages/default.aspx'),
 ('Incoming Harvard Freshman Deported After Visa Revoked ...\n\nhttps://www.thecrimson.com › article › incoming-freshman-deported',
  'https://www.thecrimson.com/article/2019/8/27/incoming-freshman-deported/'),
 ('Harvard University | History & Facts | Britannica.com\n\nhttps://www.britannica.com › topic › Harvard-University',
  'https://www.britannica.com/topic/Harvard-University')]

### Now let's try to do some basic interaction! Try navigating to my website from earlier!

In [31]:
### SOLUTION ###
driver.get(my_website_url)

### It'd be nice if we could click the button... Let's try to get that button into a variable

In [34]:
### SOLUTION ###
my_button = driver.find_element_by_id("btnMore")

### Let's click it!

In [35]:
my_button.click()

### The new data is here! Let's try to scrape it. First, let's find the table.

In [22]:
### SOLUTION ###
dynamic_table = driver.find_element_by_id("dynamic_table")

### Now let's get the row of each table

In [27]:
rows = dynamic_table.find_elements_by_tag_name("tr")

['1 1', '2 4', '3 9', '4 16', '5 25', '6 36', '7 49', '8 64', '9 81', '10 100']

### Great, now let's get the data within each row

In [29]:
data = []
for row in rows[1:]:
    ### SOLUTION ###
    input_elt = row.find_elements_by_class_name("dynamic_input")[0]
    output_elt = row.find_elements_by_class_name("dynamic_output")[0]
    data.append((input_elt.text, output_elt.text))
data

1 1
2 4
3 9
4 16
5 25
6 36
7 49
8 64
9 81
10 100


[('1', '1'),
 ('2', '4'),
 ('3', '9'),
 ('4', '16'),
 ('5', '25'),
 ('6', '36'),
 ('7', '49'),
 ('8', '64'),
 ('9', '81'),
 ('10', '100')]

### See? Not so bad! But that was kind of long... what if we could combine regex with Selenium?

In [36]:
html = driver.page_source
html

'<html xmlns="http://www.w3.org/1999/xhtml"><head>\n        <title>\n            My Website!\n        </title>\n        <style>\n            table, th, td {\n                border: 1px solid black;\n            }\n            td {\n                padding: 10px;\n            }\n            th {\n                padding: 10px;\n            }\n        </style>\n        <script src="my_script.js"></script>\n    </head>\n\n    <body>\n        <h1>Welcome to my website!</h1>\n        <h3>Look at all of this data that I have</h3>\n        <table id="static_table">\n            <tbody><tr><th>My Input</th><th>My Output</th></tr>\n            <tr><td class="static_input">1</td><td class="static_output">2</td></tr>\n            <tr><td class="static_input">2</td><td class="static_output">4</td></tr>\n            <tr><td class="static_input">3</td><td class="static_output">6</td></tr>\n            <tr><td class="static_input">4</td><td class="static_output">8</td></tr>\n            <tr><td clas

### Now, we can use RegEx!

In [40]:
### SOLUTION ###
data = re.findall(r'<td class="dynamic_input">(\d+?)<\/td><td class="dynamic_output">(\d+?)<\/td>', html)
data

[('1', '1'),
 ('2', '4'),
 ('3', '9'),
 ('4', '16'),
 ('5', '25'),
 ('6', '36'),
 ('7', '49'),
 ('8', '64'),
 ('9', '81'),
 ('10', '100')]

### Now for one of the best parts: the ability to navigate. Let's try scraping Canvas.

In [45]:
driver.get("http://canvas.harvard.edu")

### Let's see if we can get the todo titles

In [52]:
### SOLUTION ###
todo_elts = driver.find_elements_by_class_name("todo-details__title")
todo_titles = [elt.text for elt in todo_elts]
todo_titles

['Grade PSET 3', 'Complete HW 4', 'Turn in Homework 4']

### Hmm... doesn't tell us that much. Let's get which classes they are for

In [50]:
todo_class_elts = driver.find_elements_by_class_name("todo-details__context")
todo_classes = [elt.text for elt in todo_class_elts]
todo_classes

['ECON 1011A', 'MATH 116', 'ECON 1126']

In [55]:
todos = list(zip(todo_classes, todo_titles))
todos

[('ECON 1011A', 'Grade PSET 3'),
 ('MATH 116', 'Complete HW 4'),
 ('ECON 1126', 'Turn in Homework 4')]

### I'll leave it to you as a challenge to see if you can get due dates :)

## A note on ethical scraping
Scraping methods, especially like Selenium can give you a *lot* of power. Make sure you use it responsibly. Don't violate individual privacy, and make sure you check the user agreements of websites before you scrape them. Recently a court case ruled that it was legal to scrape LinkedIn, but even then, be careful. A lot of the information that you scrape is still subject to copy