# Selenium Bootcamp

## Why Selenium?

Most of the time, scraping methods like RegEx or BeautifulSoup will be fine for dealing with websites. However, some websites handle things a little bit differently. Let's take a look. Run the following two cells:

In [None]:
import os

In [None]:
os.system("open my_website.html")

### Let's try downloading the site and scraping the data. 

In [None]:
from urllib.request import Request, urlopen
my_website_url = "file://" + os.getcwd() + "/my_website.html"
html = str(urlopen(my_website_url).read())

### Seems to work fine. Let's try scraping some data from it.

In [None]:
import re

In [None]:
static_data = re.findall(r'<td class = "static_input">(.+?)<\/td><td class = "static_output">(.+?)<\/td>', html)
static_data

### Looks good! But I think we have some data missing... Not a problem, let's try to scrape it

In [None]:
dynamic_data = re.findall(r'<td class = "dynamic_input">(.+?)<\/td><td class = "dynamic_output">(.+?)<\/td>', html)
dynamic_data

## What went wrong?

## Cases where you might need Selenium
* Data is generated via interaction e.g. searching, clicking more, etc.
* Data is generated via "ajax" requests
* Website requires login of some kind
* Dealing with the html parsing and regex is just too damn annoying

## Download Instructions

1. Install Selenium for Python. ```python3 -m pip install selenium```. [Full Instructions](https://selenium-python.readthedocs.io/installation.html)

2. [Install chrome webdriver](https://sites.google.com/a/chromium.org/chromedriver/downloads).

3. Move the resulting file to this folder.


### Great! Now let's get started

In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions

In [None]:
driver = webdriver.Chrome("chromedriver")

# for Windows users
# driver = webdriver.Chrome("chromedriver.exe")

### Some helper functions that will be useful later

In [None]:
def wait_until_present(driver, time, *locator):
    return WebDriverWait(driver, time).until(expected_conditions.presence_of_element_located(locator))

def find_element_by_text(driver, text):
    return driver.find_element_by_xpath("//*[contains(text(), '{}')]".format(text))

### Let's try navigating to a url

In [None]:
driver.get("http://google.com")

### Not too bad, now let's see if we can interact with the webpage

In [None]:
search_box = wait_until_present(driver, 10, By.NAME, "q")

In [None]:
search_box.send_keys("Harvard" + Keys.ENTER)

### A quick demo of what this can be useful for

In [None]:
result_divs = driver.find_elements_by_class_name("r")
link_elts = [div.find_element_by_tag_name("a") for div in result_divs if div.text != ""]
links = [(elt.text, elt.get_attribute("href")) for elt in link_elts]
links

### Now let's try to do some basic interaction! Try navigating to my website from earlier!

In [None]:
### SOLUTION ###
driver.get(my_website_url)

### It'd be nice if we could click the button... Let's try to get that button into a variable

In [None]:
### SOLUTION ###
my_button = driver.find_element_by_id("btnMore")

### Let's click it!

In [None]:
my_button.click()

### The new data is here! Let's try to scrape it. First, let's find the table.

In [None]:
### SOLUTION ###
dynamic_table = driver.find_element_by_id("dynamic_table")

### Now let's get the row of each table

In [None]:
rows = dynamic_table.find_elements_by_tag_name("tr")

### Great, now let's get the data within each row

In [None]:
data = []
for row in rows[1:]:
    ### SOLUTION ###
    input_elt = row.find_elements_by_class_name("dynamic_input")[0]
    output_elt = row.find_elements_by_class_name("dynamic_output")[0]
    data.append((input_elt.text, output_elt.text))
data

### See? Not so bad! But that was kind of long... what if we could combine regex with Selenium?

In [None]:
html = driver.page_source
html

### Now, we can use RegEx!

In [None]:
### SOLUTION ###
data = re.findall(r'<td class="dynamic_input">(\d+?)<\/td><td class="dynamic_output">(\d+?)<\/td>', html)
data

### Now for one of the best parts: the ability to navigate. Let's try scraping Canvas.

In [None]:
driver.get("http://canvas.harvard.edu")

### Let's see if we can get the todo titles

In [None]:
### SOLUTION ###
todo_elts = driver.find_elements_by_class_name("todo-details__title")
todo_titles = [elt.text for elt in todo_elts]
todo_titles

### Hmm... doesn't tell us that much. Let's get which classes they are for

In [None]:
todo_class_elts = driver.find_elements_by_class_name("todo-details__context")
todo_classes = [elt.text for elt in todo_class_elts]
todo_classes

In [None]:
todos = list(zip(todo_classes, todo_titles))
todos

### I'll leave it to you as a challenge to see if you can get due dates :)

## Pros and cons of Selenium
### Pros
* Very powerful, can get a lot done
* Pretty intuitive, like using a browser
* Can download the current HTML representation, not just the intitial one

### Cons
* Pretty heavyweight, need to install a lot
* Can get annoying if you need to be fully automated
* Can be substantially slower than RegEx or BeautifulSoup

## A note on ethical scraping
Scraping methods, especially like Selenium can give you a *lot* of power. Make sure you use it responsibly. Don't violate individual privacy, and make sure you check the user agreements of websites before you scrape them. Recently a court case ruled that it was legal to scrape LinkedIn, but even then, be careful. A lot of the information that you scrape is still subject to copyright law and who knows what might happen legally in the future with this kind of thing. But in general, be responsible with scraping. 