# Web crawling using Selenium

For small/straightforward page downloads (no login, no page rendering/no dynamic pages), it makes sense to use `requests`.

For pages that have a login, that are interactive/dynamic, `Selenium` is usually the best option.


In [None]:
# run this cell once, then 'Kernel' - 'Restart'
!pip install selenium

## Selenium

Selenium requires 'drivers' to interact with different browsers. Each browser (Firefox, Chrome, etc) has their own driver.

The code below assumes that in the same folder of this noteboook you have a folder 'selenium_files' with either the driver for FireFox or Chrome, as well as 'jquery.js' (week 2):

- chromedriver.exe for Chrome (Download from: [https://chromedriver.chromium.org/](https://chromedriver.chromium.org/))
- geckodriver.exe for FireFox (Download from: [https://github.com/mozilla/geckodriver/releases](https://github.com/mozilla/geckodriver/releases)) - download 'geckodriver-v0.33.0-win32.zip'
- jquery.js (Download from [https://jquery.com/download/](https://jquery.com/download/))


You don't need have both Chrome and FireFox working; one is enough. There are slight differences though with certain tasks (like mouse movements). 

Furhter reading/tutorials: [https://www.browserstack.com/guide/category/tutorials](https://www.browserstack.com/guide/category/tutorials)


In the next steps: either use Selenium with Chrome, or with Firefox.

### Selenium using Chrome

In [None]:
from selenium import webdriver

# new driver (opens browser window) (edit path)
# if you move chromedriver.exe to the same folder as the notebook, you can do:
#driver = webdriver.Chrome('chromedriver.exe')
driver = webdriver.Chrome(r'C:\git\python-materials-acg7848\selenium\selenium_files\chromedriver.exe')

### Selenium using Firefox 

In [None]:
from selenium import webdriver

# make sure path is correct
driver = webdriver.Firefox(executable_path=r'C:\git\python-materials-acg7848\selenium\selenium_files\geckodriver.exe')

### SessionNotCreatedException

Error message: 
```
Message: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line
```

I had the above error occur and I could solve it with Google/settings ('options'):

In [None]:
# https://stackoverflow.com/questions/64908154/sessionnotcreatedexception-message-expected-browser-binary-location-but-unabl

from selenium import webdriver
from selenium.webdriver.firefox.options import Options

options = Options()
options.binary_location = r"C:\Program Files\Mozilla Firefox\firefox.exe"
driver = webdriver.Firefox(options=options, executable_path=r'C:\git\python-materials-acg7848\selenium\selenium_files\geckodriver.exe')


### Example Google

Either way, we now have either a Chrome or FireFox window open. Note that we assigned it to a variable 'driver'

In [None]:
driver

In [None]:
# tell driver to go to a specific url
driver.get("https://www.google.com")

### inspect

Inspect the file in two ways: open it in your web browser, and open it in your text editor.

In the browser: in Chrome and Firefox: press F12 to open the debugger. 
You will be able to right-click any element (in the web page), click 'inspect' to see the corresponding HTML.

> Note: if the console isn't open yet, you may need to right-click and 'inspect' twice for the debugger to highlight the correct HTML.
        

### Locating elements

There are many ways to 'locate' elements in a HTML page. Elements can be a button, input field, paragraphs, tables, table rows/cells, etc.

Elements can then be read/changed/clicked, etc (depending on the element).

For example, the input field in Google search is named 'q'. (Inspect the element!)

In [None]:
#input element has name="q"
elem = driver.find_element_by_name("q")
elem

In [None]:
# type in search query
elem.send_keys("go gators")

In [None]:
# clear it
elem.clear()

In [None]:
# special characters (enter, tab, etc) have a mapping
from selenium.webdriver.common.keys import Keys
Keys.TAB

In [None]:
# now with an enter
from selenium.webdriver.common.keys import Keys
elem.clear()
elem.send_keys("go gators")
elem.send_keys(Keys.RETURN)

## Selenium locator functions:

See [https://selenium-python.readthedocs.io/locating-elements.html](https://selenium-python.readthedocs.io/locating-elements.html): 
- By CSS ID: find_element_by_id.
- By CSS class name: find_element_by_class_name.
- By name attribute: find_element_by_name.
- By DOM structure or xpath: find_element_by_xpath.
- By link text: find_element_by_link_text.
- By partial link text: find_element_by_partial_link_text.
- By HTML tag name: find_element_by_tag_name.
    


In [None]:
# navigate to a page
driver.get("https://www.startengine.com/explore") 

Let's select the search textbox ("Explore Investments").

Right-click and 'inspect':
    
- it is an 'input' element
- it has an id 'mui-2'
- it has several classes (MuiInputBase-input MuiOutlinedInput-input MuiInputBase-inputAdornedStart MuiInputBase-inputAdornedEnd MuiAutocomplete-input MuiAutocomplete-inputFocused css-17o9mmt)
- a placeholder ("Explore Investments")

It has many 'parents' ('higher' elements). Sometimes a parent is easy to locate, and then you can 'find' a child inside that parent. 

In this case the best bet is to go for the id with value 'mui-2' (in principles each id is unique)

In [None]:
# Returns first element with matching class
# note: this function only works with one class (not multiple classes)
element = driver.find_element_by_id("mui-2")

# note that this is a single element
print(element)

In [None]:
# the HTML of an element
element.get_attribute('outerHTML')

In [None]:
# get attributes of it
# the inner HTML of an element ('children')
element.get_attribute('innerHTML')

In [None]:
element.get_attribute('autocapitalize')

In [None]:
# the 'class' of an element
element.get_attribute("class")

In [None]:
# how about setting some text
element.send_keys("easy money")

> Note: when 'Googling' how to use Selenium you need to be aware that Selenium can also be used with other software like Java. Java functions use 'CamelCase' (no underscores, but capitalizing words). For example, 'send_keys' is 'sendKeys' in Selenium for Java.

### Selector functions to return multiple items

Each of the selector/locator functions (find_element_by_id, find_element_by_class_name, etc) has a corresponding function to locate/select multiple elements.

For example: find_element_by_id => find_elements_by_id
    
The former returns one element (if it can be found), the latter returns a list with all elements found (empty list of no matches).

In [None]:
# let's get all hyperlinks (element 'a' in html)
# elements is a list holding elements
elements = driver.find_elements_by_tag_name("a")
print('type elements:', type(elements))
len(elements)

In [None]:
# the third link (the first two are lenghty)
elements[2].get_attribute('outerHTML')

In [None]:
# interested in the first link only? then use find_element (not elements)
el = driver.find_element_by_tag_name("a")
type(el)

In [None]:
# enumerate wraps a list into a list of tuples, where each tuple has a counter and an element
for counter, el in enumerate ( ["kangaroo", "banana", "kegerator"]  ):
    print("counter:", counter, "el:", el)
    
# how about getting the classes for the first 5 hyperlinks?# now le
for counter, el in enumerate( elements[0:5] ):
    print("link", counter, "class:", el.get_attribute("class"), "links to", el.get_attribute("href") )

## X-path

X-path is a way to 'navigate' through a HTML document

Right-click 'Sign Up' to 'inspect', then right-click the button and select 'copy' 'XPath', which gives:
    
```
/html/body/div/div/div[1]/header/div/div/div[1]/div/div[2]/div[2]/div[2]/a/button
```

In [None]:
# use function: driver.find_element_by_xpath 
# note: this xpath may not be 'stable' over time (if page changes a bit, then this path may change)
s = driver.find_element_by_xpath (r'/html/body/div/div/div[1]/header/div/div/div[1]/div/div[2]/div[2]/div[2]/a/button')
print(s)
s.click()

The X-path is not very pretty, and may be prone to being outdated if the web page changes just a bit.

Inspect the HTML again; notice how the button may be hard to find (no id, no name, generic classes), but most likely the 5th or so grandparent has a unique class 'topNavRight', and there are no other buttons.

So an alternative approach would be to first select the element with class 'topNavRight', and then within that element select the button.

In [None]:
# navigate back first
driver.get("https://www.startengine.com/explore") 
# get element
nav = driver.find_element_by_class_name (r'topNavRight')
print(nav)

In [None]:
# now, instead of 'driver' (which is the full page), we use 'nav' to select the button
btn = nav.find_element_by_tag_name("button")
print(btn)

### This may come in handy: getting the 'parent' of an element

That is; if you have an element, and you want to know the element that is one level higher (the parent)

In [None]:
# identify parent from child element with (..) in xpath
# apply the function on the element for which you want to know the 'parent' (for example: btn)
p = btn.find_element_by_xpath("..")
print('and the parent of the button is', p)

In [None]:
# all child elements of an element (all children of p)
p_children = p.find_elements_by_xpath(".//*")
p_children

## Running Javascript in the browser 

In [None]:
# for the browser to 'see' our variables/functions we need to use window (this is the main 'container' variable in a browser)
s = '''
window.message = 'hello'
'''
driver.execute_script( s )

In [None]:
# or
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

In [None]:
s = '''
window.sayMessage = function( name ) { return ('hi ' + name); }
'''
driver.execute_script( s )

Test it in the debugger, in the console type: ```sayMessage('Tarzan')```

In [None]:
# execute_script can also return variables to Python
msg = driver.execute_script( "return sayMessage('there')" )
print(msg)

Next week we will be loading (inserting) a Javascript library (jQuery) that will help us isolate elements (jQuery locators are very compact/flexible):
    
- Navigate to a web page
- Load jQuery from the hard disk (and pass it into the browser through execute_script) 
- Run jQuery code (through execute_script) from Python (jQuery runs in browser working on the page we navigated to)
- jQuery returns the data/text through execute_script
- Save data to disk (Python)