In [5]:
import selenium
print(selenium.__version__)

4.1.3


# Selenium
When a web application is using javascript to load content, we need more than just the page source to get the data. Then Beautifulsoup is not enough to get us what we need. We need a framework that can interact with the application by:
1. finding buttons
2. clicking buttons
3. fill out and submit forms
4. extract lists of images/links/divs etc.

### What is Selenium?

> Selenium automates browsers. That's it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should!) also be automated as well.
http://docs.seleniumhq.org



## Controlling the Browser with the `selenium` Module

The `selenium` module lets Python directly control the browser by programatically clicking links and filling in login information, almost as though there is a human user interacting with the page. Selenium allows you to interact with web pages in a much more advanced way than Requests and Beautiful Soup; but because it launches a web browser, it is a bit slower and hard to run in the background if, say, you just need to download some files from the Web.


### Starting a Selenium-Controlled Browser

```python
from selenium import webdriver

browser = webdriver.Firefox()
browser.get('http://www.krak.dk')
```

### Update selenium to version 4
`pip install -U selenium` (may be necessary to do it as root): `docker exec -it -u 0 notebookserver bash`

In [6]:
# Example: goto www.cphbusiness.dk and find all the "Erhvervsakademiuddannelser" that are available.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.service import Service
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.common.by import By


import bs4
import json

url = 'https://www.cphbusiness.dk'
def cphbusiness_interaction():
    #profile = webdriver.FirefoxProfile()
    #profile.set_preference("general.useragent.override", "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0")
    
    # headless is needed here because we do not have a GUI version of firefox
    options = Options()
    options.headless = True
    browser = webdriver.Firefox(service=Service(GeckoDriverManager().install()), options=options)
    
    browser.get(url)
    browser.implicitly_wait(2)
    button = browser.find_element(By.ID,'declineButton')
    button.click()
    
    button = browser.find_element(By.XPATH,'/html/body/header/div[3]/div[4]/div/nav/ul/li[1]/a')
    button.click()
    
    button = browser.find_element('xpath','/html/body/main/div[1]/div/div[1]/div/a') # using xpath string
    button.click()
    
    edu_buttons = browser.find_elements('css selector','div.tile.tile--small.small-12.medium-12.large-6.columns')
    edu_buttons = browser.find_elements('css selector','p.tile__link.tile__link--small a.icon-arrow-after')
    educations = [b.text for b in edu_buttons]
    return educations, browser.page_source
    
def find_elements(page, selector):
    soup = bs4.BeautifulSoup(page, 'html.parser')
    event_cells = soup.select(selector)
    return event_cells

def print_page(page,file):
    with open(file,'w') as file:
        file.write(json.dumps(page))

        

In [7]:
educations,source = cphbusiness_interaction()
print(educations)


[WDM] - Downloading: 16.2kB [00:00, 8.29MB/s]                   
[WDM] - Downloading: 16.2kB [00:00, 11.1MB/s]                   
[WDM] - Downloading: 100%|██████████| 1.42M/1.42M [00:00<00:00, 10.7MB/s]


['Datamatiker', 'Financial controller', 'Finansøkonom', 'Handelsøkonom', 'Laborant', 'Logistikøkonom', 'Markedsføringsøkonom', 'Miljøteknolog', 'Multimediedesigner', 'Serviceøkonom']


In [8]:
elements = find_elements(source,'a')
print(len(elements))
print(elements[:2])

176
[<a href="https://cookieinformation.com/" target="_blank">Cookie Information</a>, <a class="coi-banner__policy" href="javascript:TogglePage(this, 'coiPage-3');" onkeypress="TogglePage(this, 'coiPage-3')">Læs mere om cookies.</a>]


### Finding Elements on the Page

WebDriver objects have quite a few methods for finding elements on a page. They are divided into the `find_element_*` and `find_elements_*` methods. The `find_element_*` methods return a single `WebElement` object, representing the first element on the page that matches your query. The `find_elements_*` methods return a list of `WebElement_*` objects for every matching element on the page. 

### See **[docs](https://selenium-python.readthedocs.io/locating-elements.html)** for more details


### Clicking the Page

`WebElement` objects returned from the `find_element_*` and `find_elements_*` methods have a `click()` method that simulates a mouse click on that element. This method can be used to follow a link, make a selection on a radio button, click a Submit button, or trigger whatever else might happen when the element is clicked by the mouse.

```python
    base_url = 'http://www.krak.dk'
    browser = webdriver.Firefox() 
    browser.get(base_url)
    browser.implicitly_wait(3)

    link_to_persons = browser.find_elements(By.LINK_TEXT,'Personer') # returns a list
    link_to_persons[0].click()
```


## Find ccs selector, xpath of an element
1. Open the url in browser (private window to avoid relying on previous communication).
2. Right click the element you want to identify
3. choose "inspect element"
4. Inside the inspector pane right click the marked area and choose copy.
  - From here you can choose between 
      1. CSS Selector
      2. CSS Path
      3. XPath
  
![](images/inspector_copy.png)

#### CSS selector
If you choose this, you would get something like:
`button.cqryLz:nth-child(2)` This MAY be unique on the page in which case it can be used with selenium like: `browser.find_elements(By.CSS_SELECTOR, selector_text)`

#### CSS Path
If you choose this, you would get something like:
`html.krak body.firstPageBackground div#qc-cmp2-container.qc-cmp2-container div#qc-cmp2-main.qc-cmp2-main div.sc-VigVT.jzbnAW.qc-cmp-cleanslate div#qc-cmp2-ui.sc-bdVaJa.cNgWHs div.qc-cmp2-footer div.qc-cmp2-buttons-desktop button.sc-ifAKCX.cqryLz` which is the full path through the DOM tree. This will be unique, but also hard to read and much more brittle in terms of any small change to the html will break this path (So not so usefull in itself, but it can be broken up so you could use only the last part of it if you can identify the smallest necessary unique path.

#### XPath
If you choose this, it would look something like: `html.krak body.firstPageBackground div#qc-cmp2-container.qc-cmp2-container div#qc-cmp2-main.qc-cmp2-main div.sc-VigVT.jzbnAW.qc-cmp-cleanslate div#qc-cmp2-ui.sc-bdVaJa.eupTWg div.qc-cmp2-footer.qc-cmp2-footer-overlay.qc-cmp2-footer-scrolled div.qc-cmp2-summary-buttons button.sc-ifAKCX.kkoEyk`   

This can also be split up (to find the smallest path that is still unique (eg. go from any element with an id)) and used like `browser.find_element(By.XPATH, "div#qc-cmp2-ui.sc-bdVaJa.eupTWg div.qc-cmp2-footer.qc-cmp2-footer-overlay.qc-cmp2-footer-scrolled div.qc-cmp2-summary-buttons button.sc-ifAKCX.kkoEyk")`

### Filling Out and Submitting Forms
Sending keystrokes to text fields on a web page is a matter of finding the `<input>` or `<textarea>` element for that text field and then calling the `send_keys()` method. 


```python
    base_url = 'http://www.krak.dk'
    browser = webdriver.Firefox() # or use driver = webdriver.PhantomJS() which will do the same without the overhead of a GUI. http://phantomjs.org/download.html
    browser.implicitly_wait(3)

    search_field = browser.find_element_by_name('searchQuery')
    search_field.send_keys('Møller')
    search_field.submit()
```

## Automatically Finding Names, addresses and numbers

In the selenium_krak module script (in package: modules), you will observe, that it opens a Firefox window clicks the cookies aproval box, enters a search string (*"Møller"*), clicks the links *"Personer"* to search for persons only, and finally it prints the HTML sources of the page.

## Headless mode in modules
When running selenium in .py module files in a docker container we do not have a GUI. Therefor we use the browser in headless mode to run selenium without the graphical output. See the example [here](http://127.0.0.1:8888/edit/modules/selenium_krak.py)

Omitting the headless mode without having a display produces this error: `WebDriverException: Message: invalid argument: can't kill an exited process`


## Class exercise
Find a web site to interact with and fill out a form to get some information back.  
Examples could be https://www.jobindex.dk/,    
https://google.com or   
https://www.ikea.com/dk/da/

In [9]:
# headless is needed here because we do not have a GUI version of firefox
options = Options()
options.headless = True
browser = webdriver.Firefox(service=Service(GeckoDriverManager().install()), options=options)

browser.get("http://www.krak.dk")

[WDM] - Downloading: 16.2kB [00:00, 8.28MB/s]                   
