# Automate Web Testing with Python and Selenium

Gregory Ciccarelli   
gregory.ciccarelli@cfa.harvard.edu  
2025

<hr>

### What is Selenium? 
Selenium is an open source cross language library for interacting with websites programmatically  
* Check if elements are present on a page
* Click elements and check for responses
* https://www.selenium.dev/documentation/webdriver/


<hr>

# Start Python

### Open a Python session
1. Open a new terminal
1. `conda activate adass2025`
1. `jupyter notebook`
    1. (optional) command line version: `ipython`


### Jupyter Notebook (*.ipynb) Quick Start
1. A jupyter notebook is an interactive code and visualization file
    1. https://jupyter.org/
1. Every cell in a notebook can contain "markdown" or executable code
    1. Every cell by default is a 'code' cell.
    1. To turn code to markdown: click in the cell, <Escape> key, <m> key
    1. To turn markdown to code: click in the cell, <Escape> key, <y> key.
1. Shift+Enter // Cmd+Enter // Ctrl+Enter run code in a cell. 
    1. Cell output is printed below the cell
    1. From gui, click the "play" arrow button
1. In the top file menu, `Run > Run all cells` (or `Cell > Run all` in older version)
1. Variables/methods are globally shared across cells within the notebook
    1. Clear all variables with Kernal > Restart, 
    1. Gui: the Refresh symbol on the notebook toolbar (not the browser toolbar)

2 + 2 # This cell is markdown

In [None]:
2+2 # This cell is code

# Interact with Web pages with Selenium

In [None]:
# Import selenium classes so we can use them
from selenium import webdriver
from selenium.webdriver.common.by import By 

In [None]:
# Browser testing can be done as a background process
# or the programmatic manipulations to the browser can be observed
# Use '-headless' to put the browser in the background.

# Chrome
options = webdriver.ChromeOptions()
# options.add_argument("-headless")
driver = webdriver.Chrome(options=options)

# Create an instance of firefox 
# ('-headless' option runs as background process)
# options = webdriver.FirefoxOptions()
# options.add_argument("-headless")
# driver = webdriver.Firefox(options=options)



In [None]:
# Navigate to the url
url = "https://cda.cfa.harvard.edu/chaser/"
driver.get(url)

<hr>

# Use selenium to find the search button

### Pre-req Knowledge
This talk provides a bare minimum discussion of basic HTML/CSS web document structure.  
In depth discussions are out of scope.

### Web element HTML Id
1. Ideally each element's id should be unique.  
1. Some web pages break this rule 
    1. elements may share ids or elements may not have any id.
1. Web elements may have other attributes besides "value"  
    1. e.g. "type", "name"

### Identify an element's HTML id
1. Place the mouse cursor over the <search> button in the upper left corner.
1. Right click on the element.
1. Select "Inspect" from the popup menu.
    1. May need to inspect twice if the browser tools window was not originally open.

In [None]:
# HTML copy and pasted from browser tools
# (Not python code)
<input type="submit" value="Search" id="searchButton" 
name="operation" onclick="return startSearch();">

###  Find element by its html id

In [None]:
html_id = "searchButton"
e = driver.find_element(By.ID, html_id)
print("Found search button with value: " 
      + e.get_attribute("value"))

# Find elements by XPath 

XPath: an xml standard for describing the placement of elements  
1. Applies to html, and also .vot tables  
1. Query for a single element or all elements matching criteria

### Inspect any element's XPath using browser tools
1. Right click on the element
2. Inspect
3. Go to the browser tool section and right click on the selected element's html
4. Copy > Copy xpath
    1. e.g. `//*[@id="searchButton"]` # search button
    1. e.g. `/html/body/form/table[1]/tbody/tr/td[1]/input` # search button again

### Xpath Syntax
1. Start an xpath query with "//" 
    1. The "//" indicates any search depth is acceptable in the html tree
    1. Queries are depth first search
1. Add the html tag of the element to search
    1. e.g. "*" for any element
    1. e.g "h2" for heading level 2 elements
1. (Optional) Add "[@attribute='attribute value']" qualifier to filter by attributes.

### Find the first h2 element on the page

In [None]:
e = driver.find_element(By.XPATH, "//h2")
print("Found first H2 level heading with text: " 
      + e.text)

In [None]:
# HTML copy and pasted from browser tools
# Not python code
<h2 style="padding: 0px; margin: 0px;">
Observation Search</h2>

In [None]:
# Find elements by text
# Exact match
e_list = driver.find_elements(By.XPATH, 
                "//h2[text()='Observation Search']") 
print(e_list)

In [None]:
# Access first element in the list 
# Python is 0 indexed
print(e_list[0]) 

In [None]:
# Find elements by text
# Text fragment match
e_list = driver.find_elements(By.XPATH, 
        "//td[contains(text(), 'Customize Out')]")  
print("Found element with text: " + e_list[0].text)

In [None]:
# Find elements by attributes
# empty list because no such element exists
e_list = driver.find_elements(By.XPATH, 
            "//a[@title='View Help in new window']")
print(e_list)

In [None]:
# Find anchor (hyperlink) element by its destination url
e_list = driver.find_elements(By.XPATH, 
        "//a[@href='dispatchExternalSite?site=cxcHelpdesk']")    
print(e_list)

# Selenium Lessons Learned
1. `driver.find_element` vs `driver.find_elements`    
    1. `find_element` returns the first element matching the condition    
    If no condition is found throws error  
    1. `find_elements` returns a list of all found elements   
    If no element is found, returns an empty list
1. Contains/Exact match  
    1. Often there is unidentifiable whitespace surrounding the text inside elements.  
    1. Therefore use `contains` instead of `text()=` when searching with XPATH.
1. Page load times  
    1. Use a "wait until" command which can be seconds or until conditions are filled.
    1. https://www.selenium.dev/documentation/webdriver/waits/
    1. https://selenium-python.readthedocs.io/waits.html
1. Iframes  
    1. Beware that many of the chaser web pages put web pages within web pages.  
    1. Therefore you have to switch frame to make the iframe active and its elements visible.
1. New windows/tabs  
    1. If a process opens new windows, switch to the window to find the element of interest and then switch back.
1. QUIT YOUR DRIVER to avoid memory being unreclaimed! `driver.quit()`    

### More Selenium Examples

In [None]:
# Submitting text input
url = "https://cda.harvard.edu/chaser/"
driver.get(url)
e = driver.find_element(By.ID, "target")

from selenium.webdriver.common.keys import Keys
e.clear() # clear input box
e.send_keys("M33"); # enter text
e.send_keys(Keys.RETURN) # press enter


In [None]:
# Simulate 'click' interactions
url = "https://cda.harvard.edu/chaser/"
driver.get(url)
e = driver.find_element(By.ID, "target")

from selenium.webdriver.common.keys import Keys
e.clear() # clear input box
e.send_keys("M33"); # enter text

e = driver.find_element(By.ID, "nameResolver")
e.click()

In [None]:
# Select with By.CSS as a faster alternative to By.XPATH
# CSS does not support text content based searching and limited to html docs
url = "https://cda.harvard.edu/chaser/"
driver.get(url)
e = driver.find_elements(By.CSS_SELECTOR, "form[name='actionForm']")
print(e[0].get_attribute("action"))

In [None]:
driver.quit() # Important! Closing notebook does not quit the testing browser

# Bonus:  Non Selenium Web interaction

Pre-req
`pip install requests lxml`


In [None]:
### Example:  Use Python to perform GET requests and parse the returned content
#
# Query the Chandra footprint server with a conical search region

import os
import lxml.etree as ET
import requests

# Python
url = "https://cxcfps.cfa.harvard.edu/cgi-bin/cda/footprint/get_vo_table.pl?ra=10.684708&dec=41.268750&sr=0.2500000"
r = requests.get(url)


# Alternative non-Python Command line shell: bash/tcsh
print(f"curl '{url}' -o tmp.xml;")
print("xmllint --format tmp.xml > out.xml;")

In [None]:
print(r.content)

In [None]:
# Convert the text into a queryable object
root = ET.fromstring(r.content)

# Note: the xml namespace is needed
# even though the namespace is not explicit in the raw text content
def get_obsid_list(root):
    """Return list of strings of all obsids in vot.
    
    Duplicate obsids are included.
    
    Args:
        root (libxml ET parse of the vot): e.g. root = ET.parse(myfile.vot)
    
    Returns:
        list[str]:  Each element is the string integer id of the obsid.
    
    """
    return [e.text for e in root.findall(os.path.join(
        ".//{http://www.ivoa.net/xml/VOTable/v1.1}TABLE[@name='SIAP_KEYWORDS']",
        "{http://www.ivoa.net/xml/VOTable/v1.1}DATA", 
        "{http://www.ivoa.net/xml/VOTable/v1.1}TABLEDATA",
        "{http://www.ivoa.net/xml/VOTable/v1.1}TR",
        "{http://www.ivoa.net/xml/VOTable/v1.1}TD[1]"))]


print(get_obsid_list(root))

# Install FAQs

1. No driver found when launching selenium?
    1. Are you on latest version of selenium (e.g. 4.15?)  
        `import selenium` `print(selenium.__version__)`
    1. See here to manually download driver  
        https://www.selenium.dev/documentation/webdriver/troubleshooting/errors/driver_location/