PyCoffee 27/01/2025 - Adrien Masson (amasson@cab.inta-csic.es)

## What is Selenium?

Selenium is a powerful tool for automating web browser interactions. It allows you to control a web browser programmatically, making it ideal for:

### 1. **Web Scraping and Crawling**
- Extract data from websites, especially those that require JavaScript rendering
- Collect information dynamically loaded by JavaScript (unlike simple `requests` module)
- Navigate through pagination, popups, and dynamic content

### 2. **Task Automation**
- Automate repetitive tasks like filling out forms
- Submit data to multiple websites automatically
- Perform administrative tasks that require web interface interaction
- Schedule batch operations that would be tedious to do manually

### 3. **Testing Web Applications**
- Test user interactions and workflows
- Verify that elements appear and behave correctly
- Test JavaScript functionality

### Why Selenium?
Unlike the `requests` module (which only gets static HTML), Selenium controls a real browser. This means:
- JavaScript code executes
- You can interact with dynamic content
- You can simulate real user behavior

## Setup: Installing Selenium and Webdriver Manager

Before we can use Selenium, we need to install two essential packages:

1. **`selenium`** - The main library for controlling web browsers
2. **`webdriver-manager`** - Automatically downloads and manages browser drivers (like ChromeDriver)

Without `webdriver-manager`, you'd need to manually download the correct ChromeDriver version for your Chrome browser - this tool does that automatically!

Install them with:
```bash
pip install selenium webdriver-manager
```

Or using the requirements file:
```bash
pip install -r requirements.txt
```

## Choosing a Browser: Chrome, Firefox, Edge, and Others

Selenium supports multiple browsers. Each has its own webdriver that acts as a "bot" controlling that browser.

| Browser | Module |
|---------|--------|
| **Chrome** | `webdriver.Chrome()` | 
| **Firefox** | `webdriver.Firefox()` | 
| **Edge** | `webdriver.Edge()` | 
| **Safari** | `webdriver.Safari()` | 

Let's start by importing the necessary libraries. We will work with Chrome for this example:

In [1]:
from selenium import webdriver 
from selenium.webdriver.chrome.options import Options # to add options to the browser
from selenium.webdriver.common.by import By # to find elements on the page
from selenium.webdriver.support.ui import WebDriverWait # to wait until a task is done (e.g. loading a page)

# Optional: For using Firefox with options instead of Chrome, you would import:
# from selenium.webdriver.firefox.options import Options

import time
import random


# Part 1: Web Scraping - ExoAtmospheres Website

Let's learn how to scrape data from the ExoClock database (https://www.exoclock.space/database/planets)

This website has a very simplistic API that won't cover all our needs, making it a good use case for Selenium.

## Creating a Chrome Webdriver Instance

Now we'll instantiate an automated Chrome session. This launches a real Chrome browser that we can control with code.

When you call `webdriver.Chrome()`, Selenium:
1. Installs the webdriver if necessary
2. Launches a real Chrome browser instance on your computer
3. Returns a `driver` object that represents this browser session

When creating the driver, we can configure `Options()` to customize how Chrome behaves. Here are the most important ones:
- `--start-maximized`: Opens Chrome in a maximized window (easier for scraping full-page content).
- `--headless=new`: Runs Chrome without a visible window (useful for automation on servers).
- `--disable-gpu`: Disables GPU hardware acceleration (sometimes needed for headless mode).
- `--incognito`: Opens Chrome in incognito/private mode (avoids caching and cookies).
- `--disable-blink-features=AutomationControlled`: Makes automation less detectable by websites.

Check the full [Selenium Documentation](https://selenium-python.readthedocs.io/) for advanced case use!

In [3]:
# ============================================================================
# EXAMPLE 1: Chrome Browser (Normal Mode - With Visible Window)
# ============================================================================

# Configure Chrome options
opts = Options()
opts.add_argument("--start-maximized")  # Start with maximized window
opts.add_argument("--disable-blink-features=AutomationControlled")  # Make automation less detectable
# opts.add_argument("--headless=new")  # Run without spawning a browser window

# Create the driver - this LAUNCHES a real Chrome browser
driver = webdriver.Chrome(options=opts)

# Navigate to a website
driver.get("https://www.exoclock.space/database/planets")

# Get the page title and current url
print(f"Page title: {driver.title}")
print(f"Current URL: {driver.current_url}")


Page title: ExoClock - a project to monitor transiting exoplanets
Current URL: https://www.exoclock.space/database/planets


You can then use this `driver` object to:
- Navigate to URLs: `driver.get("https://example.com")`
- Find elements on the page: `driver.find_element(...)`
- Click buttons: `element.click()`
- Type text: `element.send_keys("text")`
- And much more!


## Finding Elements on a Web Page

To interact with a website, we first need to locate elements (buttons, text fields, links, etc.).
The easiest way to do it is by clicking on the element we want our driver to find in the browser's, and use the **Inspect Tool** (Right-click → Inspect) to get the HTML structure:

![alt text](figures/inspect_tool.png)

```

We can then use the element's properties from the inspect tool to let our driver find it. Here for example using its XPath:

In [4]:
title = driver.find_element(By.XPATH,'/html/body/div[1]/div[2]/div[1]/div/div/h1')
print(title.text)


Ephemerides & Planet information



### Main Methods to Find Elements:

| Method | Syntax | Use Case | Pros | Cons |
|--------|--------|----------|------|------|
| **ID** | `By.ID` | Unique identifiers | Most reliable, fast | Not all elements have IDs |
| **Class** | `By.CLASS_NAME` | CSS class names | Good for styled elements | Can have duplicates |
| **XPath** | `By.XPATH` | Complex selectors | Very flexible | Breaks easily with layout changes |
| **CSS Selector** | `By.CSS_SELECTOR` | CSS selectors | Flexible, standardized | Can break with updates |
| **Tag Name** | `By.TAG_NAME` | HTML tags | Simple | Returns many matches |
| **Name** | `By.NAME` | HTML name attribute | Useful for forms | Not all elements have names |
| **Link Text** | `By.LINK_TEXT` | Text of links | Intuitive | Only for links, brittle |


   - **HTML `id` attribute**: `<input id="example_id" ...>`
   - **HTML `name` attribute**: `<input name="example_name" ...>`
   - **HTML `class` attribute**: `<button class="example_class" ...>`
   - **XPath**: For complex selections

### Finding Single Elements vs. Multiple Elements

```python
# Find a single element (returns first match or raises exception)
element = driver.find_element(By.ID, "element_id")

# Find multiple elements (returns a list, empty if none found)
elements = driver.find_elements(By.CLASS_NAME, "item")

In [5]:
# Example: find all rows in the tab
rows_list = driver.find_elements(By.TAG_NAME,'tr') 
print(rows_list[:10]) # contains a list of WebElement

# print the content
print()
for el in rows_list[:10]:
    print(el.text)


[<selenium.webdriver.remote.webelement.WebElement (session="2d62834fd617ae569db54a52e40ba1e2", element="f.02756A724ED67CA41C5E13753F78295D.d.01C0131858A1A8723B84F0B1038AB4A8.e.125")>, <selenium.webdriver.remote.webelement.WebElement (session="2d62834fd617ae569db54a52e40ba1e2", element="f.02756A724ED67CA41C5E13753F78295D.d.01C0131858A1A8723B84F0B1038AB4A8.e.126")>, <selenium.webdriver.remote.webelement.WebElement (session="2d62834fd617ae569db54a52e40ba1e2", element="f.02756A724ED67CA41C5E13753F78295D.d.01C0131858A1A8723B84F0B1038AB4A8.e.127")>, <selenium.webdriver.remote.webelement.WebElement (session="2d62834fd617ae569db54a52e40ba1e2", element="f.02756A724ED67CA41C5E13753F78295D.d.01C0131858A1A8723B84F0B1038AB4A8.e.128")>, <selenium.webdriver.remote.webelement.WebElement (session="2d62834fd617ae569db54a52e40ba1e2", element="f.02756A724ED67CA41C5E13753F78295D.d.01C0131858A1A8723B84F0B1038AB4A8.e.129")>, <selenium.webdriver.remote.webelement.WebElement (session="2d62834fd617ae569db54a52e

We can also narrow down the search for elements using their parent -> child relationship. A 'parent' element can contains several 'child' elements, that we can search once we found the 'parent':

In [8]:
# Example: take a row from the table, and find all urls in it:
second_row = rows_list[2]
url_list = second_row.find_elements(By.TAG_NAME,'a')

print(url_list) # list of child WebElement contained in the parent
print()

for el in url_list:
    print(el.text,el.get_attribute('href'))


[<selenium.webdriver.remote.webelement.WebElement (session="2d62834fd617ae569db54a52e40ba1e2", element="f.02756A724ED67CA41C5E13753F78295D.d.01C0131858A1A8723B84F0B1038AB4A8.e.904")>, <selenium.webdriver.remote.webelement.WebElement (session="2d62834fd617ae569db54a52e40ba1e2", element="f.02756A724ED67CA41C5E13753F78295D.d.01C0131858A1A8723B84F0B1038AB4A8.e.905")>, <selenium.webdriver.remote.webelement.WebElement (session="2d62834fd617ae569db54a52e40ba1e2", element="f.02756A724ED67CA41C5E13753F78295D.d.01C0131858A1A8723B84F0B1038AB4A8.e.1061")>]

55Cnce https://www.exoclock.space/database/planets/55Cnce
HIGH https://www.exoclock.space/database/planets#priorities
Kokori et al. 2025 https://doi.org/10.17605/OSF.IO/WPJTN


We can use a variety of methods to access the properties of an element. We have seen the `text` methods to get the element's text, but we can also access more properties through the `is_displayed`, `get_attribute` and `value_of_css_property` methods:

In [9]:
# Get the properties of the planet name element in the first row
planet_name = second_row.find_element(By.TAG_NAME,'a')
# Get text
print(f"Element text: {planet_name.text}")

# Get an attribute value
href = planet_name.get_attribute("href")
print(f"Link href: {href}")

# Check if element is displayed
is_displayed = planet_name.is_displayed()
print(f"Element is displayed: {is_displayed}")

# Get CSS property value
color = planet_name.value_of_css_property("color")
print(f"Element color: {color}")


Element text: 55Cnce
Link href: https://www.exoclock.space/database/planets/55Cnce
Element is displayed: True
Element color: rgba(0, 123, 255, 1)


Now let's learn how to interact with elements like a real user would. The `click` method allows for clicking an element (a button, an url, etc...).

In [13]:
# Example: Find and click on the planet name:
planet_name.click()


**Important note:** by default Selenium will always stay on the first opened tab, even if we close it or manually move to another tab. Since our click has opened a new tab, we need to make this new tab active:

In [14]:
# the driver.window_handles give the list of opened tabs: switch to the second tab
driver.switch_to.window(driver.window_handles[1]) 
# check with current url that we're on the new tab
print(driver.title) # also print the tab's title
print(driver.current_url)


ExoClock - 55Cnce
https://www.exoclock.space/database/planets/55Cnce/


In [15]:
# Now find the element containing the discovery paper link
discovered_by_field = driver.find_element(By.XPATH,'/html/body/div[1]/div[2]/div[1]/div/div/h4')
# grab the discovery paper url:
discovery_paper_link = discovered_by_field.find_element(By.PARTIAL_LINK_TEXT,'et al.')
# check if we got the correct link
print(discovery_paper_link.text)

# click it: this will open a new tab
# discovery_paper_link.click()

# or use driver.get + the url to avoid opening a new tab
url = discovery_paper_link.get_attribute('href')
driver.get(url)


McArthur et al. 2004


Now let's do a bit more complex example. We will come back to the first tab and try to get the content of the 'Read Me'. We'll need to first click the Read Me to make the notes popup appear.

In [16]:
# Close the current active tab and switch to the first one
driver.close()
driver.switch_to.window(driver.window_handles[0])


In [17]:
# Note: some elements must be visible for the driver to access them. For example, the 'notes' element is hidden until we click it:
notes = driver.find_element(By.ID,'notes')
# we can verify that the element is displayed or not:
print('Are "notes" visible:',notes.is_displayed()) # False
print(notes.text) # will print nothing

# We make the element visible by clicking the '*READ ME'
readme_link = driver.find_element(By.XPATH,'/html/body/div[2]/h4[1]/a')
readme_link.click()
notes = driver.find_element(By.ID,'notes')
# wait a bit for the window to pop
time.sleep(0.5)
print('Are "notes" visible:',notes.is_displayed()) # True !
print(notes.text) # will print the content
# wait a bit then close by hitting the 'close' button
time.sleep(0.5)
driver.find_element(By.XPATH,'/html/body/div[2]/div[3]/div/div/div[3]/button').click()


Are "notes" visible: False

Are "notes" visible: True
Notes
The list includes only the observable targets for ARIEL (Edwards et al. 2019).
All T0 values have been transformed from their original published time basis to BJDTDB.
If not clearly stated in the paper, the time basis of the published version is assumed to be UTC.
When the time basis of the published version is TT, it is assumed to be equal to TDB (the maximum difference is 3.4 ms).
In cases of asymmetric uncertainties, the largest one is used.
Close


### Full Example: get the list of ADS URLs for the discovery paper of each planet

In [18]:
# First we grab the list of urls pointing toward the individual planet tabs
list_of_planet_links = []
for k,row in enumerate(rows_list):
    try:
        planet_url = row.find_element(By.TAG_NAME,'a') # planet url is the first hyperlink in the row
        list_of_planet_links.append(planet_url)
    except:
        print(f'Url not found for row #{k}')

print(len(list_of_planet_links))
print([el.text for el in list_of_planet_links[:10]])


Url not found for row #0
Url not found for row #1
776
['55Cnce', 'AUMicb', 'AUMicc', 'CoRoT-1b', 'CoRoT-2b', 'CoRoT-3b', 'CoRoT-5b', 'CoRoT-10b', 'CoRoT-11b', 'CoRoT-19b']


In [19]:
# Second, visit each link and grab the discovery paper url. 
# We will limit us to the first 10 planets, and add a small delay to avoid aggressive request toward the website

# dictionnary to store planet names & discovery papers
result = {}

# get the list of urls to visit
url_list = [element.get_attribute('href') for element in list_of_planet_links[:10]]

for url in url_list[:10]:
    driver.get(url) # use driver.get to avoid opening new tabs
    # get planet name
    planet_name = driver.find_element(By.XPATH,'/html/body/div[1]/div[2]/div[1]/div/div/h1/a[1]/font').text
    # grab the discovery paper url
    disco_field = driver.find_element(By.XPATH,'/html/body/div[1]/div[2]/div[1]/div/div/h4')
    disco_link = disco_field.find_element(By.PARTIAL_LINK_TEXT,'et al.')
    # store result
    result[planet_name] = disco_link.get_attribute('href')
    # wait a random amount of time before proceeding
    time.sleep(random.uniform(0.2,1.0))

print(result)


{'55Cnce': 'https://ui.adsabs.harvard.edu/abs/2004ApJ...614L..81M/abstract', 'AUMicb': 'https://ui.adsabs.harvard.edu/abs/2020Natur.582..497P/abstract', 'AUMicc': 'https://ui.adsabs.harvard.edu/abs/2021A&A...649A.177M/abstract', 'CoRoT-1b': 'https://ui.adsabs.harvard.edu/abs/2008A&A...482L..17B/abstract', 'CoRoT-2b': 'https://ui.adsabs.harvard.edu/abs/2008A&A...482L..21A/abstract', 'CoRoT-3b': 'https://ui.adsabs.harvard.edu/abs/2008A&A...491..889D/abstract', 'CoRoT-5b': 'https://ui.adsabs.harvard.edu/abs/2009A&A...506..281R/abstract', 'CoRoT-10b': 'https://ui.adsabs.harvard.edu/abs/2010A&A...520A..65B/abstract', 'CoRoT-11b': 'https://ui.adsabs.harvard.edu/abs/2010A&A...524A..55G/abstract', 'CoRoT-19b': 'https://ui.adsabs.harvard.edu/abs/2012A&A...537A.136G/abstract'}


# Part 2: Form Automation - VALD Database

Now let's look at kind of task: automating form submission on the VALD database.

VALD (Vienna Atomic Line Database) at https://vald.astro.uu.se/vald.php allows us to request spectral line lists,
but has limitations on spectral range. We will use Selenium to automatically request multiple ranges to cover a wider spectrum.

**You must be logged in to VALD before running this automation!**

## Opening VALD in a Browser

Let's open VALD and use the Inspect tool to find the form elements:

In [20]:
# Open VALD (note: this opens a new browser window)
driver.get("https://vald.astro.uu.se/vald.php")
print(f"Page title: {driver.title}")


Page title: VALD WWW interface


Once logged in, we can access the form by clicking on 'Extract all'. Just as before, we're going to find the elements to interact with using `find_element`. However, this time, we're looking for fields and radio button. We will use `click` to click the buttons and `send_keys` to fill the text fields.

In [21]:
# click 'Extract All'
btn_extract_all = driver.find_element(
    By.XPATH, 
    "//input[@type='button' and @value='Extract All']"
)
btn_extract_all.click()


In [23]:
# Fill the start wavelength field
start_wave = driver.find_element(By.NAME, "stwvl")
start_wave.send_keys(str('1000'))

# Fill the end wavelength field
end_wave = driver.find_element(By.NAME, "endwvl")
end_wave.send_keys(str('1040'))


In [24]:
# Click the radio button to select 'long format'
driver.find_element(
    By.XPATH,
    '/html/body/table/tbody/tr[2]/td[2]/form/table/tbody/tr[5]/td[2]/input'
).click()

# Click the radio button to retrieve data through FTP
driver.find_element(
    By.XPATH,
    '/html/body/table/tbody/tr[2]/td[2]/form/table/tbody/tr[7]/td[2]/input'
).click()

# Tick the box to include HFS splitting
driver.find_element(
    By.XPATH,
    '/html/body/table/tbody/tr[2]/td[2]/form/table/tbody/tr[9]/td[2]/input'
).click()

# Use custom list configuration
driver.find_element(
    By.XPATH,
    '/html/body/table/tbody/tr[2]/td[2]/form/table/tbody/tr[18]/td[2]/input'
).click()


## Example: Automating Multiple VALD Requests

Here's the code from our `auto_request_VALD.ipynb` file with detailed explanations:

In [25]:
import numpy as np

# Define wavelength ranges we want to request
# Since VALD has a 40 Å limit per request, we break the full range into chunks
wvl_start = np.array([1340, 1540, 1580, 1740])
wvl_stop = wvl_start + 40  # Each request covers 40 Å

print("Wavelength ranges to request:")
for start, stop in zip(wvl_start, wvl_stop):
    print(f"  {start} - {stop} Å")


Wavelength ranges to request:
  1340 - 1380 Å
  1540 - 1580 Å
  1580 - 1620 Å
  1740 - 1780 Å


## The Automation Loop - Detailed Explanation

Here's what happens in each iteration:

1. **Click "Extract All"** - Reset the form
2. **Wait** - Give the page time to load
3. **Fill start wavelength** - Enter the start value
4. **Fill end wavelength** - Enter the end value
5. **Click checkboxes** - Select desired options
6. **Submit** - Click the request button
7. **Wait** - Let the browser process (very important!)

In [26]:
import time

# Store XPaths in variables for readability
XPATH_EXTRACT_ALL = "//input[@type='button' and @value='Extract All']"
XPATH_CHECKBOX_1 = '/html/body/table/tbody/tr[2]/td[2]/form/table/tbody/tr[5]/td[2]/input'
XPATH_CHECKBOX_2 = '/html/body/table/tbody/tr[2]/td[2]/form/table/tbody/tr[7]/td[2]/input'
XPATH_SUBMIT = '/html/body/table/tbody/tr[2]/td[2]/form/table/tbody/tr[24]/td[1]/input'

def fill_vald_form(driver, wvl_start, wvl_stop):
    """
    Fill and submit the VALD form for a given wavelength range.
    
    Args:
        driver: Selenium WebDriver instance
        wvl_start: Starting wavelength
        wvl_stop: Ending wavelength
    """
    try:
        # Click Extract All
        btn = driver.find_element(By.XPATH, XPATH_EXTRACT_ALL)
        btn.click()
        time.sleep(2)
        
        # Fill wavelength fields
        driver.find_element(By.NAME, "stwvl").send_keys(str(wvl_start))
        driver.find_element(By.NAME, "endwvl").send_keys(str(wvl_stop))
        
        # Select options
        driver.find_element(By.XPATH, XPATH_CHECKBOX_1).click()
        driver.find_element(By.XPATH, XPATH_CHECKBOX_2).click()
        
        # Submit
        time.sleep(1)
        driver.find_element(By.XPATH, XPATH_SUBMIT).click()
        
        return True
    except Exception as e:
        print(f"Error submitting form for range {wvl_start}-{wvl_stop}: {e}")
        return False

# Example usage:
for start, stop in zip(wvl_start, wvl_stop):
    fill_vald_form(driver, start, stop)
    time.sleep(random.uniform(1, 2))  # Random wait between requests

print('All request forms sent!')
# '''


All request forms sent!


## Important: Avoiding Bot Detection

When using Selenium for automation, especially with multiple requests, you WILL likely be flagged as a bot.
Here are strategies to minimize this:

In [None]:
import random
import time

# Add random delays between requests
def wait_randomly(min_seconds=2, max_seconds=5):
    """
    Wait for a random amount of time to appear more human-like.
    """
    wait_time = random.uniform(min_seconds, max_seconds)
    time.sleep(wait_time)

# Set browser options that look more like a real user
# -> Below options adjust the signature of your session to make it look more 'human'
def create_user_agent_driver():
    """
    Create a driver with a realistic user-agent string.
    """
    opts = Options()
    
    # Add a realistic user-agent
    opts.add_argument(
        'user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
        'AppleWebKit/537.36 (KHTML, like Gecko) '
        'Chrome/91.0.4472.124 Safari/537.36'
    )
    
    # Additional options to appear more human
    opts.add_argument('--disable-blink-features=AutomationControlled')
    opts.add_experimental_option('excludeSwitches', ['enable-automation'])
    opts.add_experimental_option('useAutomationExtension', False)
    
    return webdriver.Chrome(
        service=Service(ChromeDriverManager().install()),
        options=opts
    )

# Use explicit waits instead of sleep when possible 
# -> this will wait for an element to be ready before trying to interact with it
def wait_for_element(driver, by, value, timeout=10):
    """
    Wait for an element to be present before interacting.
    More efficient than blindly sleeping.
    """
    wait = WebDriverWait(driver, timeout)
    element = wait.until(EC.presence_of_element_located((by, value)))
    return element


# Conclusion: When to Use Selenium

## Summary

Selenium is a powerful tool, but it's not always the best choice. It is useful when:
- The website heavily relies on JavaScript to load content
- You need to interact with forms, buttons, and dynamic elements
- You need to simulate real user behavior (clicks, typing, scrolling)
- No API is available
- The website blocks simple HTTP requests

However, for scientific websites, you will have better solutions in most cases:
- An official API is available (**use API first!**)
- The page is pure static HTML (use `requests` module instead: it's faster!)
- You need to scrape massive amounts of data (too slow)
- You can use `requests` + `BeautifulSoup` for the same task

**With great power comes great responsibility!**

Your IP may get blocked or restricted if you do requests too frequently, as this could be interpreted as an attack. 
Plus, some website may directly spot that you're a bot and prevent Selenium from working.

**Maintenance:**

Selenium scripts are **brittle**:
- If the website updates its HTML structure, your XPaths break
- Use stable selectors (ID, NAME) rather than XPath when possible. 
- You need to constantly update element selectors
- This is a "monkey automation" approach

![alt text](figures/monkey.png)

Selenium should be your **last resort**, not your first tool ;)

## Additional Resources

### Documentation
- [Selenium Python Docs](https://selenium-python.readthedocs.io/)
- [Selenium Browser Drivers](https://www.selenium.dev/documentation/webdriver/)

### Useful Modules
- `selenium`: Web automation
- `webdriver-manager`: Automatic driver management
- `requests`: HTTP library (better for static pages)
- `BeautifulSoup`: HTML parsing
- `lxml`: Fast XML/HTML parsing

**Happy automating!**