# Scraping with Selenium

A lot of modern websites relies on Javascript to navigate dynamically in the content. However the usual Python web scrapers (like `requests`) are not able to execute javascript. Since then they are struggling in getting the content of dynamic web pages.

Selenium is THE solution for tackling this problem. Initially it has been created to automate tests on websites. It will open your browser _for real_ and allow you to simulate human interactions in website through Python commands.

For example it can be useful when information is accessible by clicking on buttons (which is not possible with `requests` and `beautifulsoup`).

### Install Selenium according to this manual

https://selenium-python.readthedocs.io/installation.html#downloading-python-bindings-for-selenium/bin

*NB: On Linux, put your `geckodriver` (the downloaded extension) in the equivalent path on your machine into `/home/<YOUR_NAME>/.local/bin/`*

If you want a monolith version of this code which is up to date, you may check the `selenium.py` script made by Robin.

In [1]:
# The selenium.webdriver module provides all the implementations of WebDriver
# Currently supported are Firefox, Chrome, IE and Remote. The `Keys` class provides keys on
# the keyboard such as RETURN, F1, ALT etc.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

### Search engine simulation

We will simulate a query on the official Python website by using the search bar.

In [5]:
# Here, we create instance of Firefox WebDriver.
driver = webdriver.Safari()

# The driver.get method will lead to a page given by the URL. WebDriver will wait until the page is fully
# loaded (i.e. the "onload" event has been triggered) before returning the control to your script.
# It should be noted that if your page uses a lot of AJAX calls when loading, WebDriver may not know
# when it was fully loaded.
driver.get("http://www.python.org")

# The following line is a statement confirming that the title contains the word "Python".
assert "Python" in driver.title

# WebDriver offers a method `find_element` that aims to search for item based on attributes
# For example, the input text element can be located by its name attribute by
# using the attribute `name` with the value `q`
elem = driver.find_element(By.NAME, "q")

# Then we send keys. This is similar to entering keys using your keyboard.
# Special keys can be sent using the `Keys` class imported in line 7 (from selenium.webdriver.common.keys import Keys).
# For security reasons, we will delete any pre-filled text in the input field
# (for example, "Search") so that it does not affect our search results:
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)

# After submitting the page, you should get the result if there is one. To ensure that certain results
# are found, make an assertion that ensures that the source page does not contain the word "No results found".
assert "No results found." not in driver.page_source
driver.close()

### Getting the title of all the articles from the homepage of _The New York Times_

First let's open the homepage of the newspaper's website.


In [92]:
url = "https://www.nytimes.com/"

driver = webdriver.Safari()
driver.get(url)

As you can see, you are facing the famous GDPR banner. Let's accept it in order to access the page!

In [None]:
cookie1 = {'name' : 'fides_consent', 'value' : '%7B%22consent%22%3A%7B%7D%2C%22identity%22%3A%7B%22fides_user_device_id%22%3A%22fc660568-3ec2-42ea-8320-35e830640263%22%7D%2C%22fides_meta%22%3A%7B%22version%22%3A%220.9.0%22%2C%22createdAt%22%3A%222025-03-25T09%3A02%3A30.940Z%22%2C%22updatedAt%22%3A%222025-03-25T09%3A03%3A01.645Z%22%2C%22consentMethod%22%3A%22reject%22%7D%2C%22tcf_consent%22%3A%7B%22system_consent_preferences%22%3A%7B%7D%2C%22system_legitimate_interests_preferences%22%3A%7B%7D%7D%2C%22fides_string%22%3A%22CQO03QAQO03QAGXABBENBiFgAAAAAAAAAAAAAAAAAAAA%2C1~%22%2C%22tcf_version_hash%22%3A%2209336ff51657%22%7D', 'domain': '.nytimes.com'}
cookie2 = {'name' : 'nyt-gdpr', 'value' : '1', 'domain': '.nytimes.com'}
cookie3 = {'name' : 'nyt-traceid', 'value' : '00000000000000000bccfe887fcbd446', 'domain': '.nytimes.com'}
driver.add_cookie(cookie1)
driver.add_cookie(cookie2)
driver.add_cookie(cookie3)


In [94]:
driver.refresh()

In [95]:
from selenium import webdriver


# Trouver l'élément avec la classe 'css-hqisq1' et cliquer sur le bouton enfant avec JavaScript
driver.execute_script("""
    var parentElement = document.querySelector('.css-hqisq1'); 
    if (parentElement) {
        var button = parentElement.querySelector('button');  // Trouver le bouton dans cet élément
        if (button) {
            button.click();  // Cliquer sur le bouton
        } else {
            console.log('Bouton introuvable');
        }
    } else {
        console.log('Élément parent introuvable');
    }
""")

Now let's get all the titles of the articles by using XPATH and let's store them in a list


In [96]:
article_titles = driver.find_elements(By.XPATH, "//section[@class='story-wrapper']//a//div[@class='css-xdandi']//p")
all_titles = []
for title in article_titles:
    all_titles.append(title.text)

print(all_titles)

['Analysis', 'Is Russia an Adversary or a Future Partner? Trump Aides May Have to Decide.', 'How a Cheap Drone Punctured Chernobyl’s 40,000 Ton Shield', 'Russia and Ukraine’s U.S.-Mediated Talks: What to Know', 'White House Inner Circle Discussed Military Plans in Extraordinary Breach', 'As Trump Policies Worry Scientists, France and Others Put Out a Welcome Mat', 'Washington Bends to Kennedy’s Agenda on Obesity and Healthy Eating', 'As Crises Grip U.S. Colleges, More Students Than Ever Are Set to Enroll', 'Columbia Student Sought by ICE Sues to Prevent Deportation', 'Columbia Faculty Protests as Trump Officials Hail University Concessions', 'She Was Released From Hamas Captivity. Now She’s Campaigning for Her Partner Still in Gaza.', 'Hillel, the Campus Jewish Group, Is Thriving, and Torn by Conflict', 'U.N. to Pull International Workers From Gaza Amid Israeli Strikes', 'A Palestinian Director of ‘No Other Land’ Is Attacked and Detained, Witnesses Say', 'What Makes Sydney’s New Beach 

Here we are ! Let's close the browser then !

In [97]:
driver.close()

### Exercise

1. Use Selenium for opening the homepage of your favourite newspaper (not the New York Times, too easy)
2. Close the cookie banner (if it appears)
3. Get the link of the first article of the page and open it
4. Print the title and the content of the article

**tip:** [Newspaper3k](https://pypi.org/project/newspaper3k/) is a powerful library for scraping articles from newspapers. Have a look to the `fulltext` method.