# Scraping with Selenium

A lot of modern websites relies on Javascript to navigate dynamically in the content. However the usual Python web scrapers (like `requests`) are not able to execute javascript. Since then they are struggling in getting the content of dynamic web pages.

Selenium is THE solution for tackling this problem. Initially it has been created to automate tests on websites. It will open your browser _for real_ and allow you to simulate human interactions in website through Python commands.

For example it can be useful when information is accessible by clicking on buttons (which is not possible with `requests` and `beautifulsoup`).

### Install Selenium according to this manual

https://selenium-python.readthedocs.io/installation.html#downloading-python-bindings-for-selenium/bin

*NB: On Linux, put your `geckodriver` (the downloaded extension) in the equivalent path on your machine into `/home/<YOUR_NAME>/.local/bin/`*

In [1]:
# The selenium.webdriver module provides all the implementations of WebDriver
# Currently supported are Firefox, Chrome, IE and Remote. The `Keys` class provides keys on
# the keyboard such as RETURN, F1, ALT etc.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

### Search engine simulation

We will simulate a query on the official Python website by using the search bar.

In [2]:
# Here, we create instance of Firefox WebDriver.
driver = webdriver.Firefox()

# The driver.get method will lead to a page given by the URL. WebDriver will wait until the page is fully
# loaded (i.e. the "onload" event has been triggered) before returning the control to your script.
# It should be noted that if your page uses a lot of AJAX calls when loading, WebDriver may not know
# when it was fully loaded.
driver.get("http://www.python.org")

# The following line is a statement confirming that the title contains the word "Python".
assert "Python" in driver.title

# WebDriver offers a method `find_element` that aims to search for item based on attributes
# For example, the input text element can be located by its name attribute by
# using the attribute `name` with the value `q`
elem = driver.find_element(By.NAME, "q")

# Then we send keys. This is similar to entering keys using your keyboard.
# Special keys can be sent using the `Keys` class imported in line 7 (from selenium.webdriver.common.keys import Keys).
# For security reasons, we will delete any pre-filled text in the input field
# (for example, "Search") so that it does not affect our search results:
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)

# After submitting the page, you should get the result if there is one. To ensure that certain results
# are found, make an assertion that ensures that the source page does not contain the word "No results found".
assert "No results found." not in driver.page_source
driver.close()

### Getting the title of all the articles from the homepage of _The New York Times_

First let's open the homepage of the newspaper's website.


In [17]:
url = "https://www.nytimes.com/"

driver = webdriver.Firefox()
driver.get(url)

As you can see, you are facing the famous GDPR banner. Let's accept it in order to access the page!

In [5]:
cookie_button = driver.find_element(By.XPATH, "//button[@data-testid='Accept all-btn']")
cookie_button.click()

Now let's get all the titles of the articles by using XPATH and let's store them in a list


In [6]:
article_titles = driver.find_elements(By.XPATH, "//section[@class='story-wrapper']//p[@class='indicate-hover css-1a5fuvt'] | //p[@class='indicate-hover css-1gg6cw2']")
all_titles = []
for title in article_titles:
    all_titles.append(title.text)

all_titles

['For Some Investors, Aging and Empty Office Buildings Aren’t a Bad Thing',
 'Inflation data is coming just before the Fed meeting. Will it be a game changer?',
 'Here’s what to watch as the Federal Reserve meets.',
 'The E.U. is raising tariffs on electric vehicles from China to protect against what officials call unfair competition.',
 'President Biden has grown more resigned and afraid about his son’s future, according to people close to them.',
 'Hunter Biden’s laptop, revealed by The New York Post, came back to haunt him.',
 'Ukraine said it shot down most of a barrage of Russian missiles and drones.',
 'At the G7 summit, President Biden will push for using frozen Russian assets to help Ukraine.',
 'North Dakotans approved an age limit for members of Congress, though the measure is likely to be challenged in court.',
 'Taped remarks at a Supreme Court gala revealed glimpses of Chief Justice John Roberts and Justice Samuel Alito.',
 'Wordle',
 'Strands | BETA',
 'Connections',
 'Co

Here we are ! Let's close the browser then !

In [7]:
driver.close()

### Exercise

1. Use Selenium for opening the homepage of your favourite newspaper (not the New York Times, too easy)
2. Close the cookie banner (if it appears)
3. Get the link of the first article of the page and open it
4. Print the title and the content of the article

**tip:** [Newspaper3k](https://pypi.org/project/newspaper3k/) is a powerful library for scraping articles from newspapers. Have a look to the `fulltext` method.

In [54]:
import requests

url = "https://www.sudinfo.be/2098/sections/regions/liege"

driver = webdriver.Firefox()
driver.get(url)

In [55]:
cookie_button = driver.find_element(By.XPATH, "//button[@id='didomi-notice-agree-button']")
cookie_button.click()

In [49]:
article_titles = driver.find_elements(By.XPATH, "//a[@class='r-article--link']")
all_titles = []
# for title in article_titles:
#     all_titles.append(title.text)
article_titles[0].click()

[<selenium.webdriver.remote.webelement.WebElement (session="293921cb-c2b4-4515-a1bf-48ab674dc001", element="d53e3b62-4c37-4d9c-8572-4f904df5eca7")>]

In [53]:
title_h1 = driver.find_element(By.XPATH, "//h1").text
content = driver.find_elements(By.XPATH, "//header")
c_text = []
for title in content:
    c_text.append(title.text)
c_text

['Le tunnel de Cointe sera à nouveau fermé tout cet été: la décision du retour aux 80 km/h ne sera pas prise... avant 2025\nLa dernière phase de modernisation de la liaison E25-E40/A602 se tiendra pendant ces congés d’été : le tunnel de Cointe en direction de Bruxelles fera peau neuve, pour continuer à assurer des conditions de sécurité suffisantes, du 30 juin au 26 août. Pas de retour aux 80 km/h avant 2025.']

In [56]:
driver.close()