# Scraping with Selenium

A lot of modern websites relies on Javascript to navigate dynamically in the content. However the usual Python web scrapers (like `requests`) are not able to execute javascript. Since then they are struggling in getting the content of dynamic web pages.

Selenium is THE solution for tackling this problem. Initially it has been created to automate tests on websites. It will open your browser _for real_ and allow you to simulate human interactions in website through Python commands.

For example it can be useful when information is accessible by clicking on buttons (which is not possible with `requests` and `beautifulsoup`).

### Install Selenium according to this manual

https://selenium-python.readthedocs.io/installation.html#downloading-python-bindings-for-selenium/bin

*NB: On Linux, put your `geckodriver` (the downloaded extension) in the equivalent path on your machine into `/home/<YOUR_NAME>/.local/bin/`*

In [23]:
# The selenium.webdriver module provides all the implementations of WebDriver
# Currently supported are Firefox, Chrome, IE and Remote. The `Keys` class provides keys on
# the keyboard such as RETURN, F1, ALT etc.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

### Search engine simulation

We will simulate a query on the official Python website by using the search bar.

In [24]:
# Here, we create instance of Firefox WebDriver.
driver = webdriver.Firefox()

# The driver.get method will lead to a page given by the URL. WebDriver will wait until the page is fully
# loaded (i.e. the "onload" event has been triggered) before returning the control to your script.
# It should be noted that if your page uses a lot of AJAX calls when loading, WebDriver may not know
# when it was fully loaded.
driver.get("http://www.python.org")

# The following line is a statement confirming that the title contains the word "Python".
assert "Python" in driver.title

# WebDriver offers a method `find_element` that aims to search for item based on attributes
# For example, the input text element can be located by its name attribute by
# using the attribute `name` with the value `q`
elem = driver.find_element(By.NAME, "q")

# Then we send keys. This is similar to entering keys using your keyboard.
# Special keys can be sent using the `Keys` class imported in line 7 (from selenium.webdriver.common.keys import Keys).
# For security reasons, we will delete any pre-filled text in the input field
# (for example, "Search") so that it does not affect our search results:
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)

# After submitting the page, you should get the result if there is one. To ensure that certain results
# are found, make an assertion that ensures that the source page does not contain the word "No results found".
assert "No results found." not in driver.page_source
driver.close()

### Getting the title of all the articles from the homepage of _The New York Times_

First let's open the homepage of the newspaper's website.


In [20]:
url = "https://www.nytimes.com/"

driver = webdriver.Firefox()
driver.get(url)

As you can see, you are facing the famous GDPR banner. Let's accept it in order to access the page!

In [25]:
cookie_button = driver.find_element(By.XPATH, "//button[@data-testid='GDPR-accept']")
cookie_button.click()

InvalidSessionIdException: Message: Tried to run command without establishing a connection


Now let's get all the titles of the articles by using XPATH and let's store them in a list


In [26]:
article_titles = driver.find_elements(By.XPATH, "//section[@class='story-wrapper']//h3")
all_titles = []
for title in article_titles:
    all_titles.append(title.text)

all_titles

InvalidSessionIdException: Message: Tried to run command without establishing a connection


Here we are ! Let's close the browser then !

In [14]:
driver.close()

### Exercise

1. Use Selenium for opening the homepage of your favourite newspaper (not the New York Times, too easy)
2. Close the cookie banner (if it appears)
3. Get the link of the first article of the page and open it
4. Print the title and the content of the article

**tip:** [Newspaper3k](https://pypi.org/project/newspaper3k/) is a powerful library for scraping articles from newspapers. Have a look to the `fulltext` method.

In [19]:
url = "https://www.lesoir.be/"

driver = webdriver.Firefox()
driver.get(url)


article_titles = driver.find_elements(By.XPATH, "/html/body/r-wrapper")
all_titles = []
for title in article_titles:
    all_titles.append(title.text)

all_titles


["S'identifier\nS'abonner\nA la Une\nCoupe du monde\nOpinions\nPodcasts\nPolitique\nSociété\nMonde\nÉconomie\nSports\nCulture\nMAD\nPlanète\nMa Santé\nLéNA\nRepensons notre quotidien\nLe journal\nPodcasts\nToutes l'actualité du Soir\nA la Une\nUn accord historique sur la biodiversité approuvé à la COP15 de Montréal\nAprès quatre années de négociations difficiles, dix jours et une nuit de marathon diplomatique, plus de 190 États sont parvenus à un accord sous l’égide de la Chine, présidente de la COP15, malgré une opposition de la République démocratique du Congo.\nL’Union se dote d’une taxe carbone à ses frontières\nLe plan climat wallon épargne la voiture, pas le mazout\nAnvers touchée par une cyberattaque: «Rien n’indique que des données personnelles ont été dérobées»\nUn collectif de hackers avait revendiqué la cyberattaque.\nToujours plus de Belges exclus des services bancaires\nLe volumineux rapport sur l’inclusion financière de l’ASBL Financité tire la sonnette d’alarme : l’acces