# Scraping with Selenium

Selenium is a tool initially created to automate tests on websites. It is therefore very useful when information is accessible by clicking on links. A button for example is an element from which it is very difficult to obtain the link. BeautifulSoup then becomes limited.
In this case, use Selenium.

### Load libraries

If you are missing any libraries in the next cell, you'll need to install them before continuing.

In [2]:
import bs4
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import json
import re
import lxml.html
import time
import random
from random import randint
import logging
import collections
from time import gmtime, strftime

import re
from tabulate import tabulate
import os

date = strftime("%Y-%m-%d")

### Install Selenium according to this manual

https://selenium-python.readthedocs.io/installation.html#downloading-python-bindings-for-selenium/bin

*NB: On Linux, put your `geckodriver` (the downloaded extension) in the equivalent path on your machine into `/home/<YOUR_NAME>/.local/bin/`*

We will simulate a search on the official Python website.

In [14]:
import selenium

# The selenium.webdriver module provides all the implementations of WebDriver
# Currently supported are Firefox, Chrome, IE and Remote. The `Keys` class provides keys on
# the keyboard such as RETURN, F1, ALT etc.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

# Here, we create instance of Firefox WebDriver.
driver = webdriver.Firefox()

# The driver.get method will lead to a page given by the URL. WebDriver will wait until the page is fully
# loaded (i.e. the "onload" event has been triggered) before returning the control to your script.
# It should be noted that if your page uses a lot of AJAX calls when loading, WebDriver may not know
# when it was fully loaded.
driver.get("http://www.python.org")

# The following line is a statement confirming that the title contains the word "Python".
assert "Python" in driver.title

# WebDriver offers several methods to search for items using one of the methods
# `find_element_by_...` .
# For example, the input text element can be located by its name attribute by
# using the `find_element_by_name` method.

driver.find_element(By.NAME, 'q').send_keys("pycon")
#driver.find_element(by=by.Name, value=name)

# Then we send keys. This is similar to entering keys using your keyboard.
# Special keys can be sent using the `Keys` class imported in line 7 (from selenium.webdriver.common.keys import Keys).
# For security reasons, we will delete any pre-filled text in the input field
# (for example, "Search") so that it does not affect our search results:
#elem.clear()
#elem.send_keys("pycon")
#elem.send_keys(Keys.RETURN)

# After submitting the page, you should get the result if there is one. To ensure that certain results
# are found, make an assertion:
assert "No results found." not in driver.page_source
driver.close()

#### Open the source code of the webpage and check that the search area (field) is called "q".

In [20]:
import requests
from bs4 import BeautifulSoup

url = "http://www.python.org"
r = requests.get(url)
print(url, r.status_code)
soup = BeautifulSoup(r.content, "lxml")
soup

http://www.python.org 200


<!DOCTYPE html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]--><!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]--><!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]--><!--[if gt IE 8]><!--><html class="no-js" dir="ltr" lang="en"> <!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<link href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js" rel="prefetch"/>
<link href="//ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/jquery-ui.min.js" rel="prefetch"/>
<meta content="Python.org" name="application-name"/>
<meta content="The official home of the Python Programming Language" name="msapplication-tooltip"/>
<meta content="Python.org" name="apple-mobile-web-app-title"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<meta content="black" name="apple-mobile-web-app-status-bar-style"/>
<meta content="width=device-width, initial-scale=1.

### Getting a phone number from *leboncoin*

In [33]:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By

url = "https://weneedit.be/electronics/portable-water-chiller-cwup-20-for-ultrafast-laser/"

driver = webdriver.Firefox()
driver.implicitly_wait(30)
driver.get(url)

python_button = driver.find_element(by=XPATH()
#python_button = driver.find_elements_by_xpath('//div[@data-reactid="269"]')[0]
python_button.click()

# And then we use Beautiful soup
soup = BeautifulSoup(driver.page_source)

driver.close()

for elem in soup.find_all("a", attrs={"data-qa-id": "Contact Details"}):
    print(elem.text)

NameError: name 'XPATH' is not defined

### Starting from *leboncoin*, collect all the information available to define the product being sold. Use `selenium` for the telephone number.

### API (Application Program Interface)

A set of tools and methods that allow different applications to interact with each other. In the case of a web service, we can retrieve data dynamically. By using an API correctly, we can thus obtain in real time, the modifications made on a "parent" site.

For example, we will retrieve online news, for example from the "L'équipe" website.

Follow the instructions at https://newsapi.org/s/lequipe-api to retrieve an "API key" connection key

Your API key is: `73bbb95f8ecb49b499113a46481b4af1`


It is frequent that a key does not work after a while (e.g. `5 min`n `30 min`, a day, ...)
So don't jump up if you get an error message back.

In [18]:
import requests

key = "73bbb95f8ecb49b499113a46481b4af1"
url = "https://newsapi.org/v2/top-headlines?sources=lequipe&apiKey=" + key
response = requests.get(url)

# Here the response format is a json file, it is used as a dictionary
print(response.json())

{'status': 'ok', 'totalResults': 10, 'articles': [{'source': {'id': 'lequipe', 'name': "L'equipe"}, 'author': "L'EQUIPE", 'title': "Mouhammadou Jaiteh (Bologne) désigné MVP de l'Eurocoupe", 'description': "Mouhammadou Jaiteh, l'intérieur international français de la Virtus Bologne qui dispute la finale de l'Eurocoupe mercredi contre Bursa, a été désigné meilleur joueur de l'Eurocoupe, petite soeur de l'Euroligue.", 'url': 'https://www.lequipe.fr/Basket/Actualites/Mouhammadou-jaiteh-bologne-designe-mvp-de-l-eurocoupe/1332146', 'urlToImage': 'https://medias.lequipe.fr/img-photo-jpg/mouhammadou-jaiteh-a-fait-le-bonheur-de-bologne-cette-saison-a-reau-l-equipe/1500000001640287/0:0,1995:1330-640-427-75/12700.jpg', 'publishedAt': '2022-05-09T11:49:00+00:00', 'content': "Attribuée depuis la saison 2008-2009, la distinction de MVP (meilleur joueur) de la saison en Eurocoupe, petite soeur de l'épreuve reine, l'Euroligue, n'avait jamais été remportée par un joueur franç… [+1686 chars]"}, {'source

In [19]:
dictionnary = response.json()
print(dictionnary.keys())

dict_keys(['status', 'totalResults', 'articles'])


In [20]:
for element in list(dictionnary.keys()):
    print("##############################################")
    print("Key: ", element, "// Values: ", dictionnary[element])

##############################################
Key:  status // Values:  ok
##############################################
Key:  totalResults // Values:  10
##############################################
Key:  articles // Values:  [{'source': {'id': 'lequipe', 'name': "L'equipe"}, 'author': "L'EQUIPE", 'title': "Mouhammadou Jaiteh (Bologne) désigné MVP de l'Eurocoupe", 'description': "Mouhammadou Jaiteh, l'intérieur international français de la Virtus Bologne qui dispute la finale de l'Eurocoupe mercredi contre Bursa, a été désigné meilleur joueur de l'Eurocoupe, petite soeur de l'Euroligue.", 'url': 'https://www.lequipe.fr/Basket/Actualites/Mouhammadou-jaiteh-bologne-designe-mvp-de-l-eurocoupe/1332146', 'urlToImage': 'https://medias.lequipe.fr/img-photo-jpg/mouhammadou-jaiteh-a-fait-le-bonheur-de-bologne-cette-saison-a-reau-l-equipe/1500000001640287/0:0,1995:1330-640-427-75/12700.jpg', 'publishedAt': '2022-05-09T11:49:00+00:00', 'content': "Attribuée depuis la saison 2008-2009, la dist

In [27]:
# And now we have lists in dictionaries(it's a JSON file actually but it's very similar)
# We will discover the information of the article key.

for element in enumerate(dictionnary["articles"]):
    print("###############################################")
    print(element)

###############################################
(0, {'source': {'id': 'lequipe', 'name': "L'equipe"}, 'author': "L'EQUIPE", 'title': "Mouhammadou Jaiteh (Bologne) désigné MVP de l'Eurocoupe", 'description': "Mouhammadou Jaiteh, l'intérieur international français de la Virtus Bologne qui dispute la finale de l'Eurocoupe mercredi contre Bursa, a été désigné meilleur joueur de l'Eurocoupe, petite soeur de l'Euroligue.", 'url': 'https://www.lequipe.fr/Basket/Actualites/Mouhammadou-jaiteh-bologne-designe-mvp-de-l-eurocoupe/1332146', 'urlToImage': 'https://medias.lequipe.fr/img-photo-jpg/mouhammadou-jaiteh-a-fait-le-bonheur-de-bologne-cette-saison-a-reau-l-equipe/1500000001640287/0:0,1995:1330-640-427-75/12700.jpg', 'publishedAt': '2022-05-09T11:49:00+00:00', 'content': "Attribuée depuis la saison 2008-2009, la distinction de MVP (meilleur joueur) de la saison en Eurocoupe, petite soeur de l'épreuve reine, l'Euroligue, n'avait jamais été remportée par un joueur franç… [+1686 chars]"})
######

In [28]:
# So if we keep going, it gives us another dictionary!
for element in dictionnary["articles"][0].keys():
    print(" Key : ", element, "Values : ", dictionnary["articles"][0][element])

 Key :  source Values :  {'id': 'lequipe', 'name': "L'equipe"}
 Key :  author Values :  L'EQUIPE
 Key :  title Values :  Mouhammadou Jaiteh (Bologne) désigné MVP de l'Eurocoupe
 Key :  description Values :  Mouhammadou Jaiteh, l'intérieur international français de la Virtus Bologne qui dispute la finale de l'Eurocoupe mercredi contre Bursa, a été désigné meilleur joueur de l'Eurocoupe, petite soeur de l'Euroligue.
 Key :  url Values :  https://www.lequipe.fr/Basket/Actualites/Mouhammadou-jaiteh-bologne-designe-mvp-de-l-eurocoupe/1332146
 Key :  urlToImage Values :  https://medias.lequipe.fr/img-photo-jpg/mouhammadou-jaiteh-a-fait-le-bonheur-de-bologne-cette-saison-a-reau-l-equipe/1500000001640287/0:0,1995:1330-640-427-75/12700.jpg
 Key :  publishedAt Values :  2022-05-09T11:49:00+00:00
 Key :  content Values :  Attribuée depuis la saison 2008-2009, la distinction de MVP (meilleur joueur) de la saison en Eurocoupe, petite soeur de l'épreuve reine, l'Euroligue, n'avait jamais été remport

### Make a script that allows you to take details of the last ten news from the team or another site. Store them in a nice CSV or excel file.