# Selenium

## What is Selenium?
Selenium is a Web Browser Automation Tool.
Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that. It allows you to open a browser of your choice & perform tasks as a human being would, such as:

* Clicking buttons
* Entering information in forms
* Searching for specific information on the web pages
* Scrolling
* Taking a screenshot


At the beginning of the project (almost 20 years ago!) it was mostly used for cross-browser, end-to-end testing (acceptance tests).

Now it is still used for testing, but it is also used as a general browser automation platform. And of course, it us used for web scraping!

The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely.



![image.png](attachment:image.png)

## Required installations

### Selenium

In [4]:
#Pre-requisites
# Selenium is upgraded /upgraded
# !pip3 install -U selenium

In [5]:
# Webdriver Manager for Python is installed

# !pip3 install webdriver-manager

### Browser driver

you need to install a **browser driver**, which you choose depending on the browser you often use. In my case, I have Chrome, so I installed the Chrome driver. Below, there are links to the more popular browser drivers:

* ChromeDriver – WebDriver for Chrome (https://sites.google.com/chromium.org/driver/)

* Microsoft Edge Driver (https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/)

* Firefox Driver (https://github.com/mozilla/geckodriver/releases)


In [6]:
from selenium import webdriver
#from selenium.webdriver.chrome.service import Service
#service = Service(r"./chromedriver.exe")
#options = webdriver.ChromeOptions()
driver = webdriver.Chrome()

In [7]:
driver.get('https://google.com')

In [8]:
print(driver.title)
print(driver.current_url)

Google
https://www.google.com/


To run Chrome in headless mode (without any graphical user interface), you can run it on a server.

**ChromeOptions(Common Arguments and Methods)** is a separate class in selenium that helps to manage options specific to the ChromeDriver. ChromeOptions is a class that extends MutableCapabilities. It was introduced with Selenium v3.6.0.

**Why It is Required:** ChromeOptions class is used to customize the settings of the chrome browser. We can disable-popup-blocking, make-default-browser, disable-extensions, incognito, check the version, and other changes to the browser using this class with the latest version selenium. By default, selenium starts with a fresh session of a browser that doesn’t have any settings, cookies, and history.

**Frequently Used Methods and Arguments:**

* start-maximized: This argument opens the chrome browser window in maximize mode.

* setPageLoadStrategy: This method is used to speed up execution. It is of three types Normal, None, and Eager.
    * Normal: In this mode, Selenium WebDriver wait for the entire page is loaded.
    * None: In this mode, Selenium WebDriver only waits until the initial page is downloaded.
    * Eager: Selenium WebDriver to wait until the initial HTML document has been completely loaded and parsed
    
* disable-infobars: This argument is used to remove the information bar/notifications from the browser. But this argument has been deprecated. We can use this line of code to remove the information bar from the browser.

* Incognito: This argument is used to open a chrome browser in incognito mode. It helps to prevent history and cookies. Incognito mode deletes this data as soon as we close the web browser.

* Version: This argument is used to get the current version of the browser.

* Disable-popup-blocking: This argument is used to disable the popup in the chrome browser. We can block the popup using these methods:
    * Code: options.setExperimentalOption(“excludeSwitches”,Arrays.asList(“disable-popup-blocking”));
    * Code: options.addArguments(“–disable-popup-blocking”);
    
examples: https://studysection.com/blog/chromeoptions-class-common-arguments-and-methods/

=> The driver.page_source will return the full page HTML code.

Here are two other interesting WebDriver properties:

- driver.title gets the page's title
- driver.current_url gets the current URL (this can be useful when there are redirections on the website and you need the final URL)

In [7]:
# from selenium import webdriver
# from selenium.webdriver.chrome.service import Service
# service = Service(r"geckodriver.exe")
# options = webdriver.FirefoxOptions()
# driver = webdriver.Firefox(service=service, options=options)

In [8]:
# # Avec Firefox:
# url="http://www.google.fr"
# driver.get(url)

In [9]:
# print(driver.title)
# print(driver.current_url)

### Get and parse the HTML 

**Principal methods of Selenium** 

There are various strategies to locate elements in a page. You can use the most appropriate one for your case. Selenium provides the following methods to locate elements in a page:

* find_element_by_id
* find_element_by_name
* find_element_by_xpath
* find_element_by_link_text
* find_element_by_partial_link_text
* find_element_by_tag_name
* find_element_by_class_name
* find_element_by_css_selector

To find multiple elements (these methods will return a list):

* find_elements_by_name
* find_elements_by_xpath
* find_elements_by_link_text
* find_elements_by_partial_link_text
* find_elements_by_tag_name
* find_elements_by_class_name
* find_elements_by_css_selector

Apart from the public methods given above, there are two private methods which might be useful for locating page elements:

* find_element
* find_elements

Examples are here: https://selenium-python.readthedocs.io/locating-elements.html

username = driver.find_element(By.NAME, 'username')

login_form = driver.find_element(By.XPATH, "/html/body/form[1]")

heading1 = driver.find_element(By.TAG_NAME, 'h1')

content = driver.find_element(By.CLASS_NAME, 'content')

content = driver.find_element(By.CSS_SELECTOR, 'p.content')


XPath is a language, which uses path expressions to take nodes or a set of nodes in an XML document. There is a similarity to the paths you usually see in your computer file systems. The most useful path expressions are:

- nodename takes the nodes with that name
- / gets from the root node
- // gets nodes in the document from the current node
- . gets the current node
- .. gets the “parent” of the current node
- @ gets the attribute of that node, such as id and class

fore more details: https://www.w3schools.com/xml/xpath_syntax.asp

As usual, the easiest way to locate an element is to open your Chrome dev tools and inspect the element that you need. A cool shortcut for this is to highlight the element you want with your mouse and then press Ctrl + Shift + C or on macOS Cmd + Shift + C instead of having to right click + inspect each time

##  initialisation Steps

<div class="alert alert-success">
The steps to Parse a dynamic page using Selenium are:

1- Initialize a driver (a Python object that controls a browser window)
    
2- Direct the driver to the URL we want to scrape.
    
3- Wait for the driver to finish executing the javascript, and changing the HTML. The driver is typically a Chrome driver, so  the page is treated the same way as if you were visiting it in Chrome.
    
4- Use driver.page_source to get the HTML as it appears after javascript has rendered it.
    
5- Use a parser on the returned HTML
    
</div>

### Initialize a driver (a Python object that controls a browser window)

In [10]:
#1- Initialize a driver (a Python object that controls a browser window)
driver = webdriver.Chrome()

We'll user a wikipedia page to test scraping on. 

we'll use it on the page https://www.paruvendu.fr/  to extract the data , save it in a Pandas Dataframe and export it into a CSV file.



### Direct the driver to the URL we want to scrape.


In [11]:
#2- Direct the driver to the URL we want to scrape.
url="https://www.paruvendu.fr/immobilier/annonceimmofo/liste/listeAnnonces?tt=1&at=1&nbp0=99&pa=FR&lo=75"


In [12]:
#3- Wait for the driver to finish executing the javascript, and changing the HTML. The driver is typically a Chrome driver, 
#so the page is treated the same way as if you were visiting it in Chrome.

In [13]:
# 4- Use driver.page_source to get the HTML as it appears after javascript has rendered it.
driver.get(url)

In [14]:
#print(driver.page_source)

In [17]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
import pandas as pd
from selenium.webdriver.support import expected_conditions as EC

In [19]:


# # Path to your ChromeDriver
# CHROMEDRIVER_PATH = '/path/to/chromedriver'

# # Set up Selenium WebDriver
# options = Options()
# options.add_argument('--headless')  # Run in headless mode (no GUI)
# service = Service(CHROMEDRIVER_PATH)
# driver = webdriver.Chrome(service=service, options=options)

# Define the URL
url = "https://www.imdb.com/title/tt0108778/episodes/"

# Open the webpage
driver.get(url)

# Wait for the page to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'div.sc-ccd6e31b-4.eMYVLm')))

# Initialize lists to hold the data
seasons = []
episodes = []
titles = []
release_dates = []
ratings = []
reviews = []

# Scrape data
for season in range(1, 11):  # There are 10 seasons
    season_url = f"{url}?season={season}"
    driver.get(season_url)
    
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'div.sc-ccd6e31b-1 ggXjkj')))
    
    # Find episodes
    episode_elements = driver.find_elements(By.CSS_SELECTOR, 'div.sc-ccd6e31b-1 ggXjkj')
    
    for ep in episode_elements:
        # Extract episode details
        episode_number = ep.find_element(By.CSS_SELECTOR, 'div.ipc-title__text').text
        title = episode_number.split('∙')[-1].strip()
        episode_number = episode_number.split('∙')[0].strip()
        season_number, episode_number = episode_number.split('.')
        
        release_date = ep.find_element(By.CSS_SELECTOR, 'span.sc-ccd6e31b-10 dYquTu').text.replace(",", "", 1)
        summary = ep.find_element(By.CSS_SELECTOR, 'div.ipc-html-content').text
        
        rating_info = ep.find_element(By.CSS_SELECTOR, 'span.ipc-rating-star').text
        min_rating = float(rating_info.split("/")[0])
        max_rating = float(rating_info.split("/")[1].split("(")[0])
        review_count = rating_info.split("(")[-1][:-1]
        review_count = int(review_count.replace("K", "000").replace("M", "000000").replace(".", ""))
        
        # Append data to lists
        seasons.append(season_number)
        episodes.append(episode_number)
        titles.append(title)
        release_dates.append(release_date)
        ratings.append(min_rating)
        reviews.append(review_count)

# Create a DataFrame
df = pd.DataFrame({
    'Season': seasons,
    'Episode': episodes,
    'Title': titles,
    'Release Date': release_dates,
    'Rating': ratings,
    'Number of Reviews': reviews
})

# Save DataFrame to a CSV file
df.to_csv('friends_episodes.csv', index=False)

# Close the browser
driver.quit()


TimeoutException: Message: 
Stacktrace:
	GetHandleVerifier [0x00007FF775BCEEB2+31554]
	(No symbol) [0x00007FF775B47EE9]
	(No symbol) [0x00007FF775A0872A]
	(No symbol) [0x00007FF775A58434]
	(No symbol) [0x00007FF775A5853C]
	(No symbol) [0x00007FF775A9F6A7]
	(No symbol) [0x00007FF775A7D06F]
	(No symbol) [0x00007FF775A9C977]
	(No symbol) [0x00007FF775A7CDD3]
	(No symbol) [0x00007FF775A4A33B]
	(No symbol) [0x00007FF775A4AED1]
	GetHandleVerifier [0x00007FF775ED8B2D+3217341]
	GetHandleVerifier [0x00007FF775F25AF3+3532675]
	GetHandleVerifier [0x00007FF775F1B0F0+3489152]
	GetHandleVerifier [0x00007FF775C7E786+750614]
	(No symbol) [0x00007FF775B5376F]
	(No symbol) [0x00007FF775B4EB24]
	(No symbol) [0x00007FF775B4ECB2]
	(No symbol) [0x00007FF775B3E17F]
	BaseThreadInitThunk [0x00007FFBC4607374+20]
	RtlUserThreadStart [0x00007FFBC58DCC91+33]


In [22]:

# Define the URL
url = "https://www.imdb.com/title/tt0108778/episodes/"

# Open the webpage
driver.get(url)

# Increase timeout duration to 20 seconds
wait = WebDriverWait(driver, 20)

# Wait for the page to load
try:
    # Wait for the episode elements to be present
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'sc-ccd6e31b-1 ggXjkj')))
except Exception as e:
    print(f"Error waiting for page elements: {e}")

# Initialize lists to hold the data
seasons = []
episodes = []
titles = []
release_dates = []
ratings = []
reviews = []

# Scrape data
for season in range(1, 11):  # There are 10 seasons
    season_url = f"{url}?season={season}"
    driver.get(season_url)
    
    try:
        # Wait for the episode elements to be present
        wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'div.sc-ccd6e31b-1 ggXjkj')))
        
        # Find episodes
        episode_elements = driver.find_elements(By.CSS_SELECTOR, 'div.sc-ccd6e31b-1 ggXjkj')
        
        for ep in episode_elements:
            # Extract episode details
            episode_number = ep.find_element(By.CSS_SELECTOR, 'div.ipc-title__text').text
            title = episode_number.split('∙')[-1].strip()
            episode_number = episode_number.split('∙')[0].strip()
            season_number, episode_number = episode_number.split('.')
            
            release_date = ep.find_element(By.CSS_SELECTOR, 'span.sc-ccd6e31b-10.fVspdm').text.replace(",", "", 1)
            summary = ep.find_element(By.CSS_SELECTOR, 'div.ipc-html-content').text
            
            rating_info = ep.find_element(By.CSS_SELECTOR, 'span.ipc-rating-star').text
            min_rating = float(rating_info.split("/")[0])
            max_rating = float(rating_info.split("/")[1].split("(")[0])
            review_count = rating_info.split("(")[-1][:-1]
            review_count = int(review_count.replace("K", "000").replace("M", "000000").replace(".", ""))
            
            # Append data to lists
            seasons.append(season_number)
            episodes.append(episode_number)
            titles.append(title)
            release_dates.append(release_date)
            ratings.append(min_rating)
            reviews.append(review_count)
    except Exception as e:
        print(f"Error processing season {season}: {e}")

# Create a DataFrame
df = pd.DataFrame({
    'Season': seasons,
    'Episode': episodes,
    'Title': titles,
    'Release Date': release_dates,
    'Rating': ratings,
    'Number of Reviews': reviews
})

# Save DataFrame to a CSV file
df.to_csv('friends_episodes.csv', index=False)

# Close the browser
driver.quit()


Error waiting for page elements: Message: 
Stacktrace:
	GetHandleVerifier [0x00007FF775BCEEB2+31554]
	(No symbol) [0x00007FF775B47EE9]
	(No symbol) [0x00007FF775A0872A]
	(No symbol) [0x00007FF775A58434]
	(No symbol) [0x00007FF775A5853C]
	(No symbol) [0x00007FF775A9F6A7]
	(No symbol) [0x00007FF775A7D06F]
	(No symbol) [0x00007FF775A9C977]
	(No symbol) [0x00007FF775A7CDD3]
	(No symbol) [0x00007FF775A4A33B]
	(No symbol) [0x00007FF775A4AED1]
	GetHandleVerifier [0x00007FF775ED8B2D+3217341]
	GetHandleVerifier [0x00007FF775F25AF3+3532675]
	GetHandleVerifier [0x00007FF775F1B0F0+3489152]
	GetHandleVerifier [0x00007FF775C7E786+750614]
	(No symbol) [0x00007FF775B5376F]
	(No symbol) [0x00007FF775B4EB24]
	(No symbol) [0x00007FF775B4ECB2]
	(No symbol) [0x00007FF775B3E17F]
	BaseThreadInitThunk [0x00007FFBC4607374+20]
	RtlUserThreadStart [0x00007FFBC58DCC91+33]



KeyboardInterrupt: 

## Exemple1:


In [32]:
url="https://www.paruvendu.fr/immobilier/annonceimmofo/liste/listeAnnonces?tt=1&at=1&nbp0=99&pa=FR&lo=75"


In [33]:
# Ouvrez la page Paru Vendu
driver.get(url)
# Attendez que la page se charge
#wait = WebDriverWait(driver, 20)
#WebDriverWait(driver, 3).until(EC.element_to_be_clickable((By.XPATH,'/html/body/div[1]/div/div[1]/div[3]/button[2]'))).click()
#WebDriverWait(driver, 3).until(EC.element_to_be_clickable((By.XPATH,'//*[@id="batchsdk-ui-alert__buttons_negative"]'))).click()

title = (
    WebDriverWait(driver=driver, timeout=10)
    .until(visibility_of_element_located((By.CSS_SELECTOR, "h1")))
    .text
)
title

'Vente maison appartement - Paris (75)'

In [9]:
# retrieve fully rendered HTML content
content = driver.page_source
#driver.close()
# content


NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=126.0.6478.128)
Stacktrace:
	GetHandleVerifier [0x00007FF775BCEEB2+31554]
	(No symbol) [0x00007FF775B47EE9]
	(No symbol) [0x00007FF775A0872A]
	(No symbol) [0x00007FF7759DD995]
	(No symbol) [0x00007FF775A844D7]
	(No symbol) [0x00007FF775A9C051]
	(No symbol) [0x00007FF775A7CDD3]
	(No symbol) [0x00007FF775A4A33B]
	(No symbol) [0x00007FF775A4AED1]
	GetHandleVerifier [0x00007FF775ED8B2D+3217341]
	GetHandleVerifier [0x00007FF775F25AF3+3532675]
	GetHandleVerifier [0x00007FF775F1B0F0+3489152]
	GetHandleVerifier [0x00007FF775C7E786+750614]
	(No symbol) [0x00007FF775B5376F]
	(No symbol) [0x00007FF775B4EB24]
	(No symbol) [0x00007FF775B4ECB2]
	(No symbol) [0x00007FF775B3E17F]
	BaseThreadInitThunk [0x00007FFBC4607374+20]
	RtlUserThreadStart [0x00007FFBC58DCC91+33]


In [54]:
# ensuite vous continuez le scraping avec beautifulsoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(content, "html.parser")
print(soup.find("h1").text)
for elt in soup.find_all('div',{"class":"flex justify-center items-center w-full gap-4 text-lg sm:text-base text-red font-medium border-1 border-red p-1 mb-2 sm:my-2"}):
    print(elt.text.strip())


Vente maison appartement - Paris (75)
956 800 €
345 000 €
849 000 €
220 000 €
235 000 €
1 249 960 €
529 000 €
286 000 €
122 000 €
116 000 €
265 000 €
375 000 €
306 000 €
685 000 €
1 910 000 €
189 807 €
490 000 €
160 000 €
529 624 €
325 000 €
416 000 €
849 000 €
925 000 €
995 000 €
549 000 €
749 000 €
649 000 €


## Exemple 2

In [36]:
url="https://www.airbnb.com/experiences/272085"
 
driver.get(url)  # navigate to URL
# wait for page to load
# by waiting for <h1> element to appear on the page
title = (
    WebDriverWait(driver=driver, timeout=10)
    .until(visibility_of_element_located((By.CSS_SELECTOR, "h1")))
    .text
)
title

'Cérémonie du thé et sushis colorés'

In [56]:
# retrieve fully rendered HTML content
content = driver.page_source
#browser.close()
content


'<html data-is-hyperloop="true" data-hyperloop-version="1" class="scrollbar-gutter js-focus-visible dir native v1oc6b3k vgnbcm1 v1agkal2 vqw89vp vlugpmm g5l85gq" lang="fr" dir="ltr" style="--vh: 5.15px; --vw: 10.36px; --vw-unitless: 1036; --vw-px: 1036px;"><head><meta charset="utf-8"><meta name="locale" content="fr"><meta name="google" content="notranslate"><meta id="csrf-param-meta-tag" name="csrf-param" content="authenticity_token"><meta id="csrf-token-meta-tag" name="csrf-token" content="null"><meta id="english-canonical-url" content=""><meta name="twitter:widgets:csp" content="on"><meta name="mobile-web-app-capable" content="yes"><meta name="apple-mobile-web-app-capable" content="yes"><meta name="application-name" content="Airbnb"><meta name="apple-mobile-web-app-title" content="Airbnb"><meta name="theme-color" content="#ffffff"><meta name="msapplication-navbutton-color" content="#ffffff"><meta name="apple-mobile-web-app-status-bar-style" content="black-translucent"><meta name="msa

In [60]:
# we then could parse it with beautifulsoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(content, "html.parser")
print(soup.find("h1").text)

Cérémonie du thé et sushis colorés


In [64]:
col1 =  browser.find_element(By.XPATH, "//*[@id='site-content']/div[1]/div[7]/div/div/div/div/div[3]")
col1.text

"Récemment, la valeur de la nourriture japonaise a été reconsidérée.\nLa cuisine japonaise est très saine et magnifique. Dans ma classe, vous devriez les apprendre facilement et délicieux .\nVous pouvez apprendre à faire trois jolis sushis dans ma classe.\nCe ne sont pas des artisans de sushis traditionnels qui enseignent, mais c'est très originalité. Vous apprécierez la sensation saisonnière de ces sushis. Votre fabrication de sushis… Lire la suite"

## Useful resources:

* https://www.scrapingbee.com/blog/practical-xpath-for-web-scraping/