# <center>Scraping Workshop</center>

<div style="text-align: center;">
    <img src="../images/scrappeur.png" width="600" height="300">
</div>


## Introduction

Let's start by setting out the basics of scraping. 

Scraping means knowing how to read what's behind a site. With a simple right-click and `inspect`, it's fairly easy to access the HTML code of the page.

<div style="text-align: center;">
    <img src="../images/du site au html.png" width="800" height="400">
</div>

Scraping consists of analysing the source code of a page for various applications:  

- Locating elements and interacting with them to automate repetitive tasks (such as buttons, for example)
- Extracting different types of information (which we'll be looking at in this workshop)

___

Next, you need to choose a scraping tool: ***Selenium*** or ***BeautifulSoup***, which you can use to recover the various elements of a web page.

- Selenium is useful for dynamic web pages where content is generated via JavaScript, requiring user interaction such as clicking, scrolling or text input.

- BeautifulSoup is a Python library used primarily for parsing HTML and XML documents. It is useful for extracting structured data from static web pages.

Generally speaking, we prefer to use Selenium, which allows more actions, but beautifulsoup remains an important option. In this activity, we present the two modules to help you get to grips with them.

___

## Activity 1a: Retrieving a Professor Layton riddle with Selenium

We're going to see how we can simply retrieve the main elements of a wiki page to create a database. To do this, we're going to connect to https://professeur-layton.fandom.com/fr/wiki/En_queue_de_poisson. 

The first step will be to retrieve :
- the title
- the riddle number
- the statement
- the solution
- the solution
- the solution.

#### Importing python modules

In [1]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time
import random
import pandas as pd

#### Launch the browser and connect to the riddle

In [2]:
# Creating an instance of Chrome options
chrome_options = Options()

# Setting Chrome options 
chrome_options.add_argument('--disable-search-engine-choice-screen') 
chrome_options.add_argument('--disable-infobars')
# Create a new instance of the Chrome browser with the specified options
driver = webdriver.Chrome(options=chrome_options)

# Enter the URL you want to scrape and inject it into the :
url = "https://professeur-layton.fandom.com/fr/wiki/En_queue_de_poisson"
driver.get(url)

____

If all has worked well, you should see a chrome window open at the URL we've provided. 

The site uses cookies, we could simply click on ‘ACCEPT ALL’ or ‘REFUSE ALL’ but we're going to use selenium to perform one of these actions. The next two code cells are very specific to the site we're trying to scrape, so we won't dwell on them :

In [3]:
cookie = driver.find_element(By.CLASS_NAME ,'NN0_TB_DIsNmMHgJWgT7U')
cookie.click()

Next, we are going to scroll slightly so that the program can locate the various elements of the site (given the rather large advertising video at the top of the web page) :

In [4]:
# Retrieve the total height of the page
total_height = driver.execute_script("return document.body.scrollHeight")

# Calculate the scroll height (20% in this example)
scroll_height = total_height * 0.20

# 10% scroll
driver.execute_script(f"window.scrollBy(0, {scroll_height});")

___

We will now look at how to retrieve the information from the instructions: the title, the riddle number, the statement, the solution and the resolution.

#### Clever recovery of Web elements

Here's a little [HTML Form](../form_html.md). It's a quick way of learning or remembering the main tags used to read HTML and identify the different elements of a web page.

This is where we introduce the various Web element selectors : 

- ID = "id"
- NAME = "name"
- XPATH = "xpath"
- LINK_TEXT = "link text"
- PARTIAL_LINK_TEXT = "partial link text"
- TAG_NAME = "tag name"
- CLASS_NAME = "class name"
- CSS_SELECTOR = "css selector"

In [5]:
# Title recovery :
title = driver.find_element(By.ID, 'firstHeading')

# Riddle number retrieval :
num_enigme = driver.find_element(By.XPATH ,'//*[@id="mw-content-text"]/div/div[1]/div[2]/table/tbody/tr[4]/td')

# Retrieve the riddle statement :
enonce = driver.find_element(By.ID, "Énoncé")
enigme_enonce = enonce.find_elements(By.XPATH, "//span[@class='mw-headline' and @id='Énoncé']/ancestor::h2/following-sibling::p[following-sibling::h2]")

# Retrieving the answer :
reponse = driver.find_element(By.XPATH, '//*[@id="mw-content-text"]/div/p[8]')

resolution = driver.find_element(By.ID, "Résolution")
resolution_enonce = resolution.find_elements(By.XPATH, "//span[@class='mw-headline' and @id='Résolution']/ancestor::h3/following-sibling::p[following-sibling::h3]")

Find out what our variables contain and what type they are: 

In [6]:
title

<selenium.webdriver.remote.webelement.WebElement (session="fbb51fe6914db4fc5a4ca44eb79e66c9", element="f.65A58076A2D8875C15F6000527C46229.d.A360C40CD0FDB00724A24AF066159BD4.e.112")>

In [7]:
type(title)

selenium.webdriver.remote.webelement.WebElement

In [8]:
resolution_enonce

[<selenium.webdriver.remote.webelement.WebElement (session="fbb51fe6914db4fc5a4ca44eb79e66c9", element="f.65A58076A2D8875C15F6000527C46229.d.A360C40CD0FDB00724A24AF066159BD4.e.123")>,
 <selenium.webdriver.remote.webelement.WebElement (session="fbb51fe6914db4fc5a4ca44eb79e66c9", element="f.65A58076A2D8875C15F6000527C46229.d.A360C40CD0FDB00724A24AF066159BD4.e.124")>]

In [9]:
type(resolution_enonce)

list

#### Reading Web elements

You will probably have noticed that we have used `find_element` in some cases and `find_elements` in others. 

The choice between these two methods depends on what we want to obtain. 

If we want to extract a single element, such as a title, we'll use `find_element`, because we only want to retrieve one element. 

On the other hand, if we want to retrieve several elements, such as the text tags for a statement or the solution to a puzzle, we'll use `find_elements`, as this allows us to retrieve several elements at once.

This way of retrieving elements will have an impact on how they are read. The `.text` method is used to obtain the text of a WebElement.
However, if you have a list of WebElements, the `.text` method will not be directly accessible, hence the following code : 

In [10]:
# Extraction of the WebElement title text
title = title.text

# Extract text from WebElement num_enigme
num_enigme = num_enigme.text

# Extract text from WebElements list enigme_enonce
enigme_enonce = [elem.text for elem in enigme_enonce]
enigme_enonce = "".join(enigme_enonce)

# Extract the text from the WebElement response
reponse = reponse.text

# Extract text from WebElements list resolution_enonce
resolution_enonce = [elem.text for elem in resolution_enonce]
resolution_enonce = "".join(resolution_enonce)

#### Storage in a DataFrame

In [11]:
# Initialize the list dictionary
data = {
    'title': title,
    'number': num_enigme,
    'enonce': enigme_enonce,
    'solution': reponse,
    'resolution': resolution_enonce
}
# Display the dictionary
print(data)

{'title': 'En queue de poisson', 'number': '053', 'enonce': 'Alors que vous aviez le dos tourné, quelqu\'un a englouti le poisson que vous vous étiez préparé pour le dîner. Trois frères se trouvent à proximité des lieux du crime. Voici ce qu\'ils ont a dire :A : "Oui je l\'ai mangé. C\'était drôlement bon !"\nB : "J\'ai vu A manger le poisson !"\nC : "B et moi n\'y avons pas touché."L\'un d\'entre eux vous ment, mais qui ?', 'solution': 'La réponse est C.', 'resolution': "Le menteur est le frère C. A et C se sont partagés votre dîner.La réponse devient évidente quand on réalise que si A ment, alors B ment obligatoirement. Le même raisonnement a lieu si l'on considère que B ment. La seule réponse possible est donc que C est en train de mentir, ce qui implique que C a également touché au poisson."}


<div style="text-align: center;">
    <img src="../images/learner_scraping.png" width="300" height="300">
</div>

And that's it! We've made our first scrapbook!

___

## Activity 1b: Retrieving a Professor Layton riddle with BeautifulSoup

We're going to do the same scraping, but this time with BeautifulSoup.

#### Importing python modules

In [12]:
# Import des bibliothèques
from bs4 import BeautifulSoup, Tag
from typing import List
import requests
import random
import pandas as pd

#### Launch a query to retrieve all the source code of the url to be scrapped

To connect to the website, we're going to do something different. This time we're going to use the requests module, which sends HTTP requests to retrieve the content of web pages.

In [13]:
url = "https://professeur-layton.fandom.com/fr/wiki/En_queue_de_poisson"
# Envoie d'une requête HTTP GET 
data  = requests.get(url)

The next line is used to check whether the request was successful. The status code 200 indicates success, while other codes (such as 404 or 500) indicate errors :

In [14]:
data.status_code

200

We now have code 200, so we can move on to retrieving the HTML content of the web page : 

In [15]:
# Retrieves the content of the response to the GET request in the form of plain text (HTML)
data  = requests.get(url).text
# Creation of a BeautifulSoup object using the html5lib parser, which will interpret the raw HTML contained in data
soup = BeautifulSoup(data,"html5lib")
#soup = BeautifulSoup(data,"html.parser")

`soup` is the resulting BeautifulSoup object, which makes it easy to navigate and manipulate the structure of the HTML document.

#### Display page source code

In [16]:
#print(soup.prettify())

#### Recovering Web elements from soup.prettify

In [17]:
# Extraction of the puzzle title and URL
title = soup.find('meta', attrs={'property': "og:title"}).get("content")
url_enigme = soup.find('meta', attrs={'property': "og:url"}).get("content")

#print(title, "\n")
#print(url_enigme)

# Extraction of the number from the table associated with ‘Professor Layton and the Strange Village’.
numero = soup.find('a', title="Professeur Layton et l'Étrange Village").find_parent('tr').find_next_sibling('tr').find('td').text.strip()
#print(numero, "\n")

# Function for extracting the text between a given heading and the following heading
def extract_text_between(start_id: str, start_tag: str, stop_tag: str) -> str:
    # Find the starting element from the span ID and its parent tag
    start_element:Tag = soup.find('span', id=start_id).find_parent(start_tag)
    # Initialise an empty list to collect the text
    text_list: List[str] = []
    # Iterate on all the following elements of the same level
    for sibling in start_element.find_next_siblings():
        if sibling.name == stop_tag:  # Stop if the end element is reached
            break
        if sibling.name in ['p', 'ul']:  # Collect text from <p> and <ul> elements
            text_list.append(sibling.get_text())
    # Join the collected text into a single character string
    return "\n".join(text_list)

# Extraction of the statement and resolution
enigme_enonce = extract_text_between('Énoncé', 'h2', 'h2')
reponse = extract_text_between('Résolution', 'h3', 'h3')


#print(enigme_enonce)
#print(reponse)

More information: 
- The ‘og’ prefix in og:title and og:url refers to the Open Graph Protocol (OGP), a standard used to structure and enrich web page metadata. This protocol was initially developed by Facebook, but is now widely used by various social networking platforms and search engines to better understand and display the information on a web page when it is shared.

`numero = soup.find(‘a’, title=‘Professeur Layton et l'Étrange Village’).find_parent(‘tr’).find_next_sibling(‘tr’).find(‘td’).text.strip()`:
- Find the `<a>` tag where the title attribute is equal to ‘Professor Layton and the Strange Village’. This tag corresponds to a hypertext link pointing to ‘Professor Layton and the Strange Village’.
- `.find_parent(‘tr’)` : Find the parent element of this `<a>` tag, which is a `<tr>` tag (table row).
- `.find_next_sibling(‘tr’)` : Find the next element at the same level as this `<tr>` tag, which is the next `<tr>` tag. This corresponds to the next line in the table.
- `.find(‘td’)` : Find the first `<td>` cell in this new `<tr>` line, which probably contains the puzzle number.
- `.text.strip()` : Extracts the text contained in this `<td>` cell and deletes the white spaces at the beginning and end of the text with `strip()`.

In [18]:
# Initialiser le dictionnaire de listes
data = {
    'title': title,
    'number': numero,
    'description': enigme_enonce,
    'solution': reponse
}
# Afficher le dictionnaire
print(data)

{'title': 'En queue de poisson', 'number': '053', 'description': 'Alors que vous aviez le dos tourné, quelqu\'un a englouti le poisson que vous vous étiez préparé pour le dîner. Trois frères se trouvent à proximité des lieux du crime. Voici ce qu\'ils ont a dire\xa0:\n\nA\xa0: "Oui je l\'ai mangé. C\'était drôlement bon\xa0!"\nB\xa0: "J\'ai vu A manger le poisson\xa0!"\nC\xa0: "B et moi n\'y avons pas touché."\n\nL\'un d\'entre eux vous ment, mais qui\xa0?\n', 'solution': "Le menteur est le frère C. A et C se sont partagés votre dîner.\n\nLa réponse devient évidente quand on réalise que si A ment, alors B ment obligatoirement. Le même raisonnement a lieu si l'on considère que B ment. La seule réponse possible est donc que C est en train de mentir, ce qui implique que C a également touché au poisson.\n"}


___

## Activity 2: Generalising several puzzles with Selenium

For this activity, we're going to create a more complete table for several puzzles.

This time we want a table containing the following columns: `title`, `enigma_num`, `url`, `image`, `enigma`, `solution` for 5 enigmas.

We will provide the first part to select 5 riddles. We start by going to a page containing links to all the puzzles in the game : 

In [19]:
url = "https://professeur-layton.fandom.com/fr/wiki/Cat%C3%A9gorie:%C3%89nigmes"

We connect in ‘headless’ mode to save a few lines of code :

In [20]:
# Creating an instance of Chrome options
chrome_options = Options()

# Setting Chrome options 
chrome_options.add_argument('--disable-search-engine-choice-screen') 
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('--headless=new')
# Create a new instance of the Chrome browser with the specified options
driver = webdriver.Chrome(options=chrome_options)
driver.get(url)

We retrieve all the href links to the puzzles using a CSS_Selector : 

In [21]:
# Retrieve all <a> elements with the ‘category-page__member-link’ class.
elements = driver.find_elements(By.CSS_SELECTOR, "a.category-page__member-link")
# Extract the hrefs of the elements found
hrefs = [element.get_attribute("href") for element in elements]

We randomly extract a list of 3 links (to avoid having code that is too long and overloading the wiki with requests) :

In [22]:
# Randomly select 3 hrefs
hrefs = random.sample(hrefs, 3)

In [23]:
hrefs

['https://professeur-layton.fandom.com/fr/wiki/Chamailleries',
 'https://professeur-layton.fandom.com/fr/wiki/Paf_le_chien_!',
 'https://professeur-layton.fandom.com/fr/wiki/La_travers%C3%A9e_(1)']

We will then create a function which, for each link, retrieves the data we are interested in :

In [26]:
from selenium.common.exceptions import NoSuchElementException

def collecte_enigme(href):
    url = href
    
    # Browser configuration (Chrome in this example)
    options = Options()
    options.add_argument("--headless")  # Run Chrome in headless mode
    chrome_options.add_argument('--disable-search-engine-choice-screen') 
    chrome_options.add_argument('--disable-infobars')
    driver = webdriver.Chrome(options=options)
    
    # Open the web page
    driver.get(url)
    time.sleep(3)
    try:
        cookie = driver.find_element(By.XPATH ,'/html/body/div[7]/div/div/div[2]/div[2]/div[1]')
        cookie.click()
    except :
        pass
    try : 
        # Title recovery :
        title = driver.find_element(By.ID, 'firstHeading')
        title = title.text
    except : 
        title = None
    try:
        time.sleep(4)
        num_enigme = driver.find_element(By.XPATH ,'//*[@id="mw-content-text"]/div/div[1]/div[2]/table/tbody/tr[4]/td')
        num_enigme = num_enigme.text
       
    except :
        num_enigme = None
    try:
        
        # Image recovery
        image_element = driver.find_element(By.CSS_SELECTOR, 'div.floatnone img')
        image = image_element.get_attribute("src")
    
    except:
        image=None
    
    # Retrieve the statement of the riddle :
    enigme_enonce = None
    try : 
        enonce = driver.find_element(By.ID, "Énoncé")
        enigme_enonce = enonce.find_elements(By.XPATH, "//span[@class='mw-headline' and @id='Énoncé']/ancestor::h2/following-sibling::p[following-sibling::h2]")
        enigme_enonce = "".join([elem.text for elem in enigme_enonce])
    except NoSuchElementException:
        enigme_enonce = image
        
    resolution_title = None
    
    try:
        resolution_title = driver.find_element(By.ID, "Résolution")
        resolution = resolution_title.find_elements(By.XPATH, "//span[@class='mw-headline' and @id='Résolution']/ancestor::h3/following-sibling::p[following-sibling::h3]")
        resolution = "\n".join([elem.text for elem in resolution])
    except Exception:
        try:
            solution = driver.find_element(By.ID, "Solution")
            # Find the first ‘div’ sibling of ‘Solution’ that contains the slideshow
            slideshow_div = solution.find_element(By.XPATH, '//*[@id="slideshow-0"]/div/div[1]')

            # Locate the specific image within this slideshow
            image_element = slideshow_div.find_element(By.XPATH, './/img[@class="thumbimage"]')
            image_url = image_element.get_attribute('src')
            resolution = image_url
        except : 
            text_elements = solution.find_elements(By.XPATH, 'following::p')
            resolution = "\n".join([elem.text for elem in text_elements])
 
    # Creating the DataFrame
    return pd.DataFrame([{'title': title, 'num_enigme':num_enigme, 'url': url, 'image': image, 'enigme': enigme_enonce, 'solution': resolution}])


We then use this function to create our DataFrame :

In [25]:
df_final = pd.DataFrame()
for href in hrefs:
    print("Scraping de la page", href, "en cours...")
    df_final = pd.concat([df_final, collecte_enigme(href)], ignore_index=True)
print("Fin du Scraping")

Scraping de la page https://professeur-layton.fandom.com/fr/wiki/Chamailleries en cours...
Scraping de la page https://professeur-layton.fandom.com/fr/wiki/Paf_le_chien_! en cours...
Scraping de la page https://professeur-layton.fandom.com/fr/wiki/La_travers%C3%A9e_(1) en cours...
Fin du Scraping


In [27]:
pd.set_option('display.max_row', 5)
pd.set_option('display.max_column', 6)
df_final

Unnamed: 0,title,num_enigme,url,image,enigme,solution
0,Chamailleries,27,https://professeur-layton.fandom.com/fr/wiki/C...,https://static.wikia.nocookie.net/layton/image...,Six frères se réunissent autour d'une table po...,Beau travail !\nPlacez les garçons comme le mo...
1,Paf le chien !,9,https://professeur-layton.fandom.com/fr/wiki/P...,https://static.wikia.nocookie.net/layton/image...,Les allumettes ci-dessous représentent un chie...,La voiture a aplati le pauvre chien ! Il faut ...
2,La traversée (1),7,https://professeur-layton.fandom.com/fr/wiki/L...,https://static.wikia.nocookie.net/layton/image...,Vous devez amener les trois loups et les trois...,


In [28]:
#Permet de mettre fin au driver
driver.quit()

## Activity 2: Generalising several puzzles with BeautifulSoup

#### Launch a query to retrieve all the source code of the url to be scrapped

In [29]:
url = "https://professeur-layton.fandom.com/fr/wiki/Cat%C3%A9gorie:%C3%89nigmes"
data  = requests.get(url).text
soup = BeautifulSoup(data,"html5lib")

#### Display page source code

In [30]:
#print(soup.prettify())

#### Collect all puzzle links and take only 5 at random

In [31]:
# Trouver tous les éléments <a> avec la classe "category-page__member-link"
links = soup.find_all('a', class_='category-page__member-link')

# Extraire les attributs href
hrefs = [link.get('href') for link in links]

# Afficher les hrefs
#print(hrefs)
#print(len(hrefs))

# Sélectionner aléatoirement 5 hrefs
hrefs = random.sample(hrefs, 5)

# Afficher les hrefs sélectionnés aléatoirement
#print(random_hrefs)
#print(len(random_hrefs))

#### Let's create a function to repeat the data extraction steps

In [32]:
def collecte_enigme(racine, href):
    url = racine+href
    data  = requests.get(url).text
    soup = BeautifulSoup(data,"html.parser")
  
    # Récupération du titre
    title = soup.find('meta', attrs={'property': "og:title"})
    title = title.get("content")

    # Récupération url
    url_enigme = soup.find('meta', attrs={'property': "og:url"})
    url_enigme = url_enigme.get("content")

    # Récupération de l'image
    src_tags = soup.find_all(src=True)
    # Extraire les valeurs de src
    src_urls = [tag['src'] for tag in src_tags]
    longest_src = max(src_urls, key=len) if src_urls else None
    image = longest_src

    # Récupération de l'énigme
    try:
        enigme_title = soup.find('span', {'class': 'mw-headline', 'id': 'Énoncé'})
        # Trouver le paragraphe suivant le titre de la section "Énoncé"
        enigme_paragraph = enigme_title.find_next('p')
        # Extraire le texte du paragraphe
        enigme = enigme_paragraph.get_text(strip=True)
        
    except :
        enigme = image
        
    # Récupération de la réponse
    try :
        
        reponse_title = soup.find('span', {'class': 'mw-headline', 'id': 'Solution'})
        reponse_paragraph = reponse_title.find_next('p')
        reponse = reponse_paragraph.get_text(strip=True)
    except :
        a_tag = soup.find('a', class_='image')
        if a_tag:
            reponse = a_tag.get('href')
        else:
            reponse ="La réponse n'a pas été loadé correctement"

    # Récupération des indices
    tabs = soup.select('ul.wds-tabs li.wds-tabs__tab a')
    contents = soup.select('div.wds-tab__content')
    # Extract each index content into a list
    indices = []
    for i, tab in enumerate(tabs):
        if i < len(contents):
            content_divs = contents[i].select('div[style*="overflow-y:auto"]')
            if content_divs:
                content = content_divs[0].get_text(strip=True)
                indices.append(content)
        
    # Append the collected data as a dictionary
    data = []
    data.append({'title': title, 'url': url_enigme, 'image': image, 'enigme': enigme, 'indices': indices,'solution': reponse})

    
    df = pd.DataFrame(data)
    return df

#### Using the function and storing it in a DataFrame

In [33]:
racine = "https://professeur-layton.fandom.com"
df_final = pd.DataFrame()
for href in hrefs:
    #print(href)
    df_final = pd.concat([df_final, collecte_enigme(racine,href)], ignore_index=True)

#### Visualisation

In [34]:
pd.set_option('display.max_row', 7)
pd.set_option('display.max_column', 6)
df_final

Unnamed: 0,title,url,image,enigme,indices,solution
0,Dédale numérique,https://professeur-layton.fandom.com/fr/wiki/D...,https://static.wikia.nocookie.net/layton/image...,Essayez de sortir de ce labyrinthe ! Commencez...,[Si vous essayez tous les itinéraires possible...,Par ici la sortie.
1,Tas de feuilles,https://professeur-layton.fandom.com/fr/wiki/T...,https://static.wikia.nocookie.net/layton/image...,Plusieurs feuilles de calque ont été superposé...,"[Trois couches ici, quatre couches là...Marque...",La réponse est 5.
2,Tenir l'affiche,https://professeur-layton.fandom.com/fr/wiki/T...,https://static.wikia.nocookie.net/layton/image...,Alors que Benny s'appliquait à placarder des a...,[Il est plus simple d'éliminer les affiches di...,Il suffit de cocher l'affiche B.
3,Inéquations ?,https://professeur-layton.fandom.com/fr/wiki/I...,https://static.wikia.nocookie.net/layton/image...,"Eh bien, on dirait que quelqu'un a encore écri...","[Au premier coup d'œil, il semble que l'auteur...",La réponse est 1.
4,Alchimie en folie 01,https://professeur-layton.fandom.com/fr/wiki/A...,https://static.wikia.nocookie.net/layton/image...,https://static.wikia.nocookie.net/layton/image...,[],Solution à l'énigme.


# Conclusion

We finally succeeded in building a DataFrame from a website! 
The important thing to remember is that BeautifulSoup is very useful for so-called “open” sites and for the massive repetition of information gathering.