# <center>Scraping Activity</center>

<div style="text-align: center;">
    <img src="images/scrappeur.png" width="600" height="300">
</div>

___

## Introduction

Let's start by setting out the basics of scraping. Scraping means knowing how to read what's behind a site. By simply right-clicking and inspecting the element, you can access the page's HTML code quite easily.

<div style="text-align: center;">
    <img src="images/du site au html.png" width="800" height="300">
</div>

Here's a little  [HTML Form](form_html.md). It's a quick way of learning or remembering the main tags used to read HTML and identify the different elements of a web page.

Next, choose a scraping tool: Selenium or BeautifulSoup.

- Selenium is useful for dynamic web pages where content is generated via JavaScript, requiring user interaction such as clicking, scrolling or text input.

- BeautifulSoup is a Python library used primarily for parsing HTML and XML documents. It is useful for extracting structured data from static web pages.

Generally speaking, we prefer to use Selenium, which allows more actions, but beautifulsoup remains an important option. In this activity, we present the two modules to help you get to grips with them.

___

## Activity 1a : Recover a Professor Layton riddle with Selenium

We're going to see how we can simply retrieve the main elements of a wiki page to create a database. To do this, we're going to connect to https://professeur-layton.fandom.com/fr/wiki/La_travers%C3%A9e_(1). 

The idea will initially be to recover :
- the title
- the riddle number
- the statement,
- the solution.

###### Importing python modules

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time
import pandas as pd

###### Launch the browser and connect to the riddle

In [None]:
# Mise en place du driver chrome
service = Service(executable_path=ChromeDriverManager().install())
# Ajouter des options 
chrome_options = Options()
# Désactiver la propriété qui révèle le contrôle par l'automatisation
chrome_options.add_argument('--disable-blink-features=AutomationControlled')  
# On lance Chrome en fournissant le driver et en renseignant les options
driver = webdriver.Chrome(service=service, options=chrome_options)
# On renseigne l'URL que l'on veut scraper
url = "https://professeur-layton.fandom.com/fr/wiki/La_travers%C3%A9e_(1)"
driver.get(url)

###### Clever recovery of Web elements

This is where we introduce the various Web element selectors : 

- ID = "id"
- NAME = "name"
- XPATH = "xpath"
- LINK_TEXT = "link text"
- PARTIAL_LINK_TEXT = "partial link text"
- TAG_NAME = "tag name"
- CLASS_NAME = "class name"
- CSS_SELECTOR = "css selector"

In [None]:
# Title recovery :
title = driver.find_element(By.XPATH ,'//*[@id="firstHeading"]')

# Find the riddle number :
num_enigme = driver.find_element(By.XPATH ,'//*[@id="mw-content-text"]/div/div[1]/div[2]/table/tbody/tr[4]/td')

# Retrieving the statement :
enonce = driver.find_element(By.XPATH ,'//*[@id="Énoncé"]')
enigme_enonce = enonce.find_elements(By.XPATH, '//h2[span[@id="Énoncé"]]/following-sibling::p | //h2[span[@id="Énoncé"]]/following-sibling::ul')

# Retrieving the answer :
reponse = driver.find_element(By.XPATH, '//*[@id="mw-content-text"]/div/p[8]')

###### Reading Web elements

In [None]:
title = title.text
num_enigme = num_enigme.text
enigme_enonce = [elem.text for elem in enigme_enonce]
enigme_enonce = "\n".join(enigme_enonce)
reponse = reponse.text

###### Storage in a DataFrame

In [None]:
# Initialise the list dictionary
data = {
    'title': title,
    'number': num_enigme,
    'description': enigme_enonce,
    'solution': reponse
}
# Print data
print(data)

___

## Activity 1b : Recover a Professor Layton riddle with BeautifulSoup

###### Importing python modules

In [None]:
# Import des bibliothèques
from bs4 import BeautifulSoup
import random
import requests
import re

###### Launch a request to retrieve all the source code of the url to be scrapped

In [None]:
url = "https://professeur-layton.fandom.com/fr/wiki/La_travers%C3%A9e_(1)"
data  = requests.get(url).text
soup = BeautifulSoup(data,"html5lib")

###### Display page source code

In [None]:
#print(soup.prettify())

###### Recovering Web elements from soup.prettify

In [None]:
title = soup.find('meta', attrs={'property': "og:title"})
title = title.get("content")
print(title)
print("\n")

url_enigme = soup.find('meta', attrs={'property': "og:url"})
print(url_enigme.get("content"))

# Trouver la balise <tr> contenant "Professeur Layton et l'Étrange Village"
layton_row = soup.find('a', title="Professeur Layton et l'Étrange Village").find_parent('tr')
# Trouver la balise <tr> suivante
numero_row = layton_row.find_next_sibling('tr')
# Extraire le contenu de la balise <td> contenant le numéro
numero = numero_row.find('td').text.strip()
print(numero)

print("\n")
start_element = soup.find('span', id='Énoncé').find_parent('h2')

# Initialize an empty list to collect the text
text_list = []

# Iterate over all next siblings of the start element
for sibling in start_element.find_next_siblings():
    if sibling.name == 'h2':  # Stop if we reach another h2 element
        break
    if sibling.name in ['p', 'ul']:  # Collect text from p and ul elements
        text_list.append(sibling.get_text())

# Join the collected text into a single string
enigme_enonce = "\n".join(text_list)

# Print the resulting string
print(enigme_enonce)

##########

start_element = soup.find('span', id='Résolution').find_parent('h3')

# Initialize an empty list to collect the text
text_list = []

# Iterate over all next siblings of the start element
for sibling in start_element.find_next_siblings():
    if sibling.name == 'h3':  # Stop if we reach another h2 element
        break
    if sibling.name in ['p', 'ul']:  # Collect text from p and ul elements
        text_list.append(sibling.get_text())

# Join the collected text into a single string
reponse = " ".join(text_list)
print(reponse)

In [None]:
# Initialise the list dictionary
data = {
    'title': title,
    'number': numero,
    'descripton': enigme_enonce,
    'solution': reponse
}

# Print data
print(data)

___

## Activity 2 : Your turn !

In this second part, we suggest you do the same thing for several puzzles. To do this, we'll give you the part of the code that retrieves all the riddle urls and we'll select 5 at random to avoid saturating this little wiki.

###### Launch a request to retrieve all the source code of the url to be scrapped

In [None]:
url = "https://professeur-layton.fandom.com/fr/wiki/Cat%C3%A9gorie:%C3%89nigmes"
data  = requests.get(url).text
soup = BeautifulSoup(data,"html5lib")

###### Display page source code

In [None]:
#print(soup.prettify())

###### Collect all the puzzle links and take just 5 at random

In [None]:
# Find all <a> elements with the ‘category-page__member-link’ class.
links = soup.find_all('a', class_='category-page__member-link')

# Extract href attributes
hrefs = [link.get('href') for link in links]

# Display hrefs
#print(hrefs)
#print(len(hrefs))

# Randomly select 5 hrefs
hrefs = random.sample(hrefs, 5)

# Display randomly selected hrefs
#print(random_hrefs)
#print(len(random_hrefs))

###### Let's create a function to repeat the data extraction steps

In [None]:
def collecte_enigme(racine, href):
    url = racine+href
    data  = requests.get(url).text
    soup = BeautifulSoup(data,"html.parser")
  
    # Title recovery
    title = ...
    title = ...

    # Recovering urls
    url_enigme = ...
    url_enigme = ...

    # Retrieving the riddle
    enigme_title = ...
    # Find the paragraph following the title of the ‘Statement’ section
    enigme_paragraph = ...
    # Extract text from paragraph
    enigme = enigme_paragraph.get_text(strip=True)
        
    
        
    # Retrieving the answer
    reponse_title = ...
    reponse_paragraph = ...
    reponse = reponse_paragraph.get_text(strip=True)
  
    # Append the collected data as a dictionary
    data = []
    data.append({'Title': title, 'url': url_enigme, 'Enigme': enigme, 'Solution': reponse})

    
    df = pd.DataFrame(data)
    return df

###### Using the function and storing it in a DataFrame

In [None]:
racine = "https://professeur-layton.fandom.com"
df_final = pd.DataFrame()
for href in hrefs:
    #print(href)
    df_final = pd.concat([df_final, collecte_enigme(racine,href)], ignore_index=True)

###### Visualisation

In [None]:
pd.set_option('display.max_row', 7)
pd.set_option('display.max_column', 6)
df_final

# Conclusion

We finally managed to build a DataFrame from a website! The most important thing to remember is that BeautifulSoup is very useful for ‘open’ sites and for the massive repetition of data collection.