# Activity 3 : Generalising several puzzles with Selenium

<div style="text-align: center;">
    <img src="../images/scrappeur.png" width="600" height="300">
</div>


___

For this activity, we're going to create a more complete table for several puzzles.

This time we want a table containing the following columns: `title`, `enigma_num`, `url`, `image`, `enigma`, `solution` for 5 enigmas.

We will provide the first part to select 5 riddles. We start by going to a page containing links to all the puzzles in the game : 

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time
import random
import pandas as pd

In [19]:
url = "https://professeur-layton.fandom.com/fr/wiki/Cat%C3%A9gorie:%C3%89nigmes"

We connect in ‘headless’ mode to save a few lines of code :

In [20]:
# Creating an instance of Chrome options
chrome_options = Options()

# Setting Chrome options 
chrome_options.add_argument('--disable-search-engine-choice-screen') 
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('--headless=new')
# Create a new instance of the Chrome browser with the specified options
driver = webdriver.Chrome(options=chrome_options)
driver.get(url)

We retrieve all the href links to the puzzles using a CSS_Selector : 

In [21]:
# Retrieve all <a> elements with the ‘category-page__member-link’ class.
elements = driver.find_elements(By.CSS_SELECTOR, "a.category-page__member-link")
# Extract the hrefs of the elements found
hrefs = [element.get_attribute("href") for element in elements]

In [None]:
# Extract the hrefs from the found elements and filter to exclude those containing "Category"
hrefs = [element.get_attribute("href") for element in elements if "Cat%C3%A9gorie" not in element.get_attribute("href")]

We randomly extract a list of 3 links (to avoid having code that is too long and overloading the wiki with requests) :

In [22]:
# Randomly select 3 hrefs
hrefs = random.sample(hrefs, 3)

In [23]:
hrefs

['https://professeur-layton.fandom.com/fr/wiki/Chamailleries',
 'https://professeur-layton.fandom.com/fr/wiki/Paf_le_chien_!',
 'https://professeur-layton.fandom.com/fr/wiki/La_travers%C3%A9e_(1)']

We will then create a function which, for each link, retrieves the data we are interested in :

In [26]:
from selenium.common.exceptions import NoSuchElementException

def collecte_enigme(href):
    url = href
    
    # Browser configuration (Chrome in this example)
    options = Options()
    options.add_argument("--headless")  # Run Chrome in headless mode
    chrome_options.add_argument('--disable-search-engine-choice-screen') 
    chrome_options.add_argument('--disable-infobars')
    driver = webdriver.Chrome(options=options)
    
    # Open the web page
    driver.get(url)
    time.sleep(3)
    try:
        cookie = driver.find_element(By.XPATH ,'/html/body/div[7]/div/div/div[2]/div[2]/div[1]')
        cookie.click()
    except :
        pass
    try : 
        # Title recovery :
        title = driver.find_element(By.ID, 'firstHeading')
        title = title.text
    except : 
        title = None
    try:
        time.sleep(4)
        num_enigme = driver.find_element(By.XPATH ,'//*[@id="mw-content-text"]/div/div[1]/div[2]/table/tbody/tr[4]/td')
        num_enigme = num_enigme.text
       
    except :
        num_enigme = None
    try:
        
        # Image recovery
        image_element = driver.find_element(By.CSS_SELECTOR, 'div.floatnone img')
        image = image_element.get_attribute("src")
    
    except:
        image=None
    
    # Retrieve the statement of the riddle :
    enigme_enonce = None
    try : 
        enonce = driver.find_element(By.ID, "Énoncé")
        enigme_enonce = enonce.find_elements(By.XPATH, "//span[@class='mw-headline' and @id='Énoncé']/ancestor::h2/following-sibling::p[following-sibling::h2]")
        enigme_enonce = "".join([elem.text for elem in enigme_enonce])
    except NoSuchElementException:
        enigme_enonce = image
        
    resolution_title = None
    
    try:
        resolution_title = driver.find_element(By.ID, "Résolution")
        resolution = resolution_title.find_elements(By.XPATH, "//span[@class='mw-headline' and @id='Résolution']/ancestor::h3/following-sibling::p[following-sibling::h3]")
        resolution = "\n".join([elem.text for elem in resolution])
    except Exception:
        try:
            solution = driver.find_element(By.ID, "Solution")
            # Find the first ‘div’ sibling of ‘Solution’ that contains the slideshow
            slideshow_div = solution.find_element(By.XPATH, '//*[@id="slideshow-0"]/div/div[1]')

            # Locate the specific image within this slideshow
            image_element = slideshow_div.find_element(By.XPATH, './/img[@class="thumbimage"]')
            image_url = image_element.get_attribute('src')
            resolution = image_url
        except : 
            text_elements = solution.find_elements(By.XPATH, 'following::p')
            resolution = "\n".join([elem.text for elem in text_elements])
 
    # Creating the DataFrame
    return pd.DataFrame([{'title': title, 'num_enigme':num_enigme, 'url': url, 'image': image, 'enigme': enigme_enonce, 'solution': resolution}])


We then use this function to create our DataFrame :

In [25]:
df_final = pd.DataFrame()
for href in hrefs:
    print("Scraping de la page", href, "en cours...")
    df_final = pd.concat([df_final, collecte_enigme(href)], ignore_index=True)
print("Fin du Scraping")

Scraping de la page https://professeur-layton.fandom.com/fr/wiki/Chamailleries en cours...
Scraping de la page https://professeur-layton.fandom.com/fr/wiki/Paf_le_chien_! en cours...
Scraping de la page https://professeur-layton.fandom.com/fr/wiki/La_travers%C3%A9e_(1) en cours...
Fin du Scraping


In [27]:
pd.set_option('display.max_row', 5)
pd.set_option('display.max_column', 6)
df_final

Unnamed: 0,title,num_enigme,url,image,enigme,solution
0,Chamailleries,27,https://professeur-layton.fandom.com/fr/wiki/C...,https://static.wikia.nocookie.net/layton/image...,Six frères se réunissent autour d'une table po...,Beau travail !\nPlacez les garçons comme le mo...
1,Paf le chien !,9,https://professeur-layton.fandom.com/fr/wiki/P...,https://static.wikia.nocookie.net/layton/image...,Les allumettes ci-dessous représentent un chie...,La voiture a aplati le pauvre chien ! Il faut ...
2,La traversée (1),7,https://professeur-layton.fandom.com/fr/wiki/L...,https://static.wikia.nocookie.net/layton/image...,Vous devez amener les trois loups et les trois...,


In [28]:
#Permet de mettre fin au driver
driver.quit()

## Activity 3 BONUS : Generalising several puzzles with BeautifulSoup

In [None]:
from bs4 import BeautifulSoup, Tag
from typing import List
import requests
import random
import pandas as pd

#### Launch a query to retrieve all the source code of the url to be scrapped

In [29]:
url = "https://professeur-layton.fandom.com/fr/wiki/Cat%C3%A9gorie:%C3%89nigmes"
data  = requests.get(url).text
soup = BeautifulSoup(data,"html5lib")

#### Display page source code

In [30]:
#print(soup.prettify())

#### Collect all puzzle links and take only 5 at random

In [31]:
# Trouver tous les éléments <a> avec la classe "category-page__member-link"
links = soup.find_all('a', class_='category-page__member-link')

# Extraire les attributs href
hrefs = [link.get('href') for link in links]

# Afficher les hrefs
#print(hrefs)
#print(len(hrefs))

# Sélectionner aléatoirement 5 hrefs
hrefs = random.sample(hrefs, 5)

# Afficher les hrefs sélectionnés aléatoirement
#print(random_hrefs)
#print(len(random_hrefs))

#### Let's create a function to repeat the data extraction steps

In [32]:
def collecte_enigme(racine, href):
    url = racine+href
    data  = requests.get(url).text
    soup = BeautifulSoup(data,"html.parser")
  
    # Récupération du titre
    title = soup.find('meta', attrs={'property': "og:title"})
    title = title.get("content")

    # Récupération url
    url_enigme = soup.find('meta', attrs={'property': "og:url"})
    url_enigme = url_enigme.get("content")

    # Récupération de l'image
    src_tags = soup.find_all(src=True)
    # Extraire les valeurs de src
    src_urls = [tag['src'] for tag in src_tags]
    longest_src = max(src_urls, key=len) if src_urls else None
    image = longest_src

    # Récupération de l'énigme
    try:
        enigme_title = soup.find('span', {'class': 'mw-headline', 'id': 'Énoncé'})
        # Trouver le paragraphe suivant le titre de la section "Énoncé"
        enigme_paragraph = enigme_title.find_next('p')
        # Extraire le texte du paragraphe
        enigme = enigme_paragraph.get_text(strip=True)
        
    except :
        enigme = image
        
    # Récupération de la réponse
    try :
        
        reponse_title = soup.find('span', {'class': 'mw-headline', 'id': 'Solution'})
        reponse_paragraph = reponse_title.find_next('p')
        reponse = reponse_paragraph.get_text(strip=True)
    except :
        a_tag = soup.find('a', class_='image')
        if a_tag:
            reponse = a_tag.get('href')
        else:
            reponse ="La réponse n'a pas été loadé correctement"

    # Récupération des indices
    tabs = soup.select('ul.wds-tabs li.wds-tabs__tab a')
    contents = soup.select('div.wds-tab__content')
    # Extract each index content into a list
    indices = []
    for i, tab in enumerate(tabs):
        if i < len(contents):
            content_divs = contents[i].select('div[style*="overflow-y:auto"]')
            if content_divs:
                content = content_divs[0].get_text(strip=True)
                indices.append(content)
        
    # Append the collected data as a dictionary
    data = []
    data.append({'title': title, 'url': url_enigme, 'image': image, 'enigme': enigme, 'indices': indices,'solution': reponse})

    
    df = pd.DataFrame(data)
    return df

#### Using the function and storing it in a DataFrame

In [33]:
racine = "https://professeur-layton.fandom.com"
df_final = pd.DataFrame()
for href in hrefs:
    #print(href)
    df_final = pd.concat([df_final, collecte_enigme(racine,href)], ignore_index=True)

#### Visualisation

In [34]:
pd.set_option('display.max_row', 7)
pd.set_option('display.max_column', 6)
df_final

Unnamed: 0,title,url,image,enigme,indices,solution
0,Dédale numérique,https://professeur-layton.fandom.com/fr/wiki/D...,https://static.wikia.nocookie.net/layton/image...,Essayez de sortir de ce labyrinthe ! Commencez...,[Si vous essayez tous les itinéraires possible...,Par ici la sortie.
1,Tas de feuilles,https://professeur-layton.fandom.com/fr/wiki/T...,https://static.wikia.nocookie.net/layton/image...,Plusieurs feuilles de calque ont été superposé...,"[Trois couches ici, quatre couches là...Marque...",La réponse est 5.
2,Tenir l'affiche,https://professeur-layton.fandom.com/fr/wiki/T...,https://static.wikia.nocookie.net/layton/image...,Alors que Benny s'appliquait à placarder des a...,[Il est plus simple d'éliminer les affiches di...,Il suffit de cocher l'affiche B.
3,Inéquations ?,https://professeur-layton.fandom.com/fr/wiki/I...,https://static.wikia.nocookie.net/layton/image...,"Eh bien, on dirait que quelqu'un a encore écri...","[Au premier coup d'œil, il semble que l'auteur...",La réponse est 1.
4,Alchimie en folie 01,https://professeur-layton.fandom.com/fr/wiki/A...,https://static.wikia.nocookie.net/layton/image...,https://static.wikia.nocookie.net/layton/image...,[],Solution à l'énigme.


# Conclusion

We finally succeeded in building a DataFrame from a website! 
The important thing to remember is that BeautifulSoup is very useful for so-called “open” sites and for the massive repetition of information gathering.