# Webscraping & Applied ML 
<p> </p>
Sarujan DENSON <br>
Yahya EL OUDOUNI <br>
Mohamed Houssem REZGUI <br>
DIA 2

## Project Discovering Paris Gastronomy: Connecting Hotels to Fine Dining Experiences 

# Scrapping TheFork

### First step : Install selenium, beautifulsoup and other librairies to process html files

In [1]:
!pip install -U selenium



In [2]:
!pip install lxml



In [3]:
!pip list

Package                       Version
----------------------------- ---------------
absl-py                       2.0.0
aiobotocore                   2.4.2
aiofiles                      22.1.0
aiohttp                       3.8.3
aioitertools                  0.7.1
aiosignal                     1.2.0
aiosqlite                     0.18.0
alabaster                     0.7.12
altair                        5.5.0
anaconda-catalogs             0.2.0
anaconda-client               1.12.0
anaconda-navigator            2.4.2
anaconda-project              0.11.1
anyio                         3.5.0
appdirs                       1.4.4
argon2-cffi                   21.3.0
argon2-cffi-bindings          21.2.0
arrow                         1.2.3
astroid                       2.14.2
astropy                       5.1
asttokens                     2.0.5
astunparse                    1.6.3
async-timeout                 4.0.2
atomicwrites                  1.4.0
attrs                         22.1.0
Automat  

In [4]:
!pip install beautifulsoup4



### 2nd step : Import the webdriver of selenium and other tools to navigate automatically in a website, to scrap and to collect the data

In [5]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from bs4 import BeautifulSoup
import pandas as pd
import csv
import time

### 3rd step : Navigate to the Fork page which contains details about Restaurants in Paris

In [6]:
driver = webdriver.Edge()
driver.get("https://www.thefork.fr/restaurants/paris-c415144?cc=16770-aa2&gad_source=1&gclid=Cj0KCQiAgJa6BhCOARIsAMiL7V8adddcwz-ff6NvVzD9fDwfI6rvHJHkf5jw3y7zLNlxV3dVTG0P-kEaAoQ7EALw_wcB&p=2")

### 4th step : We use Selenium and Beautifulsoup together to get information only on restaurants present on the website

In [7]:
# We use selenium to get many information related to the restaurants from the website by using a CSS selector
contact_experience_elements = driver.find_elements(By.CSS_SELECTOR, 'div.css-1rbtt4s.e1thb4we1')

# We use beautifulsoup to get HTML content from each information collected from the selenium step (just before)
for element in contact_experience_elements:
    section = BeautifulSoup(element.get_attribute("innerHTML"), "html.parser")
    print(section) 


<div class="css-ads8kf e1lxtbcs2" data-testid="list-header"><div class="css-q8d1uu euejy81"><button class="euejy80 ejdfy9v0 css-c3s9ii ektx8jp0"><svg aria-hidden="true" class="css-1vyst8h esjta1q0" focusable="false" height="24" mr="s" viewbox="0 0 24 24" width="24" xmlns="http://www.w3.org/2000/svg"><path d="m9 5.758 6-2.4v14.884l-6 2.4V5.759Zm13.086-1.18-6-3c-.038-.017-.08-.012-.12-.024a.7.7 0 0 0-.196-.042c-.058-.001-.111.013-.168.024-.043.009-.088 0-.13.019L8.285 4.429l-5.699-2.85A.752.752 0 0 0 1.5 2.25v16.5c0 .284.161.543.414.67l6 3a.753.753 0 0 0 .614.025l7.187-2.874 5.699 2.85a.753.753 0 0 0 .731-.032.755.755 0 0 0 .355-.64V5.25a.753.753 0 0 0-.414-.67Z" fill-rule="evenodd"></path></svg><span>Voir sur la carte</span></button><img alt="" src="/_next/static/media/map-preview.ea7d1efd.svg"/></div><div class="css-1ynpgmc e1lxtbcs1"><div class="css-p4va31 elkhwc30" style="cursor:pointer"><div class="css-18aid14 elkhwc30"><svg aria-hidden="true" class="css-1vyst8h e1o8s7z70" focusable

### 5th step : Scrape the data of restaurants (Name of the restaurant, Link to see information about a specific restaurant, the address, the price, the general rating of the restaurant and the number of reviews for each restaurant). Totally, we have scrapped 100 restaurants.

In [8]:
# List to collect data of restaurants
restaurants_data=[]

try:
    # Loop on 4 pages (because each page contains 25 restaurants, we have to click the button "next" in order to navigate between pages)
    for page_num in range(4):
        # We have to do a scroll of 100 pixels to get the information of the 1st restaurant which is at the top of all restaurants
        driver.execute_script("window.scrollBy(0, 100);") 
        time.sleep(2)

        # And then we have to do a scrool of 310 pixels 25 times to get the information of the 25 restaurants of the same page before clicking to the button "next"
        for _ in range(24):
            driver.execute_script("window.scrollBy(0, 310);")
            time.sleep(2)

        # We get the HTML content of the restaurants (just like before)
        page_source=driver.page_source
        soup=BeautifulSoup(page_source,"html.parser")
        section=soup.find("div",{"data-test":"result-list-restaurants"})

        # We scrape all information about each restaurant
        
        titles=[hotel.text.strip() for hotel in section.find_all('a',{"class": "css-r0c0pd"})] # Name of the restaurant
        
        links = [f"https://www.thefork.fr{link['href']}" if 'href' in link.attrs else None    # Link to the specific restaurant
            for link in section.find_all('a', {"class": "css-r0c0pd"})]
        
        addresses=[' '.join(address.get_text(separator=" ").strip().split())   
            for address in section.find_all('p',{"data-test":"search-restaurant-address"})]    # the address of the restaurant
        
        prices=[price.find_all("span")[-1].text.strip()
            for price in section.find_all('p',{"class":"css-zju5h4"})] # The mean price
        
        types=[type_.text.strip() for type_ in section.find_all('span',{"class":"css-1a3lcq9"})]  # Type of the restaurant
        
        marks=[mark.text.split()[0] for mark in section.find_all('span',{"class":"css-13xokbo"})]    # General rating of the restaurant
        
        reviews=[review.text.split()[0].replace(" ","")
            for review in section.find_all('span',{"class": "css-vq1r47"})]    # The number of reviews for a restaurant
        
        # Append all data collected to the list
        for i in range(len(titles)):
            data={
                "Title": titles[i] if i<len(titles) else None,
                "Link": links[i] if i<len(links) else None,
                "Address": addresses[i] if i< len(addresses) else None,
                "Price": prices[i] if i<len(prices) else None,
                "Type_of_Restaurant": types[i] if i<len(types) else None,
                "Mark": marks[i] if i<len(marks) else None,
                "Number_of_Reviews": reviews[i] if i<len(reviews) else None}
            restaurants_data.append(data)

        # Click on the button "next" to go next pages to scrape new 25 restaurants
        try:
            next_button=driver.find_element("css selector",'button[data-testid="pagination-next-button"]')
            next_button.click()
            time.sleep(5)
        except Exception as e:
            print(f"Error during trying to click on the button next on page {page_num + 1}:{e}")
            break

except Exception as e:
    print(f"An error occured:{e}")

# We create a dataFrame to stock the data
df=pd.DataFrame(restaurants_data)

# Export as a CSV file and export in a local folder 
output_path="C:/A5 ESILV/Webscraping & Applied ML/Projet/CSV_files/the_fork/all_restaurants_thefork_paris.csv"
df.to_csv(output_path,index=False,quoting=csv.QUOTE_ALL)

print(f"Scraping finished. Data of {len(restaurants_data)} restaurants, load in '{output_path}'.")

Scraping finished. Data of 100 restaurants, load in 'C:/A5 ESILV/Webscraping & Applied ML/Projet/CSV_files/the_fork/all_restaurants_thefork_paris.csv'.


### 6th step : Based on every link scraped for each restaurant, now we scrape the data of each restaurant (image of the restaurant, Rating of the mood in the restaurant, Rating of meals in the restaurant, Rating of the service in the restaurant, the Description of the restaurant, the menu of the restaurant and the reviews of the restaurant). After scrapinf all these data, we have to concatenate them to the data already scrapped in the previous code.

In [13]:
# Configuration of the driver in Chrome
options=webdriver.ChromeOptions()
options.add_argument("--headless")  # Mode headless
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
options.add_argument("--incognito")  # Limit the cookies
driver=webdriver.Chrome(options=options)

# Path for the input and the output files
input_path="C:/A5 ESILV/Webscraping & Applied ML/Projet/CSV_files/the_fork/all_restaurants_thefork_paris.csv"
output_path="C:/A5 ESILV/Webscraping & Applied ML/Projet/CSV_files/the_fork/all_restaurants_thefork_paris_detailed.csv"

# Load the input file
df=pd.read_csv(input_path)

# Columns to concatenate to the data collected in the previous code
additional_columns=["Image","Ambiance","Plats","Service","Description","Menu","Avis"]
for col in additional_columns:
    if col not in df.columns:
        df[col]=None

# Function which allows to wait for 10 secondes in order to get each detail
def wait_for_element(driver,by,value,timeout=10):
    
    try:
        return WebDriverWait(driver,timeout).until(EC.presence_of_element_located((by,value)))
    except Exception:
        return None

# We scrape specific information of each restaurant
def scrape_restaurant_details(driver,link):
    
    driver.get(link)
    time.sleep(3) 

    details={"Image": None,"Ambiance": None,"Plats": None,"Service": None,"Description": None,"Menu": None,"Avis": None,}

    try:
        image_element=wait_for_element(driver, By.CSS_SELECTOR, 'img[alt]')  # The link of the image of the restaurant
        if image_element:
            details["Image"]=image_element.get_attribute("src") 

        ambiance_element=wait_for_element(driver, By.CSS_SELECTOR, '[data-testid="ambience-value"]')  # The rating of the mood in the restaurant
        if ambiance_element:
            details["Ambiance"]=ambiance_element.text

        plats_element=wait_for_element(driver, By.CSS_SELECTOR, '[data-testid="food-value"]') # Rating on meals
        if plats_element:
            details["Plats"]=plats_element.text

        service_element=wait_for_element(driver,By.CSS_SELECTOR,'[data-testid="service-value"]')  # Rating on service of the restaurant
        if service_element:
            details["Service"]=service_element.text

        description_element=wait_for_element(driver,By.CSS_SELECTOR,'.etzvdt2')  # Description of the restaurant
        if description_element:
            details["Description"]=description_element.text.strip()
            
        try:
            menu_button=wait_for_element(driver,By.CSS_SELECTOR,'button[data-test="navigation-bar-button-restaurantMenus"]') 
            if menu_button:
                menu_button.click()
                time.sleep(3)
                menu_items=driver.find_elements(By.CSS_SELECTOR,'dt.eqhwq454') # The dishes (menu) of the restaurant
                details["Menu"]=', '.join([item.text for item in menu_items])
        except Exception:
            pass

        try:
            reviews_button=wait_for_element(driver,By.CSS_SELECTOR,'button[data-test="navigation-bar-button-restaurantReviews"]') # 10 reviews of the restaurant
            if reviews_button:
                reviews_button.click()
                time.sleep(3)

                all_reviews=[]
                while len(all_reviews)<10:
                    reviews_elements=driver.find_elements(By.CSS_SELECTOR,'div.css-1hdrxx1')
                    all_reviews.extend([review.text for review in reviews_elements])

                    if len(all_reviews)>=10:
                        break

                    try:
                        more_button=wait_for_element(driver,By.CSS_SELECTOR,'button.css-1wduqg7') # Click on the button "Lire la suite" if it appears on reviews in order to read completely the review
                        if more_button:
                            more_button.click()
                            time.sleep(2)
                        else:
                            break
                    except:
                        break
                details["Avis"]=' '.join(all_reviews[:10])  # Limit to 10 reviews
        except Exception:
            pass

    except Exception as e:
        print(f"Error on collecting data:{e}")

    return details

# Loop to scrape data of all restaurants
try:
    for index, row in df.iterrows():
        link=row["Link"]
        if pd.isna(link) or pd.notna(df.at[index,"Description"]):  
            continue

        print(f"Traitement du restaurant {index}: {link}")
        details=scrape_restaurant_details(driver,link)

        # Updating details of the dataFrame of the input file
        for col in additional_columns:
            df.at[index,col]=details[col]

        # Save all information in the output file
        df.to_csv(output_path,index=False,quoting=csv.QUOTE_ALL)

finally:
    driver.quit()

print(f"Détails des restaurants mis à jour dans {output_path}")


Traitement du restaurant 0: https://www.thefork.fr/restaurant/asahi-r47875#rankedBy=SEARCH_ENGINE
Traitement du restaurant 1: https://www.thefork.fr/restaurant/le-fil-rouge-cafe-r11104#rankedBy=SEARCH_ENGINE
Traitement du restaurant 2: https://www.thefork.fr/restaurant/feyrouz-r753214#rankedBy=SEARCH_ENGINE
Traitement du restaurant 3: https://www.thefork.fr/restaurant/galia-maxim-godigna-r72314#rankedBy=SEARCH_ENGINE
Traitement du restaurant 4: https://www.thefork.fr/restaurant/la-table-de-colette-r570025#rankedBy=SEARCH_ENGINE
Traitement du restaurant 5: https://www.thefork.fr/restaurant/elsass-r808576#rankedBy=SEARCH_ENGINE
Traitement du restaurant 6: https://www.thefork.fr/restaurant/au-bouquet-saint-paul-r815390#rankedBy=SEARCH_ENGINE
Traitement du restaurant 7: https://www.thefork.fr/restaurant/la-table-des-anges-r394679#rankedBy=SEARCH_ENGINE
Traitement du restaurant 8: https://www.thefork.fr/restaurant/saveurs-de-tokyo-r364243#rankedBy=SEARCH_ENGINE
Traitement du restaurant 9: h