<h1> <b> Michelin restaurants in Italy web scraping Analysis </b> <img src="https://styles.redditmedia.com/t5_2s8x6/styles/communityIcon_ftfh5okvyxj01.png" width=100 style="vertical-align: middle"> </h1>

In today's world, people are more eager than ever to discover culinary experiences that are both unique and unforgettable. The culinary arts have evolved from a necessity to a refined skill, with passionate chefs pushing the boundaries of creativity and flavor. For food lovers, travelers, and students of gastronomy, resources like the Michelin Guide are invaluable. This platform, accessible via the <a href="https://guide.michelin.com/en/it/restaurants" > Michelin Guide website</a>, provides information, reviews, and ratings for restaurants throughout Italy and beyond, recognized for their quality, innovation, and exceptional culinary experiences.

Our team is dedicated to creating a search engine tailored for food lovers, helping users discover and rank Michelin-starred restaurants across Italy based on their unique preferences. Your mission: to create an efficient and intuitive tool to explore the best culinary experiences in Italy.

<h1> <b> Import Libraries </b> <img src="https://preview.redd.it/snoovatar/avatars/nftv2_bmZ0X2VpcDE1NToxMzdfZWI5NTlhNzE1ZGZmZmU2ZjgyZjQ2MDU1MzM5ODJjNDg1OWNiMTRmZV8yMTQ1NzYzNg_rare_46f1cdb1-634f-4c1d-8344-2be06c7880d4-headshot.png?width=256&height=256&crop=smart&auto=webp&s=400ead9440c7a9f06ca4c44953f24c5b765c4aac" width=120 style="vertical-align: middle"> </h1>

In [31]:
from bs4 import BeautifulSoup
import requests
import os
import pandas as pd
import re
import time
from tqdm import tqdm

# 1. Data collection

In [3]:
cnt = requests.get("https://guide.michelin.com/en/it/restaurants")

In [37]:
soup = BeautifulSoup(cnt.content, features="lxml")
soup.prettify()[:1000]

'<!DOCTYPE html>\n<html class="full-screen-mobile" dir="" lang="en-US">\n <head>\n  <meta charset="utf-8"/>\n  <meta content="width=device-width, initial-scale=1.0, user-scalable=0" name="viewport"/>\n  <meta content="" name="author"/>\n  <meta content="#fff" name="theme-color"/>\n  <meta content="MICHELIN Guide" property="og:site_name"/>\n  <meta content="Italy MICHELIN Restaurants – The MICHELIN Guide" itemprop="name"/>\n  <meta content="Italy MICHELIN Restaurants – The MICHELIN Guide" property="og:title"/>\n  <meta content="index, follow" name="robots"/>\n  <meta content="Article" property="og:type"/>\n  <meta content="https://guide.michelin.com/en/it/restaurants" property="og:url"/>\n  <meta content="Starred restaurants, Bib Gourmand and all the Restaurants of The MICHELIN Guide Italy. MICHELIN inspector reviews and insights" name="description"/>\n  <meta content="Starred restaurants, Bib Gourmand and all the Restaurants of The MICHELIN Guide Italy. MICHELIN inspector reviews and i

In [5]:
base_save_path = "E:\Michelin Restaurant"

## 1.1. Get the list of Michelin restaurants

#### Our task is to collect the **URL** associated with each restaurant

In [9]:
full_base_url = "https://guide.michelin.com"
base_url="https://guide.michelin.com/en/it/restaurants"
restaurant_links = []
for page_num in tqdm(range(1,101), desc="Reading pages", unit="page",colour="blue"):
    url = f"{base_url}/page/{page_num}"
    response = requests.get(url)
    soup=BeautifulSoup(response.content, features="lxml")
    links = soup.find_all('a', class_="link")  # Adjust 'link' based on actual class name for restaurant links
    # Extract URLs from each link
    for link in links:
        href = link.get('href')
        if "/restaurant/" in href:  # Filter only restaurant links
        # Construct full URL and add to the list
            restaurant_links.append(full_base_url + href)
    time.sleep(2)

Reading pages: 100%|[34m██████████[0m| 100/100 [04:10<00:00,  2.51s/page]


In [12]:
#Let's check the number of restaurant

print(len(restaurant_links))

1982


In [16]:
#Visualize the first five restaurant link:

print(*restaurant_links[:5],sep="\n")

https://guide.michelin.com/en/campania/gragnano/restaurant/o-me-o-il-mare
https://guide.michelin.com/en/abruzzo/popoli_1845563/restaurant/donevandro
https://guide.michelin.com/en/piemonte/alba/restaurant/ape-vino-e-cucina
https://guide.michelin.com/en/campania/sorrento/restaurant/da-bob-cook-fish
https://guide.michelin.com/en/basilicata/matera/restaurant/da-mo


## 1.2. Crawl Michelin restaurant pages
- ### Download the HTML corresponding to each of the collected URLs.
- ### After collecting each page, immediately save its HTML in a file. 
- ### Organize the downloaded HTML pages into folders. Each folder will contain the HTML of the restaurants from page 1, page 2, ... of the Michelin restaurant list.

In [None]:
base_save_path = "E:\Michelin Restaurants"

In [17]:
for i in range(0, len(restaurant_links), 20):
    folder_number = (i // 20) + 1
    folder_name = os.path.join(base_save_path, f"page {folder_number}")
    os.makedirs(folder_name, exist_ok=True)
    
    # Estrai il gruppo di URL per questa cartella
    links_to_download = restaurant_links[i:i + 20]
    
    # Scarica ciascun URL e salva il file
    for idx, url in enumerate(links_to_download):
        filename = os.path.join(folder_name, f"restaurant{i + idx + 1}.html")
        response = requests.get(url)
        
        # Salva il contenuto HTML del ristorante nel file
        with open(filename, "w", encoding="utf-8") as file:
            file.write(response.text)

KeyboardInterrupt: 

## 1.3 Parse downloaded pages

### At this point, you should have all the HTML documents about the restaurant of interest, and you can start to extract specific information. The list of the information we desire for each restaurant and their format is as follows:

- #### **Restaurant Name** (to save as `restaurantName`): string;
- #### **Address** (to save as `address`): string;
- #### **City** (to save as `city`): string;
- #### **Postal Code** (to save as `postalCode`): string;
- #### **Country** (to save as `country`): string;
- #### **Price Range**** (to save as `priceRange`): string;
- #### **Cuisine Type** (to save as `cuisineType`): string;
- #### **Description** (to save as `description`): string;
- #### **Facilities and Services** (to save as `facilitiesServices`): list of strings;
- #### **Accepted Credit Cards** (to save as `creditCards`): list of strings;
- #### **Phone Number** (to save as `phoneNumber`): string;
- #### **URL** to the **Restaurant Page** (to save as `website`): string.

In [27]:
def extract_restaurant_data(file_path):
    # Load and parse the HTML file
    with open(file_path, "r", encoding="utf-8") as file:
        html_content = file.read()

    soup = BeautifulSoup(html_content, "html.parser")

    # Initialize the restaurant_info dictionary with default values
    restaurant_info = {
        'restaurantName': None,
        'address': None,
        'city': None,
        'postalCode': None,
        'country': None,
        'priceRange': None,
        'cuisineType': None,
        'description': None,
        'facilitiesServices': [],
        'creditCards': [],
        'phoneNumber': None,
        'website': None
    }

    # Extract Restaurant Name
    title_tag = soup.find('h1', class_='data-sheet__title')
    if title_tag:
        restaurant_info['restaurantName'] = title_tag.get_text(strip=True)

    # Extract Address Components
    address_text_tag = soup.find("div", class_="data-sheet__block--text")
    address_text = address_text_tag.get_text(strip=True)
    address_text=re.sub(r'\bloc\.\s*[^,]*,', '', address_text) # Remove location (loc) and the next part up to the comma
    if address_text:
        address_parts = address_text.replace('\n', ' ').split(",")
        restaurant_info["address"] = address_parts[0] if len(address_parts) > 0 else None
        restaurant_info["city"] = address_parts[1] if len(address_parts) > 1 else None
        restaurant_info["postalCode"] = address_parts[2] if len(address_parts) > 2 else None
        restaurant_info["country"] = address_parts[3] if len(address_parts) > 3 else None

    # Extract Price Range and Cuisine Type
    info_divs = soup.find_all("div", class_="data-sheet__block--text")
    if len(info_divs) > 1:
        info_text = info_divs[1].text.strip()
        if '·' in info_text:
            info_parts = info_text.split('·')
            restaurant_info['priceRange'] = info_parts[0].strip()
            if len(info_parts) > 1:
                restaurant_info['cuisineType'] = info_parts[1].strip()

    # Extract Description
    description_tag = soup.select_one('.data-sheet__description')
    if description_tag:
        restaurant_info['description'] = description_tag.get_text(strip=True)

    # Extract Facilities and Services
    facilities_services = [service.get_text(strip=True) for service in soup.select('.restaurant-details__services li')]
    if facilities_services:
        restaurant_info['facilitiesServices'] = facilities_services

    # Extract Accepted Credit Cards (Modify this part to return a list of credit cards)
    cards=[]
    credit_cards = soup.find_all("img", class_="lazy", height="32")  # Modify as per the HTML structure
    for img in credit_cards:
        src = img.get('src', '') or img.get('data-src', '')
        if src:# Check if 'src' is not empty
            text=src.split("/")[-1]
            cards.append(text.split("-")[0].capitalize())
    restaurant_info["creditCards"]=cards

    # Extract Phone Number
    phone_span = soup.select_one('.collapse__block-item span')
    if phone_span:
        restaurant_info['phoneNumber'] = phone_span.get_text(strip=True)

    # Extract Website URL
    website_link = soup.select_one('.collapse__block-item.link-item a')
    if website_link:
        restaurant_info['website'] = website_link['href']

    return restaurant_info


In [28]:
def natural_sort_key(s):
    return [int(text) if text.isdigit() else text.lower() for text in re.split(r'(\d+)', s)]

In [29]:
def parse_all_folders(base_folder):

    """ Parse all HTML files in a folder structure and aggregate into a DataFrame."""

    data = []  # List to store all parsed data
    
    # Traverse through each folder and file in the base_folder
    for folder_name in tqdm(sorted(os.listdir(base_folder),key=natural_sort_key),desc="Read folder of each page",unit="page"):
        folder_path = os.path.join(base_folder, folder_name)
        
        # Check if the path is a directory
        if os.path.isdir(folder_path):
            for filename in sorted(os.listdir(folder_path),key=natural_sort_key):
                # Only process HTML files
                if filename.endswith(".html"):
                    file_path = os.path.join(folder_path, filename)
                    # Parse the file and add to data list
                    restaurant_data = extract_restaurant_data(file_path)
                    data.append(restaurant_data)
    
    # Create DataFrame from collected data
    df = pd.DataFrame(data)
    return df

In [32]:
# Usage :
base_save_path = "E:\Michelin Restaurants"
df = parse_all_folders(base_save_path)

# Save the DataFrame to a CSV file
df.to_csv("all_restaurants_data.csv", index=False)

Read folder of each page: 100%|██████████| 100/100 [03:00<00:00,  1.81s/page]


In [34]:
df.head()

Unnamed: 0,restaurantName,address,city,postalCode,country,priceRange,cuisineType,description,facilitiesServices,creditCards,phoneNumber,website
0,O Me O Il Mare,Via Roma 45/47,Gragnano,80054,Italy,€€€€,"Italian Contemporary, Modern Cuisine",After many years’ experience in Michelin-starr...,"[Air conditioning, Interesting wine list, Whee...","[Amex, Dinersclub, Mastercard, Visa]",+39 081 620 0550,http://omeoilmare.com
1,Donevandro,via Garibaldi 2,Popoli,65026,Italy,€€,"Contemporary, Seasonal Cuisine","Up until a few years ago, the owner-chef at th...",[Air conditioning],"[Mastercard, Visa]",+39 388 887 6858,http://www.donevandroristorante.it
2,Ape Vino e Cucina,Piazza Risorgimento 3,Alba,12051,Italy,€€,"Piedmontese, Contemporary",This attractive restaurant in the heart of Alb...,"[Air conditioning, Terrace, Wheelchair access]","[Amex, Dinersclub, Maestrocard, Mastercard, Visa]",+39 0173 363453,https://www.apewinebar.it/alba/
3,Da Bob Cook Fish,largo Parsano vecchio 16,Sorrento,80067,Italy,€€,Seafood,Working in partnership with the nearby fishmon...,"[Air conditioning, Terrace]","[Amex, Dinersclub, Mastercard, Visa]",+39 081 1778 3873,https://www.dabobcookfish.com/
4,DA_MÓ,Via Bruno Buozzi 20,Matera,75100,Italy,€€,"Regional Cuisine, Contemporary","This new, restored restaurant in the upper par...","[Air conditioning, Terrace]","[Amex, Dinersclub, Mastercard, Visa]",+39 0835 686548,https://www.damoristorante.it/


In [35]:
df.tail()


Unnamed: 0,restaurantName,address,city,postalCode,country,priceRange,cuisineType,description,facilitiesServices,creditCards,phoneNumber,website
1978,Umami,Via Ugo Secondo Partigiano 1,Badalucco,18010,Italy,€€,Modern Cuisine,A young chef with experience in renowned resta...,"[Terrace, Wheelchair access]","[Amex, Mastercard, Visa]",+39 331 338 6005,https://www.umamirestaurant.it/
1979,Visione Restaurant and Living,Strada Nicolini Basso 34,Barbaresco,12050,Italy,€€€,"Contemporary, Piedmontese","At this restaurant, new, young and enthusiasti...","[Air conditioning, Car park]","[Amex, Maestrocard, Mastercard, Visa]",+39 328 134 0218,https://www.ristorantevisione.it
1980,Ristorante de LEN,Via Cesare Battisti 66,Cortina d'Ampezzo,32043,Italy,€€,Regional Cuisine,Just a stone’s throw from the central and very...,[Wheelchair access],"[Amex, Dinersclub, Mastercard, Visa]",+39 0436 4246,https://hoteldelen.it
1981,AceroRosso,Via Ruvignan 1,Vodo di Cadore,32040,Italy,€€,Regional Cuisine,This secluded mountain chalet immersed in verd...,"[Car park, Terrace, Wheelchair access]","[Mastercard, Visa]",+39 0435 489653,https://www.acerorossodolomiti.it
1982,Café Les Paillotes,piazza Le Laudi 2,Pescara,65129,Italy,€€€,"Modern Cuisine, Seafood",This old acquaintance of the Michelin Guide no...,"[Air conditioning, Interesting wine list, Rest...","[Amex, Dinersclub, Mastercard, Visa]",+39 085 61809,https://www.lespaillotes.it/
