Before starting we import all the libraries that we need

In [2]:
from myFunctions import *
import requests
from bs4 import BeautifulSoup
import re
import os
import csv
import pandas as pd
import numpy as np

# <strong> Data collection

## <strong> 1.1 Get the list of Michelin restaraunts

Before scraping all the restaurant URLs, let's first determine the maximum page number. It's simple to find the correct CSS selector for the page list: just inspect the list of pages in your browser and identify the corresponding class or element name.

<p>
    <img title = "list of pages" src="./images/pages_number.png"/>
</p>

In [4]:
response = requests.get('https://guide.michelin.com/en/it/restaurants')
soup = BeautifulSoup(response.content, "html.parser")
page_links = soup.select('ul.pagination li a') #name of the pages list
page_numbers = [int(a.get_text()) for a in page_links if a.get_text().isdigit()]

# Get the maximum page number
total_pages = max(page_numbers) if page_numbers else 0
print(f'There are in total: {total_pages} pages')

There are in total: 102 pages


Now we can very easily get the URL of each page

In [None]:
pages = ['https://guide.michelin.com/en/it/restaurants'] #Initial page

for i in range(2, total_pages+1): #get all other pages from 2 to total_pages included
    pages.append('https://guide.michelin.com/en/it/restaurants/page/'+str(i))

Now in order to get the URLs of all the restaurants, we proceed the same by identifying the name of the corresponding class in the webpage.

<p>
<img title = "Class of a restaraunt" src="images/restaurant_link.png"/>
</p>

We can clearly see that the restaurant URLs follow a consistent pattern, which can be expressed using the regular expression:

```bash
BASE_URL/en/region/city/restaurant/name_of_restaurant
```


In [6]:
total_urls = [] #save all urls
base = 'https://guide.michelin.com' #base url to use

In [7]:
for p in pages: #loop all pages
    response = requests.get(p) #get the page
    soup = BeautifulSoup(response.content, "html.parser") # we use BeautifulSoup to get the content
    links = soup.select('a.link') #select all the class 'a link'
    pattern = re.compile(r'^/en/[^/]+/[^/]+/restaurant/[^/]+$') #pattern of restaurants
    restaurant_links = [base+link.get('href') for link in links if pattern.match(link.get('href', ''))] #get all the restaurants links
    total_urls.append(restaurant_links)

Now we save all the urls inside a txt called 'restaurant_urls.txt'

In [None]:
with open('restaurant_urls.txt', 'w') as f: 
    page_count = 1  # Initialize the page count
    for urls in total_urls:
        f.write(f'Page {page_count}:\n')  # Add a label for the page number
        for url in urls: # Write each URL from the current page
            f.write(f'{url}\n')  
        
        page_count += 1 # Increment the page count

In [None]:
print(sum([len(u) for u in total_urls])) # how many restaurants we got

2037


## <strong> 1.2. Crawl Michelin restaurant pages

Now we download all the HTML from the urls and save them in a folder and divide each of them in separate folder_pages

In [None]:
save_all_as_html('restaurant_urls.txt') # See actual implementation inside 'myFunctions.py'

In [None]:
count = 0
for root_dir, cur_dir, files in os.walk('restaurants_html'): #Let's check if we got all the files
    count += len(files)
print('file count:', count)

file count: 2034


The save_all_as_html function utilizes multi-threading to achieve optimal performance, generating approximately 20 threads concurrently. Within each loop for a page, each thread is tasked with downloading around a single URL, making it extremely efficient. Consequently, the function successfully downloaded 2,034 out of 2,037 files in under one minute.

## <strong> 1.3 Parse downloaded pages

The list of the information we desire for each restaurant and their format is as follows:

    Restaurant Name (to save as restaurantName): string;
    Address (to save as address): string;
    City (to save as city): string;
    Postal Code (to save as postalCode): string;
    Country (to save as country): string;
    Price Range (to save as priceRange): string;
    Cuisine Type (to save as cuisineType): string;
    Description (to save as description): string;
    Facilities and Services (to save as facilitiesServices): list of strings;
    Accepted Credit Cards (to save as creditCards): list of strings;
    Phone Number (to save as phoneNumber): string;
    URL to the Restaurant Page (to save as website): string.


To parse those information we can just inspect one html to see how those information are stored as we did before.<br>
Most of the information can be retrieved in the following json script at the end of each HTML file:
```js
<script type="application/ld+json">{"@context":"http://schema.org","address":{"@type":"PostalAddress","streetAddress":"Piazza Salvo d'Acquisto 16","addressLocality":"Lamezia Terme","postalCode":"88046","addressCountry":"ITA","addressRegion":"Calabria"},"name":"Abbruzzino Oltre","image":"https://axwwgrkdco.cloudimg.io/v7/__gmpics3__/f19d37d6b9da437fa06b6f9406645056.jpg?width=1000","@type":"Restaurant","review":{"@type":"Review","datePublished":"2024-09-11T07:32","name":"Abbruzzino Oltre","description":"This restaurant, the new home of young chef Luca Abbruzzino, occupies the first floor of a historic palazzo in the town centre which has recently been converted into a small hotel offering six ...","author":{"@type":"Person","name":"Michelin Inspector"}},"telephone":"+39 0968 188 8038","knowsLanguage":"en-IT","acceptsReservations":"No","servesCuisine":"Contemporary","url":"https://guide.michelin.com/en/calabria/lamezia-terme/restaurant/abbruzzino-oltre","currenciesAccepted":"EUR","paymentAccepted":"American Express credit card, Credit card / Debit card accepted, Mastercard credit card, Visa credit card","award":"Selected: Good cooking","brand":"MICHELIN Guide","hasDriveThroughService":"False","latitude":38.9770969,"longitude":16.3202202,"hasMap":"https://www.google.com/maps/search/?api=1&query=38.9770969%2C16.3202202"}</script>
```

<img src = "images/restaurant_page.png" />

Now we create a parse_restaurant function that given a html, it parses all the information we need and returns it as a dictionary, we also decided to keep region as an extra column

In [2]:
info = parse_restaurant('restaurants_html/1/abbruzzino-oltre.html') #Test
show_restaurant_info(info)

This restaurant, the new home of young chef Luca Abbruzzino, occupies the first floor of a historic palazzo in the town centre which has recently been converted into a small hotel offering six exclusive guestrooms. Guests can enjoy an aperitif in the elegant small lounge before making their way to one of the two dining rooms with just five tables in total (all ellipsoid in shape), where the chef’s surprise tasting menu is served. Featuring several small courses, the menu showcases imaginative and skilfully prepared dishes that demonstrate the chef’s love for his native region. We particularly enjoyed the dry-aged angler fish with black sesame (already renowned as one of the chef’s classics), spring onion, sultanas and anchovy sauce; the spaghetti with mullet, orange, turmeric and “nduja” sauce; and the pigeon with peach, tomato and barley. Although wine-paring options are available, you can also ask the talented and enthusiastic maître-sommelier to recommend a bottle or even simply a g

Now we can create a tsv file with all the informations of all the restaurants

In [2]:
root = '/home/pavka/ADM/ADM-HW3/restaurants_html'
output= 'restaurant_info.tsv'
save_all_restaurant_info_to_tsv(root, output) #actual implementation in 'myFunctions.py', but it's just a walk and parse of all .html

Data saved to restaurant_info.tsv


In [3]:
df = pd.read_table('restaurant_info.tsv', index_col=0)

In [4]:
df.head(5)

Unnamed: 0_level_0,address,city,postalCode,country,region,priceRange,cuisineType,description,facilitiesServices,creditCards,phoneNumber,website
restaurantName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Caffè La Crepa,piazza Matteotti 14,Isola Dovarese,26031,ITA,Lombardy,€€,Lombardian,"Overlooking a picturesque Renaissance square, ...","Interesting wine list, Terrace","Amex, Mastercard, Visa",+39 0375 396161,http://www.caffelacrepa.it
Terramira,piazza della Vittoria 13,Capolona,52010,ITA,Tuscany,€€€€,"Contemporary, Tuscan","Having gained valuable experience elsewhere, t...","Air conditioning, Great view","Amex, Mastercard, Visa",+39 0575 420989,https://terramira.it
Fàula,Località Talloria 1,Cerretto Langhe,12050,ITA,Piedmont,€€€,"Piedmontese, Modern Cuisine",Fàula is the restaurant at the Casa Langa hote...,"Air conditioning, Car park, Garden or park, Gr...","Amex, Mastercard, Visa",+39 0173 520520,https://www.casadilanga.com/it/food-drink/faula/
Il Fenicottero Rosa Gourmet,via Emilia Ponente 23,Faenza,48018,ITA,Emilia-Romagna,€€€,"Contemporary, Seafood",Situated within the Villa Abbondanzi luxury ho...,"Air conditioning, Car park, Garden or park, Te...","Amex, Mastercard, Visa",+39 0546 622672,https://www.villa-abbondanzi.com/ristorante-fe...
Acqua Pazza,via Dietro la Chiesa 3/4,Ponza,4027,ITA,Lazio,€€€,Seafood,Arranged on a series of multi-levelled terrace...,"Air conditioning, Great view, Interesting wine...","Amex, Mastercard, Visa",+39 0771 80643,https://www.acquapazza.com/


In [15]:
df.shape

(2034, 12)