Before starting we import all the libraries that we need

In [1]:
from MyFunctions.crawler import Crawler
from MyFunctions.parser import Parser
from MyFunctions.preprocessor import Preprocessor
from MyFunctions.searchEngine import SearchEngine, AdvancedSearchEngine
import requests
from bs4 import BeautifulSoup
import re
import os
import csv
import pandas as pd
import numpy as np
import json
from collections import defaultdict
import math
from nltk import word_tokenize
from nltk.stem import PorterStemmer


# <strong> Data collection

## <strong> 1.1 Get the list of Michelin restaraunts

Before scraping all the restaurant URLs, let's first determine the maximum page number. It's simple to find the correct CSS selector for the page list: just inspect the list of pages in your browser and identify the corresponding class or element name.

<p>
    <img title = "list of pages" src="./images/pages_number.png"/>
</p>

In [2]:
response = requests.get('https://guide.michelin.com/en/it/restaurants')
soup = BeautifulSoup(response.content, "html.parser")
page_links = soup.select('ul.pagination li a') #name of the pages list
page_numbers = [int(a.get_text()) for a in page_links if a.get_text().isdigit()]

# Get the maximum page number
total_pages = max(page_numbers) if page_numbers else 0
print(f'There are in total: {total_pages} pages')

There are in total: 100 pages


Now we can very easily get the URL of each page

In [3]:
pages = ['https://guide.michelin.com/en/it/restaurants'] #Initial page

for i in range(2, total_pages+1): #get all other pages from 2 to total_pages included
    pages.append('https://guide.michelin.com/en/it/restaurants/page/'+str(i))

Now in order to get the URLs of all the restaurants, we proceed the same by identifying the name of the corresponding class in the webpage.

<p>
<img title = "Class of a restaraunt" src="images/restaurant_link.png"/>
</p>

We can clearly see that the restaurant URLs follow a consistent pattern, which can be expressed using the regular expression:

```bash
BASE_URL/en/region/city/restaurant/name_of_restaurant
```


In [4]:
total_urls = [] #save all urls
base = 'https://guide.michelin.com' #base url to use

In [5]:
for p in pages: #loop all pages
    response = requests.get(p) #get the page
    soup = BeautifulSoup(response.content, "html.parser") # we use BeautifulSoup to get the content
    links = soup.select('a.link') #select all the class 'a link'
    pattern = re.compile(r'^/en/[^/]+/[^/]+/restaurant/[^/]+$') #pattern of restaurants
    restaurant_links = [base+link.get('href') for link in links if pattern.match(link.get('href', ''))] #get all the restaurants links
    total_urls.append(restaurant_links)

Now we save all the urls inside a txt called 'restaurant_urls.txt'

In [None]:
with open('dataset/restaurant_urls.txt', 'w') as f: 
    page_count = 1  # Initialize the page count
    for urls in total_urls:
        f.write(f'{page_count}\n')  # Add a label for the page number
        for url in urls: # Write each URL from the current page
            f.write(f'{url}\n')  
        
        page_count += 1 # Increment the page count

In [11]:
print(sum([len(u) for u in total_urls])) # how many restaurants we got

1983


## <strong> 1.2. Crawl Michelin restaurant pages

Now we download all the HTML from the urls and save them in a folder and divide each of them in separate folder_pages

In [None]:
crawler = Crawler()
crawler.save_all_as_html('dataset/restaurant_urls.txt') # See actual implementation inside 'crawler.py'

In [3]:
path = 'restaurants_html'
count = crawler.count_files(path)
print('file count:', count)

file count: 1983


The save_all_as_html function utilizes multi-threading to achieve optimal performance, generating approximately 20 threads concurrently. Within each loop for a page, each thread is tasked with downloading around a single URL, making it extremely efficient. Consequently, the function successfully downloaded 1983 out of 1983 files in under one minute. We also used random headers when accessing the server, to see implementation, go inside crawler.py

## <strong> 1.3 Parse downloaded pages

The list of the information we desire for each restaurant and their format is as follows:

    Restaurant Name (to save as restaurantName): string;
    Address (to save as address): string;
    City (to save as city): string;
    Postal Code (to save as postalCode): string;
    Country (to save as country): string;
    Price Range (to save as priceRange): string;
    Cuisine Type (to save as cuisineType): string;
    Description (to save as description): string;
    Facilities and Services (to save as facilitiesServices): list of strings;
    Accepted Credit Cards (to save as creditCards): list of strings;
    Phone Number (to save as phoneNumber): string;
    URL to the Restaurant Page (to save as website): string.


To parse those information we can just inspect one html to see how those information are stored as we did before.<br>
Most of the information can be retrieved in the following json script at the end of each HTML file:
```js
<script type="application/ld+json">{"@context":"http://schema.org","address":{"@type":"PostalAddress","streetAddress":"Piazza Salvo d'Acquisto 16","addressLocality":"Lamezia Terme","postalCode":"88046","addressCountry":"ITA","addressRegion":"Calabria"},"name":"Abbruzzino Oltre","image":"https://axwwgrkdco.cloudimg.io/v7/__gmpics3__/f19d37d6b9da437fa06b6f9406645056.jpg?width=1000","@type":"Restaurant","review":{"@type":"Review","datePublished":"2024-09-11T07:32","name":"Abbruzzino Oltre","description":"This restaurant, the new home of young chef Luca Abbruzzino, occupies the first floor of a historic palazzo in the town centre which has recently been converted into a small hotel offering six ...","author":{"@type":"Person","name":"Michelin Inspector"}},"telephone":"+39 0968 188 8038","knowsLanguage":"en-IT","acceptsReservations":"No","servesCuisine":"Contemporary","url":"https://guide.michelin.com/en/calabria/lamezia-terme/restaurant/abbruzzino-oltre","currenciesAccepted":"EUR","paymentAccepted":"American Express credit card, Credit card / Debit card accepted, Mastercard credit card, Visa credit card","award":"Selected: Good cooking","brand":"MICHELIN Guide","hasDriveThroughService":"False","latitude":38.9770969,"longitude":16.3202202,"hasMap":"https://www.google.com/maps/search/?api=1&query=38.9770969%2C16.3202202"}</script>
```

<img src = "images/restaurant_page.png" />

Now we create a parse_restaurant function that given a html, it parses all the information we need and returns it as a dictionary, we also decided to keep region as an extra column

In [2]:
parser = Parser()
info = parser.parse_restaurant('restaurants_html/1/la-trattoria-enrico-bartolini.html') #Test
parser.show_restaurant_info(info)

restaurantName: La Trattoria Enrico Bartolini
address: Località Badiola
city: Castiglione della Pescaia
postalCode: 58043
country: ITA
region: Tuscany
priceRange: €€€€
cuisineType: Mediterranean Cuisine, Grills
description: After a majestic picture-postcard approach via a long avenue lined with cypress trees and maritime pines, passing vineyards and Maremma cattle along the way, you finally arrive at this restaurant which serves trattoria-style cuisine full of intense, familiar and reassuring flavours. The decor here is elegant with the occasional rustic touch, while the service is of the highest level yet pleasantly friendly and informal. Welcome to Bartolini’s Maremma restaurant! Here, resident chef Bruno De Moura Cossio offers a choice of dishes with one common denominator, namely charcoal grilling. All the dishes served here have been grilled in some way, so that they have a distinctive barbecued flavour. However, although the chef’s Brazilian origins are obvious in many different 

Now we can create a tsv file with all the informations of all the restaurants

In [42]:
root = 'restaurants_html'
output= 'dataset/restaurant_info.tsv'
parser.save_all_restaurant_info_to_tsv(root, output) #actual implementation in Parser class

Data saved to dataset/restaurant_info.tsv


In [43]:
df = pd.read_table('dataset/restaurant_info.tsv', index_col=0)

# <strong> Search Engine </strong>

## <strong> 2.0.0. Preprocessing the Text

Before building the search engine, we need to prepare and clean the restaurant descriptions in our dataset. To accomplish this, we created a class named Preprocessor in preprocessor.py. This class leverages the nltk library to process the text in the description column. It removes stopwords and punctuation, converts the text to lowercase, and applies stemming to reduce words to their base forms. This preprocessing step ensures that the descriptions are standardized, making them more suitable for efficient search and retrieval.

In [None]:
df = pd.read_table('dataset/restaurant_info.tsv')
preprocessor = Preprocessor()
df = preprocessor.filter(df)

Now let's drop the column of description as we don't need it anymore and save the result

In [3]:
df.to_csv('dataset/restaurant_info.tsv', sep="\t")

## <strong> 2.1. Conjunctive Query

In [4]:
df = pd.read_table("dataset/restaurant_info.tsv", index_col=0)

## <strong> 2.1.1. Create Your Index!

Let's create a vocabulary that maps each word to a unique integer (term_id) and save it in a csv file.

In [3]:
all_descriptions = df['description_filtered'].str.cat(sep=' ')
all_descriptions = list(set(all_descriptions.split(" ")))
vocabulary = {word:id for id, word in enumerate(all_descriptions)}

In [4]:
# Save this vocabulary to a file with utf-8 encoding in order to be able to handle all the characters
with open('dataset/vocabulary.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['word', 'term_id'])  # Header
    for word, id in vocabulary.items():
        writer.writerow([word, id])

Now let's create an inverted index and save it into a file.

In [None]:
vocabulary = pd.read_csv("dataset/vocabulary.csv")
df = pd.read_table("dataset/restaurant_info_filtered.tsv")
df = df[['restaurantName','description_filtered']]
df['description_filtered'] = df['description_filtered'].str.split(" ")

In [None]:
word_to_restaurants = defaultdict(list)

# Iterate through the restaurant descriptions and update the dictionary
for i, row in df.iterrows():
    for word in row['description_filtered']:
        word_to_restaurants[word].append(row['restaurantName'])

# Add the 'restaurants_containing_word' column to the vocabulary DataFrame
vocabulary['restaurants_containing_word'] = vocabulary['word'].apply(lambda x: word_to_restaurants[x])

In [69]:
vocabulary

Unnamed: 0,word,term_id,restaurants_containing_word
0,mattiacci,0,[La Gioconda]
1,basil,1,"[Il Buco, Aria, Regallo, La Caravella dal 1959..."
2,betray,2,[Albergaccio di Castellina]
3,alway,3,"[Dal Pescatore, Osteria dell'Arco, Marcelin, A..."
4,mascia,4,[San Domenico]
...,...,...,...
7780,discreetli,7780,"[Morelli, Marelet, GioEle]"
7781,peopl,7781,"[Hosteria Giusti, Franceschetta 58, Taverna Ro..."
7782,sebada,7782,[Sa Domu Sarda]
7783,florian,7783,"[Umberto De Martino, Castel fine dining]"


Save inverted index into a file 

In [72]:
inverted_index = {term_id:rs for term_id, rs in zip(vocabulary['term_id'], vocabulary['restaurants_containing_word'])}
with open('dataset/inverted_index.json', 'w') as jsonfile:
    json.dump(inverted_index, jsonfile)

## <strong> 2.1.2. Execute the Query

In [2]:
df = "dataset/restaurant_info.tsv"
vocabulary = "dataset/vocabulary.csv"
inverted_index = "dataset/inverted_index.json"

searcher = SearchEngine(df,vocabulary,inverted_index)

In [3]:
ideal_restaurants = searcher.search("modern original")
ideal_restaurants

Unnamed: 0,restaurantName,address,description,website
2,Casa Fantini/Lake Time,piazza Motta,Situated on the lakefront in the attractive to...,https://www.casafantinilaketime.com/it/il-rist...
3,La Taverna di Bacco,"largo Luigi Trafelli, 5",A small restaurant just a stone’s throw from t...,https://www.latavernadibacconettuno.it
8,Novo Osteria,piazza De Cristoforis 30,"Originally a monastery, this building dating b...",https://novo-osteria.it/
9,Emozioni,via Guglielmo Marconi 129,"Situated in the heart of the historic centre, ...",http://www.ristoranteemozioni.com
12,Angiò-Macelleria di Mare,viale Africa 28/h,The phrase “macellaria di mare” translates as ...,https://albertoangiolucci.it/
...,...,...,...,...
1966,Il Grifone,via Ca' Masino 611/a - loc. Varignana,This fine-dining restaurant housed in the 18C ...,https://www.palazzodivarignana.com
1973,La Limonaia,via Mario Ponzio 10/b,This restaurant occupies a large veranda decor...,https://www.lalimonaia.org
1974,Riccio Restaurant,via Molo di Baia 47,"Overlooking the port, this restaurant serves t...",https://riccio.eatbu.com/?lang=it
1976,Sensi,via Pietro Comite 4,All five senses will be satisfied by a visit t...,https://sensiamalfi.it/


# <strong> 2.2. Ranked Search Engine with TF-IDF and Cosine Similarity

## <strong> 2.2.1 Inverted Index with TF-IDF Scores

TODO?

## <strong> 2.2.2. Execute the Ranked Query

<strong> EXAMPLE ON HOW TO CALCULATE COSINE SIMILARITY

**Table 1: Query TF-IDF Calculations**
| Word | Word Count | Total Words | TF | IDF Calculation | IDF Value | TF-IDF Score |
|------|------------|-------------|-----|-----------------|-----------|--------------|
| modern | 1 | 3 | 1/3 = 0.333 | log(100/1) | 4.605 | 0.333 × 4.605 = 1.534 |
| seasonal | 1 | 3 | 1/3 = 0.333 | log(100/1) | 4.605 | 0.333 × 4.605 = 1.534 |
| cuisine | 1 | 3 | 1/3 = 0.333 | log(100/1) | 4.605 | 0.333 × 4.605 = 1.534 |

Query Vector = [1.534, 1.534, 1.534]
Query Vector Magnitude = √(1.534² + 1.534² + 1.534²) = 2.657

**Table 2: Document TF-IDF Calculations**
| Word | Word Count | Total Words | TF | IDF Calculation | IDF Value | TF-IDF Score |
|------|------------|-------------|-----|-----------------|-----------|--------------|
| modern | 2 | 74 | 2/74 = 0.027 | log(100/1) | 4.605 | 0.027 × 4.605 = 0.124 |
| seasonal | 1 | 74 | 1/74 = 0.014 | log(100/1) | 4.605 | 0.014 × 4.605 = 0.064 |
| cuisine | 0 | 74 | 0/74 = 0 | log(100/1) | 4.605 | 0 × 4.605 = 0 |

Document Vector = [0.124, 0.064, 0]
Document Vector Magnitude = √(0.124² + 0.064² + 0²) = 0.139

**Cosine Similarity Calculation:**
```
Dot Product = (1.534 × 0.124) + (1.534 × 0.064) + (1.534 × 0)
            = 0.190 + 0.098 + 0
            = 0.288

Cosine Similarity = 0.288 / (2.657 × 0.139)
                  = 0.288 / 0.369
                  = 0.780
```

### <strong> Pseudoalgo Cosine Similarity Algorithm for Restaurant Search

## Steps

1. **Process Query**: 
   - Tokenize, lowercase, and stem the query words.
   - Map terms to unique IDs.

2. **Calculate Query TF-IDF**:
   - Compute term frequency (TF) and inverse document frequency (IDF) for each query term.
   - Multiply TF and IDF to get the query's TF-IDF vector.

3. **Calculate Document TF-IDF**:
   - Retrieve relevant documents using an inverted index.
   - Compute TF-IDF vectors for each document based on query terms.

4. **Compute Cosine Similarity**:
   - Normalize the query and document vectors.
   - Calculate the cosine similarity between the query and each document.

5. **Rank and Return**:
   - Sort documents by similarity score and return the top results.



In [4]:
query_scores = searcher.get_restaurant_scores("pasta") #Actual implementation inside searcher

  indexes.append(int(index))


In [5]:
query_scores 

Unnamed: 0,restaurantName,address,description,website,similarityScore
6,Trattoria di Via Serra,via Luigi Serra 9/b,It’s well worth heading off the traditional to...,https://www.trattoriadiviaserra.it/,1.0
10,Osteria della Foce,via Eugenio Ruspoli 72r,This simple restaurant stands out for its deli...,http://www.osteriadellafocegenova.it,1.0
15,Roscioli,via dei Giubbonari 21,This restaurant is part of one of the best foo...,https://www.salumeriaroscioli.com/,1.0
18,Trattoria del Cimino dal 1895,via Filippo Nicolai 44,Situated on the hill leading to Palazzo Farnes...,https://trattoriadelcimino.jimdofree.com/,1.0
28,Da Fausto,Località Valle Prati 1,This typical restaurant with a stone façade ha...,https://www.relaisborgodelgallo.com/,1.0


# <strong> 3. Define a New Score!

 <strong> Steps:<br>
    - User Query: The user provides a text query. We’ll retrieve relevant documents using the search engine built in Step 2.1.<br>
    - New Ranking Metric: After retrieving relevant documents, we’ll rank them using a new custom score. Instead of limiting the scoring to only the description field, we can include other attributes like priceRange, facilitiesServices, and cuisineType.<br>
    - You will use a heap data structure (e.g., Python’s heapq library) to maintain the top-k restaurants.


In [3]:
df = pd.read_table("dataset/restaurant_info.tsv")

# <strong> 5. BONUS: Advanced Search Engine

## <strong> 5.1 - Specify Search Criteria: Users can specify search terms for the following features (any or all of them):
    - restaurantName
    - city
    - cuisineType


Let's first create the inverted index for each of the features, we can do it very easily by creating a function and repeat the function call for each column we want.


In [4]:
df = pd.read_table("dataset/restaurant_info.tsv")

In [40]:
def createIndex(df, column):
    # Convert all unique words to lowercase and apply stemming
    stemmer = PorterStemmer()
    df[column] = df[column].fillna('na')
    all_words = pd.Series(" ".join(df[column].str.lower()).split()).unique()
    all_words_stemmed = [stemmer.stem(word) for word in all_words]
    
    # Initialize the dictionary for stemmed words
    word_to_restaurants = {word: [] for word in all_words_stemmed}
    
    # Split each row's text, stem each word, and update the index
    for i, row in df.iterrows():
        words = row[column].lower().split()
        for word in words:
            stemmed_word = stemmer.stem(word)
            word_to_restaurants[stemmed_word].append(row['restaurantName'])
    
    # Save to JSON
    path = f'dataset/{column}_index.json'
    with open(path, 'w') as jsonfile:
        json.dump(word_to_restaurants, jsonfile)

In [None]:
createIndex(df, 'restaurantName')
createIndex(df, 'city')
createIndex(df, 'cuisineType')

### <strong> 5.2 - Price Range Filter: Allow users to set a price range (e.g., between € and €€€) to filter the results by affordability.

In [29]:
createIndex(df, 'priceRange')

### <strong> 5.3 - Region Filter: Enable users to specify a list of Italian regions to limit the search to restaurants within those regions.

In [30]:
createIndex(df, 'region')

### <strong> 5.4 - Accepted Credit Cards: Provide an option to filter by accepted credit card types. Users can specify one or more preferred card types (e.g., Visa, MasterCard, Amex).

In [41]:
createIndex(df, 'creditCards')

### <strong> 5.5 - Services and Facilities: Allow users to filter based on specific services and facilities provided by the restaurant. For example, users may look for amenities like Wi-Fi, Terrace, Air Conditioning, or Parking.

In [42]:
createIndex(df, 'facilitiesServices')

## <strong> Now we can just implement use the AdvancedSearchEngine

In [9]:
advanced = AdvancedSearchEngine()
query = dict(
    city = 'catania',
    region = 'sicily'
)
advanced.search(**query)

Unnamed: 0,restaurantName,address,cuisineType,priceRange,website
12,Angiò-Macelleria di Mare,viale Africa 28/h,Seafood,€€,https://albertoangiolucci.it/
563,Sapio,piazza Antonino Gandolfo 11,"Sicilian, Modern Cuisine",€€€€,https://www.sapiorestaurant.it/
996,Me Cumpari Turiddu,piazza Turi Ferro 36,Sicilian,€,https://www.mecumparituriddu.it/
1054,Ménage,Via Euplio Reina 13,"Sicilian, Contemporary",€€,https://www.menagelounge.it/
1550,Concezione Restaurant,via Giuseppe Verdi 143,"Creative, Sicilian",€€€,https://concezionerestaurant.com/
1779,Materia | Spazio Cucina,via Teatro Massimo 29,"Sicilian, Modern Cuisine",€€,https://www.materiaspaziocucina.it/
1934,Coria,Via Prefettura 21,"Italian Contemporary, Sicilian",€€€,https://www.ristorantecoria.it
