Before starting we import all the libraries that we need

In [1]:
from MyFunctions.crawler import Crawler
from MyFunctions.parser import Parser
from MyFunctions.searchEngine import SearchEngine, AdvancedSearchEngine
from MyFunctions.preprocessor import Preprocessor
from MyFunctions.enanchedSearch import EnhancedSearchEngine, RestaurantSearchInterface
from MyFunctions.mapVisualizer import RestaurantMapVisualizer
import requests
from bs4 import BeautifulSoup
import re
import csv
import pandas as pd
import json
from collections import defaultdict
from nltk.stem import PorterStemmer
from collections import Counter
import math

# <strong> Data collection

## <strong> 1.1 Get the list of Michelin restaraunts

Before scraping all the restaurant URLs, let's first determine the maximum page number. It's easy to find the correct CSS selector for the page list, just inspect the list of pages in your browser and identify the corresponding class or element name.

<p>
    <img title = "list of pages" src="./images/pages_number.png"/>
</p>

In [2]:
response = requests.get('https://guide.michelin.com/en/it/restaurants')
soup = BeautifulSoup(response.content, "html.parser")
page_links = soup.select('ul.pagination li a') #name of the pages list
page_numbers = [int(a.get_text()) for a in page_links if a.get_text().isdigit()]

# Get the maximum page number
total_pages = max(page_numbers) if page_numbers else 0
print(f'There are in total: {total_pages} pages')

There are in total: 100 pages


Now we can very easily get the URL of each page

In [3]:
pages = ['https://guide.michelin.com/en/it/restaurants'] #Initial page

for i in range(2, total_pages+1): #get all other pages from 2 to total_pages included
    pages.append('https://guide.michelin.com/en/it/restaurants/page/'+str(i))

Now in order to get the URLs of all the restaurants, we proceed the same by identifying the name of the corresponding class in the webpage.

<p>
<img title = "Class of a restaraunt" src="images/restaurant_link.png"/>
</p>

We can clearly see that the restaurant URLs follow a consistent pattern, which can be expressed using the regular expression:

```bash
BASE_URL/en/region/city/restaurant/name_of_restaurant
```


In [4]:
total_urls = [] #save all urls
base = 'https://guide.michelin.com' #base url to use

In [5]:
for p in pages: #loop all pages
    response = requests.get(p) #get the page
    soup = BeautifulSoup(response.content, "html.parser") # we use BeautifulSoup to get the content
    links = soup.select('a.link') #select all the class 'a link'
    pattern = re.compile(r'^/en/[^/]+/[^/]+/restaurant/[^/]+$') #pattern of restaurants
    restaurant_links = [base+link.get('href') for link in links if pattern.match(link.get('href', ''))] #get all the restaurants links
    total_urls.append(restaurant_links)

Now we save all the urls inside a txt called 'restaurant_urls.txt'

In [None]:
with open('dataset/restaurant_urls.txt', 'w') as f: 
    page_count = 1  # Initialize the page count
    for urls in total_urls:
        f.write(f'{page_count}\n')  # Add a label for the page number
        for url in urls: # Write each URL from the current page
            f.write(f'{url}\n')  
        
        page_count += 1 # Increment the page count

In [11]:
print(sum([len(u) for u in total_urls])) # how many restaurants we got

1983


## <strong> 1.2. Crawl Michelin restaurant pages

Now we download all the HTML from the urls and save them in a folder and divide each of them in separate folder_pages

In [None]:
crawler = Crawler()
crawler.save_all_as_html('dataset/restaurant_urls.txt') # See actual implementation inside 'crawler.py'

In [3]:
path = 'restaurants_html'
count = crawler.count_files(path)
print('file count:', count)

file count: 1983


The save_all_as_html function utilizes multi-threading to achieve optimal performance, generating approximately 20 threads concurrently. <br> Within each loop for a page, each thread is tasked with downloading around a single URL, making it extremely efficient. <br>  Consequently, the function successfully downloaded 1983 out of 1983 files in under one minute.<br> We also used random headers when accessing the server, to see implementation, go inside crawler.py

## <strong> 1.3 Parse downloaded pages

The list of the information we desire for each restaurant and their format is as follows:

    Restaurant Name (to save as restaurantName): string;
    Address (to save as address): string;
    City (to save as city): string;
    Postal Code (to save as postalCode): string;
    Country (to save as country): string;
    Price Range (to save as priceRange): string;
    Cuisine Type (to save as cuisineType): string;
    Description (to save as description): string;
    Facilities and Services (to save as facilitiesServices): list of strings;
    Accepted Credit Cards (to save as creditCards): list of strings;
    Phone Number (to save as phoneNumber): string;
    URL to the Restaurant Page (to save as website): string.


To parse those information we can just inspect one html to see how those information are stored as we did before.<br>
Most of the information can be retrieved in the following json script at the end of each HTML file:
```js
<script type="application/ld+json">{"@context":"http://schema.org","address":{"@type":"PostalAddress","streetAddress":"Piazza Salvo d'Acquisto 16","addressLocality":"Lamezia Terme","postalCode":"88046","addressCountry":"ITA","addressRegion":"Calabria"},"name":"Abbruzzino Oltre","image":"https://axwwgrkdco.cloudimg.io/v7/__gmpics3__/f19d37d6b9da437fa06b6f9406645056.jpg?width=1000","@type":"Restaurant","review":{"@type":"Review","datePublished":"2024-09-11T07:32","name":"Abbruzzino Oltre","description":"This restaurant, the new home of young chef Luca Abbruzzino, occupies the first floor of a historic palazzo in the town centre which has recently been converted into a small hotel offering six ...","author":{"@type":"Person","name":"Michelin Inspector"}},"telephone":"+39 0968 188 8038","knowsLanguage":"en-IT","acceptsReservations":"No","servesCuisine":"Contemporary","url":"https://guide.michelin.com/en/calabria/lamezia-terme/restaurant/abbruzzino-oltre","currenciesAccepted":"EUR","paymentAccepted":"American Express credit card, Credit card / Debit card accepted, Mastercard credit card, Visa credit card","award":"Selected: Good cooking","brand":"MICHELIN Guide","hasDriveThroughService":"False","latitude":38.9770969,"longitude":16.3202202,"hasMap":"https://www.google.com/maps/search/?api=1&query=38.9770969%2C16.3202202"}</script>
```

<img src = "images/restaurant_page.png" />

Now we create a parse_restaurant function that given a html, it parses all the information we need and returns it as a dictionary, we also decided to keep region as an extra column

In [30]:
parser = Parser()
info = parser.parse_restaurant('restaurants_html/1/la-trattoria-enrico-bartolini.html') #Test
parser.show_restaurant_info(info)

restaurantName: La Trattoria Enrico Bartolini
address: Località Badiola
city: Castiglione della Pescaia
postalCode: 58043
country: ITA
region: Tuscany
priceRange: €€€€
cuisineType: Mediterranean Cuisine, Grills
description: After a majestic picture-postcard approach via a long avenue lined with cypress trees and maritime pines, passing vineyards and Maremma cattle along the way, you finally arrive at this restaurant which serves trattoria-style cuisine full of intense, familiar and reassuring flavours. The decor here is elegant with the occasional rustic touch, while the service is of the highest level yet pleasantly friendly and informal. Welcome to Bartolini’s Maremma restaurant! Here, resident chef Bruno De Moura Cossio offers a choice of dishes with one common denominator, namely charcoal grilling. All the dishes served here have been grilled in some way, so that they have a distinctive barbecued flavour. However, although the chef’s Brazilian origins are obvious in many different 

Now we can create a tsv file with all the informations of all the restaurants

In [31]:
root = 'restaurants_html'
output= 'dataset/restaurant_info.tsv'
parser.save_all_restaurant_info_to_tsv(root, output) #actual implementation in Parser class

Data saved to dataset/restaurant_info.tsv


In [32]:
df = pd.read_table('dataset/restaurant_info.tsv', index_col=0)

# <strong> Search Engine </strong>

## <strong> 2.0.0. Preprocessing the Text

Before building the search engine, we need to prepare and clean the restaurant descriptions in our dataset. To accomplish this, we created a class named Preprocessor in preprocessor.py. This class leverages the nltk library to process the text in the description column. It removes stopwords and punctuation, converts the text to lowercase, and applies stemming to reduce words to their base forms. This preprocessing step ensures that the descriptions are standardized, making them more suitable for efficient search and retrieval.

In [33]:
df = pd.read_table('dataset/restaurant_info.tsv')
preprocessor = Preprocessor()
df = preprocessor.filter(df)

[nltk_data] Downloading package stopwords to /home/pavka/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/pavka/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Now let's save the results

In [34]:
df.to_csv('dataset/restaurant_info.tsv', sep="\t")

## <strong> 2.1. Conjunctive Query

In [35]:
df = pd.read_table("dataset/restaurant_info.tsv", index_col=0)

## <strong> 2.1.1. Create Your Index!

Let's create a vocabulary that maps each word to a unique integer (term_id) and save it in a csv file.

In [36]:
all_descriptions = df['description_filtered'].str.cat(sep=' ')
all_descriptions = list(set(all_descriptions.split(" ")))
vocabulary = {word:id for id, word in enumerate(all_descriptions)}

In [37]:
# Save this vocabulary to a file with utf-8 encoding in order to be able to handle all the characters
with open('dataset/vocabulary.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['word', 'term_id'])  # Header
    for word, id in vocabulary.items():
        writer.writerow([word, id])

Now let's create an inverted index and save it into a file.

In [38]:
vocabulary = pd.read_csv("dataset/vocabulary.csv")
df = pd.read_table("dataset/restaurant_info_filtered.tsv")
df = df[['restaurantName','description_filtered']]
df['description_filtered'] = df['description_filtered'].str.split(" ")

In [39]:
word_to_restaurants = defaultdict(list)

# Iterate through the restaurant descriptions and update the dictionary
for i, row in df.iterrows():
    for word in row['description_filtered']:
        word_to_restaurants[word].append(row['restaurantName'])

# Add the 'restaurants_containing_word' column to the vocabulary DataFrame
vocabulary['restaurants_containing_word'] = vocabulary['word'].apply(lambda x: word_to_restaurants[x])

In [40]:
vocabulary['num_restaurants'] = vocabulary['restaurants_containing_word'].apply(len)
vocabulary['idf'] = vocabulary['num_restaurants'].apply(
    lambda x: math.log(vocabulary.shape[0] / (x + 1))
)

In [41]:
vocabulary

Unnamed: 0,word,term_id,restaurants_containing_word,num_restaurants,idf
0,elvira,0,[Due Colombe],1,8.266807
1,complet,1,"[Il Grano di Pepe, Roscioli, Trattoria del Cim...",100,4.344834
2,murano,2,"[Andrea Aprea, Miseria e Nobiltà]",2,7.861342
3,horsesho,3,[Antica Osteria il Ronchettino],1,8.266807
4,ecosystem,4,[Locanda La Raia],1,8.266807
...,...,...,...,...,...
7780,conservatori,7780,"[Piano35, Le Cedrare]",2,7.861342
7781,unannounc,7781,[Contraste],1,8.266807
7782,avail,7782,"[Sale Grosso, Novo Osteria, Osteria della Foce...",208,3.617620
7783,pretti,7783,"[Marco Martini Restaurant, Doc, Corte Matilde,...",6,7.014044


In [42]:
vocabulary.to_csv("dataset/vocabulary.csv", index = False)

Save inverted index into a file 

In [43]:
inverted_index = {term_id:rs for term_id, rs in zip(vocabulary['term_id'], vocabulary['restaurants_containing_word'])}
with open('dataset/inverted_index.json', 'w') as jsonfile:
    json.dump(inverted_index, jsonfile)

## <strong> 2.1.2. Execute the Query

In [5]:
df = "dataset/restaurant_info.tsv"
vocabulary = "dataset/vocabulary.csv"
inverted_index = "dataset/inverted_index.json"

searcher = SearchEngine(df,vocabulary,inverted_index)

In [3]:
ideal_restaurants = searcher.search("pasta")
ideal_restaurants

Unnamed: 0,restaurantName,address,description,website
6,Trattoria di Via Serra,via Luigi Serra 9/b,It’s well worth heading off the traditional to...,https://www.trattoriadiviaserra.it/
10,Osteria della Foce,via Eugenio Ruspoli 72r,This simple restaurant stands out for its deli...,http://www.osteriadellafocegenova.it
15,Roscioli,via dei Giubbonari 21,This restaurant is part of one of the best foo...,https://www.salumeriaroscioli.com/
18,Trattoria del Cimino dal 1895,via Filippo Nicolai 44,Situated on the hill leading to Palazzo Farnes...,https://trattoriadelcimino.jimdofree.com/
28,Da Fausto,Località Valle Prati 1,This typical restaurant with a stone façade ha...,https://www.relaisborgodelgallo.com/
...,...,...,...,...
1913,Osteria dalla Peppa,via Vecchia 8,"Already popular in the late 19C, this inn in t...",https://www.osteriadallapeppa.it
1947,Casa Vicina,via Ermanno Fenoglietti 20/b,Renowned over the years for its traditional Pi...,https://www.casavicina.com/
1956,Trattoria della Fortuna,Via Salaria 57,This interior of this trattoria occupying an a...,http://www.trattoriadellafortuna.it
1961,Osteria Ricanatti,corso Cavour 37,Enjoy delicious updated regional cuisine made ...,http://www.osteriaricanatti.it


# <strong> 2.2. Ranked Search Engine with TF-IDF and Cosine Similarity

## <strong> 2.2.1 Inverted Index with TF-IDF Scores

In [46]:
df = pd.read_table("dataset/restaurant_info.tsv")
df = df[['restaurantName','description_filtered']]

In [47]:
# Tokenize and Calculate Term Frequencies
tf = {}
for _, row in df.iterrows():
    doc_id = row['restaurantName']
    words = row['description_filtered'].split()  # Tokenize 
    word_counts = Counter(words)
    total_terms = len(words)
    tf[doc_id] = {word: count / total_terms for word, count in word_counts.items()}

In [48]:
# Calculate Document Frequency (DF) in order to calculate IDF
freq = Counter()
for word_counts in tf.values():
    for word in word_counts:
        freq[word] += 1

In [49]:
# Calculate IDF
total_docs = len(df)  # Total number of documents
idf = {word: math.log(total_docs/count + 1) for word, count in freq.items()}

In [50]:
# Calculate TF-IDF and build the inverted index
inverted_index_tfidf = defaultdict(dict)
for doc_id, word_counts in tf.items():
    for word, tf_score in word_counts.items():
        tf_idf_score = tf_score * idf[word]
        inverted_index_tfidf[word][doc_id] = tf_idf_score

In [51]:
# Save the new inverted index as a file
inverted_index_tfidf = dict(inverted_index_tfidf)
with open('dataset/inverted_index_tfidf.json', 'w') as jsonfile:
    json.dump(inverted_index_tfidf, jsonfile)


## <strong> 2.2.2. Execute the Ranked Query

<strong> EXAMPLE ON HOW TO CALCULATE COSINE SIMILARITY

**Table 1: Query TF-IDF Calculations**
| Word | Word Count | Total Words | TF | IDF Calculation | IDF Value | TF-IDF Score |
|------|------------|-------------|-----|-----------------|-----------|--------------|
| modern | 1 | 3 | 1/3 = 0.333 | log(100/1) | 4.605 | 0.333 × 4.605 = 1.534 |
| seasonal | 1 | 3 | 1/3 = 0.333 | log(100/1) | 4.605 | 0.333 × 4.605 = 1.534 |
| cuisine | 1 | 3 | 1/3 = 0.333 | log(100/1) | 4.605 | 0.333 × 4.605 = 1.534 |

Query Vector = [1.534, 1.534, 1.534]
Query Vector Magnitude = √(1.534² + 1.534² + 1.534²) = 2.657

**Table 2: Document TF-IDF Calculations**
| Word | Word Count | Total Words | TF | IDF Calculation | IDF Value | TF-IDF Score |
|------|------------|-------------|-----|-----------------|-----------|--------------|
| modern | 2 | 74 | 2/74 = 0.027 | log(100/1) | 4.605 | 0.027 × 4.605 = 0.124 |
| seasonal | 1 | 74 | 1/74 = 0.014 | log(100/1) | 4.605 | 0.014 × 4.605 = 0.064 |
| cuisine | 0 | 74 | 0/74 = 0 | log(100/1) | 4.605 | 0 × 4.605 = 0 |

Document Vector = [0.124, 0.064, 0]
Document Vector Magnitude = √(0.124² + 0.064² + 0²) = 0.139

**Cosine Similarity Calculation:**
```
Dot Product = (1.534 × 0.124) + (1.534 × 0.064) + (1.534 × 0)
            = 0.190 + 0.098 + 0
            = 0.288

Cosine Similarity = 0.288 / (2.657 × 0.139)
                  = 0.288 / 0.369
                  = 0.780
```

<strong> Small fact we noticed,</strong><br> if the query contains only 1 word, the probability of getting cosine_similarity is very high, let's make an example: <br>
Example if query is "pasta",<br> trivially the vector created is one dimensional (tf-idf of just pasta), and so is the document query, the angle between two 1-D vectors is 0, because they both lie on the same line, so the cos(0) is 1.

A more formal proof:

### **Cosine Similarity**
The **cosine similarity** between two vectors is defined as:  
$$\cos(\Theta) = \frac{\vec{v_1} \cdot \vec{v_2}}{\|\vec{v_1}\| \cdot \|\vec{v_2}\|}$$  
Where:  
- $\vec{v_1}$: vector representing the **QUERY**.  - $\vec{v_2}$: vector representing the **DOCUMENT**.  
We analyze **two distinct cases**:  
---
## if we make vectors of only len |query|
In this case, we calculate the cosine similarity **without normalizing with respect to the total length of the document**. The weight of the term "PASTA" is directly derived from its frequency in the query and the document.  
### **Example**  
- **QUERY** = "PASTA"  
- **IDF of "PASTA"** = \(X\)  
- **TF of QUERY** =    $$\text{TF}_{\text{QUERY}} = \frac{1}{\text{length of query}} = 1$$  
- **TF of DOCUMENT** = \(Y\)  
#### **Vectors Representation**  
- Query vector:    $$\vec{v_1} = (X \cdot \text{TF}_{\text{QUERY}}) = (X)$$  
- Document vector:  
  $$\vec{v_2} = (X \cdot \text{TF}_{\text{DOCUMENT}}) = (XY)$$  
#### **Cosine Similarity Calculation**  
1. Dot product:     $$\vec{v_1} \cdot \vec{v_2} = X \cdot XY = X^2 \cdot Y$$  
2. Vector magnitudes:  
   $\|\vec{v_1}\| = \sqrt{X^2} = X$     $\|\vec{v_2}\| = \sqrt{(XY)^2} = XY$
3. Cosine similarity formula:  
   $$\cos(\Theta) = \frac{X^2 \cdot Y}{X \cdot XY}$$  
4. Simplification:     $$\cos(\Theta) = \frac{X^2 \cdot Y}{X^2 \cdot Y} = 1$$  
#### **Conclusion**  
In this case, the cosine similarity is:  $$\cos(\Theta) = 1$$  
This result implies a **perfect alignment** between the query and the document because we are not considering the total length of the document.



### <strong> Pseudoalgo Cosine Similarity Algorithm for Restaurant Search

## Steps

1. **Process Query**: 
   - Tokenize, lowercase, and stem the query words.
   - Map terms to unique IDs.

2. **Calculate Query TF-IDF**:
   - Compute term frequency (TF) and inverse document frequency (IDF) for each query term.
   - Multiply TF and IDF to get the query's TF-IDF vector.

3. **Calculate Document TF-IDF**:
   - Retrieve relevant documents using an inverted index.
   - Compute TF-IDF vectors for each document based on query terms.

4. **Compute Cosine Similarity**:
   - Normalize the query and document vectors.
   - Calculate the cosine similarity between the query and each document.

5. **Rank and Return**:
   - Sort documents by similarity score and return the top results.



We implemented 2 versions of cosine similarity:
- the 1st version takes 2 vectors of size |query|, this one outputs very high scores because the vector size is small, based on query len.
- the 2nd version takes 2 vectors of size |Total words|, this one generally outputs low scores and it's 1 only when QUERY == DESCRIPTION.

We decided to use the second version, but it can also be easily implemented with the first one.

In [54]:
query_scores = searcher.get_restaurant_scores1("situat outskirt ") #Actual implementation inside searcher

In [55]:
query_scores 

Unnamed: 0,restaurantName,address,description,website,similarityScore
720,Trattoria la Rosa 1908,"via del Bosco 2, ang. via Facchini","Situated on the outskirts of Ferrara, this rel...",https://www.larosa1908.it/,1.0
21,Il Labirinto,via Corsica 224,"Situated on the outskirts of Brescia, this lon...",http://www.ristoranteillabirinto.it,1.0
411,Il Basilisco,via Bison 34,Situated in a residential district on the outs...,https://www.ristorantebasilisco.com/,1.0
590,Corbezzoli,via Altura 11 bis,Situated in the Relais Bellaria hotel on the o...,http://www.corbezzoli.com,1.0
1070,Boccadoro,via della Resistenza 49,"Situated on the outskirts of Padova, this clas...",https://www.boccadoro.it/,1.0


In [58]:
query_scores2 = searcher.get_restaurant_scores2("situat outskirt") #Actual implementation inside searcher

In [59]:
query_scores2

Unnamed: 0,restaurantName,address,description,website,similarityScore
21,Il Labirinto,via Corsica 224,"Situated on the outskirts of Brescia, this lon...",http://www.ristoranteillabirinto.it,0.3531
1070,Boccadoro,via della Resistenza 49,"Situated on the outskirts of Padova, this clas...",https://www.boccadoro.it/,0.27485
411,Il Basilisco,via Bison 34,Situated in a residential district on the outs...,https://www.ristorantebasilisco.com/,0.248176
590,Corbezzoli,via Altura 11 bis,Situated in the Relais Bellaria hotel on the o...,http://www.corbezzoli.com,0.242091
720,Trattoria la Rosa 1908,"via del Bosco 2, ang. via Facchini","Situated on the outskirts of Ferrara, this rel...",https://www.larosa1908.it/,0.203599


In [60]:
query_scores2_test = searcher.get_restaurant_scores2("Minimalist decor and clean lines characterise this restaurant decorated in dark tones, where the two owner-chefs have both worked in various Michelin-starred restaurants (and others) over the years. Modern cuisine which showcases seasonal ingredients in regional yet refined dishes")
query_scores2_test

Unnamed: 0,restaurantName,address,description,website,similarityScore
36,Retrobottega,via della Stelletta 4,Minimalist decor and clean lines characterise ...,https://www.retro-bottega.com,1.0


# <strong> 3. Define a New Score!

 <strong> Steps:<br>
    - User Query: The user provides a text query. We’ll retrieve relevant documents using the search engine built in Step 2.1.<br>
    - New Ranking Metric: After retrieving relevant documents, we’ll rank them using a new custom score. Instead of limiting the scoring to only the description field, we can include other attributes like priceRange, facilitiesServices, and cuisineType.<br>
    - You will use a heap data structure (e.g., Python’s heapq library) to maintain the top-k restaurants.


### **We create the class `EnhancedSearchEngine` to search for the top-k restaurants using our score**

### **We also create the class `RestaurantSearchInterface` to display an interface (using HTML and ipywidgets package)**

In the `EnhancedSearchEngine` we use the search engine from Step 2.1 and we add a custom score based on the Query, Cusine , Facilities and Price range. 
We decide to use the same weight for all the arguments so that everyone has the same relevance.


The `RestaurantSearchInterface` is only an interface that we can use to search faster our restaurant, in this way we have all the possibilities for the Price_Range and Cusine_type, for the facilities we can select more than one, in the end we can also choose how many result to show with a slider.

In [2]:
df_path = "dataset/restaurant_info.tsv"

In [3]:
# Function to create and display the interface
def create_search_interface(enhanced_searcher, df_path):
    """
    Create and display the search interface
    
    Parameters:
    enhanced_searcher: EnhancedSearchEngine instance
    """
    search_interface = RestaurantSearchInterface(enhanced_searcher, df_path)
    search_interface.display_interface()
    return search_interface

In [6]:
enhanced_searcher =EnhancedSearchEngine(
    original_file=df_path,
    vocabulary_file=vocabulary,
    inverted_index_file=inverted_index
)

# Create and display the search interface
interface = create_search_interface(enhanced_searcher, df_path)

VBox(children=(HTML(value='<h2>Restaurant Search</h2>'), HBox(children=(Text(value='', description='Search:', …

# <strong> 4. Visualizing the Most Relevant Restaurants
Maps can provide users with an easy way to see where restaurants are located. This is especially useful for understanding which regions in Italy have more options.

### Steps for Visualization:

1. **Geocode Locations**: Collect information on unique restaurant locations in Italy (in the format of `City` and `Region`). You can use tools such as Google API, OpenStreetMap, or a pre-defined list to retrieve representative coordinates for each region.
   
2. **Ask a Large Language Model (LLM)**: Alternatively, you can compile a list of unique cities and regions in Italy, formatted as `(City, Region)`, and ask an LLM (e.g., ChatGPT) to provide coordinates for these locations. This can be an efficient way to gather data without using API calls. Just make sure that the retrieved information is correct and helpful.

3. **Map Setup**: Use a mapping library like `plotly` or `folium` to create a visual display of restaurants by region.

4. **Encoding Price Ranges**: Incorporate a visual representation for price ranges:
   - Use color-coding or marker size to represent the restaurant’s price range (`€`, `€€`, `€€€`, `€€€€`).
   - Include a legend for interpreting price levels.

5. **Plot Top-K Restaurants**: Use the custom score from Step 3 to select the top-k restaurants for display.

This map will give users an overview of restaurant options across different regions in Italy, with an indication of cost based on visual cues.

### First we create the class RestaurantMapVisualizer to create and display the list of top-k restaurants
For the visualization we use `plotly` and we use the variables latitude and longitude from `restaurant_info.tsv` to take the coordinates of the restaurant.  
We recall `EnhancedSearchEngine` to take only the top restaurants.  


To complete the exercise, we used a Mapbox token. However, for privacy reasons, we will not include it on GitHub. 
To obtain a new token, simply visit the Mapbox website and register . 
[Mapbox's official website](https://www.mapbox.com/)
For simplicity, we have included a video demonstrating the map's usage in the README file.

In [None]:
# Your Mapbox token
mapbox_token = "" #HERE PUT THE TOKEN

# Function to create and display the map visualizer
visualizer = RestaurantMapVisualizer(enhanced_searcher, df_path, mapbox_token)


# Create and display the map 
visualizer.plot_restaurants(
    top_k=20,
    query="pasta"
)


In [8]:
# Example with more arguments
visualizer.plot_restaurants(
    top_k=20,
    query="modern",
    cuisine = 'Italian Contemporary, Creative',
    facilities = 'Air conditioning',
    price_range = '€€€'
)

# <strong> 5. BONUS: Advanced Search Engine

## <strong> 5.1 - Specify Search Criteria: Users can specify search terms for the following features (any or all of them):
    - restaurantName
    - city
    - cuisineType


Let's first create the inverted index for each of the features, we can do it very easily by creating a function and repeat the function call for each column we want.


In [9]:
df = pd.read_table("dataset/restaurant_info.tsv")

In [40]:
def createIndex(df, column):
    # Convert all unique words to lowercase and apply stemming
    stemmer = PorterStemmer()
    df[column] = df[column].fillna('na')
    all_words = pd.Series(" ".join(df[column].str.lower()).split()).unique()
    all_words_stemmed = [stemmer.stem(word) for word in all_words]
    
    # Initialize the dictionary for stemmed words
    word_to_restaurants = {word: [] for word in all_words_stemmed}
    
    # Split each row's text, stem each word, and update the index
    for i, row in df.iterrows():
        words = row[column].lower().split()
        for word in words:
            stemmed_word = stemmer.stem(word)
            word_to_restaurants[stemmed_word].append(row['restaurantName'])
    
    # Save to JSON
    path = f'dataset/{column}_index.json'
    with open(path, 'w') as jsonfile:
        json.dump(word_to_restaurants, jsonfile)

In [None]:
createIndex(df, 'restaurantName')
createIndex(df, 'city')
createIndex(df, 'cuisineType')

### <strong> 5.2 - Price Range Filter: Allow users to set a price range (e.g., between € and €€€) to filter the results by affordability.

In [29]:
createIndex(df, 'priceRange')

### <strong> 5.3 - Region Filter: Enable users to specify a list of Italian regions to limit the search to restaurants within those regions.

In [30]:
createIndex(df, 'region')

### <strong> 5.4 - Accepted Credit Cards: Provide an option to filter by accepted credit card types. Users can specify one or more preferred card types (e.g., Visa, MasterCard, Amex).

In [41]:
createIndex(df, 'creditCards')

### <strong> 5.5 - Services and Facilities: Allow users to filter based on specific services and facilities provided by the restaurant. For example, users may look for amenities like Wi-Fi, Terrace, Air Conditioning, or Parking.

In [42]:
createIndex(df, 'facilitiesServices')

## <strong> Now we can just implement use the AdvancedSearchEngine

In [9]:
advanced = AdvancedSearchEngine()
query = dict(
    city = 'catania',
    region = 'sicily'
)
advanced.search(**query)

Unnamed: 0,restaurantName,address,cuisineType,priceRange,website
12,Angiò-Macelleria di Mare,viale Africa 28/h,Seafood,€€,https://albertoangiolucci.it/
563,Sapio,piazza Antonino Gandolfo 11,"Sicilian, Modern Cuisine",€€€€,https://www.sapiorestaurant.it/
996,Me Cumpari Turiddu,piazza Turi Ferro 36,Sicilian,€,https://www.mecumparituriddu.it/
1054,Ménage,Via Euplio Reina 13,"Sicilian, Contemporary",€€,https://www.menagelounge.it/
1550,Concezione Restaurant,via Giuseppe Verdi 143,"Creative, Sicilian",€€€,https://concezionerestaurant.com/
1779,Materia | Spazio Cucina,via Teatro Massimo 29,"Sicilian, Modern Cuisine",€€,https://www.materiaspaziocucina.it/
1934,Coria,Via Prefettura 21,"Italian Contemporary, Sicilian",€€€,https://www.ristorantecoria.it
