<h1> 2.0.0 Preprocessing the Text <img src= "https://cdn-icons-png.flaticon.com/256/2857/2857376.png" width=100 style="vertical-align: middle"> </h1>

### Before building the search engine, you must clean and prepare the text in each restaurant’s description. We will:

- #### 1. Remove stopwords ;
- #### 2. Remove punctuation ;
- #### 3. Apply stemming ;
- #### 4. Perform any other necessary cleaning to improve search accuracy.

In [1]:
import pandas as pd
data=pd.read_csv('all_restaurants_data.csv')
data.head()
#data.shape[0]

Unnamed: 0,restaurantName,address,city,postalCode,country,priceRange,cuisineType,description,facilitiesServices,creditCards,phoneNumber,website
0,O Me O Il Mare,Via Roma 45/47,Gragnano,80054,Italy,€€€€,"Italian Contemporary, Modern Cuisine",After many years’ experience in Michelin-starr...,"['Air conditioning', 'Interesting wine list', ...","['Amex', 'Dinersclub', 'Mastercard', 'Visa']",+39 081 620 0550,http://omeoilmare.com
1,Donevandro,via Garibaldi 2,Popoli,65026,Italy,€€,"Contemporary, Seasonal Cuisine","Up until a few years ago, the owner-chef at th...",['Air conditioning'],"['Mastercard', 'Visa']",+39 388 887 6858,http://www.donevandroristorante.it
2,Ape Vino e Cucina,Piazza Risorgimento 3,Alba,12051,Italy,€€,"Piedmontese, Contemporary",This attractive restaurant in the heart of Alb...,"['Air conditioning', 'Terrace', 'Wheelchair ac...","['Amex', 'Dinersclub', 'Maestrocard', 'Masterc...",+39 0173 363453,https://www.apewinebar.it/alba/
3,Da Bob Cook Fish,largo Parsano vecchio 16,Sorrento,80067,Italy,€€,Seafood,Working in partnership with the nearby fishmon...,"['Air conditioning', 'Terrace']","['Amex', 'Dinersclub', 'Mastercard', 'Visa']",+39 081 1778 3873,https://www.dabobcookfish.com/
4,DA_MÓ,Via Bruno Buozzi 20,Matera,75100,Italy,€€,"Regional Cuisine, Contemporary","This new, restored restaurant in the upper par...","['Air conditioning', 'Terrace']","['Amex', 'Dinersclub', 'Mastercard', 'Visa']",+39 0835 686548,https://www.damoristorante.it/


In [2]:
# Mapping price_range for more readability
price_range_labels = {'€': 'Economic',  '€€': 'Affordable','€€€': 'Expensive','€€€€': 'Luxury'}

data['priceRange'] = data['priceRange'].map(price_range_labels)

In [3]:
data.head()

Unnamed: 0,restaurantName,address,city,postalCode,country,priceRange,cuisineType,description,facilitiesServices,creditCards,phoneNumber,website
0,O Me O Il Mare,Via Roma 45/47,Gragnano,80054,Italy,Luxury,"Italian Contemporary, Modern Cuisine",After many years’ experience in Michelin-starr...,"['Air conditioning', 'Interesting wine list', ...","['Amex', 'Dinersclub', 'Mastercard', 'Visa']",+39 081 620 0550,http://omeoilmare.com
1,Donevandro,via Garibaldi 2,Popoli,65026,Italy,Affordable,"Contemporary, Seasonal Cuisine","Up until a few years ago, the owner-chef at th...",['Air conditioning'],"['Mastercard', 'Visa']",+39 388 887 6858,http://www.donevandroristorante.it
2,Ape Vino e Cucina,Piazza Risorgimento 3,Alba,12051,Italy,Affordable,"Piedmontese, Contemporary",This attractive restaurant in the heart of Alb...,"['Air conditioning', 'Terrace', 'Wheelchair ac...","['Amex', 'Dinersclub', 'Maestrocard', 'Masterc...",+39 0173 363453,https://www.apewinebar.it/alba/
3,Da Bob Cook Fish,largo Parsano vecchio 16,Sorrento,80067,Italy,Affordable,Seafood,Working in partnership with the nearby fishmon...,"['Air conditioning', 'Terrace']","['Amex', 'Dinersclub', 'Mastercard', 'Visa']",+39 081 1778 3873,https://www.dabobcookfish.com/
4,DA_MÓ,Via Bruno Buozzi 20,Matera,75100,Italy,Affordable,"Regional Cuisine, Contemporary","This new, restored restaurant in the upper par...","['Air conditioning', 'Terrace']","['Amex', 'Dinersclub', 'Mastercard', 'Visa']",+39 0835 686548,https://www.damoristorante.it/


In [4]:
data["description"].iloc[0].split()

['After',
 'many',
 'years’',
 'experience',
 'in',
 'Michelin-starred',
 'restaurants,',
 'Luigi',
 'Tramontano',
 'and',
 'his',
 'wife',
 'Nicoletta',
 'have',
 'opened',
 'their',
 'first',
 'restaurant',
 'in',
 'the',
 'chef’s',
 'native',
 'Gargnano.',
 'Previously',
 'a',
 'pasta',
 'factory,',
 'the',
 'building',
 'has',
 'been',
 'converted',
 'into',
 'an',
 'elegant,',
 'contemporary-style',
 'restaurant',
 'which',
 'has',
 'nonetheless',
 'retained',
 'its',
 'charming',
 'high',
 'ceilings.',
 'The',
 'cuisine',
 'is',
 'inspired',
 'by',
 'regional',
 'traditions',
 'which',
 'are',
 'reinterpreted',
 'to',
 'create',
 'gourmet',
 'dishes,',
 'all',
 'prepared',
 'with',
 'respect',
 'for',
 'the',
 'ingredients',
 'used',
 'and',
 'a',
 'strong',
 'focus',
 'on',
 'local',
 'produce.']

 we have imported the dataset "all_restaurant_data.csv"

### For the Preprocess Operation of all description we use the `nltk` library

In [5]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize
import re
import nltk
from tqdm import tqdm  # Import tqdm for the progress bar
from nltk.stem.snowball import SnowballStemmer

# Download required NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# Initialize stop words, stemmer
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer("english")


def text_cleaning(text):
    
    text=text.lower()

    # Remove special characters and symbols (non-alphanumeric characters)
    text = re.sub(r'[^\w\s]', '', text)

    # Remove extra spaces and numbers
    text = re.sub(r'\s+', '  ', text)
    text = re.sub(r'\d+', ' ', text)

    # Tokenize the text (split it into individual words)
    words = word_tokenize(text)

    # Apply filters, remove stopwords, exclude verbs, and apply stemmmer
    processed_words = [
        stemmer.stem(word)  
        for word in words
        if word not in stop_words and nltk.pos_tag([word])[0][1] not in ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]  ] # remove all verbs from description


    processed_text = ' '.join(processed_words).strip()

    return processed_text


# Apply the preprocessing function to your data (e.g., data['description'])
tqdm.pandas(desc="Processing descriptions")  # Setup tqdm progress bar
data['description2'] = data['description'].progress_apply(text_cleaning)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\flavi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\flavi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\flavi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\flavi\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
Processing descriptions: 100%|██████████| 1983/1983 [00:15<00:00, 125.37it/s]


### We have done preprocessing on the restaurant descriptions using the nltk library, following these steps:

* ### Removal of stopwords: We imported English stopwords and kept only words that are not among them ;
* ### Removal of punctuation ;
* ### Application of stemming: We used the PorterStemmer to reduce words to their root ;
* ### Other filters: We converted words to lowercase and accepted only alphabetical words.

At the end these operations were applied to the 'description' column of the dataset, and the result was saved in a new column 'description2', ready for use in the new search engine

In [6]:
# Set display options to show full text in each cell
pd.set_option('display.max_colwidth', None)

# Display the first few rows of 'description2'
print(data[['description2']].iloc[0])

description2    mani year experi restaur luigi tramontano wife nicoletta first restaur chef nativ gargnano previous pasta factori build eleg contemporarystyl restaur nonetheless high ceil cuisin region tradit creat gourmet dish prepar respect ingredi strong focus local produc
Name: 0, dtype: object


In [7]:
#create a set of all noduplicated words in the variable 'description2'
vocabulary=set()
for text in data['description2']:
    vocabulary.update(text.split())

#create a dictionary_word with all noduplicated words as keys and the associated term_id as values
dictionary_word={}
for word,word_id in enumerate(vocabulary,start=0):
    dictionary_word[word]=word_id

In [8]:
#save dictionary_word in a csv file named "vocabulary.csv"
dictionary_word_df=pd.DataFrame(dictionary_word.items(),columns=['word_id','word'])
dictionary_word_df.to_csv('vocabulary.csv',index=False)

In [9]:
#show first rows of vocabulary.csv
vocabulary=pd.read_csv('vocabulary.csv')
vocabulary.head()

Unnamed: 0,word_id,word
0,0,transform
1,1,cera
2,2,tell
3,3,prosdocimi
4,4,class


## 2.1 Conjunctive Query
#### This first version of the search engine narrows the search to the description field of each restaurant. Only restaurants whose descriptions contain all the query words will be returned.

## 2.1.1 Create Your Index!
* ##### Vocabulary File: Create a file called vocabulary.csv that maps each word to a
unique integer (term_id);
* ##### Inverted Index: Build a dictionary_word mapping each term_id to a list of document IDs where that term appears.
{
  "term_id_1": [document_1, document_2, document_4],
  "term_id_2": [document_1, document_3, document_5],
  ...
}

#### **Each document_i represents a unique restaurant.**

### We've created a vocabulary of words from the description2 column. We assigned a unique term_id to each word and saved it to vocabulary.csv. This step is necessary for the construction of the inverted index.

In [10]:
# Initialize the inverted index and dictionary_word for restaurant names
inverted_index = {}
dictionary_word_restaurant = {}

# Create dictionary_word for term to term_id mapping
# Assuming 'dictionary_word' will store words as keys and their corresponding term_ids as values
# Populate the dictionary_word for each unique restaurant name
for document_id, restaurant_name in enumerate(data['restaurantName'], start=0):
    dictionary_word_restaurant[restaurant_name] = document_id  # Map restaurant name to document id

# Create vocabulary from the 'description2' column (unique words)
vocabulary = set()  # Set to store unique words
for desc in data['description2']:
    # Tokenize the description and add unique words to the vocabulary
    vocabulary.update(desc.split())

# Create dictionary_word for terms (words) to term_ids mapping
dictionary_word = {word: term_id for term_id, word in enumerate(vocabulary, start=0)}

# Initialize the inverted index from the 'description2' column
for i in range(len(data)):
    # Extract document_id from dictionary_word_restaurant using the restaurant name
    document_id = dictionary_word_restaurant[data['restaurantName'].iloc[i]]
    words = data['description2'].iloc[i].split()  # Split description2 into words
    
    # For each word in the description2
    for word in words:
        if word in dictionary_word:  # Ensure the word exists in the dictionary_word
            term_id = dictionary_word[word]  # Extract term_id from dictionary_word
            
            # Initialize the inverted index entry for this term_id if not present
            if term_id not in inverted_index:
                inverted_index[term_id] = []
            
            # Add the document_id to the inverted index for this term_id if it's not already present
            if document_id not in inverted_index[term_id]:
                inverted_index[term_id].append(document_id)

# Print the first word entries of the vocabulary (word, term_id pairs)
print("Vocabulary:", list(dictionary_word.items())[:1])

# Show the first word entries of the inverted index for inspection
print("Inverted Index:", list(inverted_index.items())[:1])

# Check if the number of unique term_ids in vocabulary matches the term_ids in the inverted index
if len(vocabulary) == len(inverted_index):
    print("The vocabulary and inverted index match.")
else:
    print("There is a mismatch between vocabulary and inverted index.")


Vocabulary: [('transform', 0)]
Inverted Index: [(1994, [0, 8, 9, 15, 24, 43, 83, 86, 89, 100, 110, 179, 191, 195, 197, 204, 235, 246, 257, 279, 284, 289, 320, 329, 331, 333, 341, 362, 369, 372, 396, 409, 452, 484, 496, 502, 512, 516, 1134, 601, 603, 614, 626, 632, 644, 676, 715, 771, 779, 791, 793, 805, 813, 819, 855, 857, 859, 860, 899, 911, 923, 930, 941, 977, 983, 1004, 1026, 1050, 1091, 1119, 1133, 1207, 1219, 1227, 1241, 1248, 1268, 1276, 1294, 1323, 1353, 1363, 1387, 1427, 1438, 1445, 1447, 1457, 1467, 1475, 1505, 1507, 1518, 1520, 1526, 1527, 1536, 1544, 1600, 1601, 1608, 1615, 1624, 1628, 1647, 1648, 1684, 1734, 1744, 1747, 1766, 1767, 1773, 1788, 1799, 1802, 1808, 1817, 1853, 1880, 1891, 1892, 1904, 1917, 1935, 1936, 1939, 1949, 1955, 1979])]
The vocabulary and inverted index match.


we've created
* a dictionary_word for restaurants, dictionary_word_restaurant, where each key is a unique restaurant name from the restaurantName column, and each value is a unique document_id.
* A dictionary_word inverted_index, which contains term_id values as keys, each linked to a list of document_id values where the words appear.

Then we checked that inverted_index contains all the words from description2 by comparing the length of vocabulary with the number of keys in inverted_index. This confirms that all words are included in the inverted index

In [11]:
import pickle
#saved inverted index in "inverted_index.pkl"
with open("inverted_index.pkl","wb") as f:
    pickle.dump(inverted_index,f)
#show inverted index
with open("inverted_index.pkl","rb") as f:
    loaded_inverted_index=pickle.load(f)

## 2.1.2 Execute the Query

#### **When the user inputs a query, for example, "modern seasonal cuisine", the search engine will:**

In [12]:
def execute_query(query):
    # Preprocess the query: remove punctuation, convert to lowercase
    query =  query.lower().split()

    # Create an empty list to store the word IDs corresponding to the query terms
    query_word_ids = []

    # Iterate over each word in the preprocessed query
    for word in query:
        # If the word exists in the dictionary (which maps words to IDs)
        if word in dictionary_word:
            # Append the corresponding word ID to the query_word_ids list
            query_word_ids.append(dictionary_word[word])

    # Check if we have any valid word IDs in the query
    if not query_word_ids:
        return "No matching words found in the dictionary."

    # Initialize the intersection list to store the document IDs that match the query
    intersection = []

    # If there are multiple query words, find the intersection of document IDs for all query terms
    if len(query_word_ids) >= 1:
        # Start with the set of document IDs corresponding to the first word in the query
        initial_set = set(inverted_index.get(query_word_ids[0], []))
        
        # For each remaining word ID in the query, find the intersection of document IDs
        for word_id in query_word_ids:
            current_set = set(inverted_index.get(word_id, []))
            intersection_set = initial_set & current_set  # Update the intersection
        
        # Convert the final intersection set to a list
        intersection = list(intersection_set)
    else:
        # If there is only one query word, use its document IDs directly
        intersection = list(set(inverted_index.get(query_word_ids[0], [])))

    # Check if the intersection is empty
    if not intersection:
        return "No documents match the query."

    # Initialize the list to store restaurant names that match the document IDs in the intersection
    restaurant_names = []

    # For each document ID in the intersection
    for doc_id in intersection:
        # Check if this document ID corresponds to any restaurant in the dictionary
        for name, rest_id in dictionary_word_restaurant.items():
            if rest_id == doc_id:
                # If the document ID matches, add the restaurant name to the list
                restaurant_names.append(name)

    # Check if any restaurants were found
    if not restaurant_names:
        return "No matching restaurants found."

    # Filter the original data to only include the restaurants in the restaurant_names list
    results = data[data["restaurantName"].isin(restaurant_names)]
    
    # Select only the relevant columns for the final results
    results = results[['restaurantName', 'address', 'description', 'website']]

    # Return the search results as a DataFrame
    return results

# Example usage
query = "Modern seasonal cuisine!"
results = execute_query(query)


In [13]:
results[:5]

Unnamed: 0,restaurantName,address,description,website
1,Donevandro,via Garibaldi 2,"Up until a few years ago, the owner-chef at this restaurant was working as a painter – a fact that is evident from the artistic touch in his cuisine. His recipes are modern and personalised, with careful attention naturally paid to harmonious presentation, while the flavour of his dishes is brought out by ingredients that are skilfully chosen from the Abruzzo inland area. In 2024, the restaurant moved to new, centrally located premises which have an intimate feel and are elegant and minimalist in style.",http://www.donevandroristorante.it
8,La Buca,corso Garibaldi 45,"Choose one of the tables on the outdoor summer terrace (right over the water!) at this restaurant in order to best appreciate its location overlooking the picturesque canal port and its period houses. The indoor dining room is also attractive, with its open-view kitchen and a modern and elegant feel, while the menu focuses almost exclusively on fish, most of which is sourced from local fishermen. Enjoy a wide selection of raw antipasti to start (some simple in their preparation, while others are more elaborate), then choose from an array of beautifully presented, modern dishes, perhaps accompanied by one of the many champagnes included on the wine list.",https://www.labucaristorante.com/
11,Il Ristorante Alain Ducasse Napoli,Via Cristoforo Colombo 45,"Alain Ducasse, one of the great names in contemporary fine dining, has arrived in Naples, opening this restaurant in the former premises of the prestigious Ristorante Il Comandante. Situated on the 9th floor of the Romeo hotel overlooking the port, the restaurant boasts fine views of Vesuvius and the Bay of Naples, especially at sunset when the colours are truly spectacular. Meanwhile, the modern dining room, decorated completely in black, is equally as stunning as its surroundings. Alessandro Lucassino, who was born in 1991 and has years of experience working with his mentor, adds a personal flavour to local recipes: by using short cooking times, he preserves the nutritious qualities and flavours of local fish and vegetables while remaining faithful to Ducasse’s “cuisine de la naturalité” philosophy.",https://theromeocollection.com/en/romeo-napoli/restaurants-bars/il-ristorante-alain-ducasse/
14,Etra,piazza De Ferrari 4,"Etra is an anagram of the Italian word “arte” – an apt name for this restaurant, which is housed in Palazzo Doria De Fornari on the beautiful and famous Piazza Ferrari. This palazzo is one of Genova’s magnificent “Rolli” palazzi (noble palaces once used to house famous visitors to the city). Despite the setting, the dining room is modern in style and adorned with elegant works of contemporary art. Here, chef Davide Cannavino consolidates his reputation with a concise menu of creative dishes: around a dozen meat and fish options inspired by the beauty that surrounds him and showcased on two tasting menus from which individual dishes can be chosen à la carte style.",https://www.etra.art/
17,20Tre,via David Chiossone 20 r,"Situated in the heart of Genoa’s historic centre, this contemporary-style restaurant focuses on just a few dishes, almost all fish-based, presented in a very modern style and in generous portions. Seasonal ingredients and market-fresh produce are the guiding philosophy here.",https://www.ristorante20tregenova.it/




* Processed query words: the function cleans and converts all query words to ensure compatibility with the inverted index
*  Conjunctive Query: an intersection of all document_ids for the query words is performed, so that only restaurants containing all the words in the query are returned
*  Output: the result is a DataFrame with the columns {restaurantName, address, description, website}

We have implemented the conjunctive search engine, and the output provides results for the query 'modern seasonal cuisine'




## 2.2 Ranked Search Engine with TF-IDF and Cosine Similarity
### For the second search engine, given a query, retrieve the top-k restaurants ranked by relevance to the query.

## 2.2.1 Inverted Index with TF-IDF Scores
- ### tfIdf Scores: Calculate TF-IDF scores for each term in each restaurant’s description.
- ### Updated Inverted Index: Build a new inverted index where each entry is a term, and the value is a list of tuples containing document IDs and TF-IDF scores.

In [14]:
def calculate_tf(data, dictionary_word_restaurant):
    """Calculate term frequency (TF) for each document and track document appearances for each word."""
    tf = {}
    dict_word_doc_id = {}

    # Iterate over each document in the dataset
    for idx, row in data.iterrows():
        # Get the document ID from the 'restaurantName'
        document_id = dictionary_word_restaurant.get(row['restaurantName'])

        # Getting the list of words from the preprocessed description
        words = row['description2'].split()

        # Initialize a dictionary for TF scores for this document
        tf[document_id] = {}

        for word in words:
            # Increment the word count in the current document's dictionary
            tf[document_id][word] = tf[document_id].get(word, 0) + 1

            # Track the document ID for the current word
            if word not in dict_word_doc_id:
                dict_word_doc_id[word] = set()
            dict_word_doc_id[word].add(document_id)

        # Normalize TF values by dividing by the total number of words in the document
        num_words = len(words)
        for word in tf[document_id]:
            tf[document_id][word] /= num_words

    return tf, dict_word_doc_id



* ### Calculated TF-IDF scores: the calculate_tf_idf function computed the TF-IDF scores for each word in the descriptions.
* ### Created updated inverted index: this inverted index uses term_id as keys and stores tuples of (document_id, tfidf_score) as values.

This updated inverted index is now ready to be used in the Ranked Search Engine with TF-IDF and Cosine Similarity


In [15]:
def calculate_tf_idf(data, dictionary_word_restaurant):
    """Calcola il TF-IDF per ciascun documento e crea l'indice invertito."""
    # Calcola TF e costruisce l'indice dei documenti per ciascuna parola
    tf, dict_word_doc_id = calculate_tf(data, dictionary_word_restaurant)
    N = len(data)  # Numero totale di documenti
    
    tf_idf = {}  # Dizionario che conterrà i punteggi TF-IDF

    # Calcola IDF e TF-IDF
    for document_id, word_tf in tf.items():
        tf_idf[document_id] = {}
        for word, tf_score in word_tf.items():
            # Calcola IDF
            idf = math.log(N / len(dict_word_doc_id[word]))
            # Calcola TF-IDF come tf * idf
            tf_idf[document_id][word] = tf_score * idf

    return tf_idf, dict_word_doc_id

In [16]:
import math
def create_inverted_index(tf_idf, dictionary_word):
    """Create an inverted index mapping terms to documents and their TF-IDF scores."""
    inverted_index = {}
    
    for document_id, word_tfidf in tf_idf.items():
        for word, tfidf_score in word_tfidf.items():
            # Get the ID for the term from the dictionary
            term_id = dictionary_word.get(word)
            
            if term_id is None:
                raise KeyError(f"Term '{word}' not found in dictionary_word.")
            
            # Add the document and the TF-IDF score to the term's list in the inverted index
            if term_id not in inverted_index:
                inverted_index[term_id] = []
            inverted_index[term_id].append((document_id, tfidf_score))
    
    return inverted_index

# Execute the TF-IDF calculation and create the inverted index
tf_idf, dict_word_doc_id = calculate_tf_idf(data, dictionary_word_restaurant)
inverted_index = create_inverted_index(tf_idf, dictionary_word)

## 2.2.2 Execute the Ranked Query

### Ranking Restaurants using Cosine Similarity

The process involves ranking restaurants based on a search query through the following steps:

#### 1. **Process the Query Terms**: The query terms are extracted and prepared.
#### 2. **Calculate TF-IDF Vectors**: Compute the TF-IDF vectors for both the query and each restaurant document.
#### 3. **Compute Cosine Similarity**: Compare the cosine similarity between the query and each restaurant’s TF-IDF vector.
#### 4. **Return Top-k Results**: Return the top-k restaurants or fewer if there are less than k matches with non-zero similarity.
#### 5. **Result Details**: Each result should include:
   - `Restaurant Name`
   - `Address`
   - `Description`
   - `Website`
   - `Similarity Score (between 0 and 1)`


### Now create restaurant_det_dict dictionary_word as a base to access restaurant details, like restaurant_name, address, description , website etc... This dictionary_word will be used in the next steps

In [17]:
#create dictionary_word 'restaurant_det_dict'
restaurant_det_dict={}

for i in range(len(data)):
    document_id=dictionary_word_restaurant[data['restaurantName'].iloc[i]]  #used dictionary_word_restaurant created previously to obtain `document_id`

    if document_id not in restaurant_det_dict:
        restaurant_det_dict[document_id]={
            "restaurantName": data['restaurantName'].iloc[i],
            "address": data['address'].iloc[i],
            "description": data['description'].iloc[i],
            "website": data['website'].iloc[i],
            "cuisineType": data['cuisineType'].iloc[i],
            "priceRange": data['priceRange'].iloc[i],
            "facilitiesServices": data['facilitiesServices'].iloc[i]    
        }

In [18]:
# Calculate TF-IDF for the query

def calculate_query_tfidf(preprocess_query, dict_word_doc_id, N):
    tf_query = {}
    # Calculate Term Frequency (TF) for each word in the query
    for word in preprocess_query:
        tf_query[word] = tf_query.get(word, 0) + 1

    number_word = len(preprocess_query)

    # Calculate TF-IDF scores for the query
    tfidf_query = {}
    for word, tf in tf_query.items():
        
        # Only include words that exist in our document collection
        if word in dict_word_doc_id:
            normalized_tf = tf / number_word  # Normalize TF by query length
            idf = math.log(N / len(dict_word_doc_id[word]))  # Calculate IDF
            tfidf_query[word] = normalized_tf * idf

    return tfidf_query

# Cosine Similarity calculation
def cosine_similarity(vector1, vector2):
    # Calculate dot product
    numerator = sum(vector1[word] * vector2.get(word, 0) for word in vector1)

    # Calculate Euclidean norms
    norm1 = math.sqrt(sum(value ** 2 for value in vector1.values()))
    norm2 = math.sqrt(sum(value ** 2 for value in vector2.values()))

    # Ensure norms are non-zero to prevent division by zero
    return numerator / (norm1 * norm2) if norm1 and norm2 else 0

# Ranking function to get top k results
def ranking_function(query, tf_idf_data, k, dict_word_doc_id, N, restaurant_det_dict):
    # Calculate TF-IDF for the query
    tfidf_query = calculate_query_tfidf(query, dict_word_doc_id, N)

    # Calculate cosine similarity for each document
    cos_sim = []
    for document_id, tfidf_vector in tf_idf_data.items():
        score = cosine_similarity(tfidf_query, tfidf_vector)
        if score > 0:  # Only consider documents with a positive similarity score
            cos_sim.append((document_id, score))

    # Sort results by similarity score in descending order and take top k
    cos_sim_sorted = sorted(cos_sim, key=lambda x: x[1], reverse=True)[:k]

    # Create the final results with details
    final_results = []
    for document_id, score in cos_sim_sorted:
        restaurant = restaurant_det_dict[document_id]
        final_results.append({
            "restaurantName": restaurant['restaurantName'],
            "address": restaurant['address'],
            "description": restaurant['description'],
            "website": restaurant['website'],
            "Similarity score": round(score, 4)
        })

    return pd.DataFrame(final_results)


In [19]:
# prepare the query
query="modern seasonal cuisine"
preprocess_query=text_cleaning(query)
k=5
#execute Ranking function
results=ranking_function(preprocess_query,tf_idf,k,dict_word_doc_id,len(data),restaurant_det_dict)

In [20]:
# Show the results:
results

Unnamed: 0,restaurantName,address,description,website,Similarity score
0,Matteo Ristorante,piazza Duomo 6,"The soft-toned decor is modern and intimate, while the cuisine also has a contemporary feel. Dishes include meat and fish options, plus numerous types of risottos, many of which feature distinctly modern ingredients. Coffee service and aperitifs in the near Laboratory, at n 10.",https://matteocaffeecucina.it/,0.4037
1,Amistà,via Cedrare 78,"Occupying a corner of the stunning Hotel Byblos, this restaurant is similar in style to the rest of the hotel with its colourful and harmonious blend of the old and the new, all of which is inspired by a real love of art. Echoing the decor, the cuisine served here plays with a combination of the traditional and the modern. The chef adds his own creative touches and interpretations to traditional recipes on three tasting menus: “Fattorie e Poderi” (meat), “Laghi e Mari” (fish), and the vegetarian “Campi e Orti”. Dishes can also be chosen à la carte style. The interesting wine list features more than 1 500 labels, including some rare options.",https://www.ristoranteamista.com/,0.1544
2,Il Cavallo Scosso,via al Duca 23/d,"A young and modern restaurant situated in a residential area about 2km from the centre, where the owner-chef offers three tasting menus:Territorio e Tradizione,Il Cavallo Scossoand the more creativeFuori Gara.",https://www.ilcavalloscosso.it/,0.0744
3,Casa Perbellini 12 Apostoli,vicolo Corticella San Marco 3,"Giancarlo Perbellini returns to his origins in this historic restaurant in his native city, where every chef from Verona (and possibly beyond) would like to work. The updated decor gives the restaurant even more appeal, as does the culinary style and philosophy showcased on three tasting menus: “Io e Silva”, which includes decidedly imaginative dishes (such as cooked and raw shellfish with a dash of soya and peppers) and is dedicated to his wife; “Io e Giorgio”, dedicated to his mentor and former restaurant owner, which features more classic recipes; and, last but not least, the completely vegetarian “L'Essenza”. Special mention must be made of the wine list, which includes an extensive selection of French labels, of which both the sommelier and Giancarlo are huge fans. As a final attraction, make sure you find time to visit the Roman ruins in the basement. We also highly recommend booking the chef's table where you can dine as a couple yet at the same time enjoy the company of the skilled chefs whom you’ll observe working together like the well-practised members of an orchestra – an extraordinary sight!",http://www.casaperbellini.com,0.0724
4,Trattoria Pennestri,via Giovanni Da Empoli 5,"An authentic neighbourhood trattoria with just one difference – it is so good that its fame has travelled far beyond the streets of Ostiense in which it is situated and so it is often crowded (as a result, booking is highly recommended). The atmosphere is simple, informal and attractive, while the cuisine focuses on Roman classics such as pasta with carbonara, cacio e pepe, amatriciana and gricia sauces, alongside a few more creative dishes.",https://trattoriapennestri.it/,0.0684


# 3.0.0 Define a New Score!

### Now, we will define a custom ranking metric to prioritize restaurants based on user queries.

## New Scoring Function:

### Define a scoring function that takes into account various attributes:

- #### `Description Match`: Give weight based on the query similarity to the description (using TF-IDF scores).
- #### `Cuisine Match`: Increase the score for matching cuisine types.
- #### `Facilities and Services`: Give more points for matching facilities/services (e.g., “Terrace,” “Air conditioning”).
- #### `Price Range`: Higher scores could be given to more affordable options based on the user’s choice.

In [21]:
import heapq
# Scoring function that incorporates description, cuisine, services, and price range
def custom_scoring(query,tfidf_query, tfidf_vector,restaurant_det_dict,document_id):
    cuisine_score=0
    servicies_score=0
    restaurant=restaurant_det_dict[document_id]
    # 1. Description Match (Cosine Similarity)
    description_score = cosine_similarity(tfidf_query, tfidf_vector)

    # 2. Cuisine Match (Boost score for matching cuisine type)
    if any(cuisine in query for cuisine in restaurant['cuisineType'].lower().split(',')):
        cuisine_score += 1  # Add a constant boost for matching cuisine type

    # 3. Facilities and Services Match (Add points for matching facilities/services)
    if restaurant['facilitiesServices']:
        if any(facility in query for facility in restaurant['facilitiesServices'].lower().split(',')):
            servicies_score += 1  # Add a boost for matching services

    # 4. Price Range (Match based on user’s budget, if provided)
    if restaurant['priceRange']=="Economic" or restaurant['priceRange']=="Affordable": 
            price_score = 1 
    else: 
        price_score=0.5
        
    return description_score*0.4 + cuisine_score*0.3 + servicies_score*0.2 + price_score*0.1

# Main Ranking Function that uses Heap for Top-k
def ranking_function_with_custom_score(query, tf_idf_data, k, dict_word_doc_id, N, restaurant_det_dict):
    # Preprocess query for TF-IDF calculation
    preprocess_query = query.lower().split()
    tfidf_query = calculate_query_tfidf(preprocess_query, dict_word_doc_id, N)

    # Use a heap to maintain the top-k restaurants based on custom score
    heap = []

    for restaurant_id, tfidf_vector in tf_idf_data.items():
        
        score = custom_scoring(query,tfidf_query, tfidf_vector,restaurant_det_dict,restaurant_id)

        # Maintain the top-k heap (min-heap)
        if len(heap) < k:
            heapq.heappush(heap, (score, restaurant_id))
        else:
            heapq.heappushpop(heap, (score, restaurant_id))

    # Extract the top-k results
    top_k_results = []
    for score, restaurant_id in heap:
        restaurant = restaurant_det_dict[restaurant_id]
        top_k_results.append({
            "restaurantName": restaurant['restaurantName'],
            "address": restaurant['address'],
            "description": restaurant['description'],
            "website": restaurant['website'],
            "Cuisine Type": restaurant.get('cuisineType', ''),
            "Facilities and Services": restaurant.get('facilitiesServices', ''),
            "Price Range": restaurant.get('priceRange', ''),
            "Custom Score": round(score, 4)
        })
    top_k_results = sorted(top_k_results, key=lambda x: x['Custom Score'], reverse=True)
    return pd.DataFrame(top_k_results)

In [22]:
# Example usage and results:
query = "modern seasonal cuisine terrace"
k = 5
results = ranking_function_with_custom_score(query, tf_idf, k, dict_word_doc_id, len(data), restaurant_det_dict)
results

Unnamed: 0,restaurantName,address,description,website,Cuisine Type,Facilities and Services,Price Range,Custom Score
0,Piccolo Lord,corso San Maurizio 69 bis/g,"Professional service in a welcoming, modern restaurant run by a young couple. He works in the kitchen while she (having also worked as a chef in the past) runs the front of house. Delicious Mediterranean cuisine with a seasonal focus.",https://www.ristorantepiccololord.it/,"Mediterranean Cuisine, Seasonal Cuisine",['Air conditioning'],Affordable,0.4484
1,Altriménti,via Monte Bianco 2/a,"An informal and contemporary restaurant which is elegant and yet retains a bistro feel. The cuisine is modern with meat and fish dishes featured on the menu, as well as a good number of vegetarian options. Good wine list.",https://altrimenti.eu/,"Modern Cuisine, Seasonal Cuisine",[],Affordable,0.4447
2,Kirchsteiger,via prevosto Wieser 5,This restaurant has long been one of the most interesting in the area. It has a pleasing combination of traditional and modern cuisine created by the expert chef. The dining rooms have the same blend of old and new with an elegant use of local materials in the decor. Ask for one of the newer guestrooms.,https://www.kirchsteiger.com/it/gastronomia/ristorante,"Classic Cuisine, Seasonal Cuisine","['Car park', 'Garden or park', 'Great view', 'Interesting wine list', 'Terrace']",Affordable,0.4358
3,Mima,via Madonnelle 9,"You’ll be won over by the seasonal Mediterranean cuisine created by the young yet experienced chef at this restaurant. Accommodation is also available in modern guestrooms, plus there’s an enchanting roof garden in which to sip an aperitif while the sun goes down.",http://www.domo20.com/restaurant,"Seasonal Cuisine, Mediterranean Cuisine","['Air conditioning', 'Great view', 'Terrace']",Affordable,0.4351
4,Trattoria 'petito,via Corridoni 14,"Located on the outskirts of Forlì’s charming historic centre, this simple, modern yet attractive restaurant serves eclectic and varied cuisine. More traditional fare on offer includes an excellent selection of hams and grilled meats, as well as delicious tagliatelle al ragù, while the menu also features a few more creative options including a few fish dishes.",https://www.trattoriapetito.it/,"Cuisine from Romagna, Seasonal Cuisine","['Air conditioning', 'Car park', 'Terrace', 'Wheelchair access']",Affordable,0.4322
