# 2.0.0) Preprocessing the Text
Before building the search engine, you must clean and prepare the text in each restaurant’s description. We will:

Remove stopwords.
Remove punctuation.
Apply stemming.
Perform any other necessary cleaning to improve search accuracy.
For this, we use the nltk library

In [1]:
import pandas as pd
data=pd.read_csv('/content/all_restaurants_data.csv')
data.head()
#data.shape[0]

Unnamed: 0,restaurantName,address,city,postalCode,country,priceRange,cuisineType,description,facilitiesServices,creditCards,phoneNumber,website
0,O Me O Il Mare,Via Roma 45/47,Gragnano,80054,Italy,€€€€,"Italian Contemporary, Modern Cuisine",After many years’ experience in Michelin-starr...,"['Air conditioning', 'Interesting wine list', ...","['Amex', 'Dinersclub', 'Mastercard', 'Visa']",+39 081 620 0550,http://omeoilmare.com
1,Donevandro,via Garibaldi 2,Popoli,65026,Italy,€€,"Contemporary, Seasonal Cuisine","Up until a few years ago, the owner-chef at th...",['Air conditioning'],"['Mastercard', 'Visa']",+39 388 887 6858,http://www.donevandroristorante.it
2,Ape Vino e Cucina,Piazza Risorgimento 3,Alba,12051,Italy,€€,"Piedmontese, Contemporary",This attractive restaurant in the heart of Alb...,"['Air conditioning', 'Terrace', 'Wheelchair ac...","['Amex', 'Dinersclub', 'Maestrocard', 'Masterc...",+39 0173 363453,https://www.apewinebar.it/alba/
3,Da Bob Cook Fish,largo Parsano vecchio 16,Sorrento,80067,Italy,€€,Seafood,Working in partnership with the nearby fishmon...,"['Air conditioning', 'Terrace']","['Amex', 'Dinersclub', 'Mastercard', 'Visa']",+39 081 1778 3873,https://www.dabobcookfish.com/
4,DA_MÓ,Via Bruno Buozzi 20,Matera,75100,Italy,€€,"Regional Cuisine, Contemporary","This new, restored restaurant in the upper par...","['Air conditioning', 'Terrace']","['Amex', 'Dinersclub', 'Mastercard', 'Visa']",+39 0835 686548,https://www.damoristorante.it/


 we have imported the dataset "all_restaurant_data.csv"

In [2]:

!pip install nltk
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string
def preprocessing(descr):
    #split phrases in lowercase words
    words=nltk.word_tokenize(descr.lower())
    #include words that are not punctuation, stopwords or special characters, and to these words apply the stemming
    preprocess_word=[]
    for i in words:
        if i not in string.punctuation and i not in stop_words and i.isalpha():
          preprocess_word.append(p_stemmer.stem(i))
    #merge the preprocess_word in a string
    return ' '.join(preprocess_word)
#stop_words contains the set of english stopwords
stop_words=set(stopwords.words('english'))
#inizialize stemmer in p_stemmer
p_stemmer=PorterStemmer()
#apply the function preprocessing to the variable 'description' of dataset 'data'
data['description2']=data['description'].apply(preprocessing)

#visualize the results
data[['description', 'description2']].head()
data.columns



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Index(['restaurantName', 'address', 'city', 'postalCode', 'country',
       'priceRange', 'cuisineType', 'description', 'facilitiesServices',
       'creditCards', 'phoneNumber', 'website', 'description2'],
      dtype='object')

We have done preprocessing on the restaurant descriptions using the nltk library, following these steps:

* Removal of stopwords: We imported English stopwords and kept only words that are not among them.
* Removal of punctuation.
* Application of stemming: We used the PorterStemmer to reduce words to their root.
* Other filters: We converted words to lowercase and accepted only alphabetical words.

At the end these operations were applied to the 'description' column of the dataset, and the result was saved in a new column 'description2', ready for use in the new search engine

#2.1 Conjunctive Query
This first version of the search engine narrows the search to the description field of each restaurant. Only restaurants whose descriptions contain all the query words will be returned.

#2.1.1 Create Your Index!
* Vocabulary File: Create a file called vocabulary.csv that maps each word to a
unique integer (term_id).
* Inverted Index: Build a dictionary mapping each term_id to a list of document IDs where that term appears.
{
  "term_id_1": [document_1, document_2, document_4],
  "term_id_2": [document_1, document_3, document_5],
  ...
}
Each document_i represents a unique restaurant.

Hint: Store the inverted index in a separate file to avoid recomputation.

In [3]:
#create a set of all noduplicated words in the variable 'description2'
vocabulary=set()
for i in data['description2']:
    vocabulary.update(i.split())
#create a dictionary with all noduplicated words as keys and the associated term_id as values
dictionary={}
term_id=0
for i in vocabulary:
    dictionary[i]=term_id
    term_id=term_id+1
#save dictionary in a csv file named "vocabulary.csv"
word_termid=[]
for i in dictionary.items():
    word_termid.append(i)
save_dictionary=pd.DataFrame(word_termid,columns=['word','term_id'])
save_dictionary.to_csv('vocabulary.csv',index=False)
#show first rows of vocabulary.csv
vocabulary=pd.read_csv('/content/vocabulary.csv')
vocabulary.tail()


Unnamed: 0,word,term_id
6944,marko,6944
6945,enclav,6945
6946,barral,6946
6947,return,6947
6948,restrict,6948


We've created a vocabulary of words from the description2 column. We assigned a unique term_id to each word and saved it to vocabulary.csv. This step is necessary for the construction of the inverted index.

In [5]:
import pickle
#create a dictionary for restaurants that employ name of restaurant such as key and a document_id such as value
dictionary_restaurant={}
document_id=0
for i in data['restaurantName'].unique():
    dictionary_restaurant[i]=document_id
    document_id=document_id+1
#inizialize inverted_index
inverted_index={}
#for each row on data that refers to an restaurant, extract document id and the splitted words of description2 variable
for i in range(len(data)):
    document_id=dictionary_restaurant[data['restaurantName'].iloc[i]]  #extract document_id from dictionary_restaurant[name of restaurant] calculated previously
    words=data['description2'].iloc[i].split()  #splitted description2 in words
    for i in words: #for each words read on description2
      term_id=dictionary[i]  # extract term_id from dictionary[word]
      if term_id not in inverted_index: #if term_id isn't in inverted_index
         inverted_index[term_id]=[] #add a new row
      if document_id not in inverted_index[term_id]: #if document_id isn't in inverted_index[term_id]
        inverted_index[term_id].append(document_id) #append document_id to the values of inverted_index[term_id]
# show vocabulary calculated on point 2.1.1 and inverted index calculated on point 2.1.2
print("Vocabolary:",word_termid[:5])
termid_documentids= []
for i in inverted_index.items():
    termid_documentids.append(i)
print("Inverted Index:",termid_documentids[:5])
#check if inverted_index has the all words of description2 variable
len(vocabulary)==len(sorted(inverted_index.keys()))


Vocabolary: [('kombucha', 0), ('zabaion', 1), ('read', 2), ('angelo', 3), ('tunnel', 4)]
Inverted Index: [(5259, [0, 8, 9, 15, 24, 43, 83, 86, 89, 100, 110, 179, 191, 195, 197, 204, 235, 245, 256, 278, 283, 288, 319, 328, 330, 332, 340, 361, 368, 371, 395, 408, 451, 483, 495, 501, 511, 515, 299, 598, 600, 611, 623, 628, 640, 672, 710, 763, 771, 783, 785, 797, 805, 811, 845, 847, 849, 850, 889, 901, 913, 919, 930, 966, 972, 993, 1015, 1038, 1077, 1104, 1117, 1189, 1201, 1209, 1222, 1229, 1249, 1257, 1275, 1300, 1329, 1339, 1362, 1401, 1412, 1419, 1421, 1431, 1441, 335, 1477, 1479, 1490, 1492, 1498, 1499, 1508, 1515, 1570, 1571, 1578, 1585, 1594, 1598, 1617, 1618, 1654, 1702, 1712, 1715, 1733, 1734, 1739, 1754, 1765, 1768, 1773, 1782, 269, 1842, 1853, 1854, 1866, 1879, 1897, 74, 1900, 1910, 1915, 1939]), (5535, [0, 1, 3, 11, 12, 43, 46, 70, 73, 86, 100, 101, 103, 104, 107, 179, 185, 191, 196, 212, 221, 229, 235, 288, 293, 312, 313, 328, 348, 358, 364, 394, 403, 410, 417, 433, 453, 455, 4

True

we've created
* a dictionary for restaurants, dictionary_restaurant, where each key is a unique restaurant name from the restaurantName column, and each value is a unique document_id.
* A dictionary inverted_index, which contains term_id values as keys, each linked to a list of document_id values where the words appear.

Then we checked that inverted_index contains all the words from description2 by comparing the length of vocabulary with the number of keys in inverted_index. This confirms that all words are included in the inverted index

In [6]:
import pickle
#saved inverted index in "inverted_index.pkl"
with open("inverted_index.pkl","wb") as f:
    pickle.dump(inverted_index,f)
#show inverted index
with open("inverted_index.pkl","rb") as f:
    loaded_inverted_index=pickle.load(f)
#show the first 5 elements of inverted index
print("inverted index:",list(loaded_inverted_index.items())[:5])

inverted index: [(5259, [0, 8, 9, 15, 24, 43, 83, 86, 89, 100, 110, 179, 191, 195, 197, 204, 235, 245, 256, 278, 283, 288, 319, 328, 330, 332, 340, 361, 368, 371, 395, 408, 451, 483, 495, 501, 511, 515, 299, 598, 600, 611, 623, 628, 640, 672, 710, 763, 771, 783, 785, 797, 805, 811, 845, 847, 849, 850, 889, 901, 913, 919, 930, 966, 972, 993, 1015, 1038, 1077, 1104, 1117, 1189, 1201, 1209, 1222, 1229, 1249, 1257, 1275, 1300, 1329, 1339, 1362, 1401, 1412, 1419, 1421, 1431, 1441, 335, 1477, 1479, 1490, 1492, 1498, 1499, 1508, 1515, 1570, 1571, 1578, 1585, 1594, 1598, 1617, 1618, 1654, 1702, 1712, 1715, 1733, 1734, 1739, 1754, 1765, 1768, 1773, 1782, 269, 1842, 1853, 1854, 1866, 1879, 1897, 74, 1900, 1910, 1915, 1939]), (5535, [0, 1, 3, 11, 12, 43, 46, 70, 73, 86, 100, 101, 103, 104, 107, 179, 185, 191, 196, 212, 221, 229, 235, 288, 293, 312, 313, 328, 348, 358, 364, 394, 403, 410, 417, 433, 453, 455, 478, 483, 501, 507, 511, 524, 537, 543, 547, 581, 585, 588, 598, 600, 605, 607, 624, 644, 

#2.1.2 Execute the Query
When the user inputs a query, for example, "modern seasonal cuisine", the search engine will:

Process the query terms.
Find and return a list of restaurants containing all query words in their description.
The output should include:

restaurantName
address
description
website

In [11]:
def execute_query(query):
    preprocess_query=preprocessing(query).split() #use the function preprocessing on preprocess_query and then split them
    query_word_ids=[]
    for i in preprocess_query:
        if i in dictionary:
            query_word_ids.append(dictionary[i])  #add query_word_id to the list query_word_ids if exist it in dictionary
    #inizialize intersection that will contain all document_id in common
    intersection=[]

    if len(query_word_ids)>1: #if query_word_ids contain more then 1 word
        #inizialize initial_set with the first list of document_id taken from the inverted_index[query_word_ids[0]]
        initial_set=set(inverted_index[query_word_ids[0]])
        #calculated the cumulative intersection between the list of document_ids associated to the word in preprocess_query
        for i in range(1, len(query_word_ids)):
            if query_word_ids[i] in inverted_index:
                current_set=set(inverted_index[query_word_ids[i]])
                initial_set=initial_set&current_set
        intersection=list(initial_set)
    else: #if query_word_ids contain only 1 word
        intersection=list(set(inverted_index[query_word_ids[0]])) #intersection contains the document_ids linked to the only word present in preprocess_query
    #initialize restaurant_names1 that will contains the names of restaurant linked to the document_id in common
    restaurant_names1=[]
    for i in intersection:
      for key,value in dictionary_restaurant.items():
        if value==i:
          restaurant_names1.append(key)
    #selected only the row relative to the restaurantnames present in restaurant_names1
    results_2=data[data["restaurantName"].isin(restaurant_names1)]
    #selected only the variable 'restaurantName', 'address', 'description', 'website'
    results_2=results_2[['restaurantName','address','description','website']]

    return results_2

# example suggested from the question
query="modern seasonal cuisine"
results=execute_query(query)
# show the results of search engine on query "modern seasonal cuisine"
print("Search Results:")
print(results)


Search Results:
                       restaurantName                             address  \
30                    Casin del Gamba               via Roccolo Pizzati 1   
84                        San Giorgio           viale Brigate Bisagno 69r   
196             Il Luogo Aimo e Nadia                  via Montecuccoli 6   
202                        Vesta Mare                       viale Roma 41   
302                      Ca' Del Moro                   località Erbin 31   
326                         Contrasto                         via Roma 55   
330                              Saur                via Filippo Turati 8   
417                       San Michele          via Castello di Fagagna 33   
440                         Chichibio             via Guglielmo Marconi 1   
488            Winter Garden Florence                 piazza Ognissanti 1   
637                          La Valle                    via Umberto I 25   
638                         Esplanade                       



* Processed query words: the function cleans and converts all query words to ensure compatibility with the inverted index
*  Conjunctive Query: an intersection of all document_ids for the query words is performed, so that only restaurants containing all the words in the query are returned
*  Output: the result is a DataFrame with the columns {restaurantName, address, description, website}

We have implemented the conjunctive search engine, and the output provides results for the query 'modern seasonal cuisine'




#2.2 Ranked Search Engine with TF-IDF and Cosine Similarity
For the second search engine, given a query, retrieve the top-k restaurants ranked by relevance to the query.

#2.2.1 Inverted Index with TF-IDF Scores
tfIdf Scores: Calculate TF-IDF scores for each term in each restaurant’s description.
Updated Inverted Index: Build a new inverted index where each entry is a term, and the value is a list of tuples containing document IDs and TF-IDF scores.
Example format:

In [13]:
import math
import pickle

def calculate_tf_idf(data):
    N=len(data)
    dict_word_doc_id={}  # inizialized dictionary that will contain words as keys and a sets with all document_id where word appear as values

    # calculate normalized tf
    tf={}  # inizialized dictionary that will contains document_id as keys and other dictionaries as values(which contain words linked their tf values)
    for i in range(len(data)):
        document_id=dictionary_restaurant[data['restaurantName'].iloc[i]]
        words=data['description2'].iloc[i].split()
        tf[document_id]={}  # inizialized dictionary for document_id
        for word in words:  # for each word in words

            tf[document_id][word]=tf[document_id].get(word, 0) + 1  # if the word is present in tf[document_id], we increment of 1
            if word not in dict_word_doc_id:  # if word not in dict_word_doc_id, we inizialize term_doc_frequency[i] to a set empty
                dict_word_doc_id[word]=set()
            dict_word_doc_id[word].add(document_id)  # add document_id to the set


        number_word=len(words)
        for word in tf[document_id]:
            tf[document_id][word]=tf[document_id][word]/number_word  # normalize TF to avoid the influence of length of words

    #calculate idf and tf-idf
    tf_idf={}  # inizialized a dictionary that will contains document_id as keys and dictionary for document_id as values (which contains for each word the tf_idf score)
    for document_id,word_tf in tf.items():
        tf_idf[document_id]={}  # inizialize dictionary for document_id(which contains for each word the tf_idf score)
        for word,tf_score in word_tf.items():  # for each word, tf in tf[document_id]
            idf=math.log(N/len(dict_word_doc_id[word]))  # calculate idf as log(num of restaurant/num of restaurant's description2 that contains the words)
            tf_idf[document_id][word]=tf_score*idf  # calculate tf_idf score as a tf*idf

    return tf_idf,dict_word_doc_id

def inverted_index_tfidf(tfidf):
    # create updated inverted index
    updated_inverted_index={}
    for document_id,word_tfidf in tfidf.items():
        for word,tfidf_score in word_tfidf.items():
            term_id=dictionary[word]
            if term_id not in updated_inverted_index:
                updated_inverted_index[term_id]=[]
            updated_inverted_index[term_id].append((document_id,tfidf_score))  # append document_id and tfidf

    return updated_inverted_index

# call the function inverted_index_tfidf and create the updated_inverted_index
tfidf,dict_word_doc_id=calculate_tf_idf(data)
updated_inverted_index=inverted_index_tfidf(tfidf)





* Calculated TF-IDF scores: the calculate_tf_idf function computed the TF-IDF scores for each word in the descriptions.
* Created updated inverted index: this inverted index uses term_id as keys and stores tuples of (document_id, tfidf_score) as values.

This updated inverted index is now ready to be used in the Ranked Search Engine with TF-IDF and Cosine Similarity


#2.2.2 Execute the Ranked Query
For the ranked search engine:

Process the query terms.
Use Cosine Similarity to rank matching restaurants based on the TF-IDF vectors of the query and each document.
Return the top-k results or all matching restaurants if fewer than k have non-zero similarity.
Each result should include:

restaurantName
address
description
website
Similarity score (between 0 and 1)


In [18]:
import pickle

#create dictionary 'restaurant_det_dict'
restaurant_det_dict={}

for i in range(len(data)):
    document_id=dictionary_restaurant[data['restaurantName'].iloc[i]]  #used dictionary_restaurant created previously to obtain `document_id`

    if document_id not in restaurant_det_dict:
        restaurant_det_dict[document_id]={
            "restaurantName": data['restaurantName'].iloc[i],
            "address": data['address'].iloc[i],
            "description": data['description'].iloc[i],
            "website": data['website'].iloc[i]
        }



We've created the restaurant_det_dict dictionary as a base to access restaurant details, such as restaurant_name, address, description, and website. This dictionary will be used in the next step

In [36]:
import math
import numpy as np
import pickle

def calculate_query_tfidf(preprocess_query,dict_word_doc_id,N):
    #calculate TF for query
    tf_query={}
    for word in preprocess_query:
        if word in tf_query:
            tf_query[word]=tf_query[word]+ 1
        else:
            tf_query[word]=1

    number_word=len(preprocess_query)
    for word in tf_query:
        tf_query[word]=tf_query[word]/number_word  # normalize TF for query

    # calculate TF-IDF for query
    tfidf_query={}
    for word,tf in tf_query.items():
        if word in dict_word_doc_id:
            idf=math.log(N/len(dict_word_doc_id[word]))
            tfidf_query[word]=tf * idf
        else:
            tfidf_query[word]=0  # IDF will be 0 only if the word isn't present in any description

    return tfidf_query

#function to calculate Cosine Similarity
def cosine_similarity(vector1, vector2):
  #calculated numerator, it is the scalar product between vector1 and vector2
    numerator=0
    for word in vector1:
        if word in vector2:
            numerator=numerator+vector1[word]*vector2[word]  #Multiply only if the word is present in both vectors

    #calculated norm of vector1
    sum_squares1=0
    for i in vector1.values():
        sum_squares1=sum_squares1+i**2  #Sum the square of each value in vector1
    norm1=math.sqrt(sum_squares1)  #Take the square root of sum_squares1 to obtain the norm.
    #calculated norm of vector2
    sum_squares2=0  #Sum the square of each value in vector2
    for i in vector2.values():
        sum_squares2+=i **2  #Sum the square of each value in vector2

    norm2=math.sqrt(sum_squares2)  #Take the square root of sum_squares2 to obtain the norm.


    return numerator/(norm1*norm2)

# Function to execute the Ranked Query and return the top k results
def ranking_function(query,tf_idf_data,k,dict_word_doc_id,N,restaurant_det_dict):
    tfidf_query=calculate_query_tfidf(preprocess_query,dict_word_doc_id,len(data))  # calculate TF-IDF for query
    cos_sim=[]

    for document_id,tfidf_vector in tf_idf_data.items():
        if cosine_similarity(tfidf_query,tfidf_vector)>0:
            cos_sim.append((document_id,cosine_similarity(tfidf_query,tfidf_vector)))
    #sort results by similarity
    cos_sim=sorted(cos_sim,key=lambda x:x[1],reverse=True)[:k]
    #create final results with details
    final_results=[]
    for document_id,score in cos_sim:
        restaurant=restaurant_det_dict[document_id]
        final_results.append({
            "restaurantName":restaurant['restaurantName'],
            "address":restaurant['address'],
            "description":restaurant['description'],
            "website":restaurant['website'],
            "Similarity score":round(score,4)
        })
    return final_results

# prepare the query
query="modern seasonal cuisine"
preprocess_query=preprocessing(query).split()
k=5
#execute Ranking function
results=ranking_function(preprocess_query,tfidf,k,dict_word_doc_id,len(data),restaurant_det_dict)
print(results)

[{'restaurantName': 'Saur', 'address': 'via Filippo Turati 8', 'description': 'In a tiny rural village, this contemporary, almost minimalist-style restaurant serves modern cuisine with an emphasis on seasonal, regional produce.', 'website': 'https://ristorantesaur.it', 'Similarity score': 0.2634}, {'restaurantName': 'La Botte', 'address': 'via Giuseppe Garibaldi 8', 'description': 'A modern and welcoming contemporary bistro situated in the heart of Stresa’s historic centre. Run by an entire family, the restaurant serves modern and imaginative fish and meat dishes where the focus is always on seasonal ingredients. The interesting wine list also includes a selection of wines by the glass.', 'website': 'http://www.trattorialabottestresa.it', 'Similarity score': 0.2266}, {'restaurantName': 'Razzo', 'address': 'via Andrea Doria 17/f', 'description': 'A quiet restaurant with a relaxed, young and modern feel serving contemporary cuisine prepared from seasonal, regional products. Charming roma

* Calculated the TF-IDF for the query, where each word in the processed query is associated with its TF-IDF score
* Calculated Cosine Similarity
* Ranking function: calculates the similarities between the query and each restaurant, displaying the top k results with similarities > 0