## 2. Search Engine

We have put our custom functions in functions.py and the search engine functions in searchEngine.py and then we put all the library that are useful to run our code and we are importing them in the beginning here so to run the code please before run this following piece of code:

In [44]:
import requests
from datetime import datetime
from bs4 import BeautifulSoup
import pandas as pd
from  tqdm import tqdm
import time
import re
import csv
import os
import pandas as pd
import numpy as np
from collections import defaultdict
import json
import functions as f
from functions import extract_fees
from functions import preprocess_text
from functions import get__currency_rates_api
from sklearn.feature_extraction.text import TfidfVectorizer
from functions import convert_currency
from searchEngine import conjunction_search
from searchEngine import tfidf_conjunction_search_topk

Before starting with the question, Iwe have loaded all the 6000 TSV files into one dataframe using my TSV_to_dataframe function in the functions.py:

In [45]:
column_names = [
    "courseName",
    "universityName",
    "facultyName",
    "isItFullTime",
    "description",
    "startDate",
    "fees",
    "modality",
    "duration",
    "city",
    "country",
    "administration",
    "url"
]

folder_name = 'folderTSV'
num_files = 6000

courses_df = f.TSV_to_dataframe(column_names, folder_name, num_files)

In [46]:
courses_df.shape[0]

6000

In [47]:
courses_df.head(3)

Unnamed: 0,courseName,universityName,facultyName,isItFullTime,description,startDate,fees,modality,duration,city,country,administration,url
0,3D Design for Virtual Environments - MSc,Glasgow Caledonian University,School of Engineering and Built Environment,Full time,3D visualisation and animation play a role in ...,September,,MSc,1 year full-time,Glasgow,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
1,Accounting and Finance - MSc,University of Leeds,Leeds University Business School,Full time,Businesses and governments rely on sound finan...,September,"UK: £18,000 (Total)International: £34,750 (Total)",MSc,1 year full time,Leeds,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
2,"Accounting, Accountability & Financial Managem...",King’s College London,King’s Business School,Full time,"Our Accounting, Accountability & Financial Man...",September,,MSc,1 year FT,London,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...


### 2.0 Preprocessing

2.0.0 Preprocessing the text

First, you must pre-process all the information collected for each MSc by:

Removing stopwords
Removing punctuation
Stemming
Anything else you think it's needed
For this purpose, you can use the `nltk library.


We have built a function preprocess_text in the functions.py using nltk corpus stopwords and Porter stemmer. We are applying the function over each rows of the dataframe here and keeping the processed text of the description column in a new column in the dataframe:

In [48]:
# Preprocess the 'description' column
courses_df['ProcessedDescription'] = courses_df['description'].apply(preprocess_text)

In [49]:
courses_df.loc[10,['description','ProcessedDescription']]

description             The Analytical Toxicology MSc is a unique stud...
ProcessedDescription    analyt toxicolog msc uniqu studi cours integr ...
Name: 10, dtype: object

### 2.0.1) Preprocessing the fees column
Moreover, we want the field fees to collect numeric information. As you will see, you scraped textual information for this attribute in the dataset: sketch whatever method you need (using regex, for example, to find currency symbol) to collect information and, in case of multiple information, retrieve only the highest fees. Finally, once you have collected numerical information, you likely will have different currencies: this can be chaotic, so let chatGPT guide you in the choice and deployment of an API to convert this column to a common currency of your choice (it can be USD, EUR or whatever you want). Ultimately, you will have a float column renamed fees (CHOSEN COMMON CURRENCY).

In [50]:
courses_df[courses_df['fees']!=''][['fees']].head(5)

Unnamed: 0,fees
1,"UK: £18,000 (Total)International: £34,750 (Total)"
5,"UK: £13,750 (Total)International: £31,000 (Total)"
7,Tuition fee per year (non-EU/EEA students): 15...
8,Tuition fee per year (non-EU/EEA students): 15...
9,"UK: £12,500 (Total)International: £28,750 (Total)"


Initially, we are extracting the numerical value of fees and keeping them in the fees column and creating a currency column to keep the currency symbol, to understand which currency fees is present in the row. We are doing all of this inside the extract_fees function and that we are applying to each row in the dataframe:

In [51]:
courses_df[['fees', 'currency']] = courses_df['fees'].apply(extract_fees).apply(pd.Series)

In [52]:
courses_df[courses_df['fees'].notna()][['fees','currency']].head(5)

Unnamed: 0,fees,currency
1,34750.0,£
5,31000.0,£
7,15000.0,€
8,15000.0,€
9,28750.0,£


Now that we have the numeric values of the fees and the currency in the fees and currency column, we can convert the fees to a common currency. We are choosing Euro to be the common uniform currency for my case.

Our idea - we are using the exchange-rates.com API in the function get__currency_rates_api to get the currency rates in terms of Euros in a dictionary which is kept inside exchange_rates. Since, the API wants us to insert the currency that we want to convert, but for that purpose, we will need to make 6000 different api calls if apply that to each individual row and my API key has a request limit for free access. So we are just inserting Euros and getting the inverted rates in each currencies as a dictionary. Thus, in the convert_currency function, we are dividing the amount by the rate (essentially the inverted rate) to get the value in our chosen common currency. In this way, we are just making one request for this API and getting our work of conversion done applying to each row of the dataframe. At the end, we are keeping that in the fees(EUR) column.

In [53]:
common_currency= 'EUR'
exchange_rates=get__currency_rates_api(common_currency)

In [54]:
# Apply the conversion to the 'fees' column in the DataFrame
courses_df['fees (EUR)'] = courses_df.apply(lambda row: convert_currency(exchange_rates, row['fees'], row['currency']), axis=1)


In [55]:
courses_df[courses_df['fees'].notna()][['fees','fees (EUR)','currency']].head(5)

Unnamed: 0,fees,fees (EUR),currency
1,34750.0,39664.42,£
5,31000.0,35384.09,£
7,15000.0,15000.0,€
8,15000.0,15000.0,€
9,28750.0,32815.89,£


As we have processed the dataframe to an extent, it is a safe idea to store the entire dataframe in a single TSV file so that, later on, we won't need to run the same code to make the dataframe but we can load it from the TSV file using pandas library.

In [56]:
tsv_file_path = 'courses_data_processed.tsv'
courses_df.to_csv(tsv_file_path, sep='\t', index=False)

### 2.1. Conjunctive query
For the first version of the search engine, we narrowed our interest to the description of each course. It means that you will evaluate queries only concerning the course's description.

#### 2.1.1) Create your index!
Before building the index,

Create a file named vocabulary, in the format you prefer, that maps each word to an integer (term_id).
Then, the first brick of your homework is to create the Inverted Index.

**Creating Vocabulary and saving it as vocabulary.txt**

In [57]:
# Create a set to store unique terms
unique_terms = set()

# Iterate through the DataFrame to collect unique terms
for index, row in courses_df.iterrows():
    stemmed_words = row["ProcessedDescription"].split()
    unique_terms.update(stemmed_words)

# Create a vocabulary by assigning term_ids to unique terms
vocabulary = {term: idx for idx, term in enumerate(unique_terms, start=1)}

# Save the vocabulary to a text file
with open("vocabulary.txt", "w") as vocab_file:
    for term, term_id in vocabulary.items():
        vocab_file.write(f"{term} {term_id}\n")


**Creating Inverted Index**

In [58]:
from collections import defaultdict
# Initialize an empty inverted index
inverted_index = defaultdict(list)

# Iterate through the DataFrame rows and update the inverted index
for index, row in courses_df.iterrows():
    stemmed_words = row["ProcessedDescription"].split()
    for term in stemmed_words:
        term_id = vocabulary.get(term)  # Get the term_id from the vocabulary
        if term_id is not None:
            inverted_index[term_id].append(index)

# Convert the defaultdict to a regular dictionary
inverted_index = dict(inverted_index)

In [59]:
len(inverted_index)

7307

**Saving the Inverted Index as inverted_index.json**

In [60]:
# Save the inverted index to a JSON file
import json
with open("inverted_index.json", "w") as index_file:
    json.dump(inverted_index, index_file)

#### 2.1.2) Execute the query
Given a query input by the user, for example:

advanced knowledge

**Search Engine 1.0**

Loading the dataframe we had saved in the tsv file already:

In [61]:
courses_df=pd.read_csv('courses_data_processed.tsv',sep='\t')

Loading the vocabulary and inverted_index that we saved before:

In [62]:
import json

# Load vocabulary from "vocabulary.txt"
vocabulary = {}
with open("vocabulary.txt", "r") as vocab_file:
    for line in vocab_file:
        term, term_id = line.strip().split()
        vocabulary[term] = int(term_id)

# Load inverted index from "inverted_index.json"
with open("inverted_index.json", "r") as index_file:
    inverted_index = json.load(index_file)


*In the searchEngine.py Iwe have built the function 'conjunction_search'. The function takes the dataframe, vocabulary, inverted_index and the query as parameters. It processes the query using nltk library and then checks the presence of every word in the query using the vocabulary and inverted index and then returns the index values which have the query words conjunctively. At the end, we printed the result_df with the necessary information required.*

In [63]:
query = "advanced knowledge"

# Search for the query and get the matching DataFrame
result_df = conjunction_search(courses_df, vocabulary, inverted_index, query)

# Print the result DataFrame
result_df[['courseName','universityName','description','url']]


Unnamed: 0,courseName,universityName,description,url
4099,Global Meetings and Events Management MSc / PGDip,University College Birmingham,Become part of an events industry worth an est...,https://www.findamasters.com/masters-degrees/c...
2568,Dance Science MSc,University of Chichester,This suite of MSc programmes is designed for s...,https://www.findamasters.com/masters-degrees/c...
2569,Data Analysis for Business Intelligence - MSc,University of Leicester,The course is designed for students who want t...,https://www.findamasters.com/masters-degrees/c...
10,Analytical Toxicology MSc,King’s College London,The Analytical Toxicology MSc is a unique stud...,https://www.findamasters.com/masters-degrees/c...
522,Accounting and Finance - MSc,University of Sussex,On this MSc you’ll advance your accounting and...,https://www.findamasters.com/masters-degrees/c...
...,...,...,...,...
2540,Cyber Security MSc,Keele University,IT systems are a vital part of every company's...,https://www.findamasters.com/masters-degrees/c...
4590,Intelligent Transport Planning and Engineering...,University of East London,The programme is designed to meet the increasi...,https://www.findamasters.com/masters-degrees/c...
495,(MSc/PGDip/PGCert) - Advanced Clinical Practic...,University of Warwick,Imagine if you had the confidence to react mor...,https://www.findamasters.com/masters-degrees/c...
2037,Clinical Pharmacy - MSc,University of Sunderland,The Clinical Pharmacy MSc has been designed to...,https://www.findamasters.com/masters-degrees/c...


In [64]:
print(result_df.loc[10,'description'])

The Analytical Toxicology MSc is a unique study course that integrates theoretical and practical aspects of analytical science with clinical and forensic toxicology. This course will provide you with a detailed knowledge and comprehensive understanding of advanced analytical toxicology and its applications.


### 2.2. Conjunctive query & Ranking score
For the second search engine, given a query, we want to get the top-k (the choice of k it's up to you!) documents related to the query. In particular:

Find all the documents that contain all the words in the query.
Sort them by their similarity with the query.
Return in output k documents, or all the documents with non-zero similarity with the query when the results are less than k. You must use a heap data structure (you can use Python libraries) for maintaining the top-k documents.

To solve this task, you must use the tfIdf score and the Cosine similarity. The field to consider is still the description. Let's see how.

#### 2.2.1) Create your index

**Now we created the second Inverted Index**

This index, than the first one we created above, for each word, provide the list of documents in which it is contained and the relative tfIdf score.
We used the 'sklearn.feature_extraction.text' library importing the 'TfidfVectorizer' module so we could compute the tfidf score easily and store it in a optimized manner.
The summary of what we are going to do is:
1. Pre-processes text data in a DataFrame, replacing NaN values with empty strings. 
2. Using TF-IDF vectorization on the processed text to create a TF-IDF matrix. 3. Constructing an inverted index, mapping terms to document IDs with corresponding TF-IDF scores. 
4. At the end what we expected to have is a dictionary representing the inverted index, providing a structured way to look up documents based on terms and their TF-IDF scores.

In [65]:
# Pre-processing text Data replacing NaN values with an empty string in the ProcessedDescription column
courses_df["ProcessedDescription"] = courses_df["ProcessedDescription"].fillna("")

# Creation a list of processed descriptions
processed_descriptions = courses_df["ProcessedDescription"].tolist()

# TfidfVectorizer inizialitation
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(processed_descriptions)

# Get feature names (terms) and their indices
feature_names = vectorizer.get_feature_names_out()
term_indices = {term: idx for idx, term in enumerate(feature_names)}

# Initialization of an empty inverted index with tf-idf scores
tfidf_inverted_index = defaultdict(list)

# Loop through the non-zero elements of the TF-IDF matrix
for document_id, term_id in zip(*X.nonzero()):
    tfidf_score = X[document_id, term_id]
    term = feature_names[term_id]
    tfidf_inverted_index[term_indices[term]].append((document_id, tfidf_score))

# Creation of a regular dictionary by converting the defaultdict
tfidf_inverted_index = dict(tfidf_inverted_index)


**Saving the second Inverted Index just created as tfidf_inverted_index.json**

In [66]:
import numpy as np
# Convert NumPy int32 values to Python int for serialization because we noticed
# that if we do not do this we raise an error of type object
def convert_to_builtin_type(obj):
    if isinstance(obj, np.int32):
        return int(obj)
    raise TypeError(f"Object of type {type(obj)} is not JSON serializable")

# Convert keys to strings in tfidf_inverted_index
tfidf_inverted_index_str_keys = {str(key): value for key, value in tfidf_inverted_index.items()}

# Save the inverted index to a JSON file with custom serialization
with open("tfidf_inverted_index.json", "w") as index_file:
    json.dump(tfidf_inverted_index, index_file, indent=2, default=convert_to_builtin_type)


### 2.2.2) Execute the query

In this new setting, given a query, you get the proper documents (i.e., those containing all the query's words) and sort them according to their similarity to the query. For this purpose, as the scoring function, we will use the Cosine Similarity concerning the tfIdf representations of the documents.

Given a query input by the user, for example:

advanced knowledge

The search engine is supposed to return a list of documents, ranked by their Cosine Similarity to the query entered in the input.

**Search Engine 2.0**

As we did before even now we are loading the dataframe we had saved in the tsv file already:

In [67]:
courses_df=pd.read_csv('courses_data_processed.tsv',sep='\t')

Even now, if we did before we are loading the vocabulary and tfidf_inverted_index that we saved before:

In [68]:
import json

# Load vocabulary from "vocabulary.txt"
vocabulary = {}
with open("vocabulary.txt", "r") as vocab_file:
    for line in vocab_file:
        term, term_id = line.strip().split()
        vocabulary[term] = int(term_id)

# Load inverted index from "inverted_index.json"
with open("tfidf_inverted_index.json", "r") as index_file:
    tfidf_inverted_index = json.load(index_file)

*In the searchEngine.py we have built the function 'conjunction_search', but not only, we have built the **'tfidf_conjunction_search_topk'**. The function takes the dataframe ('courses_df'), vocabulary, tfidf_inverted_index, the query, the number k and the vectorizer and the matrix X as parameters.
It processes the query using nltk library, identifies documents matching all query terms using the vocabulary and tfidf_inverted_index, and differently from the previous search engine, it ranks them by cosine similarity and only the top-k matching documents, along with their relevant information, are returned in a DataFrame. For maintaining the top-k documents we used a **heap data structure***

In [69]:
# Pre-processes data
courses_df["ProcessedDescription"] = courses_df["ProcessedDescription"].fillna("")

# Create a list of processed descriptions
processed_descriptions = courses_df["ProcessedDescription"].tolist()

# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(processed_descriptions)

# Set the query
query='advanced knowledge' 

# Set k=10 to have the top-k
k=10

# Call the function
k_results_df = tfidf_conjunction_search_topk(courses_df, term_indices, tfidf_inverted_index, query, vectorizer, X, k)

# Give the desired output
k_results_df[['courseName', 'universityName', 'description', 'url', 'SimilarityScore']]


Unnamed: 0,courseName,universityName,description,url,SimilarityScore
739,Advanced Computing (MSc/MRes),"Birkbeck, University of London",MSc Advanced Computing:If you already work in ...,https://www.findamasters.com/masters-degrees/c...,0.46391
3788,Food and Nutrition Sciences (MSc),Sheffield Hallam University,Gain advanced food industry knowledge and prac...,https://www.findamasters.com/masters-degrees/c...,0.444054
4019,Global Biodiversity Conservation - MSc,University of Sussex,This MSc will give you advanced knowledge and ...,https://www.findamasters.com/masters-degrees/c...,0.426063
809,Advanced Mechanical Engineering - MSc,Cardiff University,This degree programme aims to provide advanced...,https://www.findamasters.com/masters-degrees/c...,0.395947
740,Advanced Computing MSc,King’s College London,Our Advanced Computing MSc provides knowledge ...,https://www.findamasters.com/masters-degrees/c...,0.388661
770,Advanced Healthcare Practice - MSc,Cardiff University,Our MSc Advanced Healthcare Practice programme...,https://www.findamasters.com/masters-degrees/c...,0.3776
737,Advanced Computing - MSc,University of the West of Scotland,Our MSc Advanced Computing course is designed ...,https://www.findamasters.com/masters-degrees/c...,0.365345
916,Advancing Practice - MSc,University of Northampton,Our MSc Advancing Practice awards support the ...,https://www.findamasters.com/masters-degrees/c...,0.355659
685,Advanced Clinical Practice MSc,University of Greenwich,Develop your skills and deepen your knowledge ...,https://www.findamasters.com/masters-degrees/c...,0.350701
815,Advanced Mechanical Engineering - MSc (Eng),University of Leeds,This course offers a broad range of advanced s...,https://www.findamasters.com/masters-degrees/c...,0.340908
