## 3. Define a new score!

Now it's our turn: build a new metric to rank MSc degrees.



We have put our custom functions in functions.py and the search engine functions for this answer in searchEngineNew.py and the search engine that we recall from the question 2 in searchEngine.py and then we put all the library that are useful to run our code and we are importing them in the beginning here so to run the code please before run this following piece of code:

In [1]:
import csv
import pandas as pd
import json
from searchEngine import conjunction_search
from searchEngineNew import calculate_score
from searchEngineNew import top_k_documents
from searchEngine import tfidf_conjunction_search_topk


Before going on we have to import our data so we can do our task, it is crucial! We store our data in a .tsv file so we don't have to pre-process always the data

In [2]:
courses_df=pd.read_csv('courses_data_processed.tsv',sep='\t')

With the date we are importing the vocabulary and the first inverted index that we created (we saved both previously so that we have to not run every time the code to have those). 

In [3]:
import json

# Load vocabulary from "vocabulary.txt"
vocabulary = {}
with open("vocabulary.txt", "r") as vocab_file:
    for line in vocab_file:
        term, term_id = line.strip().split()
        vocabulary[term] = int(term_id)

# Load inverted index from "inverted_index.json"
with open("inverted_index.json", "r") as index_file:
    inverted_index = json.load(index_file)

**Now we arrive to the core of the task: define a new score and sort the query-related documents according to this new score.**

- **New score used:** we decided to use a weighted combination of various fields in our dataset, we thought that not only the 'description' field is relevant for the query because an user might want to search for master's degrees based on a city or based on fees payable so each field contributes to the score based on its weight, and the overall score for a document is the sum of the normalized values of numeric fields and binary scores for string fields, so the weights reflect the importance of each field in the scoring process.

- **Structure to solve the task:** After defining a new score, we sorted the documents according to it, to do this we use the heap data structure that keeps us top-k documents. 

We decided we wanted to have the top-10 documents that best matched the user's query.

In [4]:
# Query given in input from a user
query = input("Enter your query: ")

# Columns of our dataser that we want to use for the score
selected_columns = ['courseName', 'universityName', 'facultyName', 'isItFullTime',
       'description', 'startDate', 'fees', 'modality', 'duration', 'city',
       'country', 'administration', 'url', 'ProcessedDescription', 'currency',
       'fees (EUR)']

# Usage of the conjunction_search done in 2.1 to retrieve documents that matches with the query
result_df = conjunction_search(courses_df, vocabulary, inverted_index, query)

# Set the columns that we are using for scoring
documents = result_df[selected_columns].to_dict(orient='records')

# Define weights for scoring that reflect the importance of each field in the scoring process
weights = {
    "courseName": 0.1,
    "universityName": 0.2,
    "facultyName": 0.15,
    "isItFullTime": 0.05,
    "description": 0.2,
    "startDate": 0.08,
    "fees": 0.05,
    "modality": 0.1,
    "duration": 0.1,
    "city": 0.1,
    "country": 0.2,
    "administration": 0.15,
    "url": 0.05,
    'ProcessedDescription':0.1, 
    'currency':0.05,
    'fees (EUR)':0.05
}

# Number of top-k documents you want to retrieve
k = 10

# Retrieve and print the top-k documents using the function in searchEngineNew.py
result = top_k_documents(documents, query, weights, k)
result[['courseName','universityName','description','url', 'MyScore']]

Unnamed: 0,courseName,universityName,description,url,MyScore
0,Electronics Engineering,Linköping University,This programme focuses on the design of integr...,https://www.findamasters.com/masters-degrees/c...,0.524305
1,Electric Vehicle Systems MSc,Brunel University London,Our Electric Vehicle Systems MSc degree has be...,https://www.findamasters.com/masters-degrees/c...,0.404697
2,Computational Finance MSc,University College London,The Computational Finance MSc at UCL provides ...,https://www.findamasters.com/masters-degrees/c...,0.333067
3,International Business MSc,University of Leicester,This is for you if you want to enhance your ex...,https://www.findamasters.com/masters-degrees/c...,0.301789
4,Financial Risk Management MSc,University College London,The Financial Risk Management MSc at UCL provi...,https://www.findamasters.com/masters-degrees/c...,0.246805
5,Criminal Justice and Criminology - MSc,University of Leeds,Develop advanced knowledge in the study of cri...,https://www.findamasters.com/masters-degrees/c...,0.242466
6,MSc International Management,University of Nottingham Ningbo China,The MSc International Management programme fur...,https://www.findamasters.com/masters-degrees/c...,0.236929
7,Chemistry - Analysis of Pharmaceutical Compoun...,University College Cork,MSc degree courses are provided in three key a...,https://www.findamasters.com/masters-degrees/c...,0.231994
8,Chemistry - Analytical Chemistry MSc,University College Cork,MSc degree courses are provided in three key a...,https://www.findamasters.com/masters-degrees/c...,0.231994
9,Chemistry - Environmental Analytical Chemistry...,University College Cork,MSc degree courses are provided in three key a...,https://www.findamasters.com/masters-degrees/c...,0.231994


**Let's now focus on analyzing the results obtained with our new score function and the function based on cosine similarity (done in the question 2.2)**

In question 2.2 we used cosine similarity to sort documents based on their tf-idf, at first glance reflecting without comparing the results it might seem that just that function works better than our new score because it is based on mathematical computations and in particular it is based on tf-idf which is what is used purely in text mining. 
Comparing the results, however, we can make the following observations:
- The function based on cosine similarity exploits and processes only the 'description' column of our dataset thus going to extract query-related documents based on the 'description' field
- The function based on our new score processes all columns in our dataset and then goes to extract query-related documents by scrolling and 'weighing' each field in our dataset
This leads us to conclude that in case a user enters a query in which there is the name of a specific country/university or a range of fees for his master degree program the function we described in 2.2 is definitely less effective than the new score we created because the latter goes and parses all the fields in the dataset.

On the other hand if we use as a score the cosine similarity being a well-defined metric in the case where the query is the same when we go to apply the two score functions then in that case the precision is greater.
So we can conclude that based on our dataset and how we have processed the columns and data what is the best score function depends on what we are interested in and then it depends on the particular query and we can attest this with just one example: **comparing the results of the function that uses cosine similarity and the one that uses our new score when the query is the same, i.e. 'advanced knowledge'**.

To this end the output that see in the previous cell was made by entering 'advanced knowledge' as the input query, now in the following cells we call up the previously saved 'tfidf_inverted_index.json' file and the function used in 2.2 to define the top-10 documents and we analyze the two outputs and make a brief comparison.

In [5]:
# Importing the tfidf_inverted_index.json' for the comparison

import json

# Load inverted index from "inverted_index.json"
with open("tfidf_inverted_index.json", "r") as index_file:
    tfidf_inverted_index = json.load(index_file)

In [6]:
# Rerun the function used in 2.2 to have the top-10 query-related document

from sklearn.feature_extraction.text import TfidfVectorizer

# Pre-processes data
courses_df["ProcessedDescription"] = courses_df["ProcessedDescription"].fillna("")

# Create a list of processed descriptions
processed_descriptions = courses_df["ProcessedDescription"].tolist()

# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(processed_descriptions)
feature_names = vectorizer.get_feature_names_out()
term_indices = {term: idx for idx, term in enumerate(feature_names)}

# Set the query
query='advanced knowledge' 

# Set k=10 to have the top-k
k=10

# Call the function
k_results_df = tfidf_conjunction_search_topk(courses_df, term_indices, tfidf_inverted_index, query, vectorizer, X, k)

# Give the desired output
k_results_df[['courseName', 'universityName', 'description', 'url', 'SimilarityScore']]

Unnamed: 0,courseName,universityName,description,url,SimilarityScore
739,Advanced Computing (MSc/MRes),"Birkbeck, University of London",MSc Advanced Computing:If you already work in ...,https://www.findamasters.com/masters-degrees/c...,0.46391
3788,Food and Nutrition Sciences (MSc),Sheffield Hallam University,Gain advanced food industry knowledge and prac...,https://www.findamasters.com/masters-degrees/c...,0.444054
4019,Global Biodiversity Conservation - MSc,University of Sussex,This MSc will give you advanced knowledge and ...,https://www.findamasters.com/masters-degrees/c...,0.426063
809,Advanced Mechanical Engineering - MSc,Cardiff University,This degree programme aims to provide advanced...,https://www.findamasters.com/masters-degrees/c...,0.395947
740,Advanced Computing MSc,King’s College London,Our Advanced Computing MSc provides knowledge ...,https://www.findamasters.com/masters-degrees/c...,0.388661
770,Advanced Healthcare Practice - MSc,Cardiff University,Our MSc Advanced Healthcare Practice programme...,https://www.findamasters.com/masters-degrees/c...,0.3776
737,Advanced Computing - MSc,University of the West of Scotland,Our MSc Advanced Computing course is designed ...,https://www.findamasters.com/masters-degrees/c...,0.365345
916,Advancing Practice - MSc,University of Northampton,Our MSc Advancing Practice awards support the ...,https://www.findamasters.com/masters-degrees/c...,0.355659
685,Advanced Clinical Practice MSc,University of Greenwich,Develop your skills and deepen your knowledge ...,https://www.findamasters.com/masters-degrees/c...,0.350701
815,Advanced Mechanical Engineering - MSc (Eng),University of Leeds,This course offers a broad range of advanced s...,https://www.findamasters.com/masters-degrees/c...,0.340908


Now, that we have both results we can do a comparative analysis of TF-IDF and New Score methods for retrieval based on the query "advanced knowledge"

Recall briefly what these two methods do: the first method, the new score method applies a weighted scoring algorithm that considers multiple attributes of courses, instead the other, the TF-IDF method involves preprocessing course descriptions, creating a TF-IDF matrix, and then calculating cosine similarity scores between the query and each course description in the dataset, so the documents (courses) are ranked based on these similarity scores.

What can we notice is that the execution of the TF-IDF method on a dataset of course descriptions resulted in a list of courses where titles and descriptions closely matched the query "advanced knowledge", indeed the top-10 results include courses like "Advanced Computing (MSc/MRes)" and "Advanced Mechanical Engineering - MSc", each with a corresponding similarity score, on the other hand our new score method retrieved a diverse set of courses, with scores based on the weighted sum of various attributes, this method provided a broader perspective, taking into account multiple factors beyond the textual similarity of course descriptions to the query. If we do a comparison what we find out is that the TF-IDF method demonstrated high relevance in its results, with courses closely aligned to the query and the high cosine similarity scores indicate a strong textual match, instead, the new score function, while offering a range of courses, did not match the direct relevance achieved by the TF-IDF method for the specific query used.
We asked themselves on how it can be possible and we find out that maybe TF-IDF scores are based solely on textual relevance, making it highly effective for specific keyword-based queries instead the other method is a multifaceted approach and this allows for a more nuanced evaluation of courses, considering various attributes, which can be advantageous for more complex or multi-dimensional queries tant includes not only descprition texts but that can involve city, country, number to retry fees and so on.

In conclusion as we said above, before the example, our score function is more adaptable, allowing for the adjustment of weights based on different criteria and this makes it suitable for a variety of queries and user preferences, instead the TF-IDF method is more suited for queries where direct textual relevance is paramount and for this reason for this specific example (usage of the query "advanced knowledge") we can say that the TF-IDF method outperformed the other method in terms of retrieving courses with direct textual relevance to the query. However, the method based on our new score strength lies in its adaptability and comprehensive approach, considering a wider range of course attributes.

So at the end what can we say it is that the choice between TF-IDF and new score methods should be guided by the specific needs of the query and the desired breadth of course retrieval