## 3. Define a new score!

Now it's our turn: build a new metric to rank MSc degrees.



We have put our custom functions in functions.py and the search engine functions for this answer in searchEngineNew.py and the search engine that we recall from the question 2 in searchEngine.py and then we put all the library that are useful to run our code and we are importing them in the beginning here so to run the code please before run this following piece of code:

In [None]:
import csv
import pandas as pd
import json
from searchEngine import conjunction_search
from searchEngineNew import calculate_score
from searchEngineNew import top_k_documents


Before going on we have to import our data so we can do our task, it is crucial! We store our data in a .tsv file so we don't have to pre-process always the data

In [None]:
courses_df=pd.read_csv('courses_data_processed.tsv',sep='\t')

With the date we are importing the vocabulary and the first inverted index that we created (we saved both previously so that we have to not run every time the code to have those). 

In [None]:
import json

# Load vocabulary from "vocabulary.txt"
vocabulary = {}
with open("vocabulary.txt", "r") as vocab_file:
    for line in vocab_file:
        term, term_id = line.strip().split()
        vocabulary[term] = int(term_id)

# Load inverted index from "inverted_index.json"
with open("inverted_index.json", "r") as index_file:
    inverted_index = json.load(index_file)

**Now we arrive to the core of the task: define a new score and sort the query-related documents according to this new score.**

- **New score used:** we decided to use a weighted combination of various fields in our dataset, we thought that not only the 'description' field is relevant for the query because an user might want to search for master's degrees based on a city or based on fees payable so each field contributes to the score based on its weight, and the overall score for a document is the sum of the normalized values of numeric fields and binary scores for string fields, so the weights reflect the importance of each field in the scoring process.

- **Structure to solve the task:** After defining a new score, we sorted the documents according to it, to do this we use the heap data structure that keeps us top-k documents. 

We decided we wanted to have the top-10 documents that best matched the user's query.

In [None]:
# Query given in input from a user
query = input("Enter your query: ")

# Columns of our dataser that we want to use for the score
selected_columns = ['courseName', 'universityName', 'facultyName', 'isItFullTime',
       'description', 'startDate', 'fees', 'modality', 'duration', 'city',
       'country', 'administration', 'url', 'ProcessedDescription', 'currency',
       'fees (EUR)']

# Usage of the conjunction_search done in 2.1 to retrieve documents that matches with the query
result_df = conjunction_search(courses_df, vocabulary, inverted_index, query)

# Set the columns that we are using for scoring
documents = result_df[selected_columns].to_dict(orient='records')

# Define weights for scoring that reflect the importance of each field in the scoring process
weights = {
    "courseName": 0.1,
    "universityName": 0.2,
    "facultyName": 0.15,
    "isItFullTime": 0.05,
    "description": 0.2,
    "startDate": 0.08,
    "fees": 0.05,
    "modality": 0.1,
    "duration": 0.1,
    "city": 0.1,
    "country": 0.2,
    "administration": 0.15,
    "url": 0.05,
    'ProcessedDescription':0.1, 
    'currency':0.05,
    'fees (EUR)':0.05
}

# Number of top-k documents you want to retrieve
k = 10

# Retrieve and print the top-k documents using the function in searchEngineNew.py
result = top_k_documents(documents, query, weights, k)
result[['courseName','universityName','description','url', 'MyScore']]




**Let's now focus on analyzing the results obtained with our new score function and the function based on cosine similarity (done in the question 2.2)**

In question 2.2 we used cosine similarity to sort documents based on their tf-idf, at first glance reflecting without comparing the results it might seem that just that function works better than our new score because it is based on mathematical computations and in particular it is based on tf-idf which is what is used purely in text mining. 
Comparing the results, however, we can make the following observations:
- The function based on cosine similarity exploits and processes only the 'description' column of our dataset thus going to extract query-related documents based on the 'description' field
- The function based on our new score processes all columns in our dataset and then goes to extract query-related documents by scrolling and 'weighing' each field in our dataset
This leads us to conclude that in case a user enters a query in which there is the name of a specific country/university or a range of fees for his master degree program the function we described in 2.2 is definitely less effective than the new score we created because the latter goes and parses all the fields in the dataset.

On the other hand if we use as a score the cosine similarity being a well-defined metric in the case where the query is the same when we go to apply the two score functions then in that case the precision is greater.
So we can conclude that based on our dataset and how we have processed the columns and data what is the best score function depends on what we are interested in and then it depends on the particular query.

