# Homework 3 - Master's Degrees from all over!

#### Group 2 <br>

<div style="float: left;">
    <table>
        <tr>
            <th>Student</th>
            <th>GitHub</th>
            <th>Matricola</th>
            <th>E-Mail</th>
        </tr>
        <tr>
            <td>André Leibrant</td>
            <td>JesterProphet</td>
            <td>2085698</td>
            <td>leibrant.2085698@studenti.uniroma1.it</td>
        </tr>
        <tr>
            <td>Gloria Kim</td>
            <td>keemgloria</td>
            <td>1862339</td>
            <td>kim.1862339@studenti.uniroma1.it</td>
        </tr>
    </table>
</div>

#### Import Libraries and Modules

In [3]:
import json
import os
import pickle
import subprocess
from datetime import datetime

import geopandas as gpd
import googlemaps
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from keplergl import KeplerGl
from shapely.geometry import Point

import engine

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\glori\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\glori\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
pd.set_option("display.max_colwidth", None)

## 1. Data collection
### 1.1. Get the list of master's degree courses

We start with the list of courses to include in your corpus of documents. In particular, we focus on web scrapping the [MSc Degrees](https://github.com/Sapienza-University-Rome/ADM/tree/master/2023/Homework_3#:~:text=web%20scrapping%20the-,MSc%20Degrees,-.%20Next%2C%20we%20want). Next, we want you to collect the URL associated with each site in the list from the previously collected list. The list is long and split into many pages. Therefore, we ask you to retrieve only the URLs of the places listed in the first 400 pages (each page has 15 courses, so you will end up with 6000 unique master's degree URLs).

The output of this step is a .txt file whose single line corresponds to the master's URL.

---

For this we create a text file `links.txt` with all course links from every page. For this we take the core part of each page URL `https://www.findamasters.com/masters-degrees/msc-degrees/?PG=` and add the page number of each iteration at the end of the URL.

In [11]:
# Delete the text file if exists
if os.path.exists("links.txt"):
    os.remove("links.txt")

# Create and open text file
file = open("links.txt", "a")

# This is the main url of the website
main_url = "https://www.findamasters.com"

# This is the core part of each page url
page_url = "https://www.findamasters.com/masters-degrees/msc-degrees/?PG="

# Parse through every page and collect all urls for each Master program
for i in range(1, pages+1):
    
    # Define url for current page
    url = f"{page_url}{i}"
    
    # Make a request to the current page
    response = requests.get(url)
   
    # Get HTML from the response
    html = response.text
   
    # Parse the HTML
    soup = BeautifulSoup(html, "html.parser")
    
    # Find all course links
    links = soup.find_all(attrs={"class": "courseLink text-dark"})
    
    # Save all links in the text file
    for link in links:
    
        # Save if link exists
        if link["href"]:
            file.write(f"{i}, {main_url}{link['href']}\n")
        else:
            print(link["href"])

# Close file
file.close()

PermissionError: [WinError 32] Impossibile accedere al file. Il file è utilizzato da un altro processo: 'links.txt'

### 1.2. Crawl master's degree pages
Once you get all the URLs in the first 400 pages of the list, you:

1. Download the HTML corresponding to each of the collected URLs.
2. After you collect a single page, immediately save its HTML in a file. In this way, if your program stops for any reason, you will not lose the data collected up to the stopping point.
3. Organize the downloaded HTML pages into folders. Each folder will contain the HTML of the courses on page 1, page 2, ... of the list of master's programs.

**Tip:** Due to the large number of pages you should download, you can use some methods that can help you shorten the time. If you employed a particular process or approach, kindly describe it.

---

First we created for all 400 pages one folder each inside the folder `pages` which is being created inside the project folder if it doesn't exist already.

In [12]:
# Define how many pages we want to parse
pages = 400

# Create folder pages inside the project
os.makedirs(f"pages", exist_ok=True)

# Create a folder for each page
for i in range(1, pages+1):
    
    # Fill the page number with leading zeros
    os.makedirs(f"pages/page_{str(i).zfill(3)}", exist_ok=True)

In the next step we created a module `crawler.py` that parses through every URL link inside the text file `links.txt` and downloads the HTML content of the URL. We decided to use the library `requests` and created in addition a list with different user agents using different instances and web browser from which one is randomly selected for each iteration. This way we try to prevent that the website timeouts us. We also check if the page is still up to date and doesn't return the message `FindAMasters Page Not Found`. If yes, we insert inside the file we save the text `Page Not Found`. In case the website doesn't fully load the content of the URL for any reason we save the URL inside a text file `failed_files.txt`. After we go through every URL of `links.txt` we repeat the procedure for the failed links inside `failed_files.txt` until the HTML content of every URL was downloaded (meaning the file `failed_files.txt` is empty).

To fasten up the running time we used the package `multiprocessing` and ran $n$ parallel processes where $n$ equals the number of kernels of the current system. In addition, we tried using different proxy addresses for every process which didn't improve our running time. So, we sticked to the solution using only the package `multiprocessing`.

In [13]:
%%time
subprocess.run(["python", "crawler.py"])

CPU times: total: 0 ns
Wall time: 508 ms


CompletedProcess(args=['python', 'crawler.py'], returncode=0)

### 1.3 Parse downloaded pages
At this point, you should have all the HTML documents about the master's degree of interest, and you can start to extract specific information. The list of the information we desire for each course and their format is as follows:

1. Course Name (to save as `courseName`): string;
2. University (to save as `universityName`): string;
3. Faculty (to save as `facultyName`): string
4. Full or Part Time (to save as `isItFullTime`): string;
5. Short Description (to save as `description`): string;
6. Start Date (to save as `startDate`): string;
7. Fees (to save as `fees`): string;
8. Modality (to save as `modality`):string;
9. Duration (to save as `duration`):string;
10. City (to save as `city`): string;
11. Country (to save as `country`): string;
12. Presence or online modality (to save as `administration`): string;
13. Link to the page (to save as `url`): string.

---

We created a module `parser.py` which goes throw the HTML content of every downloaded URL inside the `pages` folder and retrieves the information of interest for every course and saves the result inside a file for every course each inside the folder `courses` (the module creates the folder inside the project if it doesn't exist).

In [14]:
%%time
subprocess.run(["python", "parser.py"])

CPU times: total: 0 ns
Wall time: 2.04 s


CompletedProcess(args=['python', 'parser.py'], returncode=0)

## 2. Search Engine
Now, we want to create two different Search Engines that, given as input a query, return the courses that match the query.

### 2.0. Preprocessing
#### 2.0.0) Preprocessing the text
First, you must pre-process all the information collected for each MSc by:

1. Removing stopwords
2. Removing punctuation
3. Stemming
4. Anything else you think it's needed

#### 2.0.1) Preprocessing the fees column
Moreover, we want the field `fees` to collect numeric information. As you will see, you scraped textual information for this attribute in the dataset: sketch whatever method you need (using regex, for example, to find currency symbol) to collect information and, in case of multiple information, retrieve only the highest fees. Finally, once you have collected numerical information, you likely will have different currencies: this can be chaotic, so let chatGPT guide you in the choice and deployment of an API to convert this column to a common currency of your choice (it can be USD, EUR or whatever you want). Ultimately, you will have a float column renamed `fees (CHOSEN COMMON CURRENCY)`.

---

We created a module `preprocess.py` which includes a function `preprocess_text` that takes a text as a string, removes stopwords and punctuation, checks if it only contains alphabetical letters, and applies stemming (this function is being used throughout the other problems, too). For saving some computation time we preprocessed the `description` field and added the results inside the course files as the new column `preprocessed_description`.

In addition, we preprocessed the `fees` field in the following way:

1. Exclude all predefined cases `no_fees_keywords`
2. Exclude field if it is a link using the function `is_valid_link`
3. Retrieve fee in EUR with the function `get_fee`

The function `get_fee` takes a string and extracts the (maximum) fee in EUR. First we retrieve with regex all numbers inside the `fees` field. We exclude fees between 1-50 because we treat those cases as outliers and also if the retrieved number is a year. If any number was found we try to retrieve all currencies using regex. In case we don't find any currency but the field includes the string `UK Fees:` we treat the retrieved fee as GBP. If we don't find any currency or we retrieve multiple currencies we exclude those cases because we don't have enough information to retrieve the correct fee. Otherwise we convert every fee regarding the predefined exchange course in `exchange_rates` and choose the maxium fee in case we have multiple.

In [15]:
%%time
subprocess.run(["python", "preprocess.py"])

CPU times: total: 0 ns
Wall time: 4.39 s


CompletedProcess(args=['python', 'preprocess.py'], returncode=1)

### 2.1. Conjunctive query
For the first version of the search engine, we narrowed our interest to the `description` of each course. It means that you will evaluate queries only concerning the course's description.

#### 2.1.1) Create your index!
Before building the index,

Create a file named `vocabulary`, in the format you prefer, that maps each word to an integer (`term_id`).

```
{
term_id_1:[document_1, document_2, document_4],
term_id_2:[document_1, document_3, document_5, document_6],
...}
```

where `document_i` is the id of a document that contains that specific word.

---

Inside the module `engine` we create the function `create_inverted_index1` which creates an inverted index based on the `preprocessed_description` field and saves the vocabulary inside a pickle file `vocabulary_preprocessed_description_score1.pkl` inside the `vocabularies` folder (the module creates the folder inside the project if it doesn't exist). The function parses through every course file, skips the it if the field is empty or doesn't exist, and adds with the function `add_document` every word and corresponding course id to the vocabulary if they don't exist yet or just adds the course id to the word.

In [17]:
%%time
engine.create_inverted_index1()

CPU times: total: 7.22 s
Wall time: 23.7 s


#### 2.1.2) Execute the query
Given a query input by the user, for example:

```
advanced knowledge
````

The Search Engine is supposed to return a list of documents.

**What documents do we want?**<br>
Since we are dealing with conjunctive queries (AND), each returned document should contain all the words in the query. The final output of the query must return, if present, the following information for each of the selected documents:

- `courseName`
- `universityName`
- `description`
- `URL`

---

For this we created the function `conjunctive_query` which takes a query string and returns a pandas dataframe of all courses where every word of the query is inside the course description.

In [18]:
query = "advanced knowledge"
engine.conjunctive_query(query).iloc[:, :4]

Unnamed: 0,courseName,universityName,description,url
0,Criminology MSc,London Metropolitan University,"Our Criminology MSc degree will allow you to develop an advanced knowledge of crime and offenders, as well as assess contemporary trends and concepts in criminal justice policy and community safety. You'll explore approaches to crime control within the community and penal institutions to gain the skills required to conduct research within the field of crime and criminal justice. This level of knowledge can prepare you for doctoral study or research posts within the criminal justice arena, but it's also ideal for consolidating your professional experience.",https://www.findamasters.com/masters-degrees/course/criminology-msc/?i149d7553c9597
1,Biotechnology MSc,University of Nottingham,"Our MSc course in Biotechnology is about discovery, innovation and translation of knowledge to application of novel products. The emphasis of the course is about the impact of modern biotechnological tools and approaches to address today's global challenges, from food to therapeutics. You'll learn about fundamental cellular mechanisms, genetic manipulations of biological systems and production processes. You will be trained in high throughput technologies in taught modules such as advanced molecular methods and be equipped with strong research skills throughout the curriculum. Industrial and commercial aspects of biotechnology bring you closer to the current trends and careers in the field. You can specialise in plant, microbial or animal biotechnology:",https://www.findamasters.com/masters-degrees/course/biotechnology-msc/?i338d851c63097
2,Finance - MSc,Newcastle University,"Understand and analyse international financial markets, institutions and strategies of investors. Our Finance MSc delivers advanced knowledge and skills in financial markets and institutions. You'll learn how they function and interact with the real economy. You'll gain knowledge in making optimal decisions in your financial career. The financial services sector has experienced worldwide growth. This has increased the demand for students with specialist skills finance. This course will suit you if you're interested in the following careers: Your studies will include:",https://www.findamasters.com/masters-degrees/course/finance-msc/?i177d3176c9111
3,Civil Engineering - MSc,"University of the West of England, Bristol","From building bridges to maintaining transport networks, civil engineers are behind some of the most significant advancements in the way we interact and function as a society. Engineering in its many forms already makes a significant contribution to the UK's GDP (gross domestic product). With ambitious construction and infrastructure projects in the pipeline and a chronic housing shortage, this is only set to grow. However an unprecedented skills shortage in the sector means that the demand for accomplished civil engineers has reached critical levels. This course is ideal for graduates of an engineering discipline who want to develop their existing knowledge across a range of specialist subject areas.",https://www.findamasters.com/masters-degrees/course/civil-engineering-msc/?i359d5890c53041
4,Master of Public Policy (MPP),London School of Economics and Political Science,"The Master of Public Policy (MPP) is designed for early to mid-career professionals who want to enhance their knowledge and analytical skills to effectively address complex public policy challenges, and advance their career in any policy-relevant sector. The intensive 9-month programme enables students to take a short career break to join an experienced global cohort, gain new perspectives and develop an understanding of the “craft of government”. This integrates theory and analysis, politics, and implementation of policy.",https://www.findamasters.com/masters-degrees/course/master-of-public-policy-mpp/?i150d8864c69630
...,...,...,...,...
443,Economics and Finance,University of Bath,"Combine core economics knowledge and theory with relevant financial topics to develop your skills for a finance-focused career. Our MSc Economics and Finance course is designed to give you the skills to start your career in a financial institution, consultancy, ministry of finance or economics, or a central bank. On this specialist master's, you’ll study advanced economic theory alongside relevant finance topics, some of which are taught by our School of Management. You will:",https://www.findamasters.com/masters-degrees/course/economics-and-finance/?i280d1681c6133
444,Food Science (Food Biotechnology) - MSc,University of Leeds,"Our Food Science (Food Biotechnology) MSc engages with issues at the very forefront of modern food production. Not only will you advance your knowledge of crucial areas within food science, health and sustainability, but you’ll also explore the origins of biotechnology, the legislation and social issues related to biotechnology in food and modern bioanalytical methods in biotechnology and food safety. You’ll study in our School of Food Science and Nutrition at Leeds, which is home to world-leading research that has impacted key areas in the food industry. This research directly informs this MSc programme, meaning you’ll be learning the latest innovations and practices in food science from researchers and academics who work within the School’s research institutes and groups.",https://www.findamasters.com/masters-degrees/course/food-science-food-biotechnology-msc/?i321d3236c2750
445,MSc International Management,University of Nottingham Ningbo China,"The MSc International Management programme furnishes students with advanced knowledge and facilitates the development of professional and interpersonal capabilities relevant to international and cross-cultural management, to prepare them for careers in international organisations and multicultural work settings. All students undertake core and elective modules relating to a wide range of management disciplines. Elective modules offer a wide selection of options that allow students to focus on advanced topics such as supply chain management, entrepreneurship, corporate social responsibility, e-business, or project management, that fit their interests and career aspirations. Apply",https://www.findamasters.com/masters-degrees/course/msc-international-management/?i1211d6419c27340
446,Civil Engineering - MSc,Abertay University,"Our Civil Engineering MSc will give you a comprehensive overview of what it takes to deliver a successful civil engineering project. You’ll learn the advanced capabilities and in-depth knowledge needed to thrive in your civil engineering career and solve work challenges such as environmental sustainability. You’ll develop your research skills by completing a dissertation on a topic you want to explore further, supported by our academic staff. Designed and delivered by a multi-disciplinary team of leading academics, who have over 60 years' experience in training civil engineers. Abertay has close links with industry and a high research rating in Civil and Environmental Engineering within Scotland.",https://www.findamasters.com/masters-degrees/course/civil-engineering-msc/?i278d3471c71093


### 2.2. Conjunctive query & Ranking score
For the second search engine, given a query, we want to get the top-*k* (the choice of *k* it's up to you!) documents related to the query. In particular:

- Find all the documents that contain all the words in the query.
- Sort them by their similarity with the query.
- Return in output *k* documents, or all the documents with non-zero similarity with the query when the results are less than *k*. You must use a heap data structure (you can use Python libraries) for maintaining the top-*k* documents.

To solve this task, you must use the *tfIdf* score and the *cosine similarity*. The field to consider is still the `description`. Let's see how.

#### 2.2.1) Inverted index
Your second Inverted Index must be of this format:

```
{
term_id_1:[(document1, tfIdf_{term,document1}), (document2, tfIdf_{term,document2}), (document4, tfIdf_{term,document4}), ...],
term_id_2:[(document1, tfIdf_{term,document1}), (document3, tfIdf_{term,document3}), (document5, tfIdf_{term,document5}), (document6, tfIdf_{term,document6}), ...],
...}
```

Practically, for each word, you want the list of documents in which it is contained and the relative *tfIdf* score.

---

For this we implemented a second function to create an inverted index `create_inverted_index2` which creates an inverted index based on the given column name using the *tfIdf* score and saves it inside a pickle file. If the column is not `preprocessed_description` we preprocess the field first. After that we create the *tfidf* matrix and based on this we create the inverted index keeping only the courses with a score larger than 0. The results are being saved inside `vocabulary_{column_name}.pkl`.

In the following cell we create the inverted index using the *tfIdf* score for the field `preprocessed_description`.

In [19]:
%%time
engine.create_inverted_index2("preprocessed_description")

CPU times: total: 35.9 s
Wall time: 38.3 s


#### 2.2.2) Execute the query
In this new setting, given a query, you get the proper documents (i.e., those containing all the query's words) and sort them according to their similarity to the query. For this purpose, as the scoring function, we will use the *cosine similarity* concerning the *tfIdf* representations of the documents.

Given a query input by the user, for example:

```
advanced knowledge
````

The search engine is supposed to return a list of documents, ranked by their *cosine similarity* to the query entered in the input.

More precisely, the output must contain:

- `courseName`
- `universityName`
- `description`
- `URL`
- The similarity score of the documents with respect to the query (float value between 0 and 1)

---

For this we created a function `retrieve_courses` which takes a query string and returns a pandas dataframe of the *k* (if no *k* is given it will return all courses) courses where every word of the query is inside the given vocabulary and is sorted by the cosine similarity in descending order. Before retrieving the courses the given query string is being preprocessed.

In the following cell we retrieve the 10 courses closest to the given query using the `vocabulary_preprocessed_description.pkl` vocabulary.

In [20]:
# Load inverted index from pickle file
with open("vocabularies/vocabulary_preprocessed_description.pkl", "rb") as file:
    vocabulary = pickle.load(file)

query = "advanced knowledge"
engine.retrieve_courses(query, vocabulary, k=10).iloc[:, [0, 1, 2, 3, 9]]

Unnamed: 0,courseName,universityName,description,url,similarity
0,Advanced Computer Science MSc,University of Liverpool,"This course aims to extend your knowledge gained during undergraduate study with more advanced specialised material reflecting current research at the “cutting-edge” of the discipline. This programme will underpin and enhance your current knowledge and understanding; along with skills that you develop during the programme, will provide you with a strong basis for your future career in the IT industry and towards specialisation in the field of Computer Science related research and development. Designed for graduates of the highest calibre, the MSc in Advanced Computer Science is directed at graduates with a previous Computer Science or IT degree.",https://www.findamasters.com/masters-degrees/course/advanced-computer-science-msc/?i326d913c13227,0.996209
1,Advanced Data Science MSc,Bangor University,"The Advanced Data Science M.Sc. allows graduates from either a CS or Data Science ungraduated programme to broaden and deepen their skills and knowledge of data management, processing and analysis. This path also allows a route to specialise their CS experience allowing careers in business intelligence, big data analytics, or research and development. The programme shares the same commitment to professionalism and ethical exploitation of data and technology. These ideals are critical as we enter Industrial Revolution 4.0 and the ever-increasing utilisation of data to define everyday life.",https://www.findamasters.com/masters-degrees/course/advanced-data-science-msc/?i13d8006c62055,0.996209
2,Astronomy - MSc,University of Sussex,"Explore astronomy and astrophysics at an advanced level, focusing on observational, theoretical or computational astronomy. This course is one of only three full-time, broad-based astronomy MSc courses in the UK. This course is for you if you graduated from a physics- or applied mathematics-based degree with a substantial physics component, and wish to learn how to apply your knowledge to astronomy. Our Astronomy Centre carries out world-leading research in many branches of theoretical and observational astrophysics. Our particular focus is on the early universe, and galaxy formation and evolution.",https://www.findamasters.com/masters-degrees/course/astronomy-msc/?i356d4268c5671,0.993489
3,Civil Engineering - MSc,University of West London,"If you are looking for a Master’s degree in Civil Engineering that provides a pathway for you to become a professional engineer in various fields of contemporary civil engineering, this course is for you. The course will advance your analytical skills, research knowledge and technical understanding and focus on the use of industrially relevant applications and IT skills in the main subject areas of civil engineering including:",https://www.findamasters.com/masters-degrees/course/civil-engineering-msc/?i237d3554c54396,0.993489
4,Computational finance,University of Padua,"Through a highly transdisciplinary study path, the Master's degree in Computational Finance prepares students with advanced quantitative and computational proficiencies combined with a solid knowledge of the financial and insurance markets. This innovative and in-depth programme combines economic, information technology, mathematical and statistical skills needed to fill qualified roles in companies and institutions of financial, insurance and energy fields.",https://www.findamasters.com/masters-degrees/course/computational-finance/?i1283d7798c70418,0.993489
5,Electronics & Photonics Manufacturing - MSc,University of Glasgow,This Masters in Electronics & Photonics Manufacturing introduces you to a broad spectrum of specialist topics in advanced manufacturing and electronics design. These topics involve the fusion of novel nanofabrication and microforming processes with material technologies within a manufacturing context. This programme provides the essential knowledge and practical experience of manufacturing techniques for modern industrial and consumer products and systems.,https://www.findamasters.com/masters-degrees/course/electronics-and-photonics-manufacturing-msc/?i307d4813c54257,0.993489
6,Financial Technology - MSc,University of Glasgow,Fast-moving trends in technology and banking have altered the landscape for those seeking to make an impact in the world of finance. The MSc Financial Technology provides an advanced education in the multiple converging skills and knowledge bases that are required by the organisations of the,https://www.findamasters.com/masters-degrees/course/financial-technology-msc/?i307d7918c56395,0.993489
7,Health and Global Environment - MSc,University of Salford,"The COVID-19 pandemic highlights the challenge in managing the complex forces and interrelationships that shape the health of the environment and the general public. Build the skills to become an advanced practitioner with our MSc Health and Global Environment postgraduate degree. Available full and part-time study, our highly-relevant and topical course is carefully-designed to equip you with the knowledge and skills necessary to engage in planning and decision-making to improve the health of populations at global and national levels.",https://www.findamasters.com/masters-degrees/course/health-and-global-environment-msc/?i347d44c54790,0.993489
8,Computing,University of East London,Do you want to be at the forefront of solving tomorrow’s real-world technology problems and aiding the need for better products and solutions? An MSc Computing and Information Communication Technology will equip you with the key skills and knowledge in order for you to impact the future of technological advances in computer-based systems. This course helps you to further develop your knowledge and skills within the cutting-edge areas of computing and information communication technology. The course has been designed to provide you with a blend of advanced theoretical knowledge and practical skills related to emerging technologies deployed in industry and research.,https://www.findamasters.com/masters-degrees/course/computing/?i298d3331c62098,0.993489
9,Computer Science MSc,University of East London,Do you want to be at the forefront of solving tomorrow’s real-world technology problems and aiding the need for better products and solutions? An MSc in Computer Science will equip you with the key skills and knowledge in order for you to impact the future of technological advances in computer-based systems. This course helps you to further develop your knowledge and skills within the cutting-edge areas of Computer Science. The course has been designed to provide you with a blend of advanced theoretical knowledge and practical skills related to emerging technologies deployed in industry and research. The course is taught and delivered by leading researchers who are actively engaged in this rapidly changing field to ensure that you’re up to date with the latest developments.,https://www.findamasters.com/masters-degrees/course/computer-science-msc/?i298d3331c51036,0.993489


## 3. Define a new score!
Now it's your turn: build a new metric to rank MSc degrees.

Practically:

1. The user will enter a text query. As a starting point, get the query-related documents by exploiting the search engine of Step 2.1.
2. Once you have the documents, you need to sort them according to your new score. In this step, you won't have any more to take into account just the `description` field of the documents; you can use also the remaining variables in your dataset (or new possible variables that you can create from the existing ones or scrape again from the original web-pages). You must use a heap data structure (you can use Python libraries) for maintaining the top-k documents.

**N.B.:** You have to define a scoring function, not a filter!

The output, must contain:

- `courseName`
- `universityName`
- `description`
- `URL`
- The **new** similarity score of the documents with respect to the query

Are the results you obtain better than with the previous scoring function? **Explain and compare results**.

In [44]:
with open("vocabularies/vocabulary_preprocessed_description.pkl", "rb") as file:
    vocabulary = pickle.load(file)

query = "food industry"
engine.retrieve_courses(query, vocabulary, k=15).iloc[:, [0, 1, 2, 3, 9]]

Unnamed: 0,courseName,universityName,description,url,similarity
0,Environmental Management for Agriculture - MSc,University of Hertfordshire,"Industry Recognised Accreditation: Accredited by the Institute of Environmental Management and Assessment (IEMA) and the Chartered Institution of Water and Environmental Management (CIWEM). Sector-Specific Course Content: You will explore environmental management issues associated with agriculture, such as crop protection and farm management. Employment Prospects: Graduates work for the Environmental Agency as environmental managers, agricultural consultants. Others work for food and agriculture organisations, in crop protection roles across the UK and overseas.",https://www.findamasters.com/masters-degrees/course/environmental-management-for-agriculture-msc/?i313d1157c28854,0.997915
1,Green Economy and Sustainability (MSc),University of Brescia,"This curriculum aims to prepare MSc (Master of Science) students to face the challenges and opportunities of a time of deep economic change, which requires high-value added, environmentally and socially friendly goods and services. At corporate level, the postgraduates in “Green Economy and Sustainability” will have a suitable profile for the new positions in the area of Corporate Social Responsibility (CSR) / Sustainability. Moreover, they will be fitting candidates for the following industries / activities where sustainability has become a crucial feature: business consulting; food supply chains; public utilities; services, mainly focused on the promotion of local flagships capable of attracting qualified tourist flows; research, and so on. First",https://www.findamasters.com/masters-degrees/course/green-economy-and-sustainability-msc/?i2679d8225c57279,0.997915
2,Horticulture MSc,Writtle University College,"Writtle University College is one of the most famous and well-respected centres for horticultural technology and research. Postgraduate students from the Writtle University College are highly regarded throughout this international industry, and often go on to work on major projects affecting the production, storage and supply of food and fuel crops across the globe. Students studying there will have access to the Research glasshouse, farms and the postharvest unit which undertakes research and trials in conjunction with commercial companies.",https://www.findamasters.com/masters-degrees/course/horticulture-msc/?i389d4350c30147,0.997915
3,International Hospitality Management-MSc,Canterbury Christ Church University,"On this course you will be immersed in the wonderful world of gastronomy and international service excellence, you will study topics including global food and drink design, hotel management, talent development and entrepreneurship. The hospitality industry is worth billions to the economies of different countries and continues to evolve. Your core modules will provide an overall foundation for management in international hospitality and one optional module offers you the chance to study a bespoke area of management. Employability skills and attributes are embedded into each module to develop your professional skills and talents.",https://www.findamasters.com/masters-degrees/course/international-hospitality-management-msc/?i32d2710c71538,0.997915
4,Advanced Chemical Engineering with Formulation - MSc,University of Birmingham,"This Advanced Chemical Engineering with Formulation programme focuses on advanced chemical engineering topics that inform the modern process engineering industry. Our three major research areas – formulation engineering, energy, and healthcare technology – guide the programme. Chemical engineering is dynamic and evolving, and today extends far beyond its roots in oil and gas processing. It provides solutions to problems facing many sectors, such as energy supply and storage, food, fast-moving consumer goods, pharmaceuticals, and healthcare.",https://www.findamasters.com/masters-degrees/course/advanced-chemical-engineering-with-formulation-msc/?i282d4546c59557,0.997915
5,Business for Agri-Food and Rural Enterprise - Business Communication (MSc),Queen’s University Belfast,"The Business for Agri-food and Rural Enterprise programmes, have been designed to develop a conceptual understanding of the principles and processes of change affecting individuals, groups or organisations within the agri-food and rural business sectors, through an active and innovative approach to learning and teaching. Students will complete their courses at the College of Agriculture, Food and Rural Enterprise (CAFRE), Loughry Campus, Cookstown, Co. Tyrone. The MSc presents full-time students with two industry –related opportunities. In Semester 1, to experience teaching practice, and in Semester 2, developing business materials in a managerial or supervisory role.",https://www.findamasters.com/masters-degrees/course/business-for-agri-food-and-rural-enterprise-business-communication-msc/?i195d2126c68452,0.997915
6,"Control, Automation and Artificial Intelligence MSc",Coventry University,"Is your undergraduate degree in engineering, mathematics or science? An accredited MSc in Control, Automation and Artificial Intelligence can help you to develop skills that would be beneficial in almost every engineering field from automotive, aircraft industry, power and energy, automation, process industry including oil and gas, food and drink, pharmaceutical industry and many others. Control engineering is a means of managing and measuring performance of process systems in areas from power plants and nuclear reactors to construction companies and manufacturing. The principles of control also extend in activities as diverse as managing risk in the financial sector to studying climate change within science.",https://www.findamasters.com/masters-degrees/course/control-automation-and-artificial-intelligence-msc/?i49d2694c28594,0.997915
7,"Advanced Chemical Engineering, MSc",University of Greenwich,"Chemical engineering is a rapidly evolving field that impacts areas including energy, food, water, and health. This Master's in Advanced Chemical Engineering focuses on the fundamentals of key chemical and industrial processes and how they are put into practice. You'll encounter the latest technologies available to the process industries and gain exposure to a broad range of crucial operations and optimisation methods.",https://www.findamasters.com/masters-degrees/course/advanced-chemical-engineering-msc/?i309d3273c64985,0.997915
8,Advanced Pharmaceutical Manufacturing MSc,University of Strathclyde,"This MSc Advanced Pharmaceutical Manufacturing is designed to produce highly skilled graduates in continuous manufacturing science and technology to meet the growing demands for expertise in this area. You’ll be trained to take up jobs in the food, chemical and pharmaceutical industries. The course is aligned with the Continuous Manufacturing & Crystallisation (CMAC) centre. It's supported by academic staff from across the University and was informed by CMAC's strategic industry partners such as:",https://www.findamasters.com/masters-degrees/course/advanced-pharmaceutical-manufacturing-msc/?i353d4190c29173,0.967516
9,Air Transport Management MSc,Coventry University,"In today’s globalised world, we have come to rely heavily on civil aviation – not only for international tourism, but also to keep the supply chains of many industry sectors running smoothly. For perishable commodities, such as fresh food or cut flowers, there is no alternative. This MSc course seeks to prepare students to be successful senior managers and leaders in the highly competitive international air transport industry. To help deliver this, the course has been designed to offer two alternative pathways.",https://www.findamasters.com/masters-degrees/course/air-transport-management-msc/?i49d2694c28587,0.96431


In [75]:
from preprocess import preprocess_text

def score_document(query, document):
    # Preprocess query and document description
    preprocessed_query = preprocess_text(query)
    preprocessed_description = preprocess_text(document['description'])

    # score based on keyword matching
    keyword_score = preprocessed_description.count(preprocessed_query)
    
    # score based on the coursename
    course_score = 0
    if "food" in document['courseName'].lower():
        course_score += 1
    
    # total score
    total_score = keyword_score + course_score

    return total_score


query = "food industry"
results = engine.retrieve_courses(query,vocabulary)

# create a score column based on the scoring function
results['score'] = results.apply(lambda row: score_document(query, row), axis=1)

# sort the results (descending)
sorted_results = results.sort_values(by='score', ascending=False)

# top-k docs
top_k = sorted_results.iloc[:,[0,1,2,3,9,10]]


In [76]:
top_k

Unnamed: 0,courseName,universityName,description,url,similarity,score
55,Master of Science in Food Innovation,Atlantic Technological University,"The Master of Science in Food Innovation is designed to build on knowledge acquired at primary degree level and experience gleaned in the food industry. It will enable students to gain extensive and relevant scientific and operational management knowledge to lead or advance them in a career in the food industry. The focus of the programme will be on the subject areas of innovation management; applied food science; physical and sensory analysis; food safety management; food processing and biotechnology; quality and innovation, research and development.",https://www.findamasters.com/masters-degrees/course/master-of-science-in-food-innovation/?i92d7727c71662,0.964310,3
57,Food Science - MSc,University of Leeds,"The food industry is one of the largest in the world — with leading global corporations seeking qualified scientists who have the acumen to advance their products. Whether it’s enhancing the quality and safety of food products that interests you, or you’re keen to develop brand new products from concept to launch, the extensive skill set you’ll build on our Food Science MSc will open the door to many diverse career opportunities in this ever-evolving field. From challenging current issues in food production to applying scientific concepts to grasp the complex characteristics of food, this programme will broaden your understanding of crucial areas in the food industry.",https://www.findamasters.com/masters-degrees/course/food-science-msc/?i321d3236c2752,0.964310,3
64,Food Technology (online) MSc,Wageningen University & Research,"The online master's specialisation Food Technology focuses on the core of food technology: ingredient functionality, sustainable food process engineering and product design. The online specialisation is part of the master's Food Technology, which is one of the best and most innovative programmes in Europe and worldwide. You will learn how to perform food science research, design food products and improve food production processes. Since the programme includes input from different disciplines: food chemistry, food physics, food microbiology, food process engineering and food quality & design, you will be able to work in different branches of the food industry.",https://www.findamasters.com/masters-degrees/course/food-technology-online-msc/?i883d5909c54129,0.873646,2
36,Food Science MSc,London Metropolitan University,"This degree focuses on food analysis and food microbiology as well as product development and quality control. You'll be taught by members of staff who are active within the Institute of Food Science and Technology, and are regularly involved in the food industry as expert consultants. You'll also learn from our food business development colleagues to gain experience in the industry through work placements. In the most recent Destinations of Leavers from Higher Education (DLHE) survey, 100% of all 2017 graduates from this course were in work or further study within six months.",https://www.findamasters.com/masters-degrees/course/food-science-msc/?i149d7552c45887,0.873646,2
38,Food Science & Technology - MSc/PgD/PgC,Cardiff Metropolitan University,"The food industry in the United Kingdom has developed a world-renowned reputation for the production of exceptional quality, safe, wholesome products. To maintain this position in the global market, it is vital that the workforce is equally competent and highly skilled. The Master’s of Food Science and Technology at Cardiff Met has been designed to provide you with professional training combining comprehensive theoretical and practical knowledge within the fields of food science and food technology. It is ideal for students and professionals seeking to expand their career prospects into a wide range of food manufacturing, commercial, government or research roles in the broad field of food science and food technology.",https://www.findamasters.com/masters-degrees/course/food-science-and-technology-msc-pgd-pgc/?i366d270c16802,0.997915,2
...,...,...,...,...,...,...
26,Advanced Pharmaceutical Manufacturing MSc,University of Strathclyde,"This MSc Advanced Pharmaceutical Manufacturing is designed to produce highly skilled graduates in continuous manufacturing science and technology to meet the growing demands for expertise in this area. You’ll be trained to take up jobs in the food, chemical and pharmaceutical industries. The course is aligned with the Continuous Manufacturing & Crystallisation (CMAC) centre. It's supported by academic staff from across the University and was informed by CMAC's strategic industry partners such as:",https://www.findamasters.com/masters-degrees/course/advanced-pharmaceutical-manufacturing-msc/?i353d4190c29173,0.762085,0
35,Aquatic Veterinary Studies MSc,University of Stirling,"Food from aquatic systems is essential for much of the world’s population. However, with wild catches of seafood declining in many places, aquaculture is playing an increasing role as an alternative source of high-quality, nutritious food - and as an employer. Controlling disease is important to the ongoing success of this industry. This Masters in Aquatic Veterinary Studies provides you with training in the wide range of disciplines and skills you need for the investigation, prevention and control of aquatic animal diseases. You’ll develop an understanding of the biology, husbandry and environment of farmed aquatic species, as well as specialist expertise in aquatic animal diseases.",https://www.findamasters.com/masters-degrees/course/aquatic-veterinary-studies-msc/?i352d8074c7681,0.964310,0
28,"Chemistry, Master Programme (60 credits)",Linnaeus University,"This degree provides an excellent basis for those wishing to pursue PhD studies, or for those aiming for work in the biotechnology, chemical, food or pharmaceutical industries. The degree program commences with a 10 week introductory course on research methodology, which is followed by four or five ten week blocks where you have the choice of studying subjects such as bioanalytical chemistry and food analysis, biophysical chemistry, bioorganic chemistry, biotechnology, environmental chemistry and nanoscience. The degree program is completed with a either a 30 or 20 week research project run within one of the research groups in Kalmar.",https://www.findamasters.com/masters-degrees/course/chemistry-master-programme-60-credits/?i2112d8646c66530,0.846429,0
1,"Advanced Chemical Engineering, MSc",University of Greenwich,"Chemical engineering is a rapidly evolving field that impacts areas including energy, food, water, and health. This Master's in Advanced Chemical Engineering focuses on the fundamentals of key chemical and industrial processes and how they are put into practice. You'll encounter the latest technologies available to the process industries and gain exposure to a broad range of crucial operations and optimisation methods.",https://www.findamasters.com/masters-degrees/course/advanced-chemical-engineering-msc/?i309d3273c64985,0.762085,0


## 4. Visualizing the most relevant MSc degrees
Using maps can help people understand how far one university is from another so they can plan their academic careers more adequately. Here, we challenge you to show a map of the courses found with the score defined in point 3. You should be able to identify at least the city and country for each MSc degree. You can find some ideas on how to create maps in Python [here](https://github.com/Sapienza-University-Rome/ADM/tree/master/2023/Homework_3#:~:text=maps%20in%20Python-,here,-and%20here%20but) and [here](https://github.com/Sapienza-University-Rome/ADM/tree/master/2023/Homework_3#:~:text=Python%20here%20and-,here,-but%20you%20will) but you will maybe need further information for a proper visualization, like coordinates (latitude and longitude). You can retrieve this data using various tools:

1. [Here](https://github.com/Sapienza-University-Rome/ADM/tree/master/2023/Homework_3#:~:text=using%20various%20tools%3A-,Here,-you%20can%20find) you can find a helpful tutorial on how to encode geo-informations using Google API in Python (this tool can also be used in [Google Sheets](https://github.com/Sapienza-University-Rome/ADM/tree/master/2023/Homework_3#:~:text=be%20used%20in-,Google%20Sheets,-)))
2. You can collect a list of unique places in the format (City, Country) and ask chatGPT (or, as usual, any other LLM chatbot) to provide you with a list of corresponding representative coordinates
3. Explore and find the best solution for your case!

Once you defined your visualization strategy, include a way to encode fees in your charts. The map should show (with a proper legend) different courses and associated taxation: the user wants a glimpse not only of how far he will need to move but also of how much it will cost him!

---

We decided to use the Google Maps client to retrieve the latitude and longitude given an address. First we retrieve the top 100 courses using our score from point 3. Then we concatenate the columns `universityName`, `city`, and `country` to create our new column `address`. By using this column we retrieve the coordinates of every course.

In [5]:
# Create Google Maps client
gmaps = googlemaps.Client(key="AIzaSyDSRFQqRgKSlvHeSAEjva_28l-OCEqk21g")

query = "data science"

#####################################
### CHANGE USING THE SCORE FROM 3 ###
#####################################

courses = engine.retrieve_courses(query, vocabulary, k=100)[["courseName", "universityName", "city", "country", "fees (€)"]]

courses["address"] = courses["universityName"] + ", " + courses["city"] + ", " + courses["country"]

courses["lat"] = ""
courses["long"] = ""

# Retrive the latitude and longtitude from given addresses
for i in range(len(courses)):
    geocode_result = gmaps.geocode(courses["address"][i])
    courses.loc[i, "lat"] = geocode_result[0]["geometry"]["location"]["lat"]
    courses.loc[i, "long"] = geocode_result[0]["geometry"]["location"]["lng"]

courses

Unnamed: 0,courseName,universityName,city,country,fees (€),address,lat,long
0,Computer Science MSc (Online),Northumbria University,Newcastle,United Kingdom,10384.52,"Northumbria University, Newcastle, United Kingdom",54.978252,-1.61778
1,Master of Science in Biomolecular Engineering and Health Informatics,The Hong Kong University of Science and Technology,Clear Water Bay,Hong Kong,,"The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong",22.337485,114.263399
2,MSc in Integrated Immunology,University of Oxford,Oxford,United Kingdom,52054.21,"University of Oxford, Oxford, United Kingdom",51.754816,-1.254367
3,MSc Politics & Data Science,University College Dublin,Dublin,Ireland,,"University College Dublin, Dublin, Ireland",53.309727,-6.22159
4,MSc Social Data Science,University College Dublin,Dublin,Ireland,,"University College Dublin, Dublin, Ireland",53.309727,-6.22159
...,...,...,...,...,...,...,...,...
95,MSc Data Science and Artificial Intelligence,University of Liverpool,Liverpool,United Kingdom,18383.18,"University of Liverpool, Liverpool, United Kingdom",53.408371,-2.991573
96,Computer Science: Applied Data Science,Malmö University,Malmo,Sweden,,"Malmö University, Malmo, Sweden",55.608795,12.994561
97,Data Science (MSc),University of Bath,Bath,United Kingdom,,"University of Bath, Bath, United Kingdom",51.378223,-2.326399
98,Data Science MSc,University of Chester,Chester,United Kingdom,,"University of Chester, Chester, United Kingdom",53.193392,-2.893075


Afte we retrieve the top *k* courses we replace every empty value in the field `fees (€)`. In addition, we add a small offset for the longitude and latitude of every coordinate so that points for the same university don't overlap.

We decided to use the packages `gpd` and `KeplerGl` to visualize our results with the predefined Kepler config file `kepler_config.json`. The result is being saved as an interactive Kepler map in `map.html`. If you open the file in your browser of choice it will show by default the whole world map. You are able to zoom in and and out. On the right side you are able to enable the legend by clicking on the button `show legend`. The legend shows the mapping of the color of the points to the corresponding range of fees using 10 steps based on the current data of fees.

In [6]:
# Replace all values where we don't have a fee with 0
courses["fees (€)"] = courses["fees (€)"].fillna(0)

# Add a random small offset for every data point so points for the same university don't overlap
geometry = [Point(xy) for xy in zip(courses["long"] + np.random.normal(-0.005, 0.005, len(courses)),
                                    courses["lat"] + np.random.normal(-0.005, 0.005, len(courses)))]

# Open kepler config file
with open("kepler_config.json", "r") as f:
    custom_config = json.load(f)

# Create a geodataframe with the found courses inside the pandas dataframe
gdf = gpd.GeoDataFrame(courses, geometry=geometry)

# Create map with kepler
map_file = KeplerGl(height=600, width=800, config=custom_config)
map_file.add_data(data=gdf, name="Visualizing the most relevant MSc degrees")

# Save file as an interactive html file
map_file.save_to_html(file_name="map.html")

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter
Map saved to map.html!


## 5. BONUS: More complex search engine
For the Bonus part, we want to ask you more sophisticated search engine. Here we want to let users issue more complex queries. The options of this new search engine are:

1. Give the possibility to specify queries for the following features (the user should have the option to issue none or all of them):

- `courseName`
- `universityName`
- `city`

2. Specify a range for the **fees** to retrieve only MSc whose taxation is in that range.
3. Specify a list of **countries** which the search engine should only return the courses taking place in city within those countries.
4. Filter based on the courses that have already started.
5. Filter based on the presence of online modality.

**Note 1:** You should be aware that you should give the user the possibility <ins>to select any</ins> of the abovementioned options. How should the user use the options? We will accept any manual that you provide to the user.

**Note 2:** As you may have realized from **1st option**, you need to build <ins>inverted indexes</ins> for those values and return all of the documents that have the similarity <ins>more than 0</ins> concerning the given queries. Choose a logical way to aggregate the similarity coming from each of them and explain your idea in detail.

**Note 3:** The options <ins>other than 1st</ins> one can be considered as filtering criteria so the retrieved documents <ins>must respect all</ins> of those filters.

The output must contain the following information about the places:

- `courseName`
- `universityName`
- `URL`

---

First we create the inverted index of the columns `courseName`, `universityName`, and `city` and save the vocabularies in seperated pickle files inside the `vocabularies` folder.

In [None]:
%%time
engine.create_inverted_index2("courseName")
engine.create_inverted_index2("universityName")
engine.create_inverted_index2("city")

We created a function `complex_search_engine` that first lets a user input some parameters to create a query. Based on this query the function returns all courses based on the aggregated similarity between all three inverted indexes and applied filters from the query parameters.

Query input:

```
Enter Course Name (Press Enter to skip): 
Enter University Name (Press Enter to skip): 
Enter City (Press Enter to skip): 
Enter minimum fees in € (Press Enter to skip): 
Enter maximum fees in € (Press Enter to skip): 
Enter a comma-separated list of countries (Press Enter to skip): 
Filter based on courses that have already started? (y/n): 
Filter based on the presence of online modality? (y/n): 
```

We decided to use the arithmetic mean to aggregate the cosine similarity between all three vocabularies because for us all three inputs `courseName`, `universityName`, and `city` are equaly important. We were also considering to use the product of all similarities but decided against this aggregation because if one cosine similarity is comparably much smaller than the other two it would have a huge impact on the aggregated similarity which would potentially falsify our result.

In [7]:
engine.complex_search_engine()

Enter Course Name (Press Enter to skip): 
Enter University Name (Press Enter to skip): Rome
Enter City (Press Enter to skip): 
Enter minimum fees in € (Press Enter to skip): 
Enter maximum fees in € (Press Enter to skip): 
Enter a comma-separated list of countries (Press Enter to skip): Italy
Filter based on courses that have already started? (y/n): y
Filter based on the presence of online modality? (y/n): 


Unnamed: 0,courseName,universityName,url
0,Master of Science (MSc) in Management,European School of Economics (Rome),https://www.findamasters.com/masters-degrees/course/master-of-science-msc-in-management/?i3375d8416c60843
1,Master of Science (MSc) in Marketing,European School of Economics (Rome),https://www.findamasters.com/masters-degrees/course/master-of-science-msc-in-marketing/?i3375d8416c60845
3,European Master in Archaeological Materials Science (ARCHMAT),Sapienza University of Rome,https://www.findamasters.com/masters-degrees/course/european-master-in-archaeological-materials-science-archmat/?i818d8733c67465
