Notebook by:
* Lorenzo Pannacci 1948926
* Francesco Proietti 1873188
* Selin Topaloglu 2113300
* Santiago Vessi 1958879

## Startup

In [3]:
######################
# LIBRARIES DOWNLOAD #
######################

install_packages = False
if install_packages:
    %pip install beautifulsoup4 tqdm pandas numpy matplotlib nltk

In [None]:
####################
# LIBRARIES IMPORT #
####################

import requests
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm
import time
import os
import csv

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import heapq
import numpy as np
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## 1. Data collection

For this homework, there is no provided dataset. Instead, you have to build your own. Your search engine will run on text documents. So, here
we detail the procedure to follow for the data collection. We strongly suggest you work on different modules when implementing the required functions. For example, you may have a ```crawler.py``` module, a ```parser.py``` module, and a ```engine.py``` module: this is a good practice that improves readability in reporting and efficiency in deploying the code. Be careful; you are likely dealing with exceptions and other possible issues! 

### 1.1. Get the list of master's degree courses

We start with the list of courses to include in your corpus of documents. In particular, we focus on web scrapping the [MSc Degrees](https://www.findamasters.com/masters-degrees/msc-degrees/). Next, we want you to **collect the URL** associated with each site in the list from the previously collected list.
The list is long and split into many pages. Therefore, we ask you to retrieve only the URLs of the places listed in **the first 400 pages** (each page has 15 courses, so you will end up with 6000 unique master's degree URLs).

The output of this step is a `.txt` file whose single line corresponds to the master's URL.

---

Firstly we observe that we can nagivate trough the different pages using the link `https://www.findamasters.com/masters-degrees/msc-degrees/?PG=n` and changing `n` with the number of the desidered page. We also observe that without waiting between a page load and the other after 30 pages the website thinks we are a bot and don't let us in, to fix this we decided wait some seconds before we open a new page.

In [3]:
data_folder_path = r"data/"
if not os.path.exists(data_folder_path):                    # create main data folder if doesn't already exist
    os.makedirs(data_folder_path)

courses_urls_path = data_folder_path + r"courses_urls.txt"  # file path of the txt file to create
sleep_time = 2                                              # idle time between to requests, to avoid being blocke
to_crawl = False                                            # we check if the file already exists and has the right length, in this case we do not repeat the crawling
n_pages = 400                                               # number of pages to search trough
n_courses = n_pages * 15                                    # total number of courses crawled

# to avoid the site block the crawler considering it a bot we use a user agent taken from a real chrome session
headers = {"user-agent": r"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"}

if not os.path.exists(courses_urls_path): # if file does not exist we have to crawl
    print("File does not exist! Crawling...")
    to_crawl = True

if to_crawl == False: # if file is incomplete we have to crawl
    with open(courses_urls_path, 'r') as file:
        file_length = len(file.readlines())
        if file_length < n_courses:
            print("File exist but is incomplete! Crawling...")
            to_crawl = True
        else:
            print("File already exist and is complete. Using the previous version.")


# if data is missing go crawl
if to_crawl == True:

    with open("data/courses_urls.txt", 'w') as file: # open file, if already exist creates a new one
        for i in tqdm(range(n_pages)): # cycle trough every page
            url = r"https://www.findamasters.com/masters-degrees/msc-degrees/?PG=" + str(1 + i) # we compose the url

            # get the webpage
            webpage = requests.get(url, headers = headers)
            soup = BeautifulSoup(webpage.text)
            soup.prettify()

            tags = soup.find_all('a', {"class": "courseLink"})  # get the tags we are interested in

            if not tags:
                raise IOError("Crawler has been blocked by the website. Try again with higher idle time.")

            for tag in tags: # for every tag get the course link and append to file
                link = tag["href"]
                file.write(r"https://www.findamasters.com" + link + "\n")

            time.sleep(sleep_time) # wait to avoid getting blocked

    # the file automatically close itself when the "with" section ends, saving the written lines

File already exist and is complete. Using the previous version.


### 1.2. Crawl master's degree pages

Once you get all the URLs in the first 400 pages of the list, you:

1. Download the HTML corresponding to each of the collected URLs.
2. After you collect a single page, immediately save its `HTML` in a file. In this way, if your program stops for any reason, you will not lose the data collected up to the stopping point.
3. Organize the downloaded `HTML` pages into folders. Each folder will contain the `HTML` of the courses on page 1, page 2, ... of the list of master's programs.
   
__Tip__: Due to the large number of pages you should download, you can use some methods that can help you shorten the time. If you employed a particular process or approach, kindly describe it.

---

As before we have to insert a idle time between the loading of two pages to avoid that the website block us. We can found whether we have been blocked by checking if the webpage ha the title "`Just a moment...`". This makes the operations particularly slow, to crawl all pages we have to wait a few hours. To avoid to repeat this procedure more times than necessary we save the html pages and try to open them only if they are not already downloaded on the device.

In [4]:
courses_pages_path = r"data/courses_html_pages/" # path of the folder containing all the subfolders with the html files

# create folder if not exist already
if not os.path.exists(courses_pages_path):
    os.makedirs(courses_pages_path)

# we check if the files already exists, in this case we do not repeat the crawling
to_crawl = False
files_count = 0
for _, _, files in os.walk(courses_pages_path):
    files_count += len(files)

if files_count < n_courses:
    print("Crawling...")
    to_crawl = True
else:
    print("All files already crawled. Using the existing version.")

# if data is missing go crawl
if to_crawl == True:

    # make a folder for every page if not already created
    for i in range(1, 400 + 1):
        folder_path = courses_pages_path + "page_" + str(i)
        if not os.path.exists(folder_path):
            os.makedirs(folder_path)

    # populate folders
    with open(courses_urls_path, 'r') as file_1:
        for i, course_url in tqdm(enumerate(file_1), total = n_courses):
            course_url = course_url.strip('\n')
            course_file_path = courses_pages_path + "page_" + str(1 + i // 15) + "/" + "course_" + str(1 + i % 15) + ".html"

            if not os.path.exists(course_file_path): # if already crawled do not repeat
                # get page
                webpage = requests.get(course_url, headers = headers)
                soup = BeautifulSoup(webpage.text, "html.parser")

                if soup.title.text == r"Just a moment...":
                    raise IOError("Crawler has been blocked by the website. Try again with higher idle time.")

                # write file
                with open(course_file_path, 'w+', encoding = "utf-8") as file_2:
                    # html_page = soup.prettify()
                    # file_2.write(str(html_page))
                    file_2.write(str(soup))
                # the file automatically close itself when the "with" section ends, saving the written lines

                time.sleep(sleep_time) # wait to avoid getting blocked

All files already crawled. Using the existing version.


We observe that some pages as `https://www.findamasters.com/masters-degrees/course/emergency-management-and-resilience-msc/?i373d7361c25450` (page 215, course 3) are missing and gives us a filler webpage. We will have to treat those courses carefully as the only information we can get from those is the link. We can easily identify those kind of pages thanks to their title: "`FindAMasters | 500 Error : Internal Server Error`".

In [None]:
# crawling correctness check

print("Checking the correctness of the crawl operation...")

blocked_pages = 0
unaviable_pages = 0
correct_pages = 0
for root, _, files in tqdm(os.walk(courses_pages_path), total = 401): # checks for files in 400 subfolders and on root folder, thus 401
    for file in files:
        course_file_path = os.path.join(root, file)

        with open(course_file_path, 'r', encoding = "utf-8") as html_file:
            html_content = html_file.read()

        soup = BeautifulSoup(html_content, "html.parser")
        page_title = soup.title.text

        if page_title == r"Just a moment...": # blocked during crawling
            blocked_pages += 1
            os.remove(course_file_path)
        elif page_title == r"FindAMasters | 500 Error : Internal Server Error": # missing on website
            unaviable_pages += 1
        else: # downloaded correctly
            correct_pages += 1

print(blocked_pages, "pages were blocked during crawling and had been removed. If this value is not zero run the crawling again to get the missing pages.")
print(unaviable_pages, "pages are not present on the website anymore.")
print(correct_pages, "pages have been correctly downloaded.")

### 1.3 Parse downloaded pages

At this point, you should have all the HTML documents about the master's degree of interest, and you can start to extract specific information. The list of the information we desire for each course and their format is as follows:

1. Course Name (to save as ```courseName```): string;
2. University (to save as ```universityName```): string;
3. Faculty (to save as ```facultyName```): string
4. Full or Part Time (to save as ```isItFullTime```): string;
5. Short Description (to save as ```description```): string;
6. Start Date (to save as ```startDate```): string;
7. Fees (to save as ```fees```): string;
8. Modality (to save as ```modality```):string;
9. Duration (to save as ```duration```):string;
10. City (to save as ```city```): string;
11. Country (to save as ```country```): string;
12. Presence or online modality (to save as ```administration```): string;
13. Link to the page (to save as ```url```): string.

<div style="overflow-x:auto;">
<table>
<thead>
  <tr>
    <th>index</th>
    <th>courseName</th>
    <th>universityName</th>
    <th>facultyName</th>
    <th>isItFullTime</th>
    <th>description</th>
    <th>startDate</th>
    <th>fees</th>
    <th>modality</th>
    <th>duration</th>
    <th>city</th>
    <th>country</th>
    <th>administration</th>
    <th>url</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>0</td>
    <td> Accounting and Finance - MSc</td>
    <td>University of Leeds</td>
    <td>Leeds University Business School</td>
    <td>Full time</td>
    <td>Businesses and governments rely on [...].</td>
    <td>September</td>
    <td>UK: £18,000 (Total) International: £34,750 (Total)</td>
    <td>MSc</td>
    <td>1 year full time</td>
    <td>Leeds</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
    <td><a href="https://www.findamasters.com/masters-degrees/course/accounting-and-finance-msc/?i321d3232c3891">Link</a></td>
  </tr>
  <tr>
    <td>1</td>
    <td> Accounting, Accountability & Financial Management MSc</td>
    <td>King’s College London</td>
    <td>King’s Business School</td>
    <td>Full time</td>
    <td>Our Accounting, Accountability & Financial Management MSc course will provide [...].</td>
    <td>September</td>
    <td>Please see the university website for further information on fees for this course.</td>
    <td>MSc</td>
    <td>1 year FT</td>
    <td>London</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
    <td><a href="https://www.findamasters.com/masters-degrees/course/accounting-accountability-and-financial-management-msc/?i132d7816c25522">Link</a></td>
  </tr>
  <tr>
    <td>2</td>
    <td> Accounting, Financial Management and Digital Business - MSc</td>
    <td>University of Reading</td>
    <td>Henley Business School</td>
    <td>Full time</td>
    <td>Embark on a professional accounting career [...].</td>
    <td>September</td>
    <td>Please see the university website for further information on fees for this course.</td>
    <td>MSc</td>
    <td>1 year full time</td>
    <td>Reading</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
    <td><a href="https://www.findamasters.com/masters-degrees/course/accounting-financial-management-and-digital-business-msc/?i345d4286c351">Link</a></td>
  </tr>
  <tr>
    <td>3</td>
    <td> Addictions MSc</td>
    <td>King’s College London</td>
    <td>Institute of Psychiatry, Psychology and Neuroscience</td>
    <td>Full time</td>
    <td>Join us for an online session for prospective [...].</td>
    <td>September</td>
    <td>Please see the university website for further information on fees for this course.</td>
    <td>MSc</td>
    <td>One year FT</td>
    <td>London</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
    <td><a href="https://www.findamasters.com/masters-degrees/course/addictions-msc/?i132d4318c27100">Link</a></td>
  </tr>
  <tr>
    <td>4</td>
    <td> Advanced Chemical Engineering - MSc</td>
    <td>University of Leeds</td>
    <td>School of Chemical and Process Engineering</td>
    <td>Full time</td>
    <td>The Advanced Chemical Engineering MSc at Leeds [...].</td>
    <td>September</td>
    <td>UK: £13,750 (Total) International: £31,000 (Total)</td>
    <td>MSc</td>
    <td>1 year full time</td>
    <td>Leeds</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
  </tr>
  <!-- Add more rows here as needed -->
</tbody>
</table>
</div>


For each master's degree, you create a `course_i.tsv` file of this structure:

```
courseName \t universityName \t  ... \t url
```

If an information is missing, you just leave it as an empty string.

---

We can observe that some informations are "mandatory", that means that every page (that is not a filler page) has them while others could or could not be present. Meanwhile filler pages gives us no information whatsoever, for those we have only the page url. To create a `.tsv` file we can just use the Python `csv` module changing its delimiter with the character `\t`.

In [27]:
tsvs_path = r"data/tsvs/" # path of the folder containing all the .tsv files

# create folder if not exist already
if not os.path.exists(tsvs_path):
    os.makedirs(tsvs_path)

# we check if all the files already exists, in this case we do not repeat the crawling
to_crawl = False
files_count = 0
for _, _, files in os.walk(tsvs_path):
    files_count += len(files)

if files_count < n_courses:
    print("Creating .tsv files...")
    to_crawl = True
else:
    print("All files already created. Using the existing version.")

# if data is missing go crawl
if to_crawl == True:
    with open(courses_urls_path, 'r') as courses_file:
        for i, url in tqdm(enumerate(courses_file), total = n_courses):
            url = url.strip("\n")

            # if file .tsv already exist skip its creation
            tsv_file_path = tsvs_path + "course_" + str(1 + i) + ".tsv"
            if os.path.exists(tsv_file_path):
                continue

            # create path, open and read .html file
            course_file_path = courses_pages_path + "page_" + str(1 + i // 15) + "/" + "course_" + str(1 + i % 15) + ".html"
            with open(course_file_path, 'r', encoding = "utf-8") as html_file:
                html_content = html_file.read()

            soup = BeautifulSoup(html_content, "html.parser")

            # if the page is no avaiable we can't get informations
            if soup.title.text == r"FindAMasters | 500 Error : Internal Server Error":
                courseName = universityName = facultyName = isItFullTime = description = startDate = fees = modality = duration = city = country = administration = ""

            else:
                # get all the required fields

                courseName = soup.find("h1", {"class": "course-header__course-title"}).get_text(strip = True)
                universityName = soup.find("a", {"class": "course-header__institution"}).get_text(strip = True)
                facultyName = soup.find("a", {"class": "course-header__department"}).get_text(strip = True)

                # some entries do not have this field
                extract = soup.find("span", {"class": "key-info__study-type"})
                if extract is None:
                    isItFullTime = ""
                else:
                    isItFullTime = extract.get_text(strip = True)

                description = soup.find("div", {"class": "course-sections__description"}).find("div", {"class": "course-sections__content"}).get_text(strip = True, separator = " ")
                startDate = soup.find("span", {"class": "key-info__start-date"}).get_text(strip = True)

                # some entries do not have this field
                extract = soup.find("div", {"class": "course-sections__fees"})
                if extract is None:
                    fees = ""
                else:
                    fees = extract.find("div", {"class": "course-sections__content"}).get_text(strip = True)

                modality = soup.find("span", {"class": "key-info__qualification"}).get_text(strip = True)
                duration = soup.find("span", {"class": "key-info__duration"}).get_text(strip = True)
                city = soup.find("a", {"class": "course-data__city"}).get_text(strip = True)
                country = soup.find("a", {"class": "course-data__country"}).get_text(strip = True)

                # courses can be 'on_campus', 'online' or both, but this information is stored in different tags
                extract1 = soup.find("a", {"class": "course-data__online"})
                extract2 = soup.find("a", {"class": "course-data__on-campus"})
                if extract1 is None and extract2 is None:
                    administration = ""
                elif extract2 is None:
                    administration = extract1.get_text(strip = True)
                elif extract1 is None:
                    administration = extract2.get_text(strip = True)
                else:
                    administration = extract1.get_text(strip = True) + " & " + extract2.get_text(strip = True)

            data = [["courseName", "universityName", "facultyName", "isItFullTime", "description", "startDate", "fees", "modality", "duration", "city", "country", "administration", "url"],
                    [courseName, universityName, facultyName, isItFullTime, description, startDate, fees, modality, duration, city, country, administration, url]]

            with open(tsv_file_path, 'w+', newline='') as tsv_file:
                writer = csv.writer(tsv_file, delimiter = '\t', lineterminator = '\n')
                writer.writerows(data)

Creating .tsv files...


  0%|          | 0/6000 [00:00<?, ?it/s]

## 2. Search Engine

Now, we want to create two different Search Engines that, given as input a query, return the courses that match the query.

### 2.0 Preprocessing

#### 2.0.0) Preprocessing the text

First, you must pre-process all the information collected for each MSc by:

1. Removing stopwords
2. Removing punctuation
3. Stemming
4. Anything else you think it's needed
   
For this purpose, you can use the [`nltk library](https://www.nltk.org/).

---

The function takes a text string as input and returns a preprocessed version of the text. It utilizes regular expressions to remove all punctuation from the input text. After this punctuation removal, the text is tokenized by splitting it into words. Stopwords, common words that often don't contribute much to the meaning, are then removed from the tokenized text. The function employs stemming using the Porter Stemmer to reduce words to their root form, helping to consolidate similar words. Finally, the preprocessed words are joined back together into a single string, creating the final output.

---

In [16]:
def preprocess_text(txt):
    
    # remove all punctuation
    txt = re.sub(r'[^\w\s]', ' ', txt)

    # tokenize the text
    txt=txt.split()
    
    # remove stopwords
    stop_words = set(stopwords.words('english'))
    txt = [t for t in txt if t not in stop_words]
 
    # stemming
    stemmer = PorterStemmer()
    txt = [stemmer.stem(t) for t in txt]
 
    # Reassemble the text
    txt = ' '.join(txt)

    return txt

#### 2.0.1) Preprocessing the fees column

Moreover, we want the field ```fees``` to collect numeric information. As you will see, you scraped textual information for this attribute in the dataset: sketch whatever method you need (using regex, for example, to find currency symbol) to collect information and, in case of multiple information, retrieve only the highest fees. Finally, once you have collected numerical information, you likely will have different currencies: this can be chaotic, so let chatGPT guide you in the choice and deployment of an API to convert this column to a common currency of your choice (it can be USD, EUR or whatever you want). Ultimately, you will have a ```float``` column renamed ```fees (CHOSEN COMMON CURRENCY)```.

---

This function takes as input a string of text. It starts by finding the currency symbol. The text is previously transformed into its uppercase form to simplify the search. We then search for all possible fees and choose the highest one found. If both are found, we use the Open Exchange Rates API to convert the currency to Euros. Throughout the text, we might find various ways in which the value and currency appear. Sometimes, they will be properly separated by a space bar, while other times both parameters will be joined together. We will consider both of these cases. Also, the value might appear before or after the currency. Finally, for the dollar, pounds, and euros currencies, there are more ways in which they can appear. For this special case, we created a special dictionary that will help translate them into the correct form. So, when we apply the request function to the website, we will have the correct parameters for these currencies.

---

In [280]:
def convert_to_eur(fees):
    # uppercase the text & tokenize the text 
    fees=fees.upper().split()
    # removes all commas and points
    fees=[i.replace(',', '').replace('.', '') for i in fees]

    # dict of possible symbols used for dollar, pounds and euros
    # in case one of these keys appears we need to change them into their respective values 
    # to use the request function
    sym={'£':'GBP','POUNDS':'GBP','$':'USD','DOLLARS':'USD','€':'EUR','EUROS':'EUR'}
    # list of all the possible currencies
    curr=["£","POUNDS","GBP","$","DOLLARS","USD","€","EUROS","EUR","AED","AFN","ALL","AMD","ANG","AOA","ARS","AUD","AWG","AZN","BAM","BBD","BDT","BGN","BHD","BIF","BMD","BND","BOB","BRL","BSD","BTN","BWP","BYN","BZD","CAD","CDF","CHF","CLP","CNY","COP","CRC","CUP","CVE","CZK","DJF","DKK","DOP","DZD","EGP","ERN","ETB","FJD","FKP","FOK","GEL","GGP","GHS","GIP","GMD","GNF","GTQ","GYD","HKD","HNL","HRK","HTG","HUF","IDR","ILS","IMP","INR","IQD","IRR","ISK","JEP","JMD","JOD","JPY","KES","KGS","KHR","KID","KMF","KRW","KWD","KYD","KZT","LAK","LBP","LKR","LRD","LSL","LYD","MAD","MDL","MGA","MKD","MMK","MNT","MOP","MRU","MUR","MVR","MWK","MXN","MYR","MZN","NAD","NGN","NIO","NOK","NPR","NZD","OMR","PAB","PEN","PGK","PHP","PKR","PLN","PYG","QAR","RON","RSD","RUB","RWF","SAR","SBD","SCR","SDG","SEK","SGD","SHP","SLE","SLL","SOS","SRD","SSP","STN","SYP","SZL","THB","TJS","TMT","TND","TOP","TRY","TTD","TVD","TWD","TZS","UAH","UGX","UYU","UZS","VES","VND","VUV","WST","XAF","XCD","XDR","XOF","XPF","YER","ZAR","ZMW","ZWL"]
    
    # find value and currency
    v=[] # list of values
    c=[] # list of currencies
    for i,elem in enumerate(fees):
        if elem.isdigit():
            # check if there is a currency used before the digit
            if i>0 and fees[i-1] in curr:
                v.append(float(elem))
                if (fees[i-1] in sym):
                    c.append(sym[fees[i-1]])
                else:
                    c.append(fees[i-1])
            elif i<len(fees)-1 and fees[i+1] in curr:
                # check if there is a currency used after the digit
                v.append(float(elem))
                if (fees[i+1] in sym): 
                    c.append(sym[fees[i+1]])
                else:
                    c.append(fees[i+1])
        # check if the currency and the value are attached together as a same element
        elif len(elem)>0 and elem[len(elem)-1] in sym:
            val = re.findall(r"\d+",elem[:len(elem)])
            if val:
                v.append(float(val[0]))
                c.append(sym[elem[len(elem)-1]])
        elif len(elem)>0 and elem[0] in sym:
            val = re.findall(r"\d+",elem[1:len(elem)])
            if val:
                v.append(float(val[0]))
                c.append(sym[elem[0]])

    conv="0.0"
    
    if len(c)>0:
        base_url = "https://open.er-api.com/v6/latest"
        API_KEY="6d56fb10262e4f29bef560d4c38fa3f4"
        conv=[]
        # convert all the values into euros 
        for i in range(len(v)):
            params = {'base': c[i], 'apiKey': API_KEY}
            response = requests.get(base_url, params=params)
            rates = response.json().get('rates', {})
            conv.append(np.round(v[i] / rates['EUR'], 2))
        # get the maximum value
        conv=max(conv)
    return conv

In [284]:
for i in range(1, 6001):
    tsv = "course_" + str(i) + ".tsv"
    file_path = os.path.join(tsvs_path, tsv)
    with open(file_path, 'r', encoding='utf-8') as ff:
        lines=ff.readlines()
    
    fields = lines[1].strip().split('\t')
    
    # Ensure that the list has enough elements
    if len(fields) > 6:
        
        # get the fee value
        f=convert_to_eur(fields[6])
    
        # Read the TSV file into a DataFrame
        df = pd.read_csv(file_path, sep='\t')
    
        # add a new column in the eighth position
        df.insert(7, 'fees (EUR)', f)

        ## Save the modified DataFrame back to the same TSV file
        df.to_csv(file_path, sep='\t', index=False)

### 2.1. Conjunctive query

For the first version of the search engine, we narrowed our interest to the __description__ of each course. It means that you will evaluate queries only concerning the course's description.

#### 2.1.1) Create your index!

Before building the index, 
* Create a file named `vocabulary`, in the format you prefer, that maps each word to an integer (`term_id`).

---
To create the vocabulary, we begin by initializing an empty dictionary. For each of the 6000 tsv files, we extract the description field. After preprocessing the text, we iterate through each word, checking if it's already in the vocabulary. If not, we add it and assign a unique ID. Finally, the vocabulary is saved in a txt file.

---

In [258]:
# start an empty dictionary
vocabulary = {}
# start with term_id as 1
term_id = 1

for i in range(1,6001):
    tsv="course_"+str(i)+".tsv"
    file_path= os.path.join(tsvs_path,tsv)
    with open(file_path, 'r', encoding='utf-8') as ff:
        lines=ff.readlines()
    
    fields = lines[1].strip().split('\t')
    # Ensure that the list has enough elements
    if len(fields) > 4:
        
        # preprocess the description
        d=preprocess_text(fields[4])
        # tokenize 
        words=d.split()
        
        for w in words:
            if w not in vocabulary: #add to the vocabulary if the word is not in there yet
                vocabulary[w]=term_id #assigns a unique id
                term_id+=1 # update term_id

# save the vocabulary to a file
vocabulary_file_path = r"data/vocabulary.txt"
with open(vocabulary_file_path, 'w', encoding='utf-8') as vocab:
    for word, term_id in vocabulary.items():
        vocab.write(f"{word}\t{term_id}\n")

Then, the first brick of your homework is to create the Inverted Index. It will be a dictionary in this format:

```
{
term_id_1:[document_1, document_2, document_4],
term_id_2:[document_1, document_3, document_5, document_6],
...}
```
where _document\_i_ is the *id* of a document that contains that specific word.

__Hint:__ Since you do not want to compute the inverted index every time you use the Search Engine, it is worth thinking about storing it in a separate file and loading it in memory when needed.

---
An empty dictionary, inverted_index, is created to store the inverted index, where each *term_ID* points to a list of documents containing that term. For each tsv file, we check if there are more than 4 fields to ensure the presence of a description field. The description field is then preprocessed using the preprocess_text function, and the resulting text is tokenized into words. For each word in the tokenized description:

- Check if the word contains alphabetical characters or numbers.
- If the word is in the vocabulary, update the inverted index:
- - Obtain the term ID for the word from the vocabulary.
- - Check if the term ID is already in the inverted index. If not, create a new entry with the tsv file.
- - If the term ID is already in the inverted index, add the document to the existing list, but only if it's not already present.

After processing all course descriptions, the inverted index is saved to a file (inv_index_file_path) with each line containing a term ID and the list of documents containing that term, separated by a tab.

---

In [259]:
# initialize an empty inverted_index
inverted_index={}

for i in range(1,6001):
    tsv="course_"+str(i)+".tsv"
    file_path= os.path.join(tsvs_path,tsv)
    with open(file_path, 'r', encoding='utf-8') as ff:
        lines=ff.readlines()
    
    fields = lines[1].strip().split('\t')
    # Ensure that the list has enough elements
    if len(fields) > 4:
        # access the description field and tokenize the words,
        d=preprocess_text(fields[4])
        words=d.split()
        for w in words:
            # check if the word contains alphabetical characters or numbers
            if w in vocabulary:
                # update inverted index
                id_term = vocabulary[w]
                if id_term not in inverted_index:
                    inverted_index[id_term]=[tsv]
                else:
                    if(tsv not in inverted_index[id_term]):
                        inverted_index[id_term].append(tsv)


# save the inverted index in a file
inv_index_file_path = r"data/inverted_index.txt"
with open(inv_index_file_path, 'w', encoding='utf-8') as inv_ind:
    for word, term_id in inverted_index.items():
        inv_ind.write(f"{word}\t{term_id}\n")

#### 2.1.2) Execute the query
Given a query input by the user, for example:

```
advanced knowledge
```

The Search Engine is supposed to return a list of documents.

##### What documents do we want?
Since we are dealing with conjunctive queries (AND), each returned document should contain all the words in the query.
The final output of the query must return, if present, the following information for each of the selected documents:

* `courseName`
* `universityName`
* `description`
* `URL`

If everything works well in this step, you can go to the next point and make your Search Engine more complex and better at answering queries.

---
The input query is preprocessed using the preprocess_text function, and the resulting text is tokenized into individual words. The variable *doc* is initialized as an empty set. It will eventually contain all the documents that have the complete query in their description.
The code iterates through each word in the processed query:
- For each word, it checks if the word is in the vocabulary. If so, it retrieves the term ID.
- If the term ID is in the inverted index, it collects the set of documents associated with that term ID.
- If it's the first word in the query, it updates the doc set. Otherwise, it updates a temporary set tmp.
- It then filters out the documents that don't contain the previous word of the query (for words beyond the first).

For each document ID in the final doc set, it reads the corresponding tsv file.
Relevant information from the file, such as course name, university name, description, and URL, is extracted and stored in a list of dictionaries (*result_data*).
The extracted information is used to create a Pandas DataFrame (result_df), which is then returned by the function.


---

In [279]:
def search_engine(query):
    query_words = preprocess_text(query).split()
    # this list will contain all the docs that have the complete query in their description
    doc = set() 
    for i in range(len(query_words)):
        tmp=set() # we will need this to determine if a document contains all the query elements
        w = query_words[i]
        if w in vocabulary:
            term_id = vocabulary[w]
            if term_id in inverted_index:
                t_id=inverted_index[term_id]
                if i==0:
                    doc.update(t_id)
                else:
                    tmp.update(t_id)
       # filters out the documents that don't contain the previous word of the query
        if i>0:
            doc=doc.intersection(doc, tmp)
                    

    # Extract information from matching documents
    result_data = []
    for tsv_id in doc:
        tsv_file = os.path.join(tsvs_path, f"{tsv_id}")
        with open(tsv_file, 'r', encoding='utf-8') as ff:
            lines=ff.readlines()
        fields=lines[1].split("\t")
        result_data.append({
                    'courseName': fields[0],
                    'universityName': fields[1],
                    'description': fields[4],
                    'URL': fields[-1],
        })

    # Create pandas DataFrame
    result_df = pd.DataFrame(result_data)
    return result_df

In [278]:
q="advanced knowledge"
search_engine(q)

['advanced', 'knowledge']


### 2.2) Conjunctive query & Ranking score
For the second search engine, given a query, we want to get the top-k (the choice of k it's up to you!) documents related to the query. In particular:

- Find all the documents that contain all the words in the query.
- Sort them by their similarity with the query.
- Return in output *k* documents, or all the documents with non-zero similarity with the query when the results are less than *k*. You must use a heap data structure (you can use Python libraries) for maintaining the *top-k* documents.

To solve this task, you must use the *tfIdf* score and the *Cosine similarity*. The field to consider is still the description. Let's see how.

#### 2.2.1) Inverted index
Your second Inverted Index must be of this format:

In [None]:
{
term_id_1:[(document1, tfIdf_{term,document1}), (document2, tfIdf_{term,document2}), (document4, tfIdf_{term,document4}), ...],
term_id_2:[(document1, tfIdf_{term,document1}), (document3, tfIdf_{term,document3}), (document5, tfIdf_{term,document5}), (document6, tfIdf_{term,document6}), ...],
...}

Practically, for each word, you want the list of documents in which it is contained and the relative tfIdf score.

Tip: TfIdf values are invariant for the query. Due to this reason, you can precalculate and store them accordingly.



---
The code initializes an inverted index *inv_index_tfid* and dictionaries for term frequency *t_f* and document frequency *d_f*.
The code iterates through a range of document IDs and reads the corresponding TSV files. 
Term Frequency Calculation:
- The code iterates through each word in the list of words extracted from the document description (preprocessed).
- It retrieves the term ID for each word from the vocabulary.
- If the term ID exists it updates the term frequency dictionary (t_f) for the pair (term_id, tsv) by incrementing the count.

Document Frequency (DF) Calculation:
- The code initializes a set to keep track of unique terms encountered in the document.
- It then iterates through the set of unique words in the document.
- For each word, it retrieves the term ID from the vocabulary.
- If the term ID exists, and it hasn't been seen before in the document, it updates the document frequency dictionary for the term ID by incrementing the count. It also adds the term ID to the set of seen words.

Then the TF-IDF score is calculated and rounded to two decimal places. Finally it updates, the inverted index with the term ID and a tuple containg the ID and it's TF-IDF score.
 
---

In [None]:
# initialize the inverted index with tf-idf scores 
inv_index_tfidf = {}
# Initialize dictionaries for term frequency (t_f) and document frequency (d_f)
t_f = {}
d_f = {}

# Step 1: Calculate term frequency (tf) and inverse document frequency (idf)
for i in range(1,6001):
    tsv="course_"+str(i)+".tsv"
    file_path= os.path.join(tsvs_path,tsv)
    with open(file_path, 'r', encoding='utf-8') as ff:
        lines=ff.readlines()
    
    # Extract the fields from the second line
    fields = lines[1].strip().split('\t')
    
    # Ensure that the list has enough elements
    if len(fields) > 4:
        # access the description field and tokenize the words, 
        words=preprocess_text(fields[4]).split()

        # calculate term frequency (tf) for each term in the document
        for w in words:
            term_id = vocabulary.get(w)
            if term_id:
                # if the pair doesn't exist  it is initialized with a count of 0 before incrementing.
                # If the pair already exists, it doesn't override the existing value; 
                # it simply returns the existing value associated with that key.
                t_f.setdefault((term_id, tsv), 0) 
                t_f[(term_id, tsv)] += 1

        # update document frequency for each term
        seen_words = set()
        for word in set(words):
            term_id = vocabulary.get(word)
            if term_id and term_id not in seen_words:
                d_f.setdefault(term_id, 0)
                d_f[term_id] += 1
                seen_words.add(term_id)

# step 2: calculate tf-idf and build the inverted index
for (term_id, doc_id),tf in t_f.items():
    
    # calculate inverse document frequency (idf)
    idf = np.log(6000/ (d_f[term_id] + 1))  
    # calculate tf-idf score
    tfidf = np.round(tf * idf,2)
    
    # update the inverted index with the term_id and the corresponding tuple
    inv_index_tfidf.setdefault(term_id, [])
    inv_index_tfidf[term_id].append((doc_id, tfidf))


In [None]:
# save the inverted index in a file
inv_ind_tfidf_file_path = r"data/inv_index_tfidf.txt"
with open(inv_ind_tfidf_file_path, 'w', encoding='utf-8') as tfidf:
    for word, term_id in inv_index_tfidf.items():
        tfidf.write(f"{word}\t{term_id}\n")

#### 2.2.2) Execute the query
In this new setting, given a query, you get the proper documents (i.e., those containing all the query's words) and sort them according to their similarity to the query. For this purpose, as the scoring function, we will use the Cosine Similarity concerning the tfIdf representations of the documents.

Given a query input by the user, for example:

In [None]:
advanced knowledge

The search engine is supposed to return a list of documents, ranked by their Cosine Similarity to the query entered in the input.

More precisely, the output must contain:

- courseName
- universityName
- description
- URL
- The similarity score of the documents with respect to the query (float value between 0 and 1)

---
The function starts by preprocessing the query and vectorizing it using the TfidfVectorizer function, calculating the TF-IDF score for each term. Following this, it identifies documents that contain all the query words in their description, leveraging the same method as the search_engine function. Afterward, the function computes the cosine similarity for each document. This is achieved using the query vector and the document vector, where the document vector is constructed from the TF-IDF scores in the inverted index file. The function employs a heap data structure to efficiently maintain the top documents based on their cosine similarity scores. In cases where there are fewer than k documents satisfying the query conditions, the function returns all available documents.

---

In [None]:
def top_k_documents(query,k=5):
    # Preprocess and tokenize the query
    query_words = preprocess_text(query)
    
    # vectorize the query
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform([query_words])
    query_vector = X.toarray()[0]

    # tokenize the query
    query_words=query_words.split()
    
    # Find documents that contain all words in the query
    doc = set()
    for i in range(len(query_words)):
        tmp=set() 
        w = query_words[i]
        if w in vocabulary:
            term_id = vocabulary[w]
            if term_id in inverted_index:
                t_id=inverted_index[term_id]
                if i==0:
                    doc.update(t_id)
                else:
                    tmp.update(t_id)
        if i>0:
            doc=doc.intersection(doc, tmp)        
    

    # Calculate cosine similarity for each matching document
    heap = []
    for doc_id in doc:
        doc_vector = {}
        # Aggregate tf-idf scores for each term in the document
        for term_id, tfidf in inv_index_tfidf.items():
            for doc, score in tfidf:
                if doc == doc_id:
                    doc_vector[term_id] = score

        # calculate the cosine similarity
        prod = 0.0
        for i in range(len(query_vector)):
            prod += query_vector[i] * doc_vector[vocabulary[query_words[i]]] 
        norm_doc = np.linalg.norm(np.array(list(doc_vector.values()))) 
        norm_query = np.linalg.norm(query_vector) 
        if norm_doc != 0 and norm_query != 0:
            score = prod / (norm_doc * norm_query)
        
        # Add the document information and similarity score to the heap
        heapq.heappush(heap, (-score, doc_id))

        
        
    # Get the top-k documents 
    result_documents = []
    for i in range(min(k,len(heap))):
        similarity_score, doc_id = heapq.heappop(heap)
        
        tsv_file = os.path.join(tsvs_path, f"{doc_id}")
        with open(tsv_file, 'r', encoding='utf-8') as ff:
            lines=ff.readlines()
        fields=lines[1].split("\t")
        result_documents.append({
                    'courseName': fields[0],
                    'universityName': fields[1],
                    'description': fields[4],
                    'URL': fields[-1],
                    'similarityScore': -similarity_score  # Convert back to positive
        })
        
    # return pandas DataFrame
    return pd.DataFrame(result_documents)


In [None]:
top_k_documents("advanced knowledge")

Unnamed: 0,courseName,universityName,description,URL,similarityScore
0,Advanced Computing MSc,King’s College London,Our Advanced Computing MSc provides knowledge ...,https://www.findamasters.com/masters-degrees/c...,0.324157
1,Advanced Healthcare Practice - MSc,Cardiff University,Why study this course Our MSc Advanced Healthc...,https://www.findamasters.com/masters-degrees/c...,0.313722
2,Advanced Clinical Practice MSc,University of Greenwich,Learn essential strategies and prepare for lea...,https://www.findamasters.com/masters-degrees/c...,0.309663
3,Advancing Practice - MSc,University of Northampton,Our MSc Advancing Practice awards support the ...,https://www.findamasters.com/masters-degrees/c...,0.304063
4,Advanced Clinical Practice - MSc,Canterbury Christ Church University,Gain the knowledge and skills needed to become...,https://www.findamasters.com/masters-degrees/c...,0.30241


# 6. Command Line Question

As done in the previous assignment, we encourage using the command as a feature that Data Scientists must master.

Note: To answer the question in this section, you must strictly use command line tools. We will reject any other method of response. The final script must be placed in CommandLine.sh.

First, take the course_i.tsv files you created in point 1 and merge them using Linux commands (Hint: make sure that the first row containing the column names appears only once).

Now that you have your merged file named merged_courses.tsv, use Linux commands to answer the following questions:
- Which country offers the most Master's Degrees? Which city?
- How many colleges offer Part-Time education?
- Print the percentage of courses in Engineering (the word "Engineer" is contained in the course's name).

__Important note:__ You may work on this question in any environment (AWS, your PC command line, Jupyter notebook, etc.), but the final script must be placed in CommandLine.sh, which must be executable. Please run the script and include a __screenshot__ of the <ins>output</ins> in the notebook for evaluation.

The next cell contains the 'CommandLine.sh' script:

```

#!/bin/bash


#command useful in order to format the output
paint=$(tput rev)
no_paint=$(tput sgr 0)
blue=$(tput setaf 4)
red=$(tput setaf 1)
green=$(tput setaf 2)
yellow=$(tput setaf 3)

#printing formatted title and introduction
echo -e "\n"
echo "$paint$red                      COMMAND LINE QUESTION HW3 AMDM                            $no_paint"
echo "$paint$red  $no_paint                                                                            $paint$red  $no_paint"
echo "$paint$red  $no_paint This bash script merges all the 6000 files .tsv in one and answers to      $paint$red  $no_paint"
echo "$paint$red  $no_paint the three questions by analysing the .tsv file created.                    $paint$red  $no_paint"
echo "$paint$red  $no_paint                                                                            $paint$red  $no_paint"
echo "$paint$red                                                                                $no_paint"
echo -e "\n"
echo "Please wait a few seconds, untill you see the result on standard output, the machine is calculating..."
echo "For a clearly visualization of the output it's recommended to maximize the terminal window..."
echo -e "\n"


#################
#               #
# Merging files #---------------------------------------------------
#               #
#################

#inizialization of the merged_file with the headers
head -n1 course_1.tsv > merged_courses.tsv

#appending rows to the merged_file
for file in course*.tsv
do
    tail -n1 $file >> merged_courses.tsv
done


#################################
#                               #
# Which country and which city? #-----------------------------------
#                               #
#################################

#assegnation of variable useful for calculate the max
max_1=0
max_country=' '

#extracting the countries column 
cut -f11 merged_courses.tsv | sed 1d | sort -u > countries.tsv

#this command says to the for loop to consider the entire line as a variable 
IFS=$'\n'

#for loop along all the countries
for country in $(cat 'countries.tsv')
do
    #l contains the occurrence of the country
    l=$(cut -f11 merged_courses.tsv | grep -i $country | wc -l)
    
    #if statement in order to compare and extract the max
    if [ $l -ge $max_1 ]
    then
	max_1=$l
	max_country=$country
    fi
done

#this part works as the previous
max_2=0
max_city=' '
cut -f10 merged_courses.tsv | sed 1d | sort -u > cities.tsv

IFS=$'\n'
for city in $(cat 'cities.tsv')
do
    c=$(cut -f10 merged_courses.tsv | grep -i $city | wc -l)
    if [ $c -ge $max_2 ]
    then
	max_2=$c
	max_city=$city
    fi
done


####################
#                   #
# Part-time courses #-----------------------------------------------
#                   #
#####################

#extracting the columns of the university name and the time type
cut -f2,4 merged_courses.tsv | sed 1d | grep -i 'part time' | cut -f1 > univ_p-t.tsv

#sorting and deleting the duplicates we can calculate
#the number of university that offers part-time courses
num_univ=$(sort -u univ_p-t.tsv | wc -l)


##########################################
#                                        #
# Calculating the courses in Engineering #--------------------------
#                                        #
##########################################

#extracting the column about the courses' name
cut -f1 merged_courses.tsv | sed 1d > courseName.tsv

#counting the courses with 'Engineer' in their name
x=$(grep -i "Engineer" courseName.tsv | wc -l)

#counting the total courses (not counting the empty lines)
y=$(grep -vc '^$' courseName.tsv)

#calculating the percentage
z=$(echo "scale=3;$x*100.0/$y" | bc)


#removing the temporary files used for the analysis 
rm cities.tsv
rm countries.tsv
rm univ_p-t.tsv
rm courseName.tsv

#printing formatted output question 1
echo "$paint$blue    QUESTION 1: WHICH COUNTRY OFFERS THE MOST MASTER'S DEGREES? WHICH CITY?     $no_paint"
echo "$paint$blue  $no_paint                                                                            $paint$blue  $no_paint"
echo "$paint$blue  $no_paint The country that offers the greater number of Master's Degrees is:         $paint$blue  $no_paint"
echo "$paint$blue  $no_paint "$max_country" with "$max_1" courses.                                          $paint$blue  $no_paint"
echo "$paint$blue  $no_paint The city that offers the greater number of Master's Degree is: "$max_city"      $paint$blue  $no_paint"
echo "$paint$blue  $no_paint with "$max_2" courses                                                          $paint$blue  $no_paint"
echo "$paint$blue  $no_paint                                                                            $paint$blue  $no_paint"
echo "$paint$blue                                                                                $no_paint"

#question 2
echo "$paint$green    QUESTION 2: HOW MANY COLLEGES OFFER PART-TIME EDUCATION?                    $no_paint"
echo "$paint$green  $no_paint                                                                            $paint$green  $no_paint"
echo "$paint$green  $no_paint The number of colleges that offer part-time education is: "$num_univ"              $paint$green  $no_paint"
echo "$paint$green  $no_paint                                                                            $paint$green  $no_paint"
echo "$paint$green                                                                                $no_paint"

#question 3
echo "$paint$yellow    QUESTION 3: PRINT THE PERCENTAGE OF COURSES IN ENGINEERING                  $no_paint"
echo "$paint$yellow  $no_paint                                                                            $paint$yellow  $no_paint"
echo "$paint$yellow  $no_paint The percentage of courses in engineering is: "$z"%                       $paint$yellow  $no_paint"
echo "$paint$yellow  $no_paint                                                                            $paint$yellow  $no_paint"
echo "$paint$yellow                                                                                $no_paint"

echo -e "\n"

```

The screenshot below contains the output of the bash script runned on local PC command line using Ubuntu Linux:

![output_screenshot](CLQ_screen.png)