Notebook by:
* Lorenzo Pannacci 1948926
* INSERT NAME SURNAME AND ID HERE

# Startup

In [3]:
######################
# LIBRARIES DOWNLOAD #
######################

install_packages = False
if install_packages:
    %pip install beautifulsoup4 tqdm pandas numpy matplotlib geopy Nominatim folium earthpy

In [4]:
####################
# LIBRARIES IMPORT #
####################

import requests
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm
import time
import os
import csv
import folium
from geopy.geocoders import Nominatim
from geopy.geocoders import Bing
from geopy.extra.rate_limiter import RateLimiter

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

# 1. Data collection

For this homework, there is no provided dataset. Instead, you have to build your own. Your search engine will run on text documents. So, here
we detail the procedure to follow for the data collection. We strongly suggest you work on different modules when implementing the required functions. For example, you may have a ```crawler.py``` module, a ```parser.py``` module, and a ```engine.py``` module: this is a good practice that improves readability in reporting and efficiency in deploying the code. Be careful; you are likely dealing with exceptions and other possible issues! 

## 1.1. Get the list of master's degree courses

We start with the list of courses to include in your corpus of documents. In particular, we focus on web scrapping the [MSc Degrees](https://www.findamasters.com/masters-degrees/msc-degrees/). Next, we want you to **collect the URL** associated with each site in the list from the previously collected list.
The list is long and split into many pages. Therefore, we ask you to retrieve only the URLs of the places listed in **the first 400 pages** (each page has 15 courses, so you will end up with 6000 unique master's degree URLs).

The output of this step is a `.txt` file whose single line corresponds to the master's URL.

---

Firstly we observe that we can nagivate trough the different pages using the link `https://www.findamasters.com/masters-degrees/msc-degrees/?PG=n`, changing `n` with the number of the desidered page. We also observe that after 30 pages are loaded the site thinks we are a bot and don't let us in, to fix this we wait some seconds before we open a new page.

In [5]:
data_folder_path = r"data/"
if not os.path.exists(data_folder_path):                    # create main data folder if doesn't already exist
    os.makedirs(data_folder_path)

courses_urls_path = data_folder_path + r"courses_urls.txt"  # file path of the txt file to create
sleep_time = 2                                              # idle time between to requests, to avoid being blocke
to_crawl = False                                            # we check if the file already exists and has the right length, in this case we do not repeat the crawling
n_pages = 400                                               # number of pages to search trough
n_courses = n_pages * 15                                    # total number of courses crawled

# to avoid the site block the crawler considering it a bot we use a user agent taken from a real chrome session
headers = {"user-agent": r"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"}

if not os.path.exists(courses_urls_path): # if file does not exist we have to crawl
    print("File does not exist! Crawling...")
    to_crawl = True

if to_crawl == False: # if file is incomplete we have to crawl
    with open(courses_urls_path, 'r') as file:
        file_length = len(file.readlines())
        if file_length < n_courses:
            print("File exist but is incomplete! Crawling...")
            to_crawl = True
        else:
            print("File already exist and is complete. Using the previous version.")


# if data is missing go crawl
if to_crawl == True:

    with open("data/courses_urls.txt", 'w') as file: # open file, if already exist creates a new one
        for i in tqdm(range(n_pages)): # cycle trough every page
            url = r"https://www.findamasters.com/masters-degrees/msc-degrees/?PG=" + str(1 + i) # we compose the url

            # get the webpage
            webpage = requests.get(url, headers = headers)
            soup = BeautifulSoup(webpage.text)
            soup.prettify()

            tags = soup.find_all('a', {"class": "courseLink"})  # get the tags we are interested in

            if not tags:
                raise IOError("Crawler has been blocked by the website. Try again with higher idle time.")

            for tag in tags: # for every tag get the course link and append to file
                link = tag["href"]
                file.write(r"https://www.findamasters.com" + link + "\n")

            time.sleep(sleep_time) # wait to avoid getting blocked

    # the file automatically close itself when the "with" section ends, saving the written lines

File already exist and is complete. Using the previous version.


## 1.2. Crawl master's degree pages

Once you get all the URLs in the first 400 pages of the list, you:

1. Download the HTML corresponding to each of the collected URLs.
2. After you collect a single page, immediately save its `HTML` in a file. In this way, if your program stops for any reason, you will not lose the data collected up to the stopping point.
3. Organize the downloaded `HTML` pages into folders. Each folder will contain the `HTML` of the courses on page 1, page 2, ... of the list of master's programs.
   
__Tip__: Due to the large number of pages you should download, you can use some methods that can help you shorten the time. If you employed a particular process or approach, kindly describe it.

---

As before we have to insert a idle time between the loading of two pages to avoid that the website block us. We can found whether we have been blocked by checking if the webpage ha the title "`Just a moment...`". This makes the operations particularly slow, to crawl all pages we have to wait a few hours. 

In [6]:
courses_pages_path = r"data/courses_html_pages/" # path of the folder containing all the subfolders with the html files

# create folder if not exist already
if not os.path.exists(courses_pages_path):
    os.makedirs(courses_pages_path)

# we check if the files already exists, in this case we do not repeat the crawling
to_crawl = False
files_count = 0
for _, _, files in os.walk(courses_pages_path):
    files_count += len(files)

if files_count < n_courses:
    print("Crawling...")
    to_crawl = True
else:
    print("All files already crawled. Using the existing version.")

# if data is missing go crawl
if to_crawl == True:

    # make a folder for every page if not already created
    for i in range(1, 400 + 1):
        folder_path = courses_pages_path + "page_" + str(i)
        if not os.path.exists(folder_path):
            os.makedirs(folder_path)

    # populate folders
    with open(courses_urls_path, 'r') as file_1:
        for i, course_url in tqdm(enumerate(file_1), total = n_courses):
            course_url = course_url.strip('\n')
            course_file_path = courses_pages_path + "page_" + str(1 + i // 15) + "/" + "course_" + str(1 + i % 15) + ".html"

            if not os.path.exists(course_file_path): # if already crawled do not repeat
                # get page
                webpage = requests.get(course_url, headers = headers)
                soup = BeautifulSoup(webpage.text, "html.parser")

                if soup.title.text == r"Just a moment...":
                    raise IOError("Crawler has been blocked by the website. Try again with higher idle time.")

                # write file
                with open(course_file_path, 'w+', encoding = "utf-8") as file_2:
                    # html_page = soup.prettify()
                    # file_2.write(str(html_page))
                    file_2.write(str(soup))
                # the file automatically close itself when the "with" section ends, saving the written lines

                time.sleep(sleep_time) # wait to avoid getting blocked


All files already crawled. Using the existing version.


We observe that some pages as `https://www.findamasters.com/masters-degrees/course/emergency-management-and-resilience-msc/?i373d7361c25450` (page 215, course 3) are missing and gives us a filler webpage. We will have to treat those courses carefully as the only information we can get from those is the link. We can easily identify those kind of pages thanks to their title: "`FindAMasters | 500 Error : Internal Server Error`".

In [None]:
# crawling correctness check

print("Checking the correctness of the crawl operation...")

blocked_pages = 0
unaviable_pages = 0
correct_pages = 0
for root, _, files in tqdm(os.walk(courses_pages_path), total = 401): # checks for files in 400 subfolders and on root folder, thus 401
    for file in files:
        course_file_path = os.path.join(root, file)

        with open(course_file_path, 'r', encoding = "utf-8") as html_file:
            html_content = html_file.read()

        soup = BeautifulSoup(html_content, "html.parser")
        page_title = soup.title.text

        if page_title == r"Just a moment...": # blocked during crawling
            blocked_pages += 1
            os.remove(course_file_path)
        elif page_title == r"FindAMasters | 500 Error : Internal Server Error": # missing on website
            unaviable_pages += 1
            os.remove(course_file_path)
        else: # downloaded correctly
            correct_pages += 1

print(blocked_pages, "pages were blocked during crawling and had been removed. If this value is not zero run the crawling again to get the missing pages.")
print(unaviable_pages, "pages are not present on the website anymore.")
print(correct_pages, "pages have been correctly downloaded.")

## 1.3 Parse downloaded pages

At this point, you should have all the HTML documents about the master's degree of interest, and you can start to extract specific information. The list of the information we desire for each course and their format is as follows:

1. Course Name (to save as ```courseName```): string;
2. University (to save as ```universityName```): string;
3. Faculty (to save as ```facultyName```): string
4. Full or Part Time (to save as ```isItFullTime```): string;
5. Short Description (to save as ```description```): string;
6. Start Date (to save as ```startDate```): string;
7. Fees (to save as ```fees```): string;
8. Modality (to save as ```modality```):string;
9. Duration (to save as ```duration```):string;
10. City (to save as ```city```): string;
11. Country (to save as ```country```): string;
12. Presence or online modality (to save as ```administration```): string;
13. Link to the page (to save as ```url```): string.

<div style="overflow-x:auto;">
<table>
<thead>
  <tr>
    <th>index</th>
    <th>courseName</th>
    <th>universityName</th>
    <th>facultyName</th>
    <th>isItFullTime</th>
    <th>description</th>
    <th>startDate</th>
    <th>fees</th>
    <th>modality</th>
    <th>duration</th>
    <th>city</th>
    <th>country</th>
    <th>administration</th>
    <th>url</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>0</td>
    <td> Accounting and Finance - MSc</td>
    <td>University of Leeds</td>
    <td>Leeds University Business School</td>
    <td>Full time</td>
    <td>Businesses and governments rely on [...].</td>
    <td>September</td>
    <td>UK: £18,000 (Total) International: £34,750 (Total)</td>
    <td>MSc</td>
    <td>1 year full time</td>
    <td>Leeds</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
    <td><a href="https://www.findamasters.com/masters-degrees/course/accounting-and-finance-msc/?i321d3232c3891">Link</a></td>
  </tr>
  <tr>
    <td>1</td>
    <td> Accounting, Accountability & Financial Management MSc</td>
    <td>King’s College London</td>
    <td>King’s Business School</td>
    <td>Full time</td>
    <td>Our Accounting, Accountability & Financial Management MSc course will provide [...].</td>
    <td>September</td>
    <td>Please see the university website for further information on fees for this course.</td>
    <td>MSc</td>
    <td>1 year FT</td>
    <td>London</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
    <td><a href="https://www.findamasters.com/masters-degrees/course/accounting-accountability-and-financial-management-msc/?i132d7816c25522">Link</a></td>
  </tr>
  <tr>
    <td>2</td>
    <td> Accounting, Financial Management and Digital Business - MSc</td>
    <td>University of Reading</td>
    <td>Henley Business School</td>
    <td>Full time</td>
    <td>Embark on a professional accounting career [...].</td>
    <td>September</td>
    <td>Please see the university website for further information on fees for this course.</td>
    <td>MSc</td>
    <td>1 year full time</td>
    <td>Reading</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
    <td><a href="https://www.findamasters.com/masters-degrees/course/accounting-financial-management-and-digital-business-msc/?i345d4286c351">Link</a></td>
  </tr>
  <tr>
    <td>3</td>
    <td> Addictions MSc</td>
    <td>King’s College London</td>
    <td>Institute of Psychiatry, Psychology and Neuroscience</td>
    <td>Full time</td>
    <td>Join us for an online session for prospective [...].</td>
    <td>September</td>
    <td>Please see the university website for further information on fees for this course.</td>
    <td>MSc</td>
    <td>One year FT</td>
    <td>London</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
    <td><a href="https://www.findamasters.com/masters-degrees/course/addictions-msc/?i132d4318c27100">Link</a></td>
  </tr>
  <tr>
    <td>4</td>
    <td> Advanced Chemical Engineering - MSc</td>
    <td>University of Leeds</td>
    <td>School of Chemical and Process Engineering</td>
    <td>Full time</td>
    <td>The Advanced Chemical Engineering MSc at Leeds [...].</td>
    <td>September</td>
    <td>UK: £13,750 (Total) International: £31,000 (Total)</td>
    <td>MSc</td>
    <td>1 year full time</td>
    <td>Leeds</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
  </tr>
  <!-- Add more rows here as needed -->
</tbody>
</table>
</div>


For each master's degree, you create a `course_i.tsv` file of this structure:

```
courseName \t universityName \t  ... \t url
```

If an information is missing, you just leave it as an empty string.

---

We can observe that some informations are "mandatory", that means that every page (that is not a filler page) has them while others could or could not be present. Meanwhile filler pages gives us no information whatsoever, for those we have only the page url. To create a `.tsv` file we can just use the Python `csv` module changing its delimiter with the character `\t`.

In [7]:
tsvs_path = r"data/tsvs/" # path of the folder containing all the .tsv files

# create folder if not exist already
if not os.path.exists(tsvs_path):
    os.makedirs(tsvs_path)

# we check if all the files already exists, in this case we do not repeat the crawling
to_crawl = False
files_count = 0
for _, _, files in os.walk(tsvs_path):
    files_count += len(files)

if files_count < n_courses:
    print("Creating .tsv files...")
    to_crawl = True
else:
    print("All files already created. Using the existing version.")

# if data is missing go crawl
if to_crawl == True:
    with open(courses_urls_path, 'r') as courses_file:
        for i, url in tqdm(enumerate(courses_file), total = n_courses):

            # if file .tsv already exist skip its creation
            tsv_file_path = tsvs_path + "course_" + str(1 + i) + ".tsv"
            if os.path.exists(tsv_file_path):
                continue

            # create path, open and read .html file
            course_file_path = courses_pages_path + "page_" + str(1 + i // 15) + "/" + "course_" + str(1 + i % 15) + ".html"
            with open(course_file_path, 'r', encoding = "utf-8") as html_file:
                html_content = html_file.read()

            soup = BeautifulSoup(html_content, "html.parser")

            # if the page is no avaiable we can't get informations
            if soup.title.text == r"FindAMasters | 500 Error : Internal Server Error":
                courseName = universityName = facultyName = isItFullTime = description = startDate = fees = modality = duration = city = country = administration = ""

            else:
                # get all the required fields

                # simplest kind of requests
                # those fields always exist for a avaiable page
                courseName = soup.find("h1", {"class": "course-header__course-title"}).get_text(strip = True)
                universityName = soup.find("a", {"class": "course-header__institution"}).get_text(strip = True)
                facultyName = soup.find("a", {"class": "course-header__department"}).get_text(strip = True)
                description = soup.find("div", {"class": "course-sections__description"}).find("div", {"class": "course-sections__content"}).get_text(strip = True)
                startDate = soup.find("span", {"class": "key-info__start-date"}).get_text(strip = True)
                modality = soup.find("span", {"class": "key-info__qualification"}).get_text(strip = True)
                duration = soup.find("span", {"class": "key-info__duration"}).get_text(strip = True)
                city = soup.find("a", {"class": "course-data__city"}).get_text(strip = True)
                country = soup.find("a", {"class": "course-data__country"}).get_text(strip = True)

                # some entries do not have this field, we have to treat the lacking of data
                extract = soup.find("span", {"class": "key-info__study-type"})
                if extract is None:
                    isItFullTime = ""
                else:
                    isItFullTime = extract.get_text(strip = True)

                extract = soup.find("div", {"class": "course-sections__fees"})
                if extract is None:
                    fees = ""
                else:
                    fees = extract.find("div", {"class": "course-sections__content"}).get_text(strip = True)

                # courses can be 'on_campus', 'online' or both, but this information is stored in different tags
                # we have to check for both and compose the value
                extract1 = soup.find("a", {"class": "course-data__online"})
                extract2 = soup.find("a", {"class": "course-data__on-campus"})
                if extract1 is None and extract2 is None:
                    administration = ""
                elif extract2 is None:
                    administration = extract1.get_text(strip = True)
                elif extract1 is None:
                    administration = extract2.get_text(strip = True)
                else:
                    administration = extract1.get_text(strip = True) + " & " + extract2.get_text(strip = True)

            # create list containing column names and data of the page
            data = [["courseName", "universityName", "facultyName", "isItFullTime", "description", "startDate", "fees", "modality", "duration", "city", "country", "administration", "url"],
                    [courseName, universityName, facultyName, isItFullTime, description, startDate, fees, modality, duration, city, country, administration, url]]

            # convert list in .tsv file
            with open(tsv_file_path, 'w+', newline='') as tsv_file:
                writer = csv.writer(tsv_file, delimiter = '\t', lineterminator = '\n')
                writer.writerows(data)

All files already created. Using the existing version.


# 4. Visualizing the most relevant MSc degrees

Using maps can help people understand how far one university is from another so they can plan their academic careers more adequately. Here, we challenge you to show a map of the courses found with the score defined in point 3. You should be able to identify at least the *city* and *country* for each MSc degree. You can find some ideas on how to create maps in Python [here](https://plotly.com/python/maps/) and [here](https://towardsdatascience.com/visualizing-geospatial-data-in-python-e070374fe621) but you will maybe need further information for a proper visualization, like coordinates (latitude and longitude). You can retrieve this data using various tools:

1. [Here](https://medium.com/@manilwagle/geocoding-the-world-using-google-api-and-python-1f6b6fb6ca48) you can find a helpful tutorial on how to encode geo-informations using Google API in Python (this tool can also be used in [Google Sheets](https://handsondataviz.org/geocode.html))
2. You can collect a list of unique places in the format (City, Country) and ask chatGPT (or, as usual, any other LLM chatbot) to provide you with a list of corresponding representative coordinates
3. Explore and find the best solution for your case!
   
Once you defined your visualization strategy, include a way to encode fees in your charts. The map should show (with a proper legend) different courses and associated taxation: the user wants a glimpse not only of how far he will need to move but also of how much it will cost him!

---

Given the courses obtained from the previous points, we have for every entry in our courses dataset to first convert `city` and `country` and `universityName` into latitude and longitude. For doing that we use the library `geopy` and as geocoder `Bing`, for which is required an API key that is here omitted. To avoid useless repetitions in the api calls we save the conversion from address to coordinates in a `.csv` file.

---

**WIP: Real data missing, for now we use all the entries in our dataset.**

In [8]:
# load courses dataset
courses_columns = ["courseName", "universityName", "facultyName", "isItFullTime", "description", "startDate", "fees", "modality", "duration", "city", "country", "administration", "url"]
courses_df = pd.DataFrame(columns = courses_columns)

for i in tqdm(range(n_courses)):
    file_path = tsvs_path + "course_" + str(1 + i) + ".tsv"
    df = pd.read_csv(file_path, sep = '\t')
    courses_df = pd.concat((courses_df, df), ignore_index = True)

# remove those courses whose page was not avaiable
courses_df = courses_df.dropna(subset = ["courseName"])
courses_df

  0%|          | 0/6000 [00:00<?, ?it/s]

Unnamed: 0,courseName,universityName,facultyName,isItFullTime,description,startDate,fees,modality,duration,city,country,administration,url
0,3D Design for Virtual Environments - MSc,Glasgow Caledonian University,School of Engineering and Built Environment,Full time,3D visualisation and animation play a role in ...,September,Please see the university website for further ...,MSc,1 year full-time,Glasgow,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
1,Accounting and Finance - MSc,University of Leeds,Leeds University Business School,Full time,Businesses and governments rely on sound finan...,September,"UK: £18,000 (Total)International: £34,750 (Total)",MSc,1 year full time,Leeds,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
2,"Accounting, Accountability & Financial Managem...",King’s College London,King’s Business School,Full time,"Our Accounting, Accountability & Financial Man...",September,Please see the university website for further ...,MSc,1 year FT,London,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
3,"Accounting, Financial Management and Digital B...",University of Reading,Henley Business School,Full time,Embark on a professional accounting career wit...,September,Please see the university website for further ...,MSc,1 year full time,Reading,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
4,Addictions MSc,King’s College London,"Institute of Psychiatry, Psychology and Neuros...",Full time&Part time,Join us for an online session for prospective ...,September,Please see the university website for further ...,MSc,One year FT,London,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5995,Materials Engineering,University of Padua,School of Engineering,Full time,The Master's degree Materials Engineering is a...,October,Our tuition fees will not exceed 2700 euros pe...,MSc,2 years,Padua,Italy,On Campus,https://www.findamasters.com/masters-degrees/c...
5996,Materials Engineering MSc,Swansea University,School of Engineering and Applied Sciences,Full time&Part time,The MSc in Materials Engineering provides you ...,September,Please visit our website for the Materials Eng...,MSc,1 year full-time; 2 years part-time; 3 years p...,Swansea,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
5997,Materials Engineering MSc by Research,Swansea University,School of Engineering and Applied Sciences,Full time&Part time,Swansea is one of the UK’s leading centres for...,"October, January",Please visit our website for the Materials Eng...,"MSc,Research Only",1 year full-time; 2 years part-time,Swansea,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
5998,"Materials Engineering with Industry, MSc",Swansea University,School of Engineering and Applied Sciences,Full time,Our MSc in Materials Engineering with Industry...,September,For current fees of the Materials Engineering ...,MSc,2 Years Full Time With a Year In Industry,Swansea,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...


We found out that a course has its city name spelled wrong! We correct it in the following line.

In [9]:
len(courses_df[(courses_df["isItFullTime"] == "Part time") |(courses_df["isItFullTime"] == "Full time&Part time")].groupby("universityName"))

194

To follow the strict Nominatim Usage Policy (see https://operations.osmfoundation.org/policies/nominatim/) we decided to convert all useful locations to coordinates and insert them in a `.csv` file for future use, to avoid unecessary recollection of data.

In [87]:
# nominatim
coordinates_table_path = r"data/coordinates_table.csv"

if os.path.exists(coordinates_table_path):
    print("Table already created!")
    coordinates_df =  pd.read_csv(coordinates_table_path)

else:
    print("Creating table...")

    coordinates_df = pd.DataFrame(columns = ["city", "country", "latitude", "longitude"])
    grouped = courses_df.groupby(["city", "country"])

    for group_name, group in tqdm(grouped, total = len(grouped)):
        location_string = group_name[0] + ", " + group_name[1]
        geolocator = Nominatim(user_agent = "DataScienceHomework_application")

        location = geolocator.geocode(location_string, timeout = None)

        new_row = pd.DataFrame({"city": [group_name[0]],
                                "country": [group_name[1]],
                                "latitude": [location.latitude],
                                "longitude": [location.longitude]})
        coordinates_df = pd.concat((coordinates_df, new_row), ignore_index = True)

        time.sleep(sleep_time)

    coordinates_df.to_csv("data/coordinates_table.csv", index = False)

coordinates_df

Table already created!


Unnamed: 0,city,country,latitude,longitude
0,Aachen,Germany,50.776351,6.083862
1,Aalborg,Denmark,57.046263,9.921526
2,Aberdeen,United Kingdom,57.148243,-2.092809
3,Aberystwyth,United Kingdom,52.414311,-4.081685
4,Abu Dhabi,United Arab Emirates,24.453835,54.377401
...,...,...,...,...
214,Wolverhampton,United Kingdom,52.584765,-2.127567
215,Worcester,United Kingdom,52.191185,-2.220658
216,York,United Kingdom,53.965658,-1.074305
217,Zagreb,Croatia,45.813097,15.977279


In [17]:
# bing
coordinates_table_path = r"data/coordinates_table.csv"

# if already created don't waste api calls
if os.path.exists(coordinates_table_path):
    print("Table already created!")
    coordinates_df =  pd.read_csv(coordinates_table_path)

else:
    print("Creating table...")

    coordinates_df = pd.DataFrame(columns = ["universityName", "city", "country", "latitude", "longitude"])
    grouped = courses_df.groupby(["universityName", "city", "country"])

    # we check the coordinates of every university
    for group_name, group in tqdm(grouped, total = len(grouped)):
        location_string = group_name[0] + ", " + group_name[1] + ", " + group_name[2] # compose the address
        geolocator = Bing(api_key = "AtdaJRct2rWxYPJrhVFVVqjLV39daVFysafTBx5zoAaa6cmWcNULVi-jywbYVIDB") # give the geolocator the api key
        location = geolocator.geocode(location_string, timeout = None) # execute the geolocation

        # create new row and insert entry ti dataframe
        new_row = pd.DataFrame({"universityName": [group_name[0]],
                                "city": [group_name[1]],
                                "country": [group_name[2]],
                                "latitude": [location.latitude],
                                "longitude": [location.longitude]})
        coordinates_df = pd.concat((coordinates_df, new_row), ignore_index = True)

    coordinates_df.to_csv("data/coordinates_table.csv", index = False)

coordinates_df

Table already created!


Unnamed: 0,universityName,city,country,latitude,longitude
0,AFG College with the University of Aberdeen,Doha,Qatar,25.285633,51.526417
1,Aalborg University,Aalborg,Denmark,57.015285,9.978995
2,Aalborg University Copenhagen Campus,Copenhagen,Denmark,55.698200,12.571623
3,Aalto University,Tampere,Finland,61.498020,23.760311
4,Aberdeen University,Aberdeen,United Kingdom,57.164803,-2.102050
...,...,...,...,...,...
330,Xi’an Jiaotong-Liverpool University,Suzhou,China,31.274799,120.738068
331,York St John University,York,United Kingdom,53.965401,-1.080673
332,York University (Canada),Toronto,Canada,43.773705,-79.503281
333,Zhejiang Gongshan University,Hangzhou,China,30.245920,120.210793


---

After this conversion we create an interactive map using the library `folium` and populate it with markers for every relevant course.


---

In [94]:
# nominatim
# create empty world map centered on Rome
map = folium.Map(location = [41.902782, 12.496366], max_bounds = True, min_zoom = 2)

# we add a single marker per city
for group_tuple, group in courses_df.groupby(["city", "country"]):
    coordinates_row = coordinates_df[(coordinates_df["city"] == group_tuple[0]) & (coordinates_df["country"] == group_tuple[1])]
    description_string = group_tuple[0] + ", " + group_tuple[1]

    folium.Marker(
    location = [coordinates_row["latitude"].item(), coordinates_row["longitude"].item()],
    popup = description_string,
    icon = folium.Icon()
    ).add_to(map)

map

ValueError: can only convert an array of size 1 to a Python scalar

---

what's missing?

* use new fees as float
* make marker change color following the cost
* make legend
* apply on result of point 3
* upload coordinates converter on aws


In [57]:
# bing
# create empty world map centered on Rome
map = folium.Map(location = [41.902782, 12.496366], max_bounds = True, min_zoom = 2)

# we add a single marker per city
for group_tuple, group in courses_df.groupby(["universityName", "city", "country"]):
    # get coordinates of marker
    coordinates_row = coordinates_df[(coordinates_df["universityName"] == group_tuple[0]) & (coordinates_df["city"] == group_tuple[1]) & (coordinates_df["country"] == group_tuple[2])]

    # get address
    description_string = "<b>" + group_tuple[0] + ", " + group_tuple[1] + ", " + group_tuple[2] + "</b>" + "<br>"

    # get courses infos
    for i, course in group.iterrows():
        description_string += "<br>" + "<b>" + "Course name: " + "</b>" + course["courseName"] + "<br>" + "<b>" + "Fees: " + "</b>" + str(course["fees"]) + "<br>"

    # create marker
    iframe = folium.IFrame(description_string)

    popup = folium.Popup(iframe,
                        min_width=500,
                        max_width=500)

    folium.Marker(
                  location = [coordinates_row["latitude"].item(), coordinates_row["longitude"].item()],
                  popup = popup,
                  icon = folium.Icon()
                  ).add_to(map)

# display map
map

In [115]:
old_coordinates_df = pd.read_csv(r"data/coordinates_table_nominatim.csv")

distance1 = []
distance2 = []

for id, row1 in coordinates_df.iterrows():
    for id, row2 in old_coordinates_df.iterrows():
        if (row1["city"] == row2["city"]) and (row1["country"] == row2["country"]):
            if (abs(row1["latitude"] - row2["latitude"])) > 1 or (abs(row1["longitude"] - row2["longitude"])) > 1:
                print(row1["universityName"])
            break

Brock University
Harper Adams University
Virginia Polytechnic Institute and State University


# 7. Algorithmic Question 

Leonardo is an intern at a company. He is paid based on the total number of hours he has worked. They agreed __d__ days ago that Leonardo could not work less than $minTime_i$ or more than $maxTime_i$ hours per <ins>i-th</ins> day. Furthermore, he was warned by HR that on his last day at the company, he should provide a detailed report on how many hours he worked <ins>each day</ins> for the previous d days.

Today is the day Leonardo should report to HR, but the problem is that he <ins>didn't</ins> account for how many hours he put in for each day, so he only has the __total sum of the hours__ ($sumHours$) he put in total in these d days. He believes that if he creates a report in which each number $dayHours_i$ corresponds to the __total hours he worked on the i-th day__ while satisfying the HR limitations and the total sum of all $dayHours_i$ equals $sumHours$, he would be fine.

He cannot create such a report independently and requests your assistance. He will give you the number of days $d$, total hours spent $sumHours$, and the HR limitations for each day $i$, and he wants you to assist him in determining whether it is possible to create such a fake report. If that is possible, make such a report. 

**Input**

The first line of input contains two integers __d__, $sumHours$ - the number of days Leonardo worked there and the total number of hours he worked for the company. Each of the following __d__ lines contains two integer numbers $minTime_i$ and $maxTime_i$ - the minimum and maximum hours he can work on the $i_{th}$ day. 

**Output**

If such a report cannot be generated, print 'NO' in one output line. If such a report is possible, print 'YES' in the output and d numbers - the number of hours Leonardo spent each day - in the second line. If more than one solution exists, print any of them. 

__Input 1__
```
2 5
0 1
3 5
```
__Output 1__
```
YES
1 4 
```
---
__Input 2__
```
1 1
5 6
```
__Output 2__
```
NO
```

---

**7.1 Implement a code to solve the above mentioned problem.**

In [6]:
def algorithmic_question(input_string):
    # get first line of input
    input_list = input_string.split("\n")
    d, sumHours = map(int, input_list.pop(0).split(" "))

    # get other lines
    result = True
    minList = np.zeros([d], dtype = int)
    maxList = np.zeros([d], dtype = int)
    minSum = 0
    maxSum = 0
    for i in range(d):
        min, max = map(int, input_list[i].split(" "))
        
        if min > max: # impossible costraint, immediate fail
            result = False
            break

        # append values to lists
        minList[i] = min
        maxList[i] = max

        # update sums
        minSum += min
        maxSum += max

    # total costraints satistaction
    if (sumHours < minSum) or (sumHours > maxSum):
        result = False

    # creating output

    if result == False:
        print("NO")
    else:
        print("YES")
        outputList = minList
        outputSum = minSum
        diffs = maxList - minList
        for i in range(d):
            if outputSum + diffs[i] < sumHours:
                outputList[i] += diffs[i]
                outputSum += diffs[i]
            else:
                outputList[i] += sumHours - outputSum
                break
        
        print(*outputList, sep = " ")

    return
 
input1 = "2 5\n0 1\n3 5"
print("Test1:")
algorithmic_question(input1)


input2 = "1 1\n5 6"
print("\nTest2:")
algorithmic_question(input2)

Test1:
YES
1 4

Test2:
NO


**7.2 What is the __time complexity__ (the Big O notation) of your solution? Please provide a <ins>detailed explanation</ins> of how you calculated the time complexity.**

---

We can see the algorithm as divided in two parts: the first one calculate if it's possible to satisfy the condition and the second one create a possible distribution of the hours troughtout the days. Calling `d` the number of days and therefore the number of lines of the input the first part has a cycle of doing the following operations:

* Reading and splitting the input (costant)
* Checking if costraint is impossible (constant)
* Append values to pre-allocated list (costant)
* Sum values in accumulator (costant)

And after that a check on the total costraints of the problem, that is costant too. The worst case scenario is when there is no impossible costraint, this means that every line is read and this brings the cost of the first part to `O(d)`. The second part creates the output of the function. If the costraints obtained in the first part were unsatisfiable we just print "NO" (costant time). If the problem is satisfiable we have to create a possible list of hours that satisfy all the costraints: for every day it has to be $\ge$ than then minimum, $\le$ than the maximum of that specific day and the sum has to be exactly `sumHours`.

To reach the last costraint we decided to start with all the days setted to the minimum hour and then increase every day to the max until that would surpass `sumHours`, at which we just set the hours at the right amount such that the sum becomes `sumHours`. In the setup of the cycle we have some costant time operations and the creation of the `diffs` array, that contains all the differences between max and min for every day, creating that array has cost `O(d)`. Then have to cycle trough all the days and do this operation. Using numpy arrays the array indexing become costant, so an iteration of a cycle uses costant time. The worst-case scenario is when we have to cycle trough all the days (EX. `sumHours` is equal to the sum of all the `max`), and this brings the cost of the second part to `O(d)`.

Since both parts are linear we can say that the cost of the algorithm is linear `O(d)`, with `d` the size of the input.

**7.3 Ask ChatGPT or any other LLM chatbot tool to check your code's time complexity (the Big O notation). Compare your answer to theirs. Do you believe this is correct? If the <ins>two differ</ins>, which one is right? (why?)**

---

We asked ChatGPT 3.5 to evaluate the time complexity of the code using the following prompt:

```
You are asked to evaluate the following algorithm wirtten in python and evaluate its time complexity in Big O notation in the worst-case scenario.
[python code of the algorithm]
```

And we obtained the following response:

```
This algorithm seems to involve iterating through each constraint provided in the input and checking certain conditions. Let's break down the time complexity:

Getting the first line of input and initializing variables takes constant time, denoted as O(1).
Iterating through the input constraints involves going through 'd' iterations, where 'd' represents the number of days. Within each iteration:
Splitting the input and mapping values takes constant time, O(1).
Checking if min > max and updating lists involves constant time operations, O(1).
Adding values to the lists and updating sums takes constant time, O(1).
Checking the total constraints satisfaction also takes constant time, O(1).
The nested loop and operations inside it iterate through 'd' constraints, making the time complexity for this part O(d) in total.

The last part of the algorithm involves another loop through 'd' constraints to determine the output. Within this loop:

The majority of operations inside the loop are constant time operations, O(1).
However, the loop might break prematurely, depending on the conditions, leading to an average case scenario of approximately d/2 iterations.
The overall time complexity of this algorithm in the worst-case scenario is approximately O(d) considering the dominant part is the iteration through 'd' constraints.
```

The response of ChatGPT is very similar to ours, both on the results and (surprisingly) in the form. The LLM seems to have analyzed the code dividing it in two parts as we did. We want to highlight that it decided to talk about "average case scenario" even if we explicitely said to consider the "worst-case scenario": in this specific case it has brought no significant effect, as $d/2 \in O(d)$, however in the analysis of other algorithms that could have brought a different results EX. quicksort has a worst-case scenario of $O(n^2)$ while its average case scenario is $O(nlogn)$

**7.4 What do you think of the __optimality__ of your code? Do you believe it is optimal? Can you improve? Please <ins>elaborate</ins> on your response.**

---

We believe the code we have written is optimal. Since only reading the input once has a time complexity of $O(d)$ if an algorithm  has a time complexity lower than that means that the algorithm must not read all the input, however without reading all the input is impossible to know whether a certain number of hours set in the output for a certain day breaks some costraints.

Maybe in the average case scenario a little boost in the performance can be obtained in the second part if instead of calculating all the `diffs` at the same time we calculate them only when needed, as we probably don't need the latter ones, but even this could be debatable as the vectorization of the operation makes it very efficient.