Notebook by:
* Lorenzo Pannacci 1948926
* INSERT NAME SURNAME AND ID HERE

# Startup

In [None]:
######################
# LIBRARIES DOWNLOAD #
######################

install_packages = False
if install_packages:
    %pip install beautifulsoup4 tqdm pandas numpy matplotlib

In [None]:
####################
# LIBRARIES IMPORT #
####################

import requests
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm
import time
import os
import csv

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

# 1. Data collection

For this homework, there is no provided dataset. Instead, you have to build your own. Your search engine will run on text documents. So, here
we detail the procedure to follow for the data collection. We strongly suggest you work on different modules when implementing the required functions. For example, you may have a ```crawler.py``` module, a ```parser.py``` module, and a ```engine.py``` module: this is a good practice that improves readability in reporting and efficiency in deploying the code. Be careful; you are likely dealing with exceptions and other possible issues! 

## 1.1. Get the list of master's degree courses

We start with the list of courses to include in your corpus of documents. In particular, we focus on web scrapping the [MSc Degrees](https://www.findamasters.com/masters-degrees/msc-degrees/). Next, we want you to **collect the URL** associated with each site in the list from the previously collected list.
The list is long and split into many pages. Therefore, we ask you to retrieve only the URLs of the places listed in **the first 400 pages** (each page has 15 courses, so you will end up with 6000 unique master's degree URLs).

The output of this step is a `.txt` file whose single line corresponds to the master's URL.

---

Firstly we observe that we can nagivate trough the different pages using the link `https://www.findamasters.com/masters-degrees/msc-degrees/?PG=n` and changing `n` with the number of the desidered page. We also observe that after 30 pages are loaded the site thinks we are a bot and don't let us in, to fix this we wait some seconds before we open a new page.

In [None]:
data_folder_path = r"data/"
if not os.path.exists(data_folder_path):                    # create main data folder if doesn't already exist
    os.makedirs(data_folder_path)

courses_urls_path = data_folder_path + r"courses_urls.txt"  # file path of the txt file to create
sleep_time = 2                                              # idle time between to requests, to avoid being blocke
to_crawl = False                                            # we check if the file already exists and has the right length, in this case we do not repeat the crawling
n_pages = 400                                               # number of pages to search trough
n_courses = n_pages * 15                                    # total number of courses crawled

# to avoid the site block the crawler considering it a bot we use a user agent taken from a real chrome session
headers = {"user-agent": r"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"}

if not os.path.exists(courses_urls_path): # if file does not exist we have to crawl
    print("File does not exist! Crawling...")
    to_crawl = True

if to_crawl == False: # if file is incomplete we have to crawl
    with open(courses_urls_path, 'r') as file:
        file_length = len(file.readlines())
        if file_length < n_courses:
            print("File exist but is incomplete! Crawling...")
            to_crawl = True
        else:
            print("File already exist and is complete. Using the previous version.")


# if data is missing go crawl
if to_crawl == True:

    with open("data/courses_urls.txt", 'w') as file: # open file, if already exist creates a new one
        for i in tqdm(range(n_pages)): # cycle trough every page
            url = r"https://www.findamasters.com/masters-degrees/msc-degrees/?PG=" + str(1 + i) # we compose the url

            # get the webpage
            webpage = requests.get(url, headers = headers)
            soup = BeautifulSoup(webpage.text)
            soup.prettify()

            tags = soup.find_all('a', {"class": "courseLink"})  # get the tags we are interested in

            if not tags:
                raise IOError("Crawler has been blocked by the website. Try again with higher idle time.")

            for tag in tags: # for every tag get the course link and append to file
                link = tag["href"]
                file.write(r"https://www.findamasters.com" + link + "\n")

            time.sleep(sleep_time) # wait to avoid getting blocked

    # the file automatically close itself when the "with" section ends, saving the written lines

## 1.2. Crawl master's degree pages

Once you get all the URLs in the first 400 pages of the list, you:

1. Download the HTML corresponding to each of the collected URLs.
2. After you collect a single page, immediately save its `HTML` in a file. In this way, if your program stops for any reason, you will not lose the data collected up to the stopping point.
3. Organize the downloaded `HTML` pages into folders. Each folder will contain the `HTML` of the courses on page 1, page 2, ... of the list of master's programs.
   
__Tip__: Due to the large number of pages you should download, you can use some methods that can help you shorten the time. If you employed a particular process or approach, kindly describe it.

---

As before we have to insert a idle time between the loading of two pages to avoid that the website block us.

In [None]:
courses_pages_path = r"data/courses_html_pages/" # path of the folder containing all the subfolders with the html files

# create folder if not exist already
if not os.path.exists(courses_pages_path):
    os.makedirs(courses_pages_path)

# we check if the files already exists, in this case we do not repeat the crawling
to_crawl = False
files_count = 0
for _, _, files in os.walk(courses_pages_path):
    files_count += len(files)

if files_count < n_courses:
    print("Crawling...")
    to_crawl = True
else:
    print("All files already crawled. Using the existing version.")

# if data is missing go crawl
if to_crawl == True:

    # make a folder for every page if not already created
    for i in range(1, 400 + 1):
        folder_path = courses_pages_path + "page_" + str(i)
        if not os.path.exists(folder_path):
            os.makedirs(folder_path)

    # populate folders
    with open(courses_urls_path, 'r') as file_1:
        for i, course_url in tqdm(enumerate(file_1), total = n_courses):
            course_url = course_url.strip('\n')
            course_file_path = courses_pages_path + "page_" + str(1 + i // 15) + "/" + "course_" + str(1 + i % 15) + ".html"

            if not os.path.exists(course_file_path): # if already crawled do not repeat
                # get page
                webpage = requests.get(course_url, headers = headers)
                soup = BeautifulSoup(webpage.text, "html.parser")

                if soup.title.text == r"Just a moment...":
                    raise IOError("Crawler has been blocked by the website. Try again with higher idle time.")

                # write file
                with open(course_file_path, 'w+', encoding = "utf-8") as file_2:
                    html_page = soup.prettify()
                    file_2.write(str(html_page))
                # the file automatically close itself when the "with" section ends, saving the written lines

                time.sleep(sleep_time) # wait to avoid getting blocked


We observe that some pages as `https://www.findamasters.com/masters-degrees/course/emergency-management-and-resilience-msc/?i373d7361c25450` (page 215, course 3) are missing and gives us a filler webpage. We will have to treat those courses carefully as the only information we can get from those is the link.

In [None]:
# crawling correctness check

print("Checking the correctness of the crawl operation...")

blocked_pages = 0
unaviable_pages = 0
correct_pages = 0
for root, _, files in tqdm(os.walk(courses_pages_path), total = 401):
    for file in files:
        course_file_path = os.path.join(root, file)

        with open(course_file_path, 'r', encoding = "utf-8") as html_file:
            html_content = html_file.read()

        soup = BeautifulSoup(html_content, "html.parser")
        page_title = soup.title.text

        if page_title == r"Just a moment...": # blocked during crawling
            blocked_pages += 1
            os.remove(course_file_path)
        elif page_title == r"FindAMasters | 500 Error : Internal Server Error": # missing on website
            unaviable_pages += 1
        else: # page downloaded correctly
            correct_pages += 1

print(blocked_pages, "pages were blocked during crawling and had been removed. If this value is not zero run the crawling again to get the missing pages.")
print(unaviable_pages, "pages are not present on the website anymore.")
print(correct_pages, "pages have been correctly downloaded.")

## 1.3 Parse downloaded pages

At this point, you should have all the HTML documents about the master's degree of interest, and you can start to extract specific information. The list of the information we desire for each course and their format is as follows:

1. Course Name (to save as ```courseName```): string;
2. University (to save as ```universityName```): string;
3. Faculty (to save as ```facultyName```): string
4. Full or Part Time (to save as ```isItFullTime```): string;
5. Short Description (to save as ```description```): string;
6. Start Date (to save as ```startDate```): string;
7. Fees (to save as ```fees```): string;
8. Modality (to save as ```modality```):string;
9. Duration (to save as ```duration```):string;
10. City (to save as ```city```): string;
11. Country (to save as ```country```): string;
12. Presence or online modality (to save as ```administration```): string;
13. Link to the page (to save as ```url```): string.

<div style="overflow-x:auto;">
<table>
<thead>
  <tr>
    <th>index</th>
    <th>courseName</th>
    <th>universityName</th>
    <th>facultyName</th>
    <th>isItFullTime</th>
    <th>description</th>
    <th>startDate</th>
    <th>fees</th>
    <th>modality</th>
    <th>duration</th>
    <th>city</th>
    <th>country</th>
    <th>administration</th>
    <th>url</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>0</td>
    <td> Accounting and Finance - MSc</td>
    <td>University of Leeds</td>
    <td>Leeds University Business School</td>
    <td>Full time</td>
    <td>Businesses and governments rely on [...].</td>
    <td>September</td>
    <td>UK: £18,000 (Total) International: £34,750 (Total)</td>
    <td>MSc</td>
    <td>1 year full time</td>
    <td>Leeds</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
    <td><a href="https://www.findamasters.com/masters-degrees/course/accounting-and-finance-msc/?i321d3232c3891">Link</a></td>
  </tr>
  <tr>
    <td>1</td>
    <td> Accounting, Accountability & Financial Management MSc</td>
    <td>King’s College London</td>
    <td>King’s Business School</td>
    <td>Full time</td>
    <td>Our Accounting, Accountability & Financial Management MSc course will provide [...].</td>
    <td>September</td>
    <td>Please see the university website for further information on fees for this course.</td>
    <td>MSc</td>
    <td>1 year FT</td>
    <td>London</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
    <td><a href="https://www.findamasters.com/masters-degrees/course/accounting-accountability-and-financial-management-msc/?i132d7816c25522">Link</a></td>
  </tr>
  <tr>
    <td>2</td>
    <td> Accounting, Financial Management and Digital Business - MSc</td>
    <td>University of Reading</td>
    <td>Henley Business School</td>
    <td>Full time</td>
    <td>Embark on a professional accounting career [...].</td>
    <td>September</td>
    <td>Please see the university website for further information on fees for this course.</td>
    <td>MSc</td>
    <td>1 year full time</td>
    <td>Reading</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
    <td><a href="https://www.findamasters.com/masters-degrees/course/accounting-financial-management-and-digital-business-msc/?i345d4286c351">Link</a></td>
  </tr>
  <tr>
    <td>3</td>
    <td> Addictions MSc</td>
    <td>King’s College London</td>
    <td>Institute of Psychiatry, Psychology and Neuroscience</td>
    <td>Full time</td>
    <td>Join us for an online session for prospective [...].</td>
    <td>September</td>
    <td>Please see the university website for further information on fees for this course.</td>
    <td>MSc</td>
    <td>One year FT</td>
    <td>London</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
    <td><a href="https://www.findamasters.com/masters-degrees/course/addictions-msc/?i132d4318c27100">Link</a></td>
  </tr>
  <tr>
    <td>4</td>
    <td> Advanced Chemical Engineering - MSc</td>
    <td>University of Leeds</td>
    <td>School of Chemical and Process Engineering</td>
    <td>Full time</td>
    <td>The Advanced Chemical Engineering MSc at Leeds [...].</td>
    <td>September</td>
    <td>UK: £13,750 (Total) International: £31,000 (Total)</td>
    <td>MSc</td>
    <td>1 year full time</td>
    <td>Leeds</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
  </tr>
  <!-- Add more rows here as needed -->
</tbody>
</table>
</div>


For each master's degree, you create a `course_i.tsv` file of this structure:

```
courseName \t universityName \t  ... \t url
```

If an information is missing, you just leave it as an empty string.

In [None]:
tsvs_path = r"data/tsvs/" # path of the folder containing all the .tsv files

# create folder if not exist already
if not os.path.exists(tsvs_path):
    os.makedirs(tsvs_path)

# we check if all the files already exists, in this case we do not repeat the crawling
to_crawl = False
files_count = 0
for _, _, files in os.walk(tsvs_path):
    files_count += len(files)

if files_count < n_courses:
    print("Creating .tsv files...")
    to_crawl = True
else:
    print("All files already created. Using the existing version.")

# if data is missing go crawl
if to_crawl == True:
    with open(courses_urls_path, 'r') as courses_file:
        for i, url in tqdm(enumerate(courses_file), total = n_courses):

            # if file .tsv already exist skip its creation
            tsv_file_path = tsvs_path + "course_" + str(1 + i) + ".tsv"
            if os.path.exists(tsv_file_path):
                continue

            # create path, open and read .html file
            course_file_path = courses_pages_path + "page_" + str(1 + i // 15) + "/" + "course_" + str(1 + i % 15) + ".html"
            with open(course_file_path, 'r', encoding = "utf-8") as html_file:
                html_content = html_file.read()

            soup = BeautifulSoup(html_content, "html.parser")

            # if the page is no avaiable we can't get informations
            if soup.title.text == r"FindAMasters | 500 Error : Internal Server Error":
                courseName = universityName = facultyName = isItFullTime = description = startDate = fees = modality = duration = city = country = administration = ""

            else:
                # get all the required fields

                courseName = soup.find("h1", {"class": "course-header__course-title"}).get_text(strip = True)
                universityName = soup.find("a", {"class": "course-header__institution"}).get_text(strip = True)
                facultyName = soup.find("a", {"class": "course-header__department"}).get_text(strip = True)

                # some entries do not have this field
                extract = soup.find("span", {"class": "key-info__study-type"})
                if extract is None:
                    isItFullTime = ""
                else:
                    isItFullTime = extract.get_text(strip = True)

                description = soup.find("div", {"class": "course-sections__description"}).find("div", {"class": "course-sections__content"}).get_text(strip = True)
                startDate = soup.find("span", {"class": "key-info__start-date"}).get_text(strip = True)

                # some entries do not have this field
                extract = soup.find("div", {"class": "course-sections__fees"})
                if extract is None:
                    fees = ""
                else:
                    fees = extract.find("div", {"class": "course-sections__content"}).get_text(strip = True)

                modality = soup.find("span", {"class": "key-info__qualification"}).get_text(strip = True)
                duration = soup.find("span", {"class": "key-info__duration"}).get_text(strip = True)
                city = soup.find("a", {"class": "course-data__city"}).get_text(strip = True)
                country = soup.find("a", {"class": "course-data__country"}).get_text(strip = True)

                # courses can be 'on_campus', 'online' or both, but this information is stored in different tags
                extract1 = soup.find("a", {"class": "course-data__online"})
                extract2 = soup.find("a", {"class": "course-data__on-campus"})
                if extract1 is None and extract2 is None:
                    administration = ""
                elif extract2 is None:
                    administration = extract1.get_text(strip = True)
                elif extract1 is None:
                    administration = extract2.get_text(strip = True)
                else:
                    administration = extract1.get_text(strip = True) + " & " + extract2.get_text(strip = True)

            data = [["courseName", "universityName", "facultyName", "isItFullTime", "description", "startDate", "fees", "modality", "duration", "city", "country", "administration", "url"],
                    [courseName, universityName, facultyName, isItFullTime, description, startDate, fees, modality, duration, city, country, administration, url]]

            with open(tsv_file_path, 'w+', newline='') as tsv_file:
                writer = csv.writer(tsv_file, delimiter = '\t', lineterminator = '\n')
                writer.writerows(data)