##1. Data collection
For this homework, there is no provided dataset. Instead, you have to build your own. Your search engine will run on text documents. So, here we detail the procedure to follow for the data collection. We strongly suggest you work on different modules when implementing the required functions. For example, you may have a crawler.py module, a parser.py module, and a engine.py module: this is a good practice that improves readability in reporting and efficiency in deploying the code. Be careful; you are likely dealing with exceptions and other possible issues!

###1.1. Get the list of master's degree courses
We start with the list of courses to include in your corpus of documents. In particular, we focus on web scrapping the MSc Degrees. Next, we want you to collect the URL associated with each site in the list from the previously collected list. The list is long and split into many pages. Therefore, we ask you to retrieve only the URLs of the places listed in the first 400 pages (each page has 15 courses, so you will end up with 6000 unique master's degree URLs).

The output of this step is a .txt file whose single line corresponds to the master's URL.

In [None]:
import requests
from datetime import datetime
from bs4 import BeautifulSoup
import pandas as pd
from  tqdm import tqdm
import time
import re
import csv

In [None]:
master_urls = [] #we create an empty list to which we will append the urls we get
for i in tqdm(range(1,401)):

    #We change along the first 400 pages and request via their urls
    url='https://www.findamasters.com/masters-degrees/msc-degrees?page='+str(i)
    result=requests.get(url)
    soup=BeautifulSoup(result.text,'html.parser')

    try:
         # Find and extract the URLs of master's degree courses
        course_links = soup.find_all('a', class_='courseLink text-dark')
        for link in course_links:
            course_url = link.get('href')
            master_urls.append('https://www.findamasters.com'+course_url)

        #To conclude, we save it in a .txt file with all the urls of the courses
        with open('urls.txt', 'w') as file:
          for url in master_urls:
            file.write(url + '\n')
    except:
        pass
    # This will leave a second between each iteration so we do not get banned from the website
    time.sleep(1)

100%|██████████| 400/400 [19:45<00:00,  2.96s/it]


*As you can see, we used beautiful soup to do the webscraping and added a delay time to prevent our IP from being blocked by the website. After the scraping, we saved the urls in a urls.txt file so that we just need to do this process once in the beginning of the project. As there are 400 pages with 15 master degree urls each, our txt file got 6000 unique urls to deal with.*

###1.2. Crawl master's degree pages
Once you get all the URLs in the first 400 pages of the list, you:

Download the HTML corresponding to each of the collected URLs.
After you collect a single page, immediately save its HTML in a file. In this way, if your program stops for any reason, you will not lose the data collected up to the stopping point.
Organize the downloaded HTML pages into folders. Each folder will contain the HTML of the courses on page 1, page 2, ... of the list of master's programs.

In [None]:
import os
#We first read the lines of the file with the urls
archive=open('urls.txt')
urls=archive.readlines()
# Create an output folder if it doesn't exist
output_folder = 'newFolderHTML'
os.makedirs(output_folder, exist_ok=True)
#We do this so we only need to acces the webpage once to get all the data and not enter it 15*400=6000 times

for i in tqdm(range(len(urls))):
    #We only need the urls, not the '\n'
    link=urls[i].strip()
    #we get the html within the same session
    html=requests.get(link)
    #and write it in differents files
    # Create the output file path in the output folder
    output_file = os.path.join(output_folder, f'Html-{i+1}.txt')
    with open(output_file,'a') as doc:
        doc.write(html.text)
    time.sleep(1)

100%|██████████| 6000/6000 [2:18:18<00:00,  1.38s/it]


Now we are crawling over the master degree course pages from the urls.txt file we already have in our directory. We created an output_folder by the name 'newFolderHTML' where we have stored each of the HTMl files we get from the 6000 URLs with the name- Html-1.txt, Html-2.txt unitl ...Html-6000.txt. We will use this folder of HTMls to do the next question...

###1.3 Parse downloaded pages
At this point, you should have all the HTML documents about the master's degree of interest, and you can start to extract specific information. The list of the information we desire for each course and their format is as follows:

Course Name (to save as courseName): string;
University (to save as universityName): string;
Faculty (to save as facultyName): string
Full or Part Time (to save as isItFullTime): string;
Short Description (to save as description): string;
Start Date (to save as startDate): string;
Fees (to save as fees): string;
Modality (to save as modality):string;
Duration (to save as duration):string;
City (to save as city): string;
Country (to save as country): string;
Presence or online modality (to save as administration): string;
Link to the page (to save as url): string.

For each master's degree, you create a course_i.tsv file of this structure:

courseName \t universityName \t  ... \t url

*To parse all the 6000 HTML documents from the HTML folder, we define a generic function that will scoop out the required specific information (if present) from the corresponding HTML files and then we call the function with the html folder name and the tsv output folder name*

*Our function takes three parameters: the html folder name, folder name to put the .tsv files inside and the urls.txt file that has the 6000 urls. Based on those 60000 urls, the function fetches the Html-i.txt file from the html folder corresponding to ith Url of the Urls.txt and scoops the information of course, university, faculty, etc using beautiful soup. It creates two rows, one as the header rows with the column names and another has the data rows with the column data and creates a course-i.tsv file with that in the tsv output folder.*

In [None]:
import os
from bs4 import BeautifulSoup

def extract_course_info(html_folder, output_folder,  urls_file):
    # Create the output folder if it doesn't exist
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    # Read the URLs from the URLs file
    with open(urls_file, 'r', encoding='utf-8') as urls_file:
        urls = urls_file.read().splitlines()
    # Define column names
    column_names = [
        "courseName",
        "universityName",
        "facultyName",
        "isItFullTime",
        "description",
        "startDate",
        "fees",
        "modality",
        "duration",
        "city",
        "country",
        "administration",
        "url"
    ]
    # Create a header row in the TSV file
    header_row = "\t".join(column_names)

    # Iterate through each HTML file
    for i,url in enumerate(urls, start=1):
        html_filename = os.path.join(html_folder, f'Html-{i}.txt')
        output_filename = os.path.join(output_folder, f'course_{i}.tsv')

        # Initialize variables to store extracted data
        courseName = ""
        universityName = ""
        facultyName = ""
        isItFullTime = ""
        description = ""
        startDate = ""
        fees = ""
        modality = ""
        duration = ""
        city = ""
        country = ""
        administration = ""


        with open(html_filename, 'r', encoding='utf-8') as html_file:
            html_content = html_file.read()

            # Use BeautifulSoup to parse the HTML
            soup = BeautifulSoup(html_content, 'html.parser')

            # Extract the courseName
            courseName_element = soup.find('h1', class_='text-white course-header__course-title')
            if courseName_element:
                courseName = courseName_element.get_text(strip=True)

            # Extract the universityName and facultyName
            inst_dept_element = soup.find('h3', class_='h5 course-header__inst-dept')
            if inst_dept_element:
                institution_element = inst_dept_element.find('a', class_='course-header__institution')
                department_element = inst_dept_element.find('a', class_='course-header__department')

                if institution_element:
                    universityName = institution_element.get_text(strip=True)
                if department_element:
                    facultyName = department_element.get_text(strip=True)

            # Extract additional information
            key_info_elements = soup.find('div', class_='key-info__outer')
            if key_info_elements:
              fulltime_element=key_info_elements.find('span',class_='key-info__study-type')
              startdate_element=key_info_elements.find('span',class_='key-info__start-date')
              modality_element=key_info_elements.find('span',class_='key-info__qualification')
              duration_element=key_info_elements.find('span',class_='key-info__duration')
              if fulltime_element:
                isItFullTime=soup.find('span', class_='key-info__study-type').get_text(strip=True)
              if startdate_element:
                startDate=soup.find('span', class_='key-info__start-date').get_text(strip=True)
              if modality_element:
                modality=soup.find('span', class_='key-info__qualification').get_text(strip=True)
              if duration_element:
                duration=soup.find('span', class_='key-info__duration').get_text(strip=True)


            #Extract the geographical information
            course_data_element=soup.find('div',class_='course-data__container col-24 ml-md-n1 p-0 pb-3')
            if course_data_element:
                city_element=course_data_element.find('a',class_='course-data__city')
                country_element=course_data_element.find('a',class_='course-data__country')
                admin_element=course_data_element.find('a',class_='course-data__on-campus')
                if city_element:
                  city=city_element.get_text(strip=True)
                if country_element:
                  country=country_element.get_text(strip=True)
                if admin_element:
                  administration=admin_element.get_text(strip=True)

            # Extract fees information
            fees_element = soup.find('div', class_='course-sections__fees')
            if fees_element:
                fees_paragraph = fees_element.find('p')
                if fees_paragraph:
                    fees_text = fees_paragraph.get_text(strip=True)
                    if "Please see the university website for further information on fees for this course." not in fees_text:
                        fees = fees_text
            # Extract description information
            description_element = soup.find('div', class_='course-sections__description')
            if description_element:
                description_paragraph = description_element.find('p')
                if description_paragraph:
                    description_text = description_paragraph.get_text(strip=True)
                    description = description_text
        # Write the extracted data to a TSV file
        with open(output_filename, 'w', encoding='utf-8') as output_file:
            output_file.write(header_row + "\n")  # Write the header row
            data_row = "\t".join([
                courseName,
                universityName,
                facultyName,
                isItFullTime,
                description,
                startDate,
                fees,
                modality,
                duration,
                city,
                country,
                administration,
                url
            ])
            output_file.write(data_row)

In [None]:
# Specify the folder containing the HTML files and the output folder for TSV files and the urls.txt file
urls_file = 'urls.txt'
html_folder = 'newFolderHTML'
output_folder = 'folderTSV'

# now we call the function to extract course information
extract_course_info(html_folder, output_folder, urls_file)

*after this, all the necessary course_1.tsv, course_2.tsv,....,course_6000.tsv are now in the 'folderTSV'*

This is the end of the data collection process. We have scraped the 400 pages of the degree pages website to get 6000 urls and crawled over those urls to download the HTML of each and then using beautiful soup we extracted all the information we are interested in from the 6000 urls and created 6000 Tab Separated Value (.tsv) files inside another folder. The most time consuming of all was the downloading of the HTML files looking at each of the 6000 urls. With a waiting time of 1 second, it took around 1.5 to 2 hours and with a waiting time of 2 seconds, it took double the time. The problem is, if we don't put a wait time of 2 seconds, some of the urls are not loading within that 1 second threshold, because of too many requests being provided, the website is blocking the IP address. This initially resulted in blank HTML files being downloaded for those few URLS. But with a waiting time of 2 seconds, the problem could be averted, and each of the 6000 HTML was correct. Only that we had to fragment the range of the urls.txt list we were using. Initially ran, with a range 1 to 1000, then reran with 10001 to 3000 and a last run with 30001 to 6000, actually averted the issue. To speed up the process, I employed multi-threading, but it scraped the 6000 pages in 1 minute and only downloaded 100 HTMl items with maximum 50 threads. As the load is quite high, 6000, the threads needs to be higher, but it means a considerable load on the server resulting in IP blocking again. Hence, I had to drop that idea and do it the normal way, trading off time consumption with accuracy. It took 4 hours to run, but at the end, I got all the 6000 HTMLs correctly.

My conclusion is: the webscraping is not so optimised with the beautiful soup framework. Perhaps, we can have a look at an alternative framework that is more optimised and meant for this purpose. For example: scrapy.