### 1.1. Get the list of master's degree courses

We start with the list of courses to include in your corpus of documents. In particular, we focus on web scrapping the [MSc Degrees](https://www.findamasters.com/masters-degrees/msc-degrees/). Next, we want you to **collect the URL** associated with each site in the list from the previously collected list.
The list is long and split into many pages. Therefore, we ask you to retrieve only the URLs of the places listed in **the first 400 pages** (each page has 15 courses, so you will end up with 6000 unique master's degree URLs).

The output of this step is a `.txt` file whose single line corresponds to the master's URL.

In [1]:
import scrapy # Import the Scrapy library
from scrapy.crawler import CrawlerProcess # Import the CrawlerProcess: for running the spider
from scrapy.utils.project import get_project_settings # Import get_project_settings: to customize the settings

class MastersSpider(scrapy.Spider): # Create a class to define the spider
    name = "masters_spider" # Spider name
    custom_settings = { # Custom settings
        'DOWNLOAD_DELAY': 2, # Specifies a delay of 2 seconds
    }

    def start_requests(self): # Define a function that returns a list of Requests
        base_url = 'https://www.findamasters.com/masters-degrees/msc-degrees/?PG=' # Base URL
        urls = [base_url + str(i) for i in range(1, 401)]  # Generate URLs for the first 400 pages
        for url in urls: # For each URL in the list
            yield scrapy.Request(url=url, callback=self.parse) # Pass it to the parse method

    def parse(self, response): # Define the parse method
        course_links = response.css('a.courseLink.text-dark::attr(href)').getall()  # Extract course links
        course_links = ['https://www.findamasters.com' + link for link in course_links]  # Add base URL to links
        with open('masters_urls.txt', 'a') as f:  # Open the file in append mode
            for link in course_links: # For each link in the list of links
                f.write(link + '\n')  # Write each link on a new line

# Run the spider
process = CrawlerProcess(get_project_settings()) # Initialize the CrawlerProcess
process.crawl(MastersSpider) # Pass the spider name to crawl
process.start() # Start the crawling process

2023-11-11 01:27:40 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: scrapybot)
2023-11-11 01:27:40 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.10.3, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.11.4 | packaged by Anaconda, Inc. | (main, Jul  5 2023, 13:38:37) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.2.0 (OpenSSL 1.1.1w  11 Sep 2023), cryptography 41.0.2, Platform Windows-10-10.0.22631-SP0
2023-11-11 01:27:40 [scrapy.crawler] INFO: Overridden settings:
{'DOWNLOAD_DELAY': 2}


See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2023-11-11 01:27:40 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2023-11-11 01:27:40 [scrapy.extensions.telnet] INFO: Telnet Password: 75891d888e9ae8a2
2023-11-11 01:27:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.tel