## Downloading articles by system technology from CANTEACH

This Python script automates the process of scraping and downloading PDF documents from the CANTEACH website (https://canteach.candu.org/Pages/Home.aspx). It uses Selenium for web scraping and BeautifulSoup for parsing HTML content. The script is structured to handle interruptions and resume progress, ensuring efficient and complete downloads. Below is a detailed explanation of the code:

### Setup and Initialization:

The script imports necessary libraries for web scraping, browser automation, and file handling.
- It defines the base URL of the CANTEACH website, the main page URL containing system links, and the local directory where the documents will be saved.
- It ensures that the directory for saving documents exists.

### Setting Up Selenium WebDriver:

The script initializes the Selenium WebDriver for Firefox, with an option to run headless.
- It clears all cookies to avoid issues with large request headers.

### Getting System URLs:

The script sends a request to the main page to get URLs for different systems.
- It parses the HTML content of the page to find links that contain "Forms" in their href attribute, as these are the links to the system pages.

### Scraping Document Links:

The script uses Selenium to navigate each system page and collect PDF document links.
- It opens each system URL, clears cookies again, and waits for the page to load completely.
- It locates elements containing the text 'Title', which, when expanded, reveal the PDF links.
- It scrolls each 'Title' element into view and clicks on it to expand the section, then collects all PDF links from the expanded sections.

### Downloading Documents:

The script downloads each PDF document and saves it to the specified folder.
- It handles exceptions to ensure the script continues running even if a download fails.

### Saving and Loading Progress:

- The script saves the progress of downloaded URLs to a file, allowing it to resume from where it left off if interrupted.
- It loads the progress from the file upon restart.

### Main Function:

The main function orchestrates the entire process:
- Sets up the Selenium WebDriver.
- Loads progress if it exists, or scrapes the main page to get system URLs.
- Iterates through each system URL, scrapes document links, and downloads the documents.
- Saves progress after each download.
- Handles exceptions to skip problematic URLs and continue with the next.

# LIBRARIES

In [None]:
import os
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from webdriver_manager.firefox import GeckoDriverManager
import time
import pickle

In [None]:
# `Defining constants`
"""
    base_url: The base URL of the CANTEACH website.
    main_page_url: The URL of the main page with the list of systems to scrape system urls from
    save_folder: The local directory where the documents will be saved.
    progress_file: The file where progress will be saved and loaded from.
"""
base_url = "https://canteach.candu.org"
main_page_url = f"{base_url}/SitePages/Publications%20by%20System.aspx"
save_folder = "CANTEACH_Documents"
progress_file = "progress.pkl"
os.makedirs(save_folder, exist_ok=True)

In [None]:
# Function that initializes the Selenium WebDriver for Firefox, with an option to run headless.
def setup_driver():
    """
    Set up the Selenium WebDriver for Firefox.

    Returns:
        WebDriver: Configured Selenium WebDriver for Firefox.
    """
    options = webdriver.FirefoxOptions()
    # options.add_argument("--headless")  # Should be uncommented to run headless
    driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()), options=options)
    driver.delete_all_cookies()
    return driver

In [None]:
# Function to scrape the main page and get system URLs with their names
def get_system_urls(main_page_url):
    """
    Scrape the main page to get system URLs with their names.

    Args:
        main_page_url (str): URL of the main page listing systems.

    Returns:
        dict: A dictionary with system names as keys and their corresponding URLs as values.
    """
    response = requests.get(main_page_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    system_urls = {}
    for link in soup.find_all('a', href=True):
        href = link['href']
        if "Forms" in href:
            system_name = link.text.strip()
            system_urls[system_name] = base_url + href
    
    return system_urls

In [None]:
# Function to scrape a system page and get document links with Selenium
def get_document_links(driver, system_url):
    """Scrape a system page to get document links using Selenium.

    Args:
        driver (WebDriver): The Selenium WebDriver instance.
        system_url (str): URL of the system page to scrape.

    Returns:
        set: A set of document URLs.
    """
    driver.get(system_url)
    driver.delete_all_cookies()  # Clear cookies again before each request to handle "400 Bad request" HTTP error
    wait = WebDriverWait(driver, 10)
    wait.until(EC.presence_of_all_elements_located((By.TAG_NAME, 'span')))

    document_urls = set()  # Uses a set to avoid duplicate URLs

    print(f"Scraping page: {driver.current_url}")

    # Locate all 'Title' elements on the web-page using XPath
    title_elements = driver.find_elements(By.XPATH, "//a[contains(text(),'Title')]")
    
    print(f"Found {len(title_elements)} elements with 'Title'")

    for title_element in title_elements:
        try:
            # Scrolling the element into view using JavaScript
            driver.execute_script("arguments[0].scrollIntoView();", title_element)
            time.sleep(1)  # Wait for scrolling to complete

            # Using ActionChains to move to the element and click
            ActionChains(driver).move_to_element(title_element).click().perform()
            time.sleep(2)  # Wait for the PDFs to be displayed

            # Finding all PDF links within the expanded section
            pdf_links = driver.find_elements(By.CSS_SELECTOR, 'a.ms-listlink')
            for link in pdf_links:
                href = link.get_attribute('href')
                if href and 'Content%20Library' in href and href.endswith('.pdf'):
                    # Fixing the URL concatenation issue
                    if href.startswith("/"):
                        href = base_url + href
                    document_urls.add(href)  # Adding URL to set
                    print(f"Found PDF link: {href}")
        except Exception as e:
            print(f"Could not click element. Error: {str(e)}")

    return document_urls

In [None]:
# Function to download a document and save it in the appropriate folder
def download_document(document_url, save_folder):
    """Download a document and save it in the appropriate folder.

    Args:
        document_url (str): URL of the document to download.
        save_folder (str): Path to the folder where the document will be saved.
    """
    try:
        response = requests.get(document_url)
        response.raise_for_status()
        document_name = document_url.split('/')[-1]
        
        save_path = os.path.join(save_folder, document_name)
        with open(save_path, 'wb') as file:
            file.write(response.content)
        print(f"Downloaded: {document_name}")
    except requests.exceptions.RequestException as e:
        print(f"Failed to download {document_url}. Error: {e}")


In [None]:
# Function to save progress to a file
def save_progress(system_urls, downloaded_urls):
    """Save progress to a file.

    Args:
        system_urls (dict): Dictionary of system URLs.
        downloaded_urls (set): Set of downloaded document URLs.
    """
    with open(progress_file, 'wb') as f:
        pickle.dump((system_urls, downloaded_urls), f)

In [None]:
# Finction to load progress from a file
def load_progress():
    """Load progress from a file.

    Returns:
        tuple: A tuple containing the dictionary of system URLs and the set of downloaded document URLs.
    """
    if os.path.exists(progress_file):
        with open(progress_file, 'rb') as f:
            return pickle.load(f)
    return None, set()

In [None]:
# Main function to orchestrate the process
def main():
    """Main function to orchestrate the process."""
    driver = setup_driver()
    
    # Load progress if exists
    saved_system_urls, downloaded_urls = load_progress()
    
    if saved_system_urls:
        system_urls = saved_system_urls
    else:
        # Scrape the main page to get system URLs
        system_urls = get_system_urls(main_page_url)

    try:
        # Print system URLs
        print("System URLs:")
        for system_name, system_url in system_urls.items():
            print(f"{system_name}: {system_url}")

        # Iterate through each system URL to get document links and download them
        for system_name, system_url in system_urls.items():
            # Create system folder
            system_folder = os.path.join(save_folder, system_name.replace(' ', '_'))
            os.makedirs(system_folder, exist_ok=True)

            try:
                document_links = get_document_links(driver, system_url)
            except Exception as e:
                print(f"Failed to scrape documents from {system_url}. Error: {e}")
                continue  # Skip to the next system URL

            # Print document URLs
            print(f"\nDocument URLs for {system_name}:")
            for document_link in document_links:
                if document_link not in downloaded_urls:
                    print(document_link)
                    download_document(document_link, system_folder)
                    downloaded_urls.add(document_link)
                    save_progress(system_urls, downloaded_urls)  # Save progress after each download
    
    except Exception as e:
        print(f"An error occurred: {str(e)}")
    finally:
        driver.quit()

In [None]:
if __name__ == "__main__":
    main()
