**OBJECTIVE:** Demonstrate that we can access data that is stored behind LSE's login page.

**AUTHOR:** [Kristina Dixon](https://www.github.com/KristinaD1910) (edited by [@jonjoncardoso](https://github.com/jonjoncardoso))

⚙️ **SETUP**

- Ensure you are running with the `chat-lse` conda environment and that you're up to date. See [README.md](../../README.md) if you haven't set up your environment yet.

    On the command line:

    ```bash
    conda activate chat-lse
    pip install -r requirements.txt
    ```

    <span style="color:red">**Note:** You might get errors with things like tqdm if you forget to do this step</span>

- On VSCode, select `chat-lse` as the Python interpreter for this notebook and project.

**🔐 LSE Credentials**

You will need to provide your LSE credentials for this.


- If you haven't already, create a `.env` file in the root of this project. If it's your first time doing that, you can copy from the [`.env.sample` file](../../.env.sample). 
- Modify the variables `LSE_USERNAME` and `LSE_PASSWORD` so they contain your LSE credentials (e-mail and password, respectively)

    For example:

    ```bash
    LSE_USERNAME=J.Cardoso-Silva@lse.ac.uk
    LSE_PASSWORD=MySuperSecretPassword
    ```
- Have your Microsoft Authenticator app ready to approve the login request.

**Imports**

In [34]:
import os
import time
import random
import selenium

from dotenv import load_dotenv
from tqdm.notebook import tqdm, trange

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException, TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains

**Constants**

In [35]:
load_dotenv()
LSE_USERNAME = os.getenv('LSE_USERNAME')
LSE_PASSWORD = os.getenv('LSE_PASSWORD')

**Util Functions**

In [151]:
def safely_get_elements(driver, css_selector, is_single_element=False, retries=3, wait_time=10):
    """
    More than simply return the element(s) based on the given CSS selector, 
    this function ensures the elements are visible on the page before trying to capture them
    and it also retries the operation a few times in case of failure.

    A function generic enough that can be used by any of our future Selenium scripts.

    Args:
        driver (WebDriver): The Selenium WebDriver instance.
        css_selector (str): The CSS selector to locate the elements.
        is_single_element (bool): Flag indicating whether to retrieve a single element or multiple elements.
        retries (int, optional): The number of retries in case of failure. Defaults to 3.
        wait_time (int, optional): The maximum wait time for the elements to be located. Defaults to 10.

    Returns:
        WebElement or list: The located web element(s) based on the given CSS selector. 
        If `is_single_element` is True, a single WebElement is returned. If `is_single_element` is False, a list of WebElements is returned.

    Raises:
        None

    """

    elements = None  # Output variable

    # Expected conditions
    ec_single_element = EC.visibility_of_element_located((By.CSS_SELECTOR, css_selector))
    ec_multiple_elements = EC.visibility_of_all_elements_located((By.CSS_SELECTOR, css_selector))

    for attempt in range(retries):
        try:
            if is_single_element:
                elements = WebDriverWait(driver, wait_time).until(ec_single_element)
            else:
                elements = WebDriverWait(driver, wait_time).until(ec_multiple_elements)
            break
        except (StaleElementReferenceException, TimeoutException) as e:
            if attempt < retries - 1:
                time.sleep(1)
                continue
            else:
                print(f"Failed to get element{'s' if is_single_element else ''}: {e}")
                if is_single_element:
                    return None
                else:
                    return []

    return elements

def safely_locate_element(driver, xpath, wait_time=5):
    try:
        element = WebDriverWait(driver, wait_time).until(EC.element_to_be_clickable((By.XPATH, xpath)))
        return element
    except TimeoutException:
        return None


def safely_click_element(driver, xpath, must_click=False, wait_time=5):
    """
    After ensuring that the element is visible on the page, clicks on it.

    A function generic enough that can be used by any of our future Selenium scripts.

    Args:
        driver (WebDriver): The Selenium WebDriver instance.
        xpath (str): The XPath to locate the element.
        must_click (bool, optional): Flag indicating whether the element MUST be clicked. If it's a must, then a message is printed if the element cannot be clicked. Defaults to False.
        wait_time (int, optional): The maximum wait time for the element to be clickable. Defaults to 5.

    Returns:
        None

    Raises:
        None. It simply prints an error message if the element cannot be clicked.

    """

    try:
        element = safely_locate_element(driver, xpath, wait_time=wait_time)
        if element:
            element.click()
    except TimeoutException as e:
        print(f"Failed to click element: {e}")

# 1. Logging in to LSE (via the LSE Library website)

In [37]:
driver = webdriver.Firefox()
driver.get('https://www.lse.ac.uk/library')

# enter full screen
driver.fullscreen_window()

# Let the user actually see something!
WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, 'Loginto'))).click()

# TODO: This part here could be moved to a `lse_login()` function to be reused in other scripts

# Let the user actually see something!
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '/html/body/div[3]/div/div/h3[1]/a'))).click()

# Let the user actually see something!
WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.NAME, 'loginfmt'))).send_keys(LSE_USERNAME)

# Let the user actually see something!
next_button = WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, 'idSIButton9')))
if next_button:
    next_button.click()
else:
    print("No next button found")

# Let the user actually see something!
WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.NAME, 'passwd'))).send_keys(LSE_PASSWORD)

# Let the user actually see something!
next_button = WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.ID, 'idSIButton9')))
next_button.click()

#For the Microsoft Authenticator - number for user to enter into app.
print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//*[@id="idRichContext_DisplaySign"]'))).text)

# Wait until the screen changes, this can be a variable length of time though
try:
    # Wait until the element is visible
    element = WebDriverWait(driver, 300).until(EC.visibility_of_element_located((By.ID, 'idBtn_Back')))
    # Act on the element as soon as it becomes visible
    element.click()
except TimeoutException:
    print("The element did not appear within the time limit")

# Let the user actually see something!
WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'div.grid-item:nth-child(1) > button:nth-child(1)'))).click()

44


# 2. Demo of our custom Selenium functions

**NOTE:** if you just want to run the full selenium code, skip to section 3 of this notebook.
 
Now that we are logged in and on the library page, click on the 'Exam Papers' link to access the past exam papers and download them.

**Go to the exam papers page**

In [180]:
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'div[data-main-menu-item="ExamPapers"] > a'))).click()

# Move the driver to the page that lists all the exam papers per department
# NOTE: This works fine, but be careful not to re-run this cell as it will be stuck in an infinite loop (due to the num_windows=2 condition)
WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
new_window = driver.window_handles[-1]
driver.switch_to.window(new_window)

## 2.1 Replicability notes I: demonstration of the `get_elements()` function

To whoever needs to copy and adapt from this template to parse other pages, here's a demonstration of how to use the `get_elements()` function to get the elements you need once the driver is on the page you want to parse.

The CSS Selector below represents each one of the boxes shown in the sub-collections below:

![image](https://github.com/latentnetworks/vimure/assets/896254/2a05478a-934b-43f0-92dc-7141bea85400)


In [39]:
driver.execute_script("document.body.style.zoom='30%'")
department_boxes = safely_get_elements(
    driver=driver,
    is_single_element=False, # Because we're collecting multiple divs set to False
    css_selector=".margin-bottom-small > prm-gallery-collection",  # The CSS selector for the department boxes shown above
    wait_time=60 # Increase the wait time to 60 seconds
)
# This should return 25 elements
print(f"Number of department boxes collected: {len(department_boxes)}")

Number of department boxes collected: 9


## 2.2 Replicability notes II: demonstration of the `safely_click_element()` function

For this demonstration, let's click on the first box in the collection (Accounting):

In [40]:
curr_box = department_boxes[0]

title = curr_box.find_element(By.TAG_NAME, "h3").text
subject_page_url = curr_box.find_element(By.TAG_NAME, "a").get_attribute("href")

print(f"When we run `curr_box.click()`, we will essentially be sending the driver to the following URL:\n{subject_page_url}")

When we run `curr_box.click()`, we will essentially be sending the driver to the following URL:
https://librarysearch.lse.ac.uk/discovery/collectionDiscovery?vid=44LSE_INST%3A44LSE_VU1&collectionId=81235317140002021&lang=en


Let's follow the first box:

In [41]:
curr_box.click()

Now, on this new page we see a list of PDFs. 

Here's how the `safely_locate_element()` and `safely_click_element()` functions can be used to click on the first exam papers in the collection:

In [42]:
containers = safely_get_elements(driver, f".is-grid-view > prm-gallery-item")
print(f"Number of PDFs found: {len(containers)}")

Number of PDFs found: 17


After this, we would be able to download the PDFs.

## 2.3 Replicability notes III: demonstration of the `safely_locate_element()` function

Some of the department pages contain multiple pages of past exam papers that are loaded dynamically. In these cases, we need to check if there is a 'Show More' button and then repeatedly click on it until we have all the exam papers loaded.

The `safely_locate_element()` function can be used to check if the 'Show More' button is present.

In [43]:
driver.back() # Go back to the previous page

Get the boxes again

In [125]:
department_boxes = safely_get_elements(
    driver=driver,
    is_single_element=False, # Because we're collecting multiple divs set to False
    css_selector=".margin-bottom-small > prm-gallery-collection",  # The CSS selector for the department boxes shown above
    wait_time=60 # Increase the wait time to 60 seconds
)

Go to the 'Anthropology' page:

In [45]:
curr_box = department_boxes[1] # I know that the second box leads to a page that matches the criteria above

curr_box.click()

We know for a fact that there is a 'Load more items' button at the bottom of the page. Let's use the `safely_locate_element()` function to confirm it's there:

In [61]:
# Instead of a highly specific absolute XPath, let's write a more generic one that can extract a <button> element with the text "Load More items"
button_xpath = "//button[contains(text(), 'Load more items')]"

# driver.find_element(By.XPATH, button_xpath)
element = safely_locate_element(driver, button_xpath)
element

<selenium.webdriver.remote.webelement.WebElement (session="a6b3c087-ad42-4792-9011-e3e2d458fdd5", element="a48671e7-da8d-4ea4-a9f8-d80dc97fbbe4")>

We got a `WebElement` so it means the button is there. Now we can click on it.

In [62]:
safely_click_element(driver, button_xpath)

We would continue to do this until we have all the exam papers loaded.

OK, enough with the demos of our custom functions. Let's go back to the main page and download the exam papers.

In [128]:
driver.back()

# 3. Collect exam papers from all departments

Once we're logged in, we can collect the exam papers from all departments. Re-run the code in Section 1 if you need to log in again.

## 3.1 Functions to collect exam papers

Once in a department page, collect metadata about the exam papers:

In [235]:
def scrape_department_exam_papers(driver, box_url, waiting_modifier=1, verbose=False):
    """
    Scrapes exam papers for a given department.

    Args:
        driver: The Selenium WebDriver instance.
        department_box: The WebElement representing the department box.
        waiting_modifier: A modifier to adjust the waiting time for element location and clicking. 
                          For example, when set to 2, if a function waits for 10 seconds, it will wait for 20 seconds instead. Defaults to 1.
        verbose: A boolean indicating whether to print verbose output.

    Returns:
        A generator that yields a dictionary containing the scraped information for each exam paper.

    """
    driver.get(box_url)

    # Department nametitle
    department_name = safely_locate_element(driver, "//h1[contains(@class, 'collection-title')]", wait_time=waiting_modifier*10).text
    print(f"Scraping exam papers for department: {department_name}")
    
    # URL with listings of exam papers for this department
    department_listing_url = box_url
    
    # Click load more button repeatedly until all exam papers are loaded
    if verbose:
        print("Loading all exam papers...")
    load_more_button_selector = "//button[contains(text(), 'Load more items')]"
    while safely_locate_element(driver, load_more_button_selector, wait_time=waiting_modifier*3):
        safely_click_element(driver, load_more_button_selector, wait_time=waiting_modifier*3)

    # Retrieve the total number of PDFs in this listing
    if verbose:
        print("Counting the number of exam papers...")
    marker_num_items = 'Items in this collection'
    xpath_num_items = f"//h2[./span[contains(text(), '{marker_num_items}')]]"
    count_pdfs = safely_locate_element(driver, xpath_num_items, wait_time=waiting_modifier*10).text
    count_pdfs = int(
        count_pdfs.split(marker_num_items)[1]
        .strip()
        .replace("(", "")
        .replace(")", "")
    )

    if verbose:
        print(f"Number of exam papers found: {count_pdfs}. Looping through each...")
    for i in trange(count_pdfs):
        if verbose:
            print(f"Scraping exam paper {i+1} of {count_pdfs}...")
        containers = safely_get_elements(driver, f".is-grid-view > prm-gallery-item", wait_time=waiting_modifier*0)
        driver.execute_script("arguments[0].scrollIntoView();", containers[i])
        container = containers[i]

        if verbose:
            print("Clicking on the exam paper...")
        driver.execute_script("document.body.style.zoom='30%'")
        course_name = container.find_element(By.CLASS_NAME, "item-title").text
        # FIXME: Find a more human-readable CSS selector for the type
        type = container.find_element(By.CSS_SELECTOR, "div:nth-child(2) > div:nth-child(2) > span:nth-child(1) > span:nth-child(1)").text

        driver.execute_script("window.scrollBy(0, -200);")  # Scroll up by 200 pixels 
        
        container.click()

        if verbose:
            print("Checking for 'Show More' button...")
        #FIXME: Find a more human-readable XPath for the 'Show More' button
        show_more_selector = "/html/body/primo-explore/div[3]/div/md-dialog/md-dialog-content/sticky-scroll/prm-full-view/div/div/div/div/div[1]/div[4]/div/prm-full-view-service-container/div[2]/div/prm-alma-viewit/prm-alma-viewit-items/button"
        if safely_click_element(driver, show_more_selector, wait_time=waiting_modifier*5):
            time.sleep(1)
        else:
            if verbose:
                print("'Show More' button not visible. Continuing...")
        
        if verbose:
            print("Scraping exam paper details...")
        exams = safely_get_elements(driver, "md-list-item.md-3-line", wait_time=waiting_modifier*5)

        #scroll to top of page - this could be the key to resolving the issue of 'Failed to get elements'

        if verbose:
            print("Scrolling to the top of the page...")
        # FIXME: Define a more human-readable CSS selector for the element to scroll to
        driver.execute_script("arguments[0].scrollIntoView();", driver.find_element(By.CSS_SELECTOR,'#action_list > div:nth-child(1) > prm-full-view-service-container:nth-child(1) > div:nth-child(1) > prm-service-header:nth-child(1)'))

        if verbose:
            print("Extracting metadata about each year's exam paper...")
        for n in range(1, len(exams) + 1):
            #FIXME: Find a more human-readable CSS selector for the exam paper
            css_selector = f"md-list-item.md-3-line:nth-child({n}) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(2) > a:nth-child(1)"
            exam = safely_get_elements(driver, css_selector, is_single_element=True, wait_time=waiting_modifier*0)

            if exam:
                driver.execute_script("arguments[0].scrollIntoView();", exam)
                exam_year = exam.text
                pdf_link = exam.get_attribute("href")

                yield {
                    "department_name": department_name,
                    "department_listing_url": department_listing_url,
                    "course_name": course_name,
                    "type": type,
                    "exam_year": exam_year,
                    "pdf_link": pdf_link,
                }

        exit_button = safely_get_elements(driver, "button.md-icon-button:nth-child(4)", is_single_element=True, wait_time=waiting_modifier*5)
        if exit_button:
            exit_button.click()
            time.sleep(2)
        else:
            print("Exit button not found. Returning")
            return

Collect the exam papers for all departments:

In [230]:
COLLECTION_URL = "https://librarysearch.lse.ac.uk/discovery/collectionDiscovery?vid=44LSE_INST:44LSE_VU1&collectionId=81235317150002021&lang=en"


def scrape_all_departments(driver, 
                           collection_url=COLLECTION_URL, 
                           waiting_modifier=1, 
                           verbose=False):
    
    driver.get(collection_url)
    driver.execute_script("document.body.style.zoom='100%'")

    department_boxes = safely_get_elements(
        driver=driver,
        css_selector=".margin-bottom-small > prm-gallery-collection",
        wait_time=60,  # Increase the wait time to 60 seconds to ensure all department boxes are loaded
    )
    boxes_urls = [box.find_element(By.TAG_NAME, "a").get_attribute("href") for box in department_boxes]

    return [
        list(
            scrape_department_exam_papers(
                driver,
                box_url,
                waiting_modifier=waiting_modifier,
                verbose=verbose,
            )
        )
        for box_url in boxes_urls
    ]

## 3.2 Run the scraper

Tip: increase the `waiting_modifier` to make the scraper wait longer for the page to load. This is useful if you have a slow internet connection.

In [None]:
scrape_all_departments(driver, waiting_modifier=0.5)

**REMAINING TODO:**

- [ ] Test that Department of Law is working with the new version of the code
- [ ] Reintroduce the retry mechanism to the scraping functions
- [ ] Finish refactoring: add the missing 'Back' button functionality
- [ ] Finish refactoring: allow scraper to continue from where it left off if it crashes
- [ ] Finish refactoring: send output to a file