# 1. Data Sourcing

To effectively address our research question — how undergraduate students at LSE might strategically select courses and degrees that offer relatively easier pathways to achieving high grades — we first need to acquire a range of publicly available data from LSE's online resources. This initial phase involves identifying relevant data sources, scraping and collecting data, exploring and cleaning said datasets from faulty and irrelevant information, as well as preparing and structuring the datasets for downstream processing and analysis.

The data sourcing process was divided into collecting two core datasets: (1) degree programme data, including recommended modules and application statistics, and (2) grade distributions by module, obtained from 5-year departmental PDF reports. These datasets need to be collected through targeted web scraping and manual extraction methods, considering the limitations of unstructured or semi-structured web formats.


### Data of interest
We specifically focused on the following data elements:
1. **Undergraduate Degrees 2024/25:**
    * Degree Names (of all available programmes)
    * Module Information
        * Mandatory Modules (per year)
        * Outside Module Options
    * Extra Information
        * A-level requirements
        * Application stats (volume, intake & acceptance rate)
        * Tuition fees
        * Median salary post-graduation
<br>
<br>
2. **Undergraduate Modules 2024/25:**
    * Module Codes (of all available courses)
    * Grade distributions
        * Grade summary statistics (mean, median, standard deviation, min, max, quartiles)
        * Classified grade distributions (Number of 1st, 2:1s, 2:2s, 3rds, fails)
    * Module Selection Criteria
        * Prerequisites for courses
        * Mutually exclusive courses
    * Extra Information
        * Number of enrolled students
        * Average class sizes
        * Capsizes (if applicable)
        * Units of courses
        * Responsible Departments

### Data sources
This data can be found on the following LSE websites:
1. **Degree Information and Application Statistics**
<br> *URL: https://www.lse.ac.uk/Programmes/Search-courses*
<br> (22 pages of degree programmes with individual programme pages containing module, entry, and application data)

2. **Module Grade Distributions**
<br> *URL: https://info.lse.ac.uk/staff/divisions/academic-registrars-division/systems/what-we-do/internal/degree-and-course-results*
<br> (includes departmental PDF files with annual module-grade data from 2019 - 2024)

3. **Course Guide and Module Metadata:** 
<br> *URL: https://www.lse.ac.uk/resources/calendar2024-2025/courseGuides/undergraduate.htm*
<br> (Contains course unit values, prerequisites, exclusions, departments, and descriptions)


## 1.1. Degree Data Scraping

The first phase of data acquisition involved extracting relevant degree programme information from the LSE Degree Search platform. This site lists all available undergraduate programmes for the 2024/25 academic year across 22 paginated results. Each programme contains a detail page with structured data including:

* Programme title and UCAS code
* Recommended modules (year-wise, compulsory and optional)
* A-level entry requirements
* Application statistics (number of applicants, number of offers, and intake)
* Tuition fees (Home and International)
* Career outcomes (median salary post-graduation, if listed)
---

### 1.1.1. Degree webpage URL scraping

We first need to scrape the hyperlinks to all undergraduate degrees listed on the LSE Degree Search platform, which we will store in the programme_links list for further data extraction in later steps. Sorting through all 22 paginated result pages of the degree catalogue is achieved through extending the *Search-course* website's URL with it's page index and looping the link extraction through every results page. Within this loop the code uses requests to fetch each page’s HTML content and employs BeautifulSoup to parse the HTML and extract all anchor tags. Only links that correspond to undergraduate programme pages — identified by its URL pattern — are selected and converted into full URLs. These links are stored in the programme_links list for further data extraction in later steps.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

base_url = 'https://www.lse.ac.uk'
courses_url = 'https://www.lse.ac.uk/Programmes/Search-courses?pageIndex='

# Getting links to websites of all undergraduate programme
programme_links = []

for page in range(1, 23):
    print(f'Scraping page {page}/22...', end='\r', flush=True)
    url = f"{courses_url}{page}"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    all_links = soup.find_all('a', href=True)
    
    for link in all_links:
        href = link['href']
        if href.startswith('/study-at-lse/undergraduate/'):
            full_url = base_url + href
            programme_links.append(full_url)

print('\n'+ f'Total undergraduate programmes found: {len(programme_links)}')

Scraping page 22/22...
Total undergraduate programmes found: 42


### 1.1.2. Degree data scraping - Function

With the list of programme URLs collected in the previous step, the next step is to extract structured information from each individual degree page. To achieve this, we define a function called scrape_programme_data, which takes a single URL as input and returns a dictionary of the degree’s attributes.

Before writing the function, it was necessary to manually inspect several programme webpages and their underlying HTML structures using developer tools and by examining the parsed output from BeautifulSoup. This allowed us to locate where elements of interest are stored within the HTML tree and identify consistent CSS selectors or tags we could use to extract the relevant data.

The targeted elements are:
* Degree name and department: Retrieved from the page’s *h1><span* heading.
* Course structure per academic year: Modules are listed in div blocks identified by unique *#year-x* IDs. These were iterated over to capture the course codes year-by-year.
* Entry requirements: A-level requirements are stored under a specific element with the ID *#alevels*, and typically follow a paragraph structure.
* Application statistics: These appear as stylized bullet-point figures under the *"Your Application"* section and include number of applications, intake, and acceptance ratio.
* Cost information: Undergraduate home fees are usually listed in a paragraph block under the fees section and were extracted using a regular expression to capture pound-amounts *(£)*. Due to inconsistencies in the storage of overseas fees for various degrees, we opted to exclude this information from our dataset.
* Median graduate salary: If available, this is shown in the *"Graduate Destinations"* section.

The result of each function call is a dictionary containing all scraped data points, which can be appended to a list or converted into a DataFrame for further cleaning and analysis.

*Note, since LSE webpages occasionally returned temporary server errors (status code 500) during scraping, we added a status code check at the start of the function to skip any pages that could not be successfully accessed. This ensures our scraping process remains robust and continues running without interruption.*

*Further important observation is, that while the module lists are meant to only include all mandatory set courses for each year, in reality at times they include optional courses that are still listed on a degree's website as official course recommendations. Due to major inconsistencies among the storage of such recommended course options on the various websites, we are forced to later manually update these lists to exclude such instances and truly only reflect set mandatory courses for each degree.*

In [2]:
# Function for extraxting data of interest from websites
def scrape_programme_data(url):
    res = requests.get(url)
    if res.status_code != 200: # Needed to include due to temporary type 500 errors occuring when loading websites
        print(f"⚠️ Skipping {url} — status code {res.status_code}")
        return None
    
    soup = BeautifulSoup(res.text, 'html.parser')
    data = {}

    # Degree
    course = soup.select_one('h1 > span').get_text(strip=True)
    data['degree'] = course

    # A-level requirement
    alevel_elem = soup.select_one('#alevels > div > p')
    alevel_text = alevel_elem.get_text(strip=True).split(maxsplit=1)
    data['a_lvl_req'] = alevel_text[0].strip(',')
    if len(alevel_text) > 1: data['a_lvl_extra'] = alevel_text[1]
    else: data['a_lvl_extra'] = None

    # Modules (looping through years)
    data['modules_y1'] = []
    data['modules_y2'] = []
    data['modules_y3'] = []
    data['modules_y4'] = []
    for year in range(1, 5):  # assuming up to Year 3
        modules = soup.select(f'#year-{year} div.code')
        for module in modules:
            code = module.get_text(strip=True)
            if year == 1:
                data['modules_y1'].append(code)
            elif year == 2:
                data['modules_y2'].append(code)
            elif year == 3:
                data['modules_y3'].append(code)
            elif year == 4:
                data['modules_y4'].append(code)
        
    # Applications statistics
    nr_apps = soup.select_one("#your-application__overview .block--applications .stats")
    if nr_apps: data['nr_applications'] = nr_apps.get_text(strip=True)
    else: data['nr_applications'] = None
        
    intake = soup.select_one("#your-application__overview .block--places .stats")
    if intake: data['intake'] = intake.get_text(strip=True)
    else: data['intake'] = None
        
    ratio = soup.select_one("#your-application__overview .block--ratio .stats")
    if ratio: data['ratio'] = ratio.get_text(strip=True)
    else: data['ratio'] = None

    # Fees
    home_fee_text = soup.select_one('#fees-and-funding__home p').get_text(strip=True)
    data['home_fee'] = re.search(r'£[\d,]+', home_fee_text).group()

    # Median Salary
    salary = soup.select_one('#graduate-destinations__overview .salary')
    if salary: data['median_salary'] = salary.get_text(strip=True)
    else: data['median_salary'] = None

    return data

### 1.1.3. Degree data scraping - Application

We can now loop through each undergraduate degree URL and apply our previously defined scraping function to extract relevant data. Successfully scraped data is stored in a list, while any pages that failed to load (due to temporary server errors) are skipped and counted. This gives us a complete, structured dataset from the available programme pages.

In [3]:
# Applying function on all websites
degrees_data = []
skipped_urls = 0

for i, url in enumerate(programme_links):
    print(f"Scraping {i+1}/{len(programme_links)}: {url}", end='\r', flush=True)
    info = scrape_programme_data(url)
    if info is None:
        skipped_urls += 1
    else:
        degrees_data.append(info)

print('\nData scraping complete\n')
if skipped_urls >= 1:
    print(f'{skipped_urls} programmes skipped in data extraction, due to website loading error (500).')

Scraping 42/42: https://www.lse.ac.uk/study-at-lse/undergraduate/llb-bachelor-of-lawsogyehavioural-scienced-with-politicsss
Data scraping complete



### 1.1.4. Cleaning & Structuring Data

After scraping, we remove any *None* entries from the list—these correspond to non-responsive or failed websites. We then convert the cleaned list of dictionaries into a pandas DataFrame. This tabular format allows for easier inspection, manipulation, and analysis of the degree data moving forward.

In [4]:
# Cleaning data from non-responsive websites & converting to Dataframe
degrees_data_clean = [d for d in degrees_data if d is not None]
degrees_df = pd.DataFrame(degrees_data_clean)
degrees_df

Unnamed: 0,degree,a_lvl_req,a_lvl_extra,modules_y1,modules_y2,modules_y3,modules_y4,nr_applications,intake,ratio,home_fee,median_salary
0,BA Anthropology and Law,AAB,,"[LL141, AN100, AN101, LL142, LL108, LL100, LL1...","[AN253, AN379, LL106, LL143, LL200]",[LL276],[],250.0,20.0,13:1,"£9,535","£34,500"
1,BA Geography,AAA,,"[GY100, GY140, GY144, LSE100]","[GY245, GY246, GY212, GY204, GY206, GY207]",[GY350],[],377.0,38.0,10:1,"£9,535","£35,000"
2,BA History,AAA,,"[HY120, LSE100, EH101, HY113, HY116, HY118]",[],[HY300],[],503.0,58.0,9:1,"£9,535","£35,000"
3,BA Social Anthropology,AAB,,"[AN100, AN101, AN102, LSE100]","[AN286, AN253, AN256, AN273, AN285, AN287, AN2...",[AN397],[],232.0,30.0,8:1,"£9,535","£34,500"
4,BSc Accounting and Finance,AAA,with A in Mathematics,"[LSE100, AC105, AC106, ST107, FM101, EC1A3, EC...","[AC205, AC206, FM210, FM211, FM214, FM215, EC2...","[AC331, AC311, FM310, FM311]",[],2283.0,140.0,16:1,"£9,535","£35,000"
5,BSc Actuarial Science,A*AA,with an A* in Mathematics,"[ST102, MA100, EC1A3, EC1B3, LSE100]","[ST206, ST216, MA221, MA222, ST226, ST227]","[ST302, ST301]",[],615.0,68.0,9:1,"£9,535","£36,500"
6,BSc Data Science,A*AA,with an A* in Mathematics,"[ST102, MA100, ST101, ST115, LSE100, AC102, AC...","[ST206, ST211, ST207, MA214, MA222, MA102, MA2...","[ST310, ST311, ST312, ST300, ST301, ST302, ST3...",[],633.0,32.0,20:1,"£9,535","£36,500"
7,BSc Econometrics and Mathematical Economics,A*AA,with an A* in Mathematics,"[EC1P1, MA108, ST109, EC1A1, EC1B1, EC1C1, LSE...","[EC2A1, EC2B1, EC2C1, EH238, FM214, FM215]","[EC319, EC333, EC336, EC337, EC311]",[],,,,"£9,535","£55,000"
8,BSc Economic History,AAA,including Economics or History,"[EH101, EC1A5, EC1B5, EH102, LSE100]",[EH237],[EH390],[],267.0,29.0,9:1,"£9,535","£35,000"
9,BSc Economic History and Geography,AAB,including Economics or History,"[EH101, GY100, GY140, LSE100, EC1A3, EC1A5, EC...","[GY209, GY210, EH237]","[GY313, GY314, EH308, EH390]",[],175.0,6.0,29:1,"£9,535","£35,000"


**5. Exporting the Data**
To preserve our cleaned dataset and enable easy reuse in future analysis steps, we save the DataFrame as a CSV file in our project directory. This allows us to avoid re-scraping the web every time we need the data.

In [5]:
# Saving as CSV file
degrees_df.to_csv('data/degrees/programme_data.csv', index=False)
print('Data has been saved as a CSV file')

Data has been saved as a CSV file


This CSV file now forms a foundational part of our analysis, offering essential context for identifying potential patterns in course difficulty, grade distributions, and academic outcomes across different degree programmes.

## 1.2 Module Grade Distribution Scraping

To complement our degree-level data and gain more insight into what makes a course “easy” or high-scoring, we now turn to the course-level grade distributions. These statistics are published annually by each department at LSE in the form of PDF documents, which include a breakdown of student performance in each undergraduate module — typically showing distribution statistics and frequencies such as mean and median grades, as well as classified results (e.g., 1sts, 2:1s, etc.) for the last 5 years running.

Unfortunately, the LSE webpage where these PDFs are hosted is behind a login portal that requires student credentials. Since it is not publicly accessible and protected against automated scraping, we manually downloaded the full set of course results PDFs across all departments and stored them locally in the *data/modules* folder. This ensures we can still extract and analyze the grade data programmatically.

The goal of this section is to loop through each of these PDFs, parse out the relevant statistics for each undergraduate module, and clean them into a structured format suitable for analysis.

---

### 1.2.1. Identifying Departments PDF's
We start by looping through all files and identifying all relevant PDF files in our folder, allowing us to extract the module-level data needed for our broader analysis.

In [24]:
import pdfplumber
import os
import contextlib
import io

# Identifying all PDFs
pdf_folder = 'data/modules'
pdf_files = [f for f in os.listdir(pdf_folder) if f.endswith('.pdf')]
total_files = len(pdf_files)

### 1.2.2. Extracting text from PDF

Unlike scraping data from HTML websites — where structured elements like tags, classes, and IDs help us pinpoint exactly where data is stored — the structure of PDFs is often less consistent and not inherently designed for data extraction. As such, we needed to use the specialized Python library *pdfplumber*. This allows us to read and parse text content from PDF files while preserving the layout and line structure of the original documents.

The following function reads every page of a given PDF and concatenates all extracted text into a single string, preparing it for further pattern-based filtering and analysis.

*Note, at times the parsing encountered insignificant layout or font issues, resulting in persistent warning messages displayed - eventhough the output was perfectly fine. To deal with this and keep our output clean and readable, we imported the contextlib and io libraries. These let us suppress standard error messages during the PDF processing step. This workaround ensures that we can extract the text content reliably while ignoring any non-critical warnings cluttering the output.*

In [8]:
# Function to extract text from PDF using pdfplumber
def extract_pdf_text(pdf_path):
    all_text = ''
    with contextlib.redirect_stderr(io.StringIO()): # Use of AI to avoid warning messages
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                page_text = page.extract_text()
                if page_text:
                    all_text += page_text + '\n'
    return all_text

### 1.2.3. Identifying Departments

To begin parsing the content of each module PDF, we first needed to identify which department the module data belongs to. To do this, we created a function that scans the extracted PDF text line by line, searching for a pattern that consistently appears across files:
"*Department (XY) course results*".

We used the *re* library and AI to help write a pattern that looks for a department name followed by a short code in parentheses and hence can generically match this format flexibly across departments.

We also used the *.group()*-function to extract the matching parts of the string (i.e. the department name and code). This function was something we discovered and learned to apply through AI assistance.

The final result is a tuple containing the department code and department name, which we’ll use to label and organize the extracted data correctly.

In [9]:
# Function to extract department name & code
def extract_department_code(text):
    lines = text.split('\n')
    
    for line in lines:
        match = re.search(r'([A-Za-z]+) \(([A-Za-z]{2,4})\) course results', line) # Use of AI to generate generic re code that identifies string
        if match:
            department_name = match.group(1) # Use of AI to learn about .group() function
            department_code = match.group(2)
            break
    return department_code, department_name

### 1.2.4. Collecting Grade Distribution Statistics

Next, we focus on parsing the actual Marksummary tables from each PDF. These tables contain statistical data such as mean, median, standard deviation of marks, as well as minimum and maximum marks and percentile values about student performance across individual courses and academic years. This data allows us to assess performance patterns across departments and over time and are the key part of our module-level analysis.

**Approach to Extraction**

We extract the data by defining a function that uses a while loop to parse through each PDF line-by-line using an index variable i, looking for the start of tables marked by the consistent header string *'Year marks mean sd'*. Once the header is found, we:
* store the column names and map them to their respective indices
* check for a course code, often found a few lines below the table, formatted like *AB123:Marksummary*
* loop through each row, parsing values and adding them to a structured list mark_data (if the row is complete and aligns with the header)
* keep track of excluded rows in excluded_data, especially those that are either empty or misaligned.

This setup allows us to extract data even when PDFs contain multiple tables or have slightly inconsistent formatting.


**Challenges Encountered and Fixes**

One major issue we ran into was dealing with incomplete or misaligned table rows. These rows typically arise when the PDF text parser encounters blank cells in the original table—commonly seen when there are 0 values (e.g., 0 students received a fail grade). Unfortunately, when using pdfplumber (or any other PDF parser), these blank cells are not interpreted as 0, but instead are skipped altogether, causing the rest of the row to shift left, misaligning values with the headers and hence having the total number of entries in a row to fall short of the header length.

This issue was especially common in the second table (called *'Gradesummary'*) that follows the Marksummary for each course, which contains the degree classification frequencies (e.g., # & % of students who got 1st, 2:1, 2:2, fail). Despite extensive attempts using both extensive online research and AI suggestions, we were not able to reliably parse this table due to the unpredictability of missing values and their impact on row structure.

Thus, we made the decision to exclude the second table from our dataset. While unfortunate — since classification frequencies offer valuable and arguably more interesting insight — this decision was necessary to maintain the integrity of our dataset. The Marksummary statistics, by contrast, are consistently populated (as statistical summaries like mean, median, and standard deviation always require numerical input) and hence were extracted successfully in most cases. Including only rows where the lenght of values matches the lenght of the header, fully ensures that our dataset is reliable and clean.

In [10]:
# Function to extract Marksummary tables from text
def extract_marksummary(text, department_code, department_name):
    lines = text.split('\n')
    mark_data = []
    excluded_data = []

    i = 0
    while i < len(lines):
        line = lines[i].strip()

        # Identifying table headers
        if line.startswith('Year marks mean sd'):
            header = line.split()
            pos = {col: idx for idx, col in enumerate(header)}

            # Identifying course code (at bottom of table)
            course = department_code # Setting department code as default
            for k in range(1, 7):
                if i + k < len(lines):
                    match = re.search(r'([A-Z0-9]+):Marksummary', lines[i + k])
                    if match:
                        course = match.group(1)
                        break

            # Moving to first data row (skipping header)
            i += 1
            course_data = []
            skipped_rows = []
            
            while i < len(lines):
                line = lines[i].strip()
                # Break when encountering table title (at bottom of each table)
                if re.match(r'([A-Z0-9]+):Marksummary', line) or line.startswith('MarksbyYear'):
                    break

                # Parsing data
                if line:
                    values = line.split()
                    if len(values) == len(pos): # cleaning data from incomplete and misaligned rows due to missing values
                        course_data.append({
                            'department': department_name,
                            'code': course,
                            'year': values[pos['Year']],
                            'marks': int(values[pos['marks']]),
                            'mean': float(values[pos['mean']]),
                            'sd': float(values[pos['sd']]),
                            'min': float(values[pos['min']]),
                            'q10': float(values[pos['q10']]),
                            'q25': float(values[pos['q25']]),
                            'median': float(values[pos['median']]),
                            'IQR': float(values[pos['IQR']]),
                            'q75': float(values[pos['q75']]),
                            'q90': float(values[pos['q90']]),
                            'q95': float(values[pos['q95']]),
                            'max': float(values[pos['max']])
                        })
                        
                    elif len(values) == 1: # Seperating excluded rows between empty rows and incomplete rows
                        skipped_rows.append({'course': course, 'year': values[pos['Year']], 'reason': 'no data in year'})
                        
                    else:
                        skipped_rows.append({'course': course, 'year': values[pos['Year']], 'reason': 'incomplete data'})
                
                            
                i += 1  # Moving to next line

            mark_data.extend(course_data)
            excluded_data.extend(skipped_rows)

        else:
            i += 1  # Moving to next line

    return mark_data, excluded_data

### 1.2.5. PDF Processing & Application

Finally, we need to bring together all previously defined steps and apply them to every PDF in the specified folder with the following function. For each file, the mark summary statistics are parsed from the text, and valid rows are appended to a main data list, while excluded or malformed rows are collected separately. Lastly, both datasets are returned as DataFrames, ready for subsequent processing and analysis.

In [11]:
# Function to scrape all PDFs in the folder
def process_pdfs(pdf_folder):
    all_data = []
    all_excl_data = []
    for i, pdf_file in enumerate(pdf_files, 1):
        pdf_path = os.path.join(pdf_folder, pdf_file)
        print(f'Processing ({i}/{total_files}): {pdf_file}...', end='\r', flush=True)

        # Extracting text from the PDF
        text = extract_pdf_text(pdf_path)
        
        # Extracting department code and name
        department_code, department_name = extract_department_code(text)

        # Extracting mark summary data
        mark_data, excluded_data = extract_marksummary(text, department_code, department_name)
        
        # Appending data
        all_data.extend(mark_data)
        all_excl_data.extend(excluded_data)

    # Converting data
    df = pd.DataFrame(all_data)
    df_excl = pd.DataFrame(all_excl_data)
    
    return df, df_excl

### 1.2.6. Sorting Dataframe & Identifying Excluded Rows

Once all PDFs have been processed, the resulting data is sorted by course code and academic year for easier readability and analysis. The DataFrame index is reset to ensure consistency after sorting. In addition, we report how many rows were excluded due to either being completely empty (often corresponding to years before a course was introduced) or misaligned (typically caused by missing values within a table row).

In [12]:
# Scraping all PDFs & sorting data
df, df_excl = process_pdfs(pdf_folder)
df = df.sort_values(by=['code', 'year'], ascending=[True, True])
df.reset_index(drop=True, inplace=True)

empty_rows = len(df_excl[df_excl['reason'] == 'no data in year'])
misaligned_rows = len(df_excl[df_excl['reason'] == 'incomplete data'])

print('\n'+f'Data scraping complete, {len(df)} rows of data extracted.'+'\n')
print(f'{empty_rows} rows deleted due to empty rows for years prior to introduction of new modules.')
print(f'{misaligned_rows} rows deleted due to missing values resulting in misalignment.')

Processing (20/20): DS-results-2023-24-All-Sittings.pdf...
Data scraping complete, 1934 rows of data extracted.

516 rows deleted due to empty rows for years prior to introduction of new modules.
49 rows deleted due to missing values resulting in misalignment.


### 1.2.7. Saving seperated Dataframes

In the final step of the module data scraping process, we distinguish between department-level and individual module-level data based on the length of the course code,creating two separate DataFrames for clearer organization and analysis. Each is then saved as a CSV file in the appropriate directory, ensuring our data is both accessible and structured for the next phase of the project.

In [25]:
# Separating modules and department data
departments_df = df[df['code'].str.len() == 2]
departments_df.reset_index(drop=True, inplace=True)

modules_df = df[df['code'].str.len() > 2]
modules_df.reset_index(drop=True, inplace=True)

# Saving DataFrames to CSV files
modules_df.to_csv("data/modules/marks_summary_modules.csv", index=False)
departments_df.to_csv("data/departments/marks_summary_departments.csv", index=False)

print('Dataframes seperated and saved as CSV files')
modules_df

Dataframes seperated and saved as CSV files


Unnamed: 0,department,code,year,marks,mean,sd,min,q10,q25,median,IQR,q75,q90,q95,max
0,Accounting,AC100,2019/20,116,76.9,9.2,45.0,65.0,72.8,79.0,10.5,83.2,86.0,87.0,91.0
1,Accounting,AC100,2020/21,145,67.4,10.4,32.0,54.0,61.0,69.0,14.0,75.0,79.0,81.0,85.0
2,Accounting,AC100,2021/22,114,65.8,15.6,0.0,47.3,58.0,68.0,19.0,77.0,81.7,85.7,88.0
3,Accounting,AC100,2022/23,112,63.9,15.7,0.0,48.0,55.0,66.0,19.2,74.2,80.0,83.4,90.0
4,Accounting,AC102,2019/20,524,86.8,10.3,35.0,76.0,83.0,90.0,11.0,94.0,96.0,97.0,99.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1836,Statistics,ST330,2019/20,69,72.0,11.8,33.0,56.0,64.0,74.0,17.0,81.0,86.2,89.6,92.0
1837,Statistics,ST330,2020/21,70,63.1,16.4,7.0,40.9,56.2,65.5,18.8,75.0,82.0,83.5,90.0
1838,Statistics,ST330,2021/22,65,60.8,15.7,0.0,41.6,51.0,63.0,20.0,71.0,79.4,83.0,91.0
1839,Statistics,ST330,2022/23,64,56.4,20.7,0.0,28.0,43.8,56.0,27.2,71.0,82.7,85.8,89.0


## 1.3. Module Extra Info Scraping


In [15]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

base_url = 'https://www.lse.ac.uk/resources/calendar2024-2025/courseGuides'
guide_url = f'{base_url}/undergraduate.htm'

In [16]:
# Getting all links to course guides
response = requests.get(guide_url)
soup = BeautifulSoup(response.content, 'html.parser')

# Finding all tables (each course is stored in departments table)
tables = soup.find_all('table')

course_links = []

for table in tables:
    for a_tag in table.find_all('a', href=True):
        href = a_tag['href']
        if href.startswith('../courseGuides/'):
            full_url = base_url + href.split('../courseGuides')[1]
            course_links.append(full_url)

print(f'Found {len(course_links)} course guide links.')

Found 568 course guide links.


In [17]:
def extract_course_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    title = soup.find('title').get_text().split(maxsplit=1)
    code = title[0]
    course = title[1]
    
    data = {'code': code, 'course': course, 'prerequisites' :[]}
    
    key_facts_section = soup.find('div', id='keyFacts-Content')
    items = key_facts_section.find_all('p')
        
    for item in items:
        text = item.get_text(strip=True)
    
        if text.startswith('Department'):
            data['department'] = text.split(':')[1].strip()
        elif text.startswith('Total students'):
            data['total_students'] = text.split(':')[1].strip()
        elif text.startswith('Average class size'):
            data['avg_class_size'] = text.split(':')[1].strip()
        elif text.startswith('Capped'):
            data['capped'] = text.split(':')[1].strip()
        elif text.startswith("Value:"):
            data['units'] = text.split('Value:')[1].strip()
    
      # === Updated Prerequisite Logic Based on HTML ===
    prereq_div = soup.find('div', id='preRequisites-Content')
    prereqs = set()
    
    if prereq_div:
        text = prereq_div.get_text(separator=" ", strip=True)
        words = text.replace('(', ' ').replace(')', ' ').replace(',', ' ').replace('.', ' ').split()

        for word in words:
            cleaned = word.strip()
            # Check for 2 letters + 3 digits anywhere in the word
            if len(cleaned) >= 5:
                prefix = cleaned[:2]
                digits = cleaned[2:5]
                if prefix.isalpha() and digits.isdigit():
                    if cleaned != code:
                        prereqs.add(cleaned)

    data['prerequisites'] = list(prereqs)
        
    return data

In [18]:
# Running the scraper
all_course_data = []

for i, url in enumerate(course_links):
    print(f"Scraping {i+1}/{len(course_links)}: {url}", end='\r', flush=True)
    course_data = extract_course_data(url)
    all_course_data.append(course_data)

Scraping 568/568: https://www.lse.ac.uk/resources/calendar2024-2025/courseGuides/ST/2024_ST360.htmmm

In [19]:
# Converting to DataFrame
modules_facts = pd.DataFrame(all_course_data)
modules_facts

Unnamed: 0,code,course,prerequisites,department,total_students,avg_class_size,capped,units
0,AC102,Elements of Financial Accounting,[],Accounting,564,15,No,Half Unit
1,AC103,"Elements of Management Accounting, Financial M...",[],Accounting,256,18,No,Half Unit
2,AC105,Introduction to Financial Accounting,[],Accounting,115,39,No,Half Unit
3,AC106,Introduction to Management Accounting,[],Accounting,115,39,No,Half Unit
4,AC205,Intermediate Financial Accounting,"[AC102, AC105]",Accounting,Unavailable,Unavailable,No,Half Unit
...,...,...,...,...,...,...,...,...
563,ST314,Multilevel and Longitudinal Models,"[ST109, ST211, ST102, ST201, ST107]",Statistics,23,21,Yes (30),Half Unit
564,ST326,Financial Statistics,"[ST206, ST202, ST211]",Statistics,65,33,No,Half Unit
565,ST327,Market Research: An Integrated Approach,"[MG202, ST109, MG205, ST102, ST203, ST107]",Statistics,58,15,Yes (60),One Unit
566,ST330,Stochastic and Actuarial Methods in Finance,"[ST206, ST202, ST302]",Statistics,61,31,No,One Unit


In [20]:
# Cleaning Data
modules_facts['units'] = modules_facts['units'].map({'One Unit': 1.0, 'Half Unit': 0.5, 'Non-credit bearing': 0.0})
modules_facts['total_students'] = modules_facts['total_students'].replace('Unavailable', np.nan).astype(float)
modules_facts['avg_class_size'] = modules_facts['avg_class_size'].replace('Unavailable', np.nan).astype(float)
modules_facts.loc[modules_facts['capped'].str.startswith('No'), 'capped'] = False
modules_facts.loc[modules_facts['capped'] != False, 'capped'] = modules_facts.loc[modules_facts['capped'] != False, 'capped'].str.split(' ').str[1].str.strip('()').astype(int)
modules_facts

Unnamed: 0,code,course,prerequisites,department,total_students,avg_class_size,capped,units
0,AC102,Elements of Financial Accounting,[],Accounting,564.0,15.0,False,0.5
1,AC103,"Elements of Management Accounting, Financial M...",[],Accounting,256.0,18.0,False,0.5
2,AC105,Introduction to Financial Accounting,[],Accounting,115.0,39.0,False,0.5
3,AC106,Introduction to Management Accounting,[],Accounting,115.0,39.0,False,0.5
4,AC205,Intermediate Financial Accounting,"[AC102, AC105]",Accounting,,,False,0.5
...,...,...,...,...,...,...,...,...
563,ST314,Multilevel and Longitudinal Models,"[ST109, ST211, ST102, ST201, ST107]",Statistics,23.0,21.0,30,0.5
564,ST326,Financial Statistics,"[ST206, ST202, ST211]",Statistics,65.0,33.0,False,0.5
565,ST327,Market Research: An Integrated Approach,"[MG202, ST109, MG205, ST102, ST203, ST107]",Statistics,58.0,15.0,60,1.0
566,ST330,Stochastic and Actuarial Methods in Finance,"[ST206, ST202, ST302]",Statistics,61.0,31.0,False,1.0


In [21]:
# Counting rows containing NaN values
modules_facts[modules_facts['avg_class_size'].isna()].count()

code              107
course            107
prerequisites     107
department        107
total_students     10
avg_class_size      0
capped            107
units             107
dtype: int64

In [22]:
# Saving to CSV
modules_facts.to_csv('data/modules/modules_key_facts.csv', index=False)

In [23]:
# Force the single column to be called "code"
outside_options = pd.read_csv('data/degrees/ug_outside_options.csv', names=["code"], header=None)

# Now you can safely set index
outside_options.set_index("code", inplace=True)

# Make sure modules_facts has 'code' column
if "code" not in modules_facts.columns:
    modules_facts.reset_index(inplace=True)

modules_facts.set_index("code", inplace=True)

# Add all columns from modules_facts to outside_options using index alignment
for col in modules_facts.columns:
    outside_options[col] = modules_facts[col].reindex(outside_options.index)

outside_options['units'] = outside_options['units'].replace({'Half Unit': 0.5, 'One Unit': 1.0})
outside_options.dropna(how='all', inplace=True)

outside_options

FileNotFoundError: [Errno 2] No such file or directory: 'data/degrees/ug_outside_options.csv'

In [None]:


mutually_exclusive_options = pd.read_csv('data/modules/mutual_exclusive.csv')
mutually_exclusive_outside_options = mutually_exclusive_options[mutually_exclusive_options['Course'].isin(outside_options.index)]

outside_options['mutually_exclusive_courses'] = outside_options.index.to_series().apply(
    lambda x: mutually_exclusive_outside_options.loc[mutually_exclusive_outside_options['Course'] == x, 'Mutually Exclusive Courses'].tolist() if x in mutually_exclusive_outside_options['Course'].values else []

)
outside_options.dropna(axis=1, how='all', inplace=True)
cols = outside_options.columns.tolist()

# Find the index of the 'mutually_exclusive_courses' and 'department' columns
mutually_exclusive_index = cols.index('mutually_exclusive_courses')
department_index = cols.index('department')

# Swap the columns
cols[mutually_exclusive_index], cols[department_index] = cols[department_index], cols[mutually_exclusive_index]

# Reorder the DataFrame columns based on the modified column list
outside_options = outside_options[cols]
outside_options

outside_options.to_csv('data/modules/outside_options.csv', index=True)
outside_options