# Module Information Scraper
This code is to scrape assessment details from UCD module-by-module. From there, we can find out how vulnerable UCD is to ChatGPT and other similar AI helpers. First we will need to import some packages to do this.

## Imports and Global Variables

In [None]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import re
import pathlib as Path
import html5lib
import json

Next we will need to set the path to the datasets that we will use. This currently pulls in a specific file, that of MODULES.csv, which has all collected module information for the school of Engineering and Architecture. However, this could easily be changed to analyze sub-schools or other schools.

In [None]:
#This is the directory that holds all our input datasets
dir_raw=Path.Path("Datasets")

#Read in the csv that has all of our modules, if desired
moduleCodes= dir_raw / "MODULES.csv"
modules=pd.read_csv(moduleCodes)

#Print the csv with all our modules
modules

The below function will be used to read in modules. This can be by school if required, and otherwise includes all modules.

In [None]:
def input_Modules(school= None, filename=None):
    #This is the directory that holds all our input datasets
    dir_raw=Path.Path("Datasets")
    
    #This is the dict that holds the filename for each school
    school_filenames={"Civil Engineering":"MODULES_CE.csv", \
                     "Mechanical & Materials Eng": "MODULES_MME.csv", \
                     "Chem & Bioprocess Engineering": "MODULES_CBE.csv", \
                     "Biosystems & Food Engineering": "MODULES_BFE.csv", \
                     "Architecture, Plan & Env Pol": "MODULES_APEP.csv", \
                     "Electrical & Electronic Eng": "MODULES_EEE.csv"}
    
    #If the school is not equal to none, do only modules from the set school
    if school != None:
        #Get the file for the school's modules
        moduleCodes = dir_raw / school_filenames[school]
        
        #Read in the desired module codes into a dataframe
        modules=pd.read_csv(moduleCodes)
        
        print(modules)
        #Return a list of the module codes
        return modules["Unnamed: 0"].iloc, None
        
    elif filename != None:
        #If the file is an excel sheet, check it out
        if filename.endswith("xlsx"):
            corelist, excelTable=excelListReader(filename)
            
            return corelist, excelTable
        else:
            print("FILENAME ERROR: check filename, make sure its in excel format")
            return None, None
    else:
        #Set the code to look at the csv that has all of our modules, if desired
        moduleCodes= dir_raw / "MODULES.csv"
        
        #Read in the desired module codes into a dataframe
        modules=pd.read_csv(moduleCodes)

        #Return a list of the module codes
        return modules["Code"].iloc, None

In [None]:
def input_Infohub(school=None):
    #This is the directory that holds all our input datasets
    dir_raw=Path.Path("Datasets")
    
    #This is a list of schools
    schools=["Civil Engineering", \
                     "Mechanical & Materials Eng", \
                     "Chem & Bioprocess Engineering", \
                     "Biosystems & Food Engineering", \
                     "Architecture, Plan & Env Pol", \
                     "Electrical & Electronic Eng"]
    #This is the dict that holds the filename for each school
    school_filenames={"Civil Engineering":"SCE Module places by Term 22 23.csv", \
                     "Mechanical & Materials Eng": "SMME Module places by Term 22 23.csv", \
                     "Chem & Bioprocess Engineering": "SCBE Module places by Term 22 23.csv", \
                     "Biosystems & Food Engineering": "SBFE Module places by Term 22 23.csv", \
                     "Architecture, Plan & Env Pol": "SAPEP Module places by Term 22 23.csv", \
                     "Electrical & Electronic Eng": "SEEE Module places by Term 22 23.csv"}
    
    #If the school is not set, loop through all of them
    if school == None:
        schoolHubs=[]
        for school in schools:
            print("Getting file on the School of %s at %s"  %(school, school_filenames[school]))
            
            #Find the file location
            dir_in= dir_raw / school_filenames[school]
            schoolHubs.append(pd.read_csv(dir_in))
            
    
        #Combine all the module details together
        infohub=pd.concat(schoolHubs)
    #Otherwise just get that one school
    else:
        dir_in= dir_raw / school_filenames[school]
        infohub=pd.read_csv(dir_in)
        
    #Get rid of any modules that end in a letter
    infohub["Module"]=infohub["Module"].apply(lambda x : None if(re.search(r'\d+$', x) == None) else x)
    infohub.dropna(subset=["Module"], inplace=True)
    
    return infohub

In [None]:
input_Infohub()

The below function is used for reading in module codes that are stored in an Excel sheet. The standard paths for certain engineering qualifications is currently the main use for such a method. 

In [None]:
#This function reads in an excel sheet with module codes
def excelListReader(filename, excel_table=True):

    #Get the input file path
    coreCodes= dir_raw / filename

    #Make sure that it is in the desired excel table format
    if excel_table:
        #If it is, read in the excel sheet, and get the values in the "Module" column as a list
        coreModules=pd.read_excel(coreCodes)
        coreList=coreModules["Module"].values.tolist()
        
    else:
        #Return an error if the file is not in an excel sheet
        print("ERROR: not in excel table format")
        return None

    #print the module codes that we found, and then return them
    print(coreList)
    return coreList, coreModules

## Scraper Functions and Required Sub Functions

The module descriptor scraper pulls all module descriptor information from the UCD module website. This includes information such as who runs the module, and importantly for our analysis, the number of credits for each module.

In [None]:
#This pulls all module descriptor information from the publicly available UCD module website
def module_descriptor_scraper(url, level=None, school=None):

    #Get the HTML representation of the module page, the page being given by the URL
    request=requests.get(url)
    soup=BeautifulSoup(request.content, 'html.parser')

    #This will hold all items in the description list and associate them with their related element
    descriptor_list={}

    #Get all the elements in the "description list" - the 'dl'
    for element in soup.select('dl'):
        #Get the element text 
        credit_list=element.text
        
        #Taking the "Description Term", dt, and the "Description element", dd, as a pair
        for items in zip(soup.select('dt'), soup.select('dd')):
            #Create a dictionary item with the term and its associated element, to be turned into a series later
            descriptor_list[items[0].text]=items[1].text

    
    #Create a Series from the items in the description list
    module_descriptor=pd.Series(descriptor_list)
    #Make sure that the Credits column is numeric - if there is an error when changing to numeric, the value is set as None
    module_descriptor["Credits:"]=pd.to_numeric(module_descriptor["Credits:"], errors='coerce')
    
    #This implements filters as desired. If filtered changes to true, it means that the item is filtered out
    filtered=False
    online=False
    
    #If filters exist, check that the module is not filtered out
    if (level != None):
        filtered= (pd.to_numeric(module_descriptor["Level:"].split('(')[0], errors='ignore') != level)
                   
    #If it wasn't filtered out by level, check if it is filtered out by school
    if (filtered == False) and (school != None):
        filtered = (module_descriptor["School:"] != school)

    #Check if the module is delivered online or not
    if(module_descriptor["Mode of Delivery:"] == "Online"):
        online=True
        
    #Return the module descriptor and whether or not it was filtered out
    return module_descriptor, filtered, online

The below code is used to simply assert that the filtering worked, and is more of a sanity check than anything else.

In [None]:
#This asserts that the filter works correctly
def assert_filtered(module_descriptors, level=None, school=None):
    #Combine all descriptors into a dataframe
    all_descriptors=pd.concat(module_descriptors)
    
    #Make sure that IF the level was specified, only one level is allowed
    if level !=None:
        assert (all_descriptors["Level:"].nunique() == 1)
    
    #Make sure that only one school is allowed, IF it was specified
    if school != None:
        assert (all_descriptors["School:"].nunique() == 1)
        
    #Print all the unique schools scrapped
    print("\n %s" %all_descriptors["School:"].unique())
    
    #Return the number of unique values - by school and level
    return all_descriptors["School:"].nunique(), all_descriptors["Level:"].nunique()

Below is a helper function. This saves files as desired after their information has been taken from the UCD website. 

In [None]:
def save_module_files(module_assessments, module_descriptors, codeList=None, level=None, school=None, foldername=None):
    #The directory to save outputs to
    dir_output=Path.Path("ModuleInformation")
    dir_output.mkdir(parents=True, exist_ok=True)
    
    subdirectory=""
    #Save the file in its desired format
    if level != None:
        subdirectory+="Level=%d" %(level)
        
    if school != None:
        subdirectory+="_School="+school.replace(" ", "-")
    
    if codeList != None:
        subdirectory+="SelectedModules"
        
    if foldername != None:
        subdirectory=foldername
        
    #if the modules have been filtered, and thus belong in a sub directory, make that directory
    if len(subdirectory) > 0:
        dir_output=dir_output / subdirectory
        dir_output.mkdir(parents=True, exist_ok=True)
        
   
        
    #Save our two module detail files
    with open(dir_output / "assessments.json", 'w') as outfile:
        if (len(module_assessments)) > 2 and (isinstance(module_assessments, list)):
            module_assessments=pd.concat(module_assessments, ignore_index=True)
            print("saving to %s" % dir_output)
        outfile.write(module_assessments.to_json())
        
    with open(dir_output / "descriptors.json", 'w') as outfile:
        if (len(module_descriptors) > 2) and (isinstance(module_descriptors, list)):
            module_descriptors=pd.DataFrame(module_descriptors)
            print("saving to %s" % dir_output)
        outfile.write(module_descriptors.to_json())

Below is a dict that will be used to develop a custom column, "work type", which will be necessary for further analysis later.

In [None]:
work_type={"Assignment" :"At home", \
                "Attendance": "In person", \
                "Class Test" : "In person", \
                "Continuous Assessment": "At home", \
               "Essay": "At home", \
                "Examination": "In person", \
                "Fieldwork": "In person", \
                "Group Project": "Blended", \
                "Journal": "Blended",\
               "Lab Report": "Blended", \
                "Multiple Choice Questionnaire": "Blended", \
                "Oral Examination": "In person", \
               "Portfolio" : "Blended",  \
                "Practical Examination": "In person", \
                "Presentation" : "In person", \
                "Project": "At home", \
               "Seminar": "In person", \
               "Studio Examination" : "In person",\
               "Assessments worth <2%": "Unknown"}

The below code collects all module assessment and module descriptor information into two lists. It also creates a "Scaled % of Final Grade" column in the asssessment table. This weights the assessment based on the number of credits the module has overall. In this way, the median and normal amount of credits, 5.0, has assessments weighting that add up to 100%. Those above and below are given assessment weightings that scale with how much more or less they are worth then a normal module - a 10 credit module will have assessments that add up to 200%, because they are worth twice the amount as a normal module.

Error module details are stored for inspection later, to see why they occurred. The code continues on even if errors occur, after having stored these details it simply proceeds to the next module.

There are two sources of information that we are using, the module codes available from the search function, and the previous academic year module records. Experimentation proved that the previous academic year records include some modules that have  since been removed - while all the module codes currently available from the search function are included in it. Therefore, the module codes is a subset of the previous academic year records.

### The Collector Function - combines sub functions to collect all available information on desired modules.

In [None]:
#This functiom will allow school and year functions to be placed on it
def collector(codeList=None, level=None, school=None, filename=None, foldername=None):
    #This will store module information
    module_assessments=[]
    module_descriptors=[]

    #This will store error module information
    error_modules=[]
    error_module_descriptors=[]

    #Next we need to get our moduleCodes
    moduleCodes
    
    #Pick where to get the module codes from
    if codeList!=None:
        modulesCodes=codeList
    else:
        modulesCodes, excelTable=input_Modules(school=school, filename=filename)
        
    #Get the previous academic year records
    infohub=input_Infohub(school=school)
        
    #Going through the modules one-by-one    
    for i in modulesCodes:
        #Get the associated previous academic year record
        
            
        #Let the user know we iterated
        print(".",end="")
        
        #Change the URL to finish with the desired module code
        url= "https://hub.ucd.ie/usis/!W_HU_MENU.P_PUBLISH?p_tag=MODULE&MODULE=" + i

        #Get the module descriptor
        descriptor, filtered, online=module_descriptor_scraper(url, level=level, school=school)
        #If the module is in violation of the filters, continue to the next without saving
        if filtered==True:
            continue
            
        #Use pandas to read in the asssessment html table. This starts with the word 'Description', 
        #which is how we differentiate it from the other tables on the webpage
        table=pd.read_html(url, match="Description")
        
        #Get the first table, and turn it into a dataframe
        df=pd.DataFrame(table[0])
        #Create the "Assessment Type" column. There are 18 assessment types across UCD
        df["Assessment Type"] = df['Description'].str.split(':').str[0]
        #Add in the module code column to both the assessment dataframe and the descriptor list
        df["Module Code"]=i
        descriptor["Module Code"]=i

        #Try and create a column where the grade is scaled by credits worth, with 5 credits being the normal
        try:
            df["Scaled % of Final Grade"]= df['% of Final Grade'].apply(lambda x: x * (descriptor["Credits:"]/5.0))
            
        #If the scaling didn't work, this is an error module. Save it as an error module and continue
        except:
            print("\nERROR MODULE DETECTED.")
            print("Module may need to be inspected, saving information as an error module and continuing without it")
        
            error_modules.append(df)
            error_module_descriptors.append(descriptor)
            continue
            
        #Add a new column, "Work Type". This is based on the assessment type, and should help inform us of how big the risk of 
        #ChatGPT in assessment is.
        #If the module is delivered online, the work type is only "At home", owing to the inherent risk to ChatGPT of these
        #modules. Otherwise, set the assessment type according to the provided dict
        if online:
            df["Work Type"]="At home"
        else:
            #Replace the (short) MCQ with just MCQ for simplicity
            df=df.replace("Multiple Choice Questionnaire (Short)", "Multiple Choice Questionnaire")
            df["Work Type"]=df["Assessment Type"].apply(lambda x: work_type[x])
        
        #Add a few extra columns onto the dataFrame, so that we could make an interactive graph later
        df["Level"]=descriptor["Level:"]
        df["Credits"]=descriptor["Credits:"]
        df["School"]=descriptor["School:"]
        df["Module Coordinator"]=descriptor["Module Coordinator:"]
        
        #Add the records from the previous academic year
        prev_record=infohub[infohub["Module"] == i]
        #If the records from the previous year exist
        if not prev_record.empty:
            df["Semester"]=prev_record["Semester"].iloc[0]
            df["Enrolled Students 22/23"]=prev_record["Total Places - Max"].iloc[0]\
            -prev_record["Total Places - Avail"].iloc[0]
        else:
            df["Semester"]=None
            df["Enrolled Students 22/23"]=None
        
        #Add the stage if we know it
        if filename == None:
            df["Stage"]=None
        else:
            df["Stage"]=excelTable[excelTable["Module"] == i]["Stage"].iloc[0]
            
            
        #Append the module information dataframes to their respective lists
        module_assessments.append(df)
        module_descriptors.append(descriptor)
        
        #Save the individual module files
        save_module_files(df, descriptor, foldername="IndividualModules/%s" %i)
    
    #This asserts that the filters were properly imposed, if imposed at all
    num_schools, num_levels=assert_filtered(module_descriptors, level, school)

    #Inform the user that we have finished
    print("\nFINISHED, SCRAPED DETAILS ON %d MODULES, OVER %d SCHOOLS AND %d LEVELS" \
          %(len(module_assessments), num_schools, num_levels))
    
    #Save the output files
    save_module_files(module_assessments, module_descriptors, codeList=codeList, level=level, school=school, \
                      foldername=foldername)
    
    #Return the desired variables, the list of module assessment and descriptor dataframes, as well as the error dataframes
    return module_assessments, module_descriptors, error_modules, error_module_descriptors

Having defined the collector function, we simply now need to run it. The collector function is a general function. It can use filters in a number of ways as required. Filters can:
- Limit the school
- Limit the level
- Limit to a module list (mainly used for possible engineering paths)

It automatically saves the collected and scraped data in files when done. It also collects all "error modules" for later visual inspection. These are modules that have some anamoly that necessitates them not being included with the rest of the modules, and a detection message is thrown every time they are found.

Now we will run the collector function
## Collecting Data:
### Collecting Data on All Modules from the College of Engineering and Architecture

In [None]:
#Run the above function in its base form
module_assessments, module_descriptors, ALL_error_modules, ALL_error_module_descriptors=collector()

### Collecting Data by Standard Undergraduate Paths

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors\
=collector(filename="UCD_EngArch_Path_Electronic_UG_Modules.xlsx", foldername="ElectronicPath")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors\
=collector(filename="UCD_EngArch_Path_Electronic_UG_Modules.xlsx", foldername="ElectricalPath")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors\
=collector(filename="UCD_EngArch_Path_Architecture_UG_Modules.xlsx", foldername="ArchitecturePath")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors\
=collector(filename="UCD_EngArch_Path_Mechanical_UG_Modules.xlsx", foldername="MechanicalPath")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors\
=collector(filename="UCD_EngArch_Path_Biomed-Elec_UG_Modules.xlsx", foldername="BiomedElectricalPath")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors\
=collector(filename="UCD_EngArch_Path_Biomed-Mech_UG_Modules.xlsx", foldername="BiomedMechanicalPath")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors\
=collector(filename="UCD_EngArch_Path_Chemical_UG_Modules.xlsx", foldername="ChemicalPath")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors\
=collector(filename="UCD_EngArch_Path_Civil_UG_Modules.xlsx", foldername="CivilPath")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors\
=collector(filename="UCD_EngArch_Path_CPEP_UG_Modules.xlsx", \
           foldername="CityPlanningAndEnvironmentalPolicyPath")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors\
=collector(filename="UCD_EngArch_Path_LandArch_UG_Modules.xlsx", \
           foldername="LandscapeArchitecturePath")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors\
=collector(filename="UCD_EngArch_Path_StructEngArch_UG_Modules.xlsx", \
           foldername="StructuralEngineerWithArchitecturePath")

### Collecting Data by Standard Integrated Masters Path

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors\
=collector(filename="UCD_EngArch_Path_Biomed-Elec_ME_Modules.xlsx", \
           foldername="BiomedElectronicMastersPath")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors\
=collector(filename="UCD_EngArch_Path_Biomed-Mech_ME_Modules.xlsx", \
           foldername="BiomedMechanicalMastersPath")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors\
=collector(filename="UCD_EngArch_Path_Electrical_ME_Modules.xlsx", \
           foldername="ElectricalMastersPath")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors\
=collector(filename="UCD_EngArch_Path_Electronic_ME_Modules.xlsx", \
           foldername="ElectronicMastersPath")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors\
=collector(filename="UCD_EngArch_Path_Mechanical_ME_Modules.xlsx", \
           foldername="MechanicalMastersPath")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors\
=collector(filename="UCD_EngArch_Path_StructEngArch_ME_Modules.xlsx", \
           foldername="StructuralEngineerWithArchitectureMastersPath")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors\
=collector(filename="UCD_EngArch_Path_Civil_ME_Modules.xlsx", \
           foldername="CivilMastersPath")

### Collecting Data by Level

In [None]:
#Run the collector function to only collect level 1 modules in the college of Engineering and Architecture
module_assessments, module_descriptors, error_modules, error_module_descriptors=collector(level=1)

In [None]:
#Run the collector function to only collect level 1 modules in the college of Engineering and Architecture
module_assessments, module_descriptors, error_modules, error_module_descriptors=collector(level=2)

In [None]:
#Run the collector function to only collect level 1 modules in the college of Engineering and Architecture
module_assessments, module_descriptors, error_modules, error_module_descriptors=collector(level=3)

In [None]:
#Run the collector function to only collect level 1 modules in the college of Engineering and Architecture
module_assessments, module_descriptors, error_modules, error_module_descriptors=collector(level=4)

In [None]:
#Run the collector function to only collect level 1 modules in the college of Engineering and Architecture
module_assessments, module_descriptors, error_modules, error_module_descriptors=collector(level=5)

### Collecting Data by School

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors=collector(school="Mechanical & Materials Eng")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors=collector(school="Civil Engineering")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors=collector(school="Chem & Bioprocess Engineering")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors=collector(school="Biosystems & Food Engineering")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors=collector(school="Architecture, Plan & Env Pol")

In [None]:
module_assessments, module_descriptors, error_modules, error_module_descriptors=collector(school="Electrical & Electronic Eng")

## Inspecting Error Modules
As stated, the collector identifies any error modules and saves them for later visual inspection. We will just quickly inspect these modules, and make sure that they are not either a sign of a greater issue, or should be somehow included.

In [None]:
#Inspect the error modules just in case. Simply loop through their assessment and description data and print it for inspection
for i, error in enumerate(zip(ALL_error_modules, ALL_error_module_descriptors)):
    print("********ERROR COUNT %d*********" %i)
    print(error[0])
    print(error[1])
    