In [110]:
import pandas as pd
import numpy as np

Load OULAD data and view column headers

In [111]:
courses = pd.read_csv('data/OULAD/courses.csv')
assessments = pd.read_csv('data/OULAD/assessments.csv')
VLEdata = pd.read_csv('data/OULAD/VLE.csv')
studentAssessments = pd.read_csv('data/OULAD/studentAssessment.csv')
studentInfo = pd.read_csv('data/OULAD/studentInfo.csv')
studentRegistration = pd.read_csv('data/OULAD/studentRegistration.csv')
studentVLE = pd.read_csv('data/OULAD/studentVLE.csv')

Info on variables can be found here: https://analyse.kmi.open.ac.uk/open-dataset <br>
<br>
courses.csv<br>
File contains the list of all available modules and their presentations. The columns are:<br>
- *code_module* - code name of the module, which serves as the identifier.<br>
- *code_presentation* - code name of the presentation. It consists of the year and "B" for the presentation starting in February and "J" for the presentation starting in October.<br>
- *length* - length of the module-presentation in days.<br>

The structure of B and J presentations may differ and therefore it is good practice to analyse the B and J presentations separately. Nevertheless, for some presentations the corresponding previous B/J presentation do not exist and therefore the J presentation must be used to inform the B presentation or vice versa. In the dataset this is the case of CCC, EEE and GGG modules.
<br>


assessments.csv<br>
This file contains information about assessments in module-presentations. Usually, every presentation has a number of assessments followed by the final exam. CSV contains columns:<br>
- *code_module* - identification code of the module, to which the assessment belongs.<br>
- *code_presentation* - identification code of the presentation, to which the assessment belongs.<br>
- *id_assessment* - identification number of the assessment.<br>
- *assessment_type* - type of assessment. Three types of assessments exist: Tutor Marked Assessment (TMA), Computer Marked Assessment (CMA) and Final Exam (Exam).<br>
- *date* - information about the final submission date of the assessment calculated as the number of days since the start of the module-presentation. The starting date of the presentation has number 0 (zero).<br>
- *weight* - weight of the assessment in %. Typically, Exams are treated separately and have the weight 100%; the sum of all other assessments is 100%.<br>

If the information about the final exam date is missing, it is at the end of the last presentation week.<br>
<br>


vle.csv<br>
The csv file contains information about the available materials in the VLE. Typically, these are html pages, pdf files, etc. Students have access to these materials online and their interactions with the materials are recorded.
The vle.csv file contains the following columns:<br>
- *id_site* - an identification number of the material.<br>
- *code_module* - an identification code for module.<br>
- *code_presentation* - the identification code of presentation.<br>
- *activity_type* - the role associated with the module material.<br>
- *week_from* - the week from which the material is planned to be used.<br>
- *week_to* - week until which the material is planned to be used.<br>
<br>


studentInfo.csv<br>
This file contains demographic information about the students together with their results. File contains the following columns:<br>
- *code_module* - an identification code for a module on which the student is registered.<br>
- *code_presentation* - the identification code of the presentation during which the student is registered on the module.<br>
- *id_student* - a unique identification number for the student.<br>
- *gender* - the student's gender.<br>
- *region* - identifies the geographic region, where the student lived while taking the module-presentation.<br>
- *highest_education* - highest student education level on entry to the module presentation.<br>
- *imd_band* - specifies the Index of Multiple Depravation band of the place where the student lived during the module-presentation.<br>
- *age_band* - band of the student's age.<br>
- *num_of_prev_attempts* - the number times the student has attempted this module.<br>
- *studied_credits* - the total number of credits for the modules the student is currently studying.<br>
- *disability* - indicates whether the student has declared a disability.<br>
- *final_result* - student's final result in the module-presentation.<br>
<br>


studentRegistration.csv<br>
This file contains information about the time when the student registered for the module presentation. For students who unregistered the unregistered date is also recorded. File contains five columns:<br>
- *code_module* - an identification code for a module.<br>
- *code_presentation* - the identification code of the presentation.<br>
- *id_student* - a unique identification number for the student.<br>
- *date_registration* - the date of student's registration on the module presentation, this is the number of days measured relative to the start of the module-presentation (e.g. the negative value -30 means that the student registered to module presentation 30 days before it started).<br>
- *date_unregistration* - the student's unregistered date from the module presentation, this is the number of days measured relative to the start of the module-presentation. Students, who completed the course have this field empty. Students who unregistered have Withdrawal as the value of the final_result column in the studentInfo.csv file.<br>
<br>


studentAssessment.csv<br>
This file contains the results of students' assessments. If the student does not submit the assessment, no result is recorded. The final exam submissions is missing, if the result of the assessments is not stored in the system.
This file contains the following columns:<br>
- *id_assessment* - the identification number of the assessment.<br>
- *id_student* - a unique identification number for the student.<br>
- *date_submitted* - the date of student submission, measured as the number of days since the start of the module presentation.
- *is_banked* - a status flag indicating that the assessment result has been transferred from a previous presentation.<br>
- *score* - the student's score in this assessment. The range is from 0 to 100. The score lower than 40 is interpreted as Fail. The marks are in the range from 0 to 100.<br>
<br>


studentVle.csv<br>
The studentVle.csv file contains information about each student's interactions with the materials in the VLE.
This file contains the following columns:<br>
- *code_module* - an identification code for a module.<br>
- *code_presentation* - the identification code of the module presentation.<br>
- *id_student* - a unique identification number for the student.<br>
- *id_site* - an identification number for the VLE material.<br>
- *date* - the date of student's interaction with the material measured as the number of days since the start of the module-presentation.<br>
- *sum_click* - the number of times a student interacts with the material in that day.<br>

From Kuzilek et al. (2017): <br>
**studentInfo** can be linked to **studentAssessment**, **studentVle** and **studentRegistration** tables using column *id_student*. <br>
**courses** links to the **assessments**, **studentRegistration**, **vle** and **studentInfo** using identifier columns *code_module* and
*code_presentation*. <br>
**assessments** links to **studentAssessment** using *id_assessment*. <br> 
**vle** to **studentVle** using *id_site*.

In [112]:
print(courses.columns.tolist())
print(assessments.columns.tolist())
print(studentAssessments.columns.tolist())
print(VLEdata.columns.tolist())
print(studentInfo.columns.tolist())
print(studentRegistration.columns.tolist())
print(studentVLE.columns.tolist())

['code_module', 'code_presentation', 'module_presentation_length']
['code_module', 'code_presentation', 'id_assessment', 'assessment_type', 'date', 'weight']
['id_assessment', 'id_student', 'date_submitted', 'is_banked', 'score']
['id_site', 'code_module', 'code_presentation', 'activity_type', 'week_from', 'week_to']
['code_module', 'code_presentation', 'id_student', 'gender', 'region', 'highest_education', 'imd_band', 'age_band', 'num_of_prev_attempts', 'studied_credits', 'disability', 'final_result']
['code_module', 'code_presentation', 'id_student', 'date_registration', 'date_unregistration']
['code_module', 'code_presentation', 'id_student', 'id_site', 'date', 'sum_click']


In [113]:
# Both assessments and studentVLE have 'date' column but are not the same variable. 
# Rename 'date' column in assessments to 'due_date'
assessments = assessments.rename(columns={'date': 'due_date'})

In [114]:
print(courses.shape) 
print(assessments.shape)
print(VLEdata.shape)
print(studentAssessments.shape)
print(studentInfo.shape) 
print(studentRegistration.shape) 
print(studentVLE.shape)

(22, 3)
(206, 6)
(6364, 6)
(173912, 5)
(32593, 12)
(32593, 5)
(10655280, 6)


Methods from Casalino et al. 2024: <br>
(1) In detail, the initial step involved merging the student_info and student_vle tables based on
the features code_module , code_presentation , and id_student , resulting in a consolidated table
that integrated information from both sources. <br> 
(2) Subsequently, this consolidated table was merged
with the vle table, utilizing code_module , code_presentation , and id_site as reference points to
create a table containing information on activity types. <br> 
(3) The assessments and student_assessment
tables were merged using the id_assessment feature. <br> 
(4) Finally, the two resulting tables were
joined using the id_student and date features, thereby creating a comprehensive dataset that
encapsulates relevant student interactions and assessment results.

Note that Casalino et al. 2024 do not include studentRegistration information. 

Modified approach <br>
1. Merge studentInfo, studentRegistration, courses into df_studentInfo
    - 1 row per ('code_module','code_presentation','id_student')
    - same number of rows as studentInfo & studentRegistration
2. Merge assessments & studentAssessments into df_assessment
    - 1 row per ('code_module','code_presentation','id_student','id_assessment')
    - same number of rows as studentAssessments
3. Merge studentVLE & VLEdata into df_vle
    - 1 row per ('code_module','code_presentation','id_student','id_site')
    - same number of rows as studentVLE
4. Merge df_studentInfo & df_assessment into df_student_assessment
    - 1 row per ('code_module','code_presentation','id_student','id_assessment')
    - same number of rows as df_assessment
5. Merge df_student_assessment & df_vle into df
    - 1 row per ('code_module','code_presentation','id_student','id_assessment','id_site')

In [115]:
# Merge studentInfo & studentRegistration dataframes 
common_columns = list(set(studentInfo.columns) & set(studentRegistration.columns)) # ['code_module','code_presentation','id_student']
df_studentInfo=pd.merge(studentInfo, studentRegistration, how='outer', on=common_columns)

# checks on merge
assert studentInfo.shape[1] + studentRegistration.shape[1] - len(common_columns) == df_studentInfo.shape[1]
assert studentInfo.shape[0] == studentRegistration.shape[0] == df_studentInfo.shape[0]

# Merge courses with above (adds column module_presentation_length)
common_columns = list(set(courses.columns) & set(df_studentInfo.columns)) # ['code_module', 'code_presentation']
df_studentInfo=pd.merge(courses, df_studentInfo, how='outer', on=common_columns)

df_studentInfo.head()

Unnamed: 0,code_module,code_presentation,module_presentation_length,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result,date_registration,date_unregistration
0,AAA,2013J,268,11391,M,East Anglian Region,HE Qualification,90-100%,55<=,0,240,N,Pass,-159.0,
1,AAA,2013J,268,28400,F,Scotland,HE Qualification,20-30%,35-55,0,60,N,Pass,-53.0,
2,AAA,2013J,268,30268,F,North Western Region,A Level or Equivalent,30-40%,35-55,0,60,Y,Withdrawn,-92.0,12.0
3,AAA,2013J,268,31604,F,South East Region,A Level or Equivalent,50-60%,35-55,0,60,N,Pass,-52.0,
4,AAA,2013J,268,32885,F,West Midlands Region,Lower Than A Level,50-60%,0-35,0,60,N,Pass,-176.0,


In [116]:
# Merge assessments, studentAssessment dataframes 
common_columns = list(set(assessments.columns) & set(studentAssessments.columns)) # ['id_assessment']
df_assessment=pd.merge(studentAssessments, assessments, how='left', on=common_columns)

df_assessment.head()

Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score,code_module,code_presentation,assessment_type,due_date,weight
0,1752,11391,18,0,78.0,AAA,2013J,TMA,19.0,10.0
1,1752,28400,22,0,70.0,AAA,2013J,TMA,19.0,10.0
2,1752,31604,17,0,72.0,AAA,2013J,TMA,19.0,10.0
3,1752,32885,26,0,69.0,AAA,2013J,TMA,19.0,10.0
4,1752,38053,19,0,79.0,AAA,2013J,TMA,19.0,10.0


In [117]:
# Merge studentVLE, VLEdata dataframes 
common_columns = list(set(studentVLE.columns) & set(VLEdata.columns)) # ['id_site', 'code_presentation', 'code_module']
df_vle = pd.merge(studentVLE,VLEdata, how='left', on=common_columns)

df_vle.head()

Unnamed: 0,code_module,code_presentation,id_student,id_site,date,sum_click,activity_type,week_from,week_to
0,AAA,2013J,28400,546652,-10,4,forumng,,
1,AAA,2013J,28400,546652,-10,1,forumng,,
2,AAA,2013J,28400,546652,-10,1,forumng,,
3,AAA,2013J,28400,546614,-10,11,homepage,,
4,AAA,2013J,28400,546714,-10,1,oucontent,,


In [None]:
# Merge df_studentInfo, df_assessment dataframes 
common_columns = list(set(df_studentInfo.columns) & set(df_assessment.columns)) # ['code_module', 'id_student', 'code_presentation']
df_student_assessment=pd.merge(df_assessment, df_studentInfo, how='left', on=common_columns)

df_student_assessment.head()

Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score,code_module,code_presentation,assessment_type,due_date,weight,...,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result,date_registration,date_unregistration
0,1752,11391,18,0,78.0,AAA,2013J,TMA,19.0,10.0,...,East Anglian Region,HE Qualification,90-100%,55<=,0,240,N,Pass,-159.0,
1,1752,28400,22,0,70.0,AAA,2013J,TMA,19.0,10.0,...,Scotland,HE Qualification,20-30%,35-55,0,60,N,Pass,-53.0,
2,1752,31604,17,0,72.0,AAA,2013J,TMA,19.0,10.0,...,South East Region,A Level or Equivalent,50-60%,35-55,0,60,N,Pass,-52.0,
3,1752,32885,26,0,69.0,AAA,2013J,TMA,19.0,10.0,...,West Midlands Region,Lower Than A Level,50-60%,0-35,0,60,N,Pass,-176.0,
4,1752,38053,19,0,79.0,AAA,2013J,TMA,19.0,10.0,...,Wales,A Level or Equivalent,80-90%,35-55,0,60,N,Pass,-110.0,


In [None]:
# Merge df_student_assessment, df_vle dataframes 
common_columns = list(set(df_student_assessment.columns) & set(df_vle.columns)) # ['code_module', 'id_student', 'code_presentation']
df=pd.merge(df_vle, df_student_assessment, how='left', on=common_columns)
df.head()

# I ran into a memory issue trying to run this
# MemoryError: Unable to allocate 7.33 GiB for an array with shape (11, 89457269) and data type float64 

Brief exploration of options to collapse vle data to avoid memory error

In [None]:
# How many unique id_sites are there for each activity_type?
counts = (
    df_vle
    .groupby('activity_type')['id_site']
    .nunique()
    .reset_index(name='unique_id_sites')
)

counts
# NB: there are several id_sites for each activity_type 

Unnamed: 0,activity_type,unique_id_sites
16,resource,2634
18,subpage,1038
9,oucontent,979
19,url,876
4,forumng,175
14,quiz,127
12,page,102
8,oucollaborate,77
13,questionnaire,61
11,ouwiki,49


In [None]:
# How many unique dates are there for each activity_type, student combo?
counts = (
    df_vle
    .groupby(['id_student','activity_type'])['date']
    .nunique()
    .reset_index(name='unique_dates')
)

counts
# NB: there are several dates for each activity_type per student

Unnamed: 0,id_student,activity_type,unique_dates
0,6516,dataplus,4
1,6516,forumng,78
2,6516,homepage,158
3,6516,oucontent,74
4,6516,resource,13
...,...,...,...
220866,2698588,oucollaborate,5
220867,2698588,oucontent,19
220868,2698588,resource,23
220869,2698588,subpage,22


We could collapse id_site and only consider the activity_type, 
but we would need to decide how to collapse the other columns...
- 'date' - multiple dates for each activity_type - thoughts?? 
- 'sum_click' - I would suggest sum(sum_click) across all id_site for each activity_type
- 'week_from' & 'week_to' - I'm leaning towards just dropping these, unless we want to make some sort of engineered feature that checks whether the students interaction actually falls within this date range, but there's a lot of NA values