<img src="https://www.dbs.ie/images/default-source/logos/dbs-logo-2019-small.png" align = left/>

#  Open University Learning Analytics Dataset Preparation

Capstone Project

Claire Connaughton (10266499)

# Import Relevant Libraries 

In [None]:
import os
import pickle
import pydotplus
import numpy as np
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from plotnine import *
import plotnine
plotnine.options.figure_size = (5.2,3.2)
import seaborn as sns
sns.set()
sns.set_style("white")
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
%%HTML
<style type="text/css">
table.dataframe td, table.dataframe th {
    border: 1px  black solid !important;
  color: black !important;
}
</style>

# Prepare the dataset

The OULA dataset contains 7 separate csv files. The database schema is displayed below. 
Source: https://analyse.kmi.open.ac.uk/open_dataset

![](schema.png)

Each csv file will be loaded and inspected once by one to get an insight into the tables.
Each csv file will be cleaned sequentially before finally being merged into the final dataset. 

# Courses 

This file contains information about contains the list of all available modules and their presentations 

In [None]:
# Load courses table

try:
    courses = pd.read_csv('courses.csv')
    print("The 'courses' table has {} samples with {} features each.".format(*courses.shape))
    display(courses.info())
    display(courses.head())
except:
    print("Dataset could not be loaded. Is the dataset missing?")

                Feature Description for Courses

code_module – code name of the module, which serves as the identifier.

code_presentation – code name of the presentation. It consists of the year and “B” for the presentation starting in February and “J” for the presentation starting in October.

length - length of the module-presentation in days.

The structure of B and J presentations may differ and therefore it is good practice to analyse the B and J presentations separately. Nevertheless, for some presentations the corresponding previous B/J presentation do not exist and therefore the J presentation must be used to inform the B presentation or vice versa. In the dataset this is the case of CCC, EEE and GGG modules.

In [None]:
# Highlighting the most common course length
g = sns.countplot(x ='module_presentation_length', 
              data = courses,
              color='grey',
              order = courses.module_presentation_length.value_counts().index);

patch_h = []    
for patch in g.patches:
    reading = patch.get_height()
    patch_h.append(reading)

idx_tallest = np.argmax(patch_h)   

g.patches[idx_tallest].set_facecolor('#a834a8')  
g.set_title('Module Duration (in days)', fontsize = 18)


Most courses last around 8 months each.

In [None]:
# Visulaise the breakdown in module length per module type

g= sns.catplot('module_presentation_length', col='code_module', col_wrap=4,
                data=courses[courses.code_module.notnull()],
                kind="count", height=3.5, aspect=.8, 
                palette= "tab20")
g.fig.subplots_adjust(top=0.9) 
g.fig.suptitle('Length of Modules', fontsize = 18)

Here we can see that the modules are different lengths for every intake. 

In [None]:
pd.crosstab(courses.module_presentation_length, courses.code_presentation).plot.barh(stacked = True);

This plot further verifies that the course presentation lenght varies with every year, albeit slightly. This indicates that the course presentation length offers little value to the analysis because it varies with every year and module. Therefore it may need to be discarded. 

*****************************************

# Assessments

This file contains information about assessments in module-presentations.
Usually, every presentation has a number of assessments followed by the final exam. 

In [None]:
# Load assessments table

try:
    assessments = pd.read_csv('assessments.csv')
    print("The 'assessments' table has {} samples with {} features each.".format(*assessments.shape))
    display(assessments.info())
    display(assessments.head())
except:
    print("Dataset could not be loaded. Is the dataset missing?")

                Feature Description for Assessments:

code_module – identification code of the module, to which the assessment belongs.

code_presentation - identification code of the presentation, to which the assessment belongs.

id_assessment – identification number of the assessment.

assessment_type – type of assessment. Three types of assessments exist: Tutor Marked Assessment (TMA), Computer Marked Assessment (CMA) and Final Exam (Exam).

date – information about the final submission date of the assessment calculated as the number of days since the start of the module-presentation. The starting date of the presentation has number 0 (zero).

weight - weight of the assessment in %. Typically, Exams are treated separately and have the weight 100%; the sum of all other assessments is 100%.

If the information about the final exam date is missing, it is at the end of the last presentation week.

*****************************************

Data Cleaning for Assessments is required.

In [None]:
# Change Assessments IDs from integers to categorical datatypes

assessments['id_assessment'] = assessments['id_assessment'].astype(object)

In [None]:
print(assessments.info())

In [None]:
# Check weightings of assessment results.
# The weighting of exams is 100%
# The weighting of the sum of assessments is 100%
# Modules with assessments and exams would have a weighting of 200%

# Determine the weightings of each module

assessments\
.groupby(['code_module','code_presentation', 'assessment_type'])\
.agg(weight_by_type = ('weight', sum))

This indicates that the modules have both assessments (100%) and exam (100%) which is why their weighting is 200.

The exeptions are:
    
    Module CCC which has a score of 200 for exams. This suggests 2 exams.
    Module GGG which has a score of 0 for assignments. This suggests no assignments.

In [None]:
# Check that there are 2 exams in Module CCC
assessments[(assessments['code_module'] == 'CCC') & (assessments['assessment_type'] == 'Exam')][['code_module', 'code_presentation', 'assessment_type']]\
.groupby(['code_module', 'code_presentation'])\
.count()

This confirms that there are two exams in Module CCC.

In [None]:
# Check that there is only 1 exam in Module GGG
assessments[(assessments['code_module'] == 'GGG') & (assessments['assessment_type'] == 'Exam')][['code_module', 'code_presentation', 'assessment_type']]\
.groupby(['code_module', 'code_presentation'])\
.count()

This confirms that there is only one exam in Module GGG.

In [None]:
# Highlighting the assessment types
sns.set_style("white")
g = sns.countplot(x ='assessment_type', 
              data = assessments,
              color='grey',
              order = assessments.assessment_type.value_counts().index);

patch_h = []    
for patch in g.patches:
    reading = patch.get_height()
    patch_h.append(reading)

idx_tallest = np.argmax(patch_h)   

g.patches[idx_tallest].set_facecolor('#a834a8')  
g.set_title('Types of Assessments', fontsize = 18)


TMAs are the most common assessment types. 

In [None]:
# Visualise the assessment breakdown per module

sns.set(style='white')

# Plot
g = sns.catplot("code_module", col="assessment_type",
                data=assessments[assessments.assessment_type.notnull()],
                kind="count", height=4, aspect=1.0, palette='tab20')
plt.show()

Every module has an exam and TMAs. AAA and EEE have no CMA assessments. 

******************************************

# Student Results (studentAssssments table)


This file contains the results of students’ assessments. 
If the student does not submit the assessment, no result is recorded. 
The final exam submissions is missing, if the result of the assessments is not stored in the system.


In [None]:
# Load the Results table

try:
    results = pd.read_csv('studentAssessment.csv')
    print("The 'Results' table has {} samples with {} features each.".format(*results.shape))
    display(results.info())
    display(results.head())
except:
    print("Dataset could not be loaded. Is the dataset missing?")


                    Feature Description

id_assessment – the identification number of the assessment.

id_student – a unique identification number for the student.

date_submitted – the date of student submission, measured as the number of days since the start of the module presentation.

is_banked – a status flag indicating that the assessment result has been transferred from a previous presentation.

score – the student’s score in this assessment. The range is from 0 to 100. The score lower than 40 is interpreted as Fail. The marks are in the range from 0 to 100.

In [None]:
# Change the data types of id_assessment and id_student from integer to categorical

results['id_assessment'] = results['id_assessment'].astype(object)
results['id_student'] = results['id_student'].astype(object)

In [None]:
print((results.info()))

In [None]:
# Check whether the Assessments information is in the Results Table

def compareCols(df1, df2):

    # Show shared columns between dataframes
    # (a) Make lists of columns for each data frame
    df1Columns = df1.columns.values.tolist()
    df2Columns = df2.columns.values.tolist()

    # (b) Find column names that are the same
    diffDict = set(df1Columns) & set(df2Columns)
    
    print('Shared columns : ', diffDict, '\n')

    # (c) Make a list of the dictinary
    diffList = list(diffDict)
    # (d) Check that if values in
    # every shared column match in
    # the two dataframes
    for col in diffList:
        x = df1[col].isin(df2[col]).value_counts()
        print('Check if values are present in both dataframes:')
        print(x, '\n')

compareCols(assessments, results)

In [None]:
# Determine what assignments are missing from the results table 

def printDiffValues(df1, df2, col):
    '''
    Show all df1.col values not present in df2.col
    '''
    # Pull out all unique values id_assessments
    df1_IDs = df1[col].unique()
    df2_IDs = df2[col].unique()

    # Compare the two lists
    # (a) Find what values are different
    diff = set(df1_IDs).difference(set(df2_IDs))
    
    # Show information for all df1.col values not presentin df2.col
    # (a) Make a list of missing values
    missingList = list(diff)
    # (b) Find these IDs in df2
    missingDf = df1[df1[col].isin(missingList)]

    return missingDf

printDiffValues(assessments, results, 'id_assessment')

All assignments missing from the Results table are exams with 100% module weight.

# Materials (VLE table)

The csv file contains information about the available materials in the VLE. 
Typically these are html pages, pdf files, etc. 
Students have access to these materials online and their interactions with the materials are recorded. 

In [None]:
# Load vle table

try:
    materials = pd.read_csv('vle.csv')
    print("The 'Materials' table has {} samples with {} features each.".format(*materials.shape))
    display(materials.info())
    display(materials.head())
except:
    print("Dataset could not be loaded. Is the dataset missing?")

            Feature Description:
    
id_site – an identification number of the material.

code_module – an identification code for module.

code_presentation - the identification code of presentation.

activity_type – the role associated with the module material.

week_from – the week from which the material is planned to be used.

week_to – week until which the material is planned to be used.

In [None]:
# Change id_site from integer to categorical 
materials['id_site'] = materials['id_site'].astype(object)

In [None]:
# Chart the most common VLE activities
sns.set_style("white")
g = sns.countplot(y= "activity_type", 
              data = materials,
              color='grey',
              order = materials.activity_type.value_counts().index);

patch_h = []    
for patch in g.patches:
    reading = patch.get_height()
    patch_h.append(reading)

idx_tallest = np.argmax(patch_h)   

g.patches[idx_tallest].set_facecolor('#a834a8')  
g.set_title('Most Common VLE Activities', fontsize = 18)

Resource, oucontent, subpage and url are the most popular activities on the VLE. 

******************************

# StudentInfo Table

This file contains demographic information about the students together with their results.

In [None]:
# Load studentInfo table

try:
    studentInfo = pd.read_csv('studentInfo.csv')
    print("The 'studentInfo' table has {} samples with {} features each.".format(*studentInfo.shape))
    display(studentInfo.info())
    display(studentInfo.head())
except:
    print("Dataset could not be loaded. Is the dataset missing?")

                        Feature Description

code_module – an identification code for a module on which the student is registered.

code_presentation - the identification code of the presentation during which the student is registered on the module.

id_student – a unique identification number for the student.

gender – the student’s gender.

region – identifies the geographic region, where the student lived while taking the module-presentation.

highest_education – highest student education level on entry to the module presentation.

imd_band – specifies the Index of Multiple Depravation band of the place where the student lived during the module-presentation.

age_band – band of the student’s age.

num_of_prev_attempts – the number times the student has attempted this module.

studied_credits – the total number of credits for the modules the student is currently studying.

disability – indicates whether the student has declared a disability.

final_result – student’s final result in the module-presentation.

In [None]:
# Change the data type for id_student from integer to categorical

studentInfo['id_student'] = studentInfo['id_student'].astype(object)

In [None]:
# Inspect boxplots of the numeric variables

fig, ax = plt.subplots(1, 2, figsize=(15, 5))
studentInfo.boxplot(column=['num_of_prev_attempts'], grid=False,  ax=ax[0], patch_artist=True)
studentInfo.boxplot(column=['studied_credits'],  grid=False,  ax=ax[1], patch_artist=True)
print("Boxplots for numerical variables")

There is evidence of outliers in the studied_credits column. This will need to be cleaned later. It is also clear that the num_of_prev_attempts is an ordinal variable, not a continuous variable.

In [None]:
# Change the num_of_prev_attempts to a categorical variable because it could not be visualised using a box plot

studentInfo.num_of_prev_attempts=pd.Categorical(studentInfo.num_of_prev_attempts)

In [None]:
# Display the counts of each category

studentInfo.num_of_prev_attempts.value_counts()

The majority of students completed the module on their first attempt. The categories should be collapsed further during cleaning.

In [None]:
# Visualise the categorical variables
sns.set_style("white")
fig, ax = plt.subplots(1,3, figsize=(15, 5))
# Code Module
g_1 = sns.countplot(x ='code_module', 
              data = studentInfo,
              ax=ax[0],
              color='grey',
              order = studentInfo.code_module.value_counts().index);

patch_h = []    
for patch in g_1.patches:
    reading = patch.get_height()
    patch_h.append(reading)

idx_tallest = np.argmax(patch_h)   

g_1.patches[idx_tallest].set_facecolor('#a834a8')  
g_1.set_title('Module Codes', fontsize = 18)

# Code Presentation
g_2= sns.countplot(x ='code_presentation', 
              data = studentInfo,
              ax=ax[1],
              color='grey',
              order = studentInfo.code_presentation.value_counts().index);

patch_h = []    
for patch in g_2.patches:
    reading = patch.get_height()
    patch_h.append(reading)

idx_tallest = np.argmax(patch_h)   

g_2.patches[idx_tallest].set_facecolor('#a834a8')  
g_2.set_title('Year of Course', fontsize = 18)

# Gender
g_3= sns.countplot(x ='gender', 
              data = studentInfo,
              ax=ax[2],
              color='grey',
              order = studentInfo.gender.value_counts().index);

patch_h = []    
for patch in g_3.patches:
    reading = patch.get_height()
    patch_h.append(reading)

idx_tallest = np.argmax(patch_h)   

g_3.patches[idx_tallest].set_facecolor('#a834a8')  
g_3.set_title('Gender', fontsize = 18)

print("Count plots for code_module, code_presentation, gender")

There are 7 module codes. These categories should be condensed futher during cleaning. The code_presentation could be condensed into two year groups. More males were registered than females.

In [None]:
# Visualise the categorical variables
sns.set_style("white")
fig, ax = plt.subplots(1,3, figsize=(15, 5))
# num_of_prev_attempts
g_1 = sns.countplot(x ='num_of_prev_attempts', 
              data = studentInfo,
              ax=ax[0],
              color='grey',
              order = studentInfo.num_of_prev_attempts.value_counts().index);

patch_h = []    
for patch in g_1.patches:
    reading = patch.get_height()
    patch_h.append(reading)

idx_tallest = np.argmax(patch_h)   

g_1.patches[idx_tallest].set_facecolor('#a834a8')  
g_1.set_title('Number of Previous Attempts', fontsize = 18)

# Disability
g_2= sns.countplot(x ='disability', 
              data = studentInfo,
              ax=ax[1],
              color='grey',
              order = studentInfo.disability.value_counts().index);

patch_h = []    
for patch in g_2.patches:
    reading = patch.get_height()
    patch_h.append(reading)

idx_tallest = np.argmax(patch_h)   

g_2.patches[idx_tallest].set_facecolor('#a834a8')  
g_2.set_title('Disability', fontsize = 18)

# Age_band
g_3= sns.countplot(x ='age_band', 
              data = studentInfo,
              ax=ax[2],
              color='grey',
              order = studentInfo.age_band.value_counts().index);

patch_h = []    
for patch in g_3.patches:
    reading = patch.get_height()
    patch_h.append(reading)

idx_tallest = np.argmax(patch_h)   

g_3.patches[idx_tallest].set_facecolor('#a834a8')  
g_3.set_title('Age Band', fontsize = 18)

print("Count plots for num_of_previous_attempts, disability, final_result, age_band")

The vast majority of students completed the course on their first attempt. Very few had a disability. Most were aged 35 and under. Most students passed the course but the withdrawals are very high.

In [None]:
# Visualise the most common region
g = sns.countplot(y= "region", 
              data = studentInfo,
              color='grey',
              order = studentInfo.region.value_counts().index);

patch_h = []    
for patch in g.patches:
    reading = patch.get_height()
    patch_h.append(reading)

idx_tallest = np.argmax(patch_h)   

g.patches[idx_tallest].set_facecolor('#a834a8')  
g.set_title('Most Common Region', fontsize = 18)
print("Count plot for region")

Scotland had the most students but overall England had the most students and Ireland had the least. 

In [None]:
# Visualise the most common education band
g = sns.countplot(y= "highest_education", 
              data = studentInfo,
              color='grey',
              order = studentInfo.highest_education.value_counts().index);

patch_h = []    
for patch in g.patches:
    reading = patch.get_height()
    patch_h.append(reading)

idx_tallest = np.argmax(patch_h)   

g.patches[idx_tallest].set_facecolor('#a834a8')  
g.set_title('Most Common Education Level', fontsize = 18)
print("Count plots for highest_education")

In [None]:
# Visualise the most common imd band
g = sns.countplot(y= "imd_band", 
              data = studentInfo,
              color='grey',
              order = studentInfo.imd_band.value_counts().index);

patch_h = []    
for patch in g.patches:
    reading = patch.get_height()
    patch_h.append(reading)

idx_tallest = np.argmax(patch_h)   

g.patches[idx_tallest].set_facecolor('#a834a8')  
g.set_title('Socio-Economic Status', fontsize = 18)
print("Count plots for imd_band")

More students from lower income groups were registered. There seems to be a few redundant categories in the highest_education column. This will have to be addressed during cleaning. Not much variation in the imb bands but there are too many bands so this should be condensed. 

In [None]:
# Visualise the target variable

g = sns.countplot(x ='final_result', 
              data = studentInfo,
              color='grey',
              order = studentInfo.final_result.value_counts().index);

patch_h = []    
for patch in g.patches:
    reading = patch.get_height()
    patch_h.append(reading)

idx_tallest = np.argmax(patch_h)   

g.patches[idx_tallest].set_facecolor('#a834a8')  
g.set_title('Student Outcome', fontsize = 18)


Most students passed but the withdrawal and fail rates are high.

In [None]:
# Determine the overall fail or withdrawal rate

len(studentInfo[(studentInfo['final_result'] == 'Withdrawn') | (studentInfo['final_result'] == 'Fail') ]) / len(studentInfo)

Almost 53% of students either withdrew or failed

In [None]:
# Determine the overall fail rate

len(studentInfo[(studentInfo['final_result'] == 'Fail') ]) / len(studentInfo)

Almost 23% of the students failed. 

In [None]:
# Determine the overall withdrawal rate

len(studentInfo[(studentInfo['final_result'] == 'Withdrawn') ]) / len(studentInfo)

31% of the students dropped out of their course.

************************************

# StudentRegistration table

This file contains information about the time when the student registered for the module presentation. 
For students who unregistered the date of unregistration is also recorded.  

In [None]:
# Load studentRegistration table
try:
    studentRegistration = pd.read_csv('studentRegistration.csv')
    print("The 'studentRegistration' table has {} samples with {} features each.".format(*studentRegistration.shape))
    display(studentRegistration.info())
    display(studentRegistration.head())
except:
    print("Dataset could not be loaded. Is the dataset missing?")

                Feature Description

code_module – an identification code for a module.

code_presentation - the identification code of the presentation.

id_student – a unique identification number for the student.

date_registration – the date of student’s registration on the module presentation, this is the number of days measured relative to the start of the module-presentation (e.g. the negative value -30 means that the student registered to module presentation 30 days before it started).

date_unregistration – date of student unregistration from the module presentation, this is the number of days measured relative to the start of the module-presentation. Students, who completed the course have this field empty. Students who unregistered have Withdrawal as the value of the final_result column in the studentInfo.csv file.

In [None]:
# Change the id_student from integer to categorical

studentRegistration['id_student'] = studentRegistration['id_student'].astype(object)

In [None]:
# Check if all student IDs recorded in the Registration tables are recorded in the Results table

compareCols(studentRegistration, results)

There are 5847 students missing from the Results table.

In [None]:
# Check if there any students from the Student Information table missing from the Results table

compareCols(studentInfo, results)

There 5847 students recorded in the Students Information table missing from the Assessment Results table. Are they the same students?

In [None]:
# Pull out all unique values id_assessments
df1_IDs = studentRegistration['id_student'].unique()
df2_IDs = studentInfo['id_student'].unique()

# Compare the two lists
# (a) Find what assessment IDs are different
diff = set(df1_IDs).difference(set(df2_IDs))
# (b) Count how many are different
numberDiff = len(diff)

numberDiff

This confirms that they are the same students.

In [None]:
# Check to see their outcome

info_not_in_results = printDiffValues(studentInfo, results, 'id_student')

column = info_not_in_results['final_result']

unique, counts = np.unique(column, return_counts = True)

dict(zip(unique, counts))

Strangely, 2 students with no submissions recorded have passed their modules. Further investigation is required.

In [None]:
# Investigate whether there is a clerical error.
# If unregistration dates for these students are found, it is a clerical error.

info_not_in_results[info_not_in_results['final_result'] == 'Pass']

In [None]:
# Find unregistered date for id_student 1336190

reg_not_in_results = printDiffValues(studentRegistration, results, 'id_student')
reg_not_in_results[reg_not_in_results['id_student'] == 1336190]


In [None]:
# Find unregistered date for id_student 1777834

reg_not_in_results[reg_not_in_results['id_student'] == 1777834]

There are no unregistration dates for these 2 students indicating that it is not a clerical error

In [None]:
# Change the dates into month and week number for easier visualisation

# Sep Oct Nov Dec Jan Feb Mar Apr May June
def date_revision(date):
    if date <= -1:
        return 'Sep'
    elif date <= 31:
        return 'Oct'
    elif date <= 61:
        return 'Nov'
    elif date <= 92:
        return 'Dec'
    elif date <= 123:
        return 'Jan'
    elif date <= 151:
        return 'Feb'
    elif date <= 179:
        return 'Mar'
    elif date <= 210:
        return 'Apr'
    elif date <= 240:
        return 'May'
    else:
        return 'Jun'
    
def date_number(date):
    if date == 'Sep':
        return 1
    elif date == 'Oct':
        return 2
    elif date == 'Nov':
        return 3
    elif date == 'Dec':
        return 4
    elif date == 'Jan':
        return 5
    elif date == 'Feb':
        return 6
    elif date == 'Mar':
        return 7
    elif date == 'Apr':
        return 8
    elif date == 'May':
        return 9
    else:
        return 10

studentRegistration['reg_month'] = studentRegistration['date_registration'].apply(date_revision)
studentRegistration['unreg_month'] = studentRegistration['date_unregistration'].apply(date_revision)

In [None]:
# Inspect the date registration

g = sns.countplot(x ='reg_month', 
              data = studentRegistration,
              color='grey',
              order = studentRegistration.reg_month.value_counts().index);

patch_h = []    
for patch in g.patches:
    reading = patch.get_height()
    patch_h.append(reading)

idx_tallest = np.argmax(patch_h)   

g.patches[idx_tallest].set_facecolor('#a834a8')  
g.set_title('Student Registration Date (Month)', fontsize = 18)


99% of registrations took place in September.

In [None]:
# Inspect the date registration

g = sns.countplot(x ='unreg_month', 
              data = studentRegistration,
              color='grey',
              order = studentRegistration.unreg_month.value_counts().index);

patch_h = []    
for patch in g.patches:
    reading = patch.get_height()
    patch_h.append(reading)

idx_tallest = np.argmax(patch_h)   

g.patches[idx_tallest].set_facecolor('#a834a8')  
g.set_title('Student Unregistration Date (Month)', fontsize = 18)

69% of students unregistered in June which makes sense because 31% of students dropped out of their course. Most dropouts occur in the first term with a steady number every other month of the year. The lowest dropout rate was in May. 

In [None]:
# Check to see whether registration months varied across module types

pd.crosstab(studentRegistration['code_module'], studentRegistration['reg_month']).plot.barh(stacked = True);


Only module BBB, DDD, FFF and GGG had students registering in October, but that was a minority of students. 

In [None]:
# Check to see whether unregistration months varied across module types


g= sns.catplot(y= 'unreg_month', col='code_module', col_wrap=4,
                data=studentRegistration[studentRegistration.unreg_month.notnull()],
                kind="count", height=3.5, aspect=.8, 
                palette= "tab20")
g.fig.subplots_adjust(top=0.9) 
g.fig.suptitle('Unregistrations per Module per Month', fontsize = 18)

It seems that AAA has no dropouts and module GGG has very few.

# VLE Interactions (studentVle table)

The studentVle.csv file contains information about each student’s interactions with the materials in the VLE. 

In [None]:
# Load vle_interaction table
try:
    vle_interaction = pd.read_csv('studentVle.csv')
    print("The 'vle_interaction' table has {} samples with {} features each.".format(*vle_interaction.shape))
    display(vle_interaction.info())
    display(vle_interaction.head())
except:
    print("Dataset could not be loaded. Is the dataset missing?")

                Feature Description

code_module – an identification code for a module.

code_presentation - the identification code of the module presentation.

id_student – a unique identification number for the student.

id_site - an identification number for the VLE material.

date – the date of student’s interaction with the material measured as the number of days since the start of the module-presentation.

sum_click – the number of times a student interacts with the material in that day.


In [None]:
# There is evidence of dupliation in the data. Check what percentage of values are duplicated.

print("Percentage of duplicated values in vle_interaction  ", vle_interaction.duplicated().sum() * 100 / len(vle_interaction))

Since a student can click on the same material more than once a day, duplicates will be retained but aggregated into a total clicks column later on.

In [None]:
# Let's see how many vle material in each module.
g = sns.countplot(x ='code_module', 
              data = vle_interaction,
              color='grey',
              order = vle_interaction.code_module.value_counts().index);

patch_h = []    
for patch in g.patches:
    reading = patch.get_height()
    patch_h.append(reading)

idx_tallest = np.argmax(patch_h)   

g.patches[idx_tallest].set_facecolor('#a834a8')  
g.set_title('VLE interaction per Module', fontsize = 18)


Modules BBB, DDD, FFF may have a heavier workload because there is more VLE interaction than other modules. 

In [None]:
# Create a column to indicate the average clicks per student

from statistics import mean

mean_click_per_student = vle_interaction\
.groupby(['code_module', 'code_presentation', 'id_student'])\
.agg(AVG_click = ("sum_click", mean))\
.reset_index()

mean_click_per_student = mean_click_per_student.round(0)

mean_click_per_student.head(3)

In [None]:
g = sns.countplot(x ='AVG_click', 
              data = mean_click_per_student,
              color='grey',
              order = mean_click_per_student.AVG_click.value_counts().index);

patch_h = []    
for patch in g.patches:
    reading = patch.get_height()
    patch_h.append(reading)

idx_tallest = np.argmax(patch_h)   

g.patches[idx_tallest].set_facecolor('#a834a8') 
g.set_xticklabels(['3', '2', '4', '5', '6', '1', '7', '8', '', '', '', '', '', '', '', '', '', '', ''])
g.set_title('Average VLE Interaction', fontsize = 18)


Most students clicked on the material three times per day. 

In [None]:
# Create a column to indicate the total clicks per student

total_click_per_student = vle_interaction\
.groupby(['code_module', 'code_presentation', 'id_student'])\
.agg(total_click = ("sum_click",sum))\
.reset_index()

total_click_per_student.head(3)

In [None]:
# Merge mean_click_per_student and total_click_per_student tables together 

total_click_per_student = pd.merge(total_click_per_student ,  mean_click_per_student , on=['code_module', 'code_presentation', 'id_student'], how='inner')

In [None]:
# Merge total_click_per_student and vle_interaction tables together

vle_interaction = pd.merge(total_click_per_student , vle_interaction, on=['code_module', 'code_presentation', 'id_student'], how='inner')

In [None]:
# Merge vle_interaction and Vle tables together to get a better idea of student activity

vle_interaction = vle_interaction.merge(materials[['id_site', 'activity_type']], on='id_site', how='left')

In [None]:
# Find the overall activity (total clicks) per activity type

overall_activity = pd.DataFrame(vle_interaction.groupby(['activity_type'])['sum_click'].sum()).reset_index()
overall_activity['percentage'] = round(overall_activity['sum_click'] / overall_activity['sum_click'].sum() * 100,2)

In [None]:
# Visualise the most common activity type

# sort df by sum_click column
overall_activity = overall_activity.sort_values(['sum_click']).reset_index(drop=True)
print (overall_activity)

OU content, foruming, quiz and homepage are the most common VLE activities.

In [None]:
g = sns.barplot(overall_activity.index, overall_activity.sum_click, color='grey')
g.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
g.set(xlabel="activity_type", ylabel='sum_click')
# add proper Dim values as x labels
g.set_xticklabels(overall_activity.activity_type)
for item in g.get_xticklabels(): item.set_rotation(90)
patch_h = []
for patch in g.patches:
    reading = patch.get_height()
    patch_h.append(reading)

idx_tallest = np.argmax(patch_h)   

g.patches[idx_tallest].set_facecolor('#a834a8')  
g.set_title('Most Common VLE Activity', fontsize = 18)
print("Count plots for VLE Activity")

In [None]:
# Change the date to month for visualisation

vle_interaction['month'] = vle_interaction['date'].apply(date_revision)
vle_interaction['month_no'] = vle_interaction['month'].apply(date_number)

In [None]:
# Create a dataframe to store month, month_no and sum_clicks to visualise activity

studentVle_merge_A_df = pd.DataFrame(vle_interaction.groupby(['month','month_no'])['sum_click'].sum())
studentVle_merge_A_df.reset_index(inplace = True)

In [None]:
# Sort by month number

studentVle_merge_A_df = studentVle_merge_A_df.sort_values('month_no')

In [None]:
fig= plt.figure(figsize=(10,6)) 
ax1 = plt.plot( 'month', 'sum_click', data=studentVle_merge_A_df, marker='', color='#003366', linewidth=2)
plt.axvspan('Apr', 'May', color='#cccccc', alpha=0.25)
plt.legend(labels =['All Students'])
plt.ylabel('No. of Clicks', fontsize=10)
plt.title('VLE Engagement Over Time', loc='center',pad=15, fontsize=15);

The most activity occurred in October and the least in June.

In [None]:
# Drop columns which won't provide any extra information after grouping by module presentation per student.
vle_interaction.drop(columns=['id_site', 'date', 'activity_type'], inplace=True)

In [None]:
# Create a dataframe which merges vle_interaction with student info to chart the VLE engagement of
# distinction and failing students

student_outcome = pd.merge(vle_interaction, studentInfo, on=['code_module', 'code_presentation', 'id_student'], how='left')

In [None]:
student_outcome.head(3)

In [None]:
# Create a subset of the required columns

student_outcome = student_outcome[['final_result', 'month', 'month_no', 'sum_click']]
student_outcome =student_outcome.sort_values('month_no')
student_outcome.head(3)

In [None]:
# Contrast distinction vs non-distinction

distinction = student_outcome['final_result'] == 'Distinction'
studentVle_merge_A_df = pd.DataFrame(student_outcome[distinction].groupby(['month','month_no'])['sum_click'].sum())
studentVle_merge_A_df.reset_index(inplace = True)

nodistinction = student_outcome['final_result'] != 'Distinction'
studentVle_merge_NA_df = pd.DataFrame(student_outcome[nodistinction].groupby(['month','month_no'])['sum_click'].sum())
studentVle_merge_NA_df.reset_index(inplace = True)
studentVle_merge_NA_df['sum_click'] = round(studentVle_merge_NA_df['sum_click'] / 16)
studentVle_merge_NA_df


studentVle_merge_A_df = studentVle_merge_A_df.sort_values('month_no')
studentVle_merge_NA_df = studentVle_merge_NA_df.sort_values('month_no')

In [None]:
# Display visualisation

fig= plt.figure(figsize=(10,6)) 
ax1 = plt.plot( 'month', 'sum_click', data=studentVle_merge_A_df, marker='', color='#003366', linewidth=2)
ax2 = plt.plot( 'month', 'sum_click', data=studentVle_merge_NA_df, marker='', color='#cccccc', linewidth=2)
plt.axvspan('Apr', 'May', color='#cccccc', alpha=0.25)
plt.legend(labels =['Distinction','Non-Distinction'])
plt.ylabel('No. of Clicks', fontsize=10)
plt.title('VLE Interaction of Distinction Students', loc='center',pad=15, fontsize=15);

In [None]:
# Create a 'student_failed' column which indicates whether the student failed the course. 
# '0' : Did not fail, '1': 'Failed'

student_outcome['student_failed'] = [1 if result in ['Distinction', 'Pass'] else 0  for result in student_outcome['final_result']]

In [None]:
# Contrast Fail vs Not Fail

Fail = student_outcome['student_failed'] == 0
studentVle_merge_A_df = pd.DataFrame(student_outcome[Fail].groupby(['month','month_no'])['sum_click'].sum())
studentVle_merge_A_df.reset_index(inplace = True)

noFail = student_outcome['student_failed'] == 1
studentVle_merge_NA_df = pd.DataFrame(student_outcome[noFail].groupby(['month','month_no'])['sum_click'].sum())
studentVle_merge_NA_df.reset_index(inplace = True)
studentVle_merge_NA_df['sum_click'] = round(studentVle_merge_NA_df['sum_click'] / 16)
studentVle_merge_NA_df

studentVle_merge_A_df = studentVle_merge_A_df.sort_values('month_no')
studentVle_merge_NA_df = studentVle_merge_NA_df.sort_values('month_no')

In [None]:
# Plot the chart

fig= plt.figure(figsize=(10,6)) 
ax1 = plt.plot( 'month', 'sum_click', data=studentVle_merge_A_df, marker='', color='#003366', linewidth=2)
ax2 = plt.plot( 'month', 'sum_click', data=studentVle_merge_NA_df, marker='', color='#cccccc', linewidth=2)
plt.axvspan('Apr', 'May', color='#cccccc', alpha=0.25)
plt.legend(labels =['Did not Fail','Failed'])
plt.ylabel('No. of Clicks', fontsize=10)
plt.title('VLE Engagement of Failing Students', loc='center',pad=15, fontsize=15);

In [None]:
# Create a 'student_withdrew' column which indicates whether the student withdrew from the course. 
# '0' : Did not withdraw, '1': 'withdrew'

student_outcome['student_Withdrawn'] = [1 if result in ['Withdrawn'] else 0  for result in student_outcome['final_result']]

In [None]:
# Contrast Withdrawn vs completed course

Withdrawn = student_outcome['student_Withdrawn'] == 1
studentVle_merge_A_df = pd.DataFrame(student_outcome[Withdrawn].groupby(['month','month_no'])['sum_click'].sum())
studentVle_merge_A_df.reset_index(inplace = True)

noWithdrawn = student_outcome['student_Withdrawn'] == 0
studentVle_merge_NA_df = pd.DataFrame(student_outcome[noWithdrawn].groupby(['month','month_no'])['sum_click'].sum())
studentVle_merge_NA_df.reset_index(inplace = True)
studentVle_merge_NA_df['sum_click'] = round(studentVle_merge_NA_df['sum_click'] / 16)
studentVle_merge_NA_df

studentVle_merge_A_df = studentVle_merge_A_df.sort_values('month_no')
studentVle_merge_NA_df = studentVle_merge_NA_df.sort_values('month_no')

In [None]:
# Plot the chart

fig= plt.figure(figsize=(10,6)) 
ax1 = plt.plot( 'month', 'sum_click', data=studentVle_merge_A_df, marker='', color='#003366', linewidth=2)
ax2 = plt.plot( 'month', 'sum_click', data=studentVle_merge_NA_df, marker='', color='#cccccc', linewidth=2)
plt.axvspan('Apr', 'May', color='#cccccc', alpha=0.25)
plt.legend(labels =['Withdrew','Completed Course'])
plt.ylabel('No. of Clicks', fontsize=10)
plt.title('VLE Engagement of Dropouts', loc='center',pad=15, fontsize=15);

In [None]:
# Drop unnecessary columns 
vle_interaction.drop(columns=['month', 'month_no'], inplace=True)

In [None]:
# Drop the duplicate values because it will overly complicate the grade prediction process if a student is included more than once

vle_interaction = vle_interaction.drop_duplicates(subset='id_student', keep= 'first')

In [None]:
vle_interaction.info()

*******************************

# CREATING THE FINAL DATASET

# Merging Tables Together

 Merge the studentRegistration table with the Courses table using an inner join into regCourses. 

In [None]:
# Merge with an inner join
regCourses = pd.merge(studentRegistration , courses, on=['code_module', 'code_presentation'], how='inner')


Merge regCourses with the studentInfo table using an inner join.

In [None]:
# Merge with an inner join
regCoursesInfo = pd.merge(regCourses, studentInfo, on=['code_module', 'code_presentation', 'id_student'], how='inner')

Merge assessments and results tables. 

In [None]:
# merge with an inner join
assResults = pd.merge(assessments, results, on=['id_assessment'], how='inner')
# Rearrange column names
assResults = assResults[['id_student', 'code_module', 'code_presentation', 'id_assessment', 'assessment_type', 'date', 'date_submitted', 'weight', 'is_banked', 'score']]

# Creating New Columns

Create a Weighted Score so that the total weight of all modules can be created. 

In [None]:
# Make a copy of dataset
scores = assResults

# Count how many exams there are in Results for every module presentation
scores[scores['assessment_type'] == 'Exam'][['code_module', 'code_presentation', 'id_assessment']]\
.groupby(['code_module', 'code_presentation'])\
.nunique()

CCC module only has results for 1 exam when the module should have 2 exams in total.

DDD module has results for the final exam (DDD module should have one exam in total).

In [None]:
### Make helper columns ###
# (a) Add column multiplying weight and score
scores['weight*score'] = scores['weight']*scores['score']
# (b) Aggregate recorded weight*score per student
    # per module presentation
sum_scores = scores\
.groupby(['id_student', 'code_module', 'code_presentation'])\
.agg(weightByScore = ('weight*score', sum))\
.reset_index()
# (c) Calculate total recorded weight of module
# (c.i) Get total weight of modules
total_weight = assessments\
.groupby(['code_module', 'code_presentation'])\
.agg(total_weight = ('weight', sum))\
.reset_index()
# (c.ii) Subtract 100 to account for missing exams
total_weight['total_weight'] = total_weight['total_weight']-100
# (c.iii) Mark module DDD as having 200 credits 
total_weight.loc[(total_weight.code_module == 'DDD'), 'total_weight'] = 200

### Calculate weighted score ###
# (a) Merge sum_scores and total_weight tables
score_weights = pd.merge(sum_scores, total_weight, on=['code_module', 'code_presentation'], how='inner')
# (b) Calculate weighted score
score_weights['weighted_score'] = score_weights['weightByScore'] / score_weights['total_weight']
# (c) Drop helper columns
score_weights.drop(columns=['weightByScore', 'total_weight'], inplace=True)

In [None]:
score_weights.head(7)

Create a late_rate_per_student to indicate what percentage of assignments were submitted late

In [None]:
# Calculate the difference between the submission dates
lateSubmission = assResults.assign(submission_days=assResults['date_submitted']-assResults['date'])
# Make a column indicating if the submission was late or not 
lateSubmission = lateSubmission.assign(late_submission=lateSubmission['submission_days'] > 0)

# Aggregate per student per module presentation
total_late_per_student = lateSubmission\
.groupby(['id_student', 'code_module', 'code_presentation'])\
.agg(total_late_submission = ('late_submission', sum))\
.reset_index()

# Make a df with total number of all assessments per student per module presentation
total_count_assessments = lateSubmission[['id_student', 'code_module', 'code_presentation', 'id_assessment']]\
.groupby(['id_student', 'code_module', 'code_presentation'])\
.size()\
.reset_index(name='total_assessments')

# Merge df with total late assessements and total count assessments
late_rate_per_student = pd.merge(total_late_per_student, total_count_assessments, on=['id_student', 'code_module', 'code_presentation'], how='inner')
# Make a new column with late submission rate
late_rate_per_student['late_rate'] = late_rate_per_student['total_late_submission'] / late_rate_per_student['total_assessments']


late_rate_per_student

In [None]:
# Treat null values in the late_rate column as 100% late 
# because they did not make any submission

late_rate_per_student = late_rate_per_student.replace(np.nan).fillna(1.0)

Create a fail_rate_per_student to indicate what percentage of assignments were submitted late

In [None]:
# Define function for marking failed assignments
passRate = assResults
passRate = passRate.assign(fail=passRate['score'] < 40)

# Aggregate per student per module presentation
total_fails_per_student = passRate\
.groupby(['id_student', 'code_module', 'code_presentation'])\
.agg(total_fails = ("fail",sum))\
.reset_index()

total_fails_per_student.head()

# Merge df with total fails and total count assessments
fail_rate_per_student = pd.merge(total_fails_per_student, total_count_assessments, on=['id_student', 'code_module', 'code_presentation'], how='inner')
# Make a new column with late submission rate
fail_rate_per_student['fail_rate'] = fail_rate_per_student['total_fails'] / fail_rate_per_student['total_assessments']
# Drop helper columns
fail_rate_per_student.drop(columns=['total_fails', 'total_assessments'], inplace=True)

fail_rate_per_student

# Merge All Tables

Merge assessment table

In [None]:
assessments = pd.merge(score_weights, late_rate_per_student, on=['id_student', 'code_module', 'code_presentation'], how='inner')
assessments = pd.merge(assessments, fail_rate_per_student, on=['id_student', 'code_module', 'code_presentation'], how='inner')

assessments.head()

In [None]:
merged = pd.merge(regCoursesInfo, vle_interaction, on=['id_student', 'code_module', 'code_presentation'], how='left')

In [None]:
merged = pd.merge(merged, assessments, on=['id_student', 'code_module', 'code_presentation'], how='left')

In [None]:
data= merged

In [None]:
data.head()

In [None]:
data.info()

In [None]:
# Create a new column called 'procrastination' which notes whether a student has less than the 
# average number of VLE clicks and at least one late submission

data['procrastination'] = ((data.iloc[:,18:19] < 3.0) | (data.iloc[:,23:24] > 0.0)).any(1)

In [None]:
data['procrastination'].value_counts()

In [None]:
data['procrastination'].value_counts().plot.bar()

Most students were not procrastinating. 

In [None]:
 print("The final dataset has {} samples with {} features each.".format(*data.shape))

In [None]:
data.info()

*******************************

# Format the dataset and send to CSV

In [None]:
# Reset the columns so that id_student is listed first

col_list = list(data.columns)
col_list.insert(0,col_list.pop(col_list.index('id_student')))
data = data.loc[:,col_list]

In [None]:
data.head()

In [None]:
# Create new csv file containing the final dataset

data.to_csv('oulad_final.csv', index=False)

# END