## Notebook 2.2 Data Understanding and Preprocessing of Support Tables

For all intents and purposes, this should be considered as the second real notebook that is part of the thesis work. In it, we will look at the support tables that are part of the original database. These tables have information concerning courses and student performance - meaningful features for our project.

#### 1. We are familiarized with the logs

Before going further, we should assess the remaining tables presented in the database. 

Recall, **logs record interactions with the system and we are looking for ways to determine whether these interactions can assist educators identify at risk students and high performing students.**

Thus, to make the best out of the logs, we will need to perform different segmentations and it is likely that we will need perform some filtering. 

### To do that, we will take a look at all tables

We will look at all tables and all columns to make a preliminary assessment of the utility of the available elements.
In general, these are support elements that will be used sparsely, as most of the relevant information is present in the logs.

The observation of each table will resort to the same chain of commands:

info -> to observe count and datatype of each column, 
describe -> a command that that returns the most notable descriptive statistics of each column.
The obeservation of each table ends with a look at the raw data (At least the visible rows).

#### 2. We'll start this notebook by importing all relevant packages and data

All data is stored in the csv files that were exported in the previous notebook. 

In order to minimize unecessary steps, as we import these csv files we will immediatly remove, from each dataset:
1. The first unnamed column,
2. All columns that are entirely made of missing values - we have detected some.
3. All numerical columns that are immediatly recognied as categorical (or likely to be categorical values) are also immediatly declared as categoricals - this does not mean that, upon further assessment, other features may be converted to objects,
4. All features that display no null values and have a single value are promptly removed as well, 
5. Features related to time are converted to appropriate format - this is ultimately an ad-hoc assessment, but an important one to make.

In [1]:
#import libs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

import warnings
warnings.filterwarnings('ignore')

In [2]:
#other tables with support information
context_table = pd.read_csv('../Data/R_Gonz_data_mdl_context.csv', #context table -> unclear utility
                           dtype = {
                                   'id': object,
                                   'contextlevel': object,
                                   'instanceid': object,
                                   'path': object,
                                   'depth': object,                       
                                   },).drop('Unnamed: 0', axis = 1).dropna(how = 'all', axis = 1)

course_table = pd.read_csv('../Data/R_Gonz_data_mdl_course.csv', #course table -> unclear utility
                           dtype = {
                                   'id': object,
                                   'category': object,
                                   'outcomeid': object,
                                   'summaryformat': object,
                                   'showgrades': object,
                                   'newsitems': object,
                                   'legacyfiles': object,
                                   'marker': object,
                                   'showreports': object,          
                                   'visible': object,
                                   'visibleold': object,
                                   'groupmode': object,
                                   'groupmodeforce': object,
                                   'defaultgroupingid': object,
                                   'lang': object,          
                                   'requested': object,
                                   'enablecompletion': object,
                                   'completionnotify': object,                              
                                   },).drop('Unnamed: 0', axis = 1).dropna(how = 'all', axis = 1) 

course_mod_table = pd.read_csv('../Data/R_Gonz_data_mdl_course_modules.csv', #course module table -> unclear utility
                           dtype = {
                                   'id': object,
                                   'module': object,
                                   'instance': object,
                                   'course': object,
                                   'section': object,
                                   'idnumber': object,
                                   'indent': object,
                                   'visible': object,
                                   'visibleold': object,
                                   'groupmode': object,
                                   'groupingin': object,
                                   'groupmembersonly': object,
                                   'visibleold': object,
                                   'groupmode': object,
                                   'groupingin': object,
                                   'groupmembersonly': object,
                                   'completion': object,
                                   'completionview': object,
                                   'showavailability': object,
                                   'showdescription': object,
                                   },).drop('Unnamed: 0', axis = 1).dropna(how = 'all', axis = 1) 

grades_table = pd.read_csv('../Data/R_Gonz_data_mdl_grade_grades.csv',  # grade table -> unclear utility
                           dtype = {
                                   'id': object,
                                   'itemid': object,
                                   'userid': object,
                                   'usermodified': object,
                                   'rawscaleid': object,
                                    'hidden': object,
                                   'feedback': object,
                                   'userid': object,
                                   'feedbackformat': object,
                                   'informationformat': object, 
                                   },).drop('Unnamed: 0', axis = 1).dropna(how = 'all', axis = 1)

grade_item_table = pd.read_csv('../Data/R_Gonz_data_mdl_grade_items.csv', # grade_items table -> unclear utility
                           dtype = {
                                   'id': object,
                                   'itemid': object,
                                   'categoryid': object,
                                   'courseid': object,
                                   'idnumber': object,
                                   'iteminstance' : object,
                                   'itemnumber' : object,
                                   'gradetype': object,
                                   'scaleid': object,
                                   'multfactor': object,
                                   'outcomeid': object,                          
                                   },).drop('Unnamed: 0', axis = 1).dropna(how = 'all', axis = 1) 

role_assign_table = pd.read_csv('../Data/R_Gonz_data_mdl_role_assignments.csv', # role assignments table -> unclear utility
                           dtype = {
                                   'id': object,
                                   'roleid': object,
                                   'contextid': object,
                                   'itemid': object,
                                   'userid': object,
                                   'modifierid': object,
                                   },).drop('Unnamed: 0', axis = 1).dropna(how = 'all', axis = 1)

#### First, the role assignment tables

The role assignment table is the table where the user assignments of the database are present.
With it, we can see which roles exist and, ultimately, filter for their role - which is represented in this table by the column roleid.

In [3]:
role_assign_table.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 219297 entries, 0 to 219296
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            219297 non-null  object 
 1   roleid        219297 non-null  object 
 2   contextid     219297 non-null  object 
 3   userid        219297 non-null  object 
 4   timemodified  219297 non-null  float64
 5   modifierid    219297 non-null  object 
 6   component     194425 non-null  object 
 7   itemid        219297 non-null  object 
 8   sortorder     219297 non-null  float64
dtypes: float64(2), object(7)
memory usage: 15.1+ MB


In [4]:
role_assign_table.drop([
                    'sortorder',
                    ],
                    axis = 1, inplace = True)

#timemodifieds seem to be a time feature, so we will appropriately make the conversion to datetime
role_assign_table['timemodified'] = pd.to_datetime(role_assign_table['timemodified'], unit = 's', errors = 'coerce')

We see that this database has 30813 unique users and 3 unique roles. As the data that is part of this database deals with information collected throughout one school-year, it is likely that the most represented role is the role of student.

we will find the most common value in the role in role id and filter the role assignment table, only keeping rows where that role is present. 

In [5]:
#get most common role and filter approppriately
student_role = list(role_assign_table['roleid'].mode())

#we will create 2 dataframes - one with the students and another with other members
other_roles_tables = role_assign_table[~(role_assign_table['roleid'].isin(student_role))]
role_assign_table = role_assign_table[role_assign_table['roleid'].isin(student_role)]

#we will also create a list will all users whose role is student
students = role_assign_table['userid'].unique()

In [6]:
role_assign_table.describe(include = 'all', datetime_is_numeric = True).T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max
id,208550,208550.0,2.0,1.0,NaT,NaT,NaT,NaT,NaT,NaT
roleid,208550,1.0,5.0,208550.0,NaT,NaT,NaT,NaT,NaT,NaT
contextid,208550,4215.0,229861.0,24826.0,NaT,NaT,NaT,NaT,NaT,NaT
userid,208550,29062.0,38881.0,33.0,NaT,NaT,NaT,NaT,NaT,NaT
timemodified,208550,,,,2014-09-21 10:15:55.523346944,2014-07-01 11:13:21,2014-08-05 22:34:00,2014-09-03 03:15:18,2014-10-15 18:15:47.750000128,2015-07-30 23:41:19
modifierid,208550,27.0,0.0,208416.0,NaT,NaT,NaT,NaT,NaT,NaT
component,183687,1.0,enrol_database,183687.0,NaT,NaT,NaT,NaT,NaT,NaT
itemid,208550,4208.0,0.0,24863.0,NaT,NaT,NaT,NaT,NaT,NaT


In [7]:
#use this cell to write any additional piece of code that may be required

**Next, we'll consider grades**

Student performance is, in general, measured by the student's grade. So... how do we measure grades?
As all we have is data from Moodle, it is important that we can either find or calculate the targets from Moodle data.

So, in an immediate fashion, we'll have to identify which courses have graded assignments and slice a course list for those.
we can deal with calculating our target at a later stage.

In [8]:
grades_table.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 437650 entries, 0 to 437649
Data columns (total 20 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   id                 437650 non-null  object 
 1   itemid             437650 non-null  object 
 2   userid             437650 non-null  object 
 3   rawgrade           137820 non-null  float64
 4   rawgrademax        437650 non-null  float64
 5   rawgrademin        437650 non-null  float64
 6   rawscaleid         88518 non-null   object 
 7   usermodified       437650 non-null  object 
 8   finalgrade         236668 non-null  float64
 9   hidden             437650 non-null  object 
 10  locked             437650 non-null  float64
 11  locktime           437650 non-null  float64
 12  exported           437650 non-null  float64
 13  overridden         437650 non-null  float64
 14  excluded           437650 non-null  float64
 15  feedback           20273 non-null   object 
 16  fe

In [9]:
#the informationformat feature has no null values and is a single value feature, so we remove it
grades_table.drop('informationformat', axis = 1, inplace = True)

#time created and timemodified seem to be time features, so we will appropriately make the conversion to datetime
grades_table['timecreated'] = pd.to_datetime(grades_table['timecreated'], unit = 's', errors = 'coerce')
grades_table['timemodified'] = pd.to_datetime(grades_table['timemodified'], unit = 's', errors = 'coerce')

#experimental - to delete if need be - only keeping students
grades_table = grades_table[grades_table['userid'].isin(students)]

In [10]:
grades_table.describe(include ='all', datetime_is_numeric = True).T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max,std
id,436576.0,436576.0,160262.0,1.0,,,,,,,
itemid,436576.0,12449.0,12096.0,669.0,,,,,,,
userid,436576.0,17847.0,71424.0,170.0,,,,,,,
rawgrade,137747.0,,,,49.95053,-11.73913,4.79899,10.0,60.0,1001.0,120.249099
rawgrademax,436576.0,,,,88.604383,0.0,10.0,100.0,100.0,1001.0,179.80363
rawgrademin,436576.0,,,,0.207278,-1.0,0.0,0.0,0.0,5.0,0.409931
rawscaleid,88426.0,186.0,4.0,32572.0,,,,,,,
usermodified,436576.0,15316.0,0.0,62082.0,,,,,,,
finalgrade,236432.0,,,,43.932876,0.0,2.5,8.07,51.0,1140.8,112.841518
hidden,436576.0,2.0,0.0,416015.0,,,,,,,


In [11]:
grades_table

Unnamed: 0,id,itemid,userid,rawgrade,rawgrademax,rawgrademin,rawscaleid,usermodified,finalgrade,hidden,locked,locktime,exported,overridden,excluded,feedback,feedbackformat,timecreated,timemodified
25,160262.0,24765.0,4.0,,100.0,0.0,,69457.0,0.25000,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,NaT,2015-01-13 11:35:05
26,186592.0,24769.0,4.0,2.0,2.0,0.0,,4.0,2.00000,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,2015-02-10 10:45:00,2015-02-10 10:45:00
27,216725.0,24770.0,4.0,,100.0,0.0,,4.0,,0.0,0.0,0.0,0.0,0.0,0.0,,1.0,2015-02-24 12:00:51,1970-01-01 00:00:00
28,216721.0,24771.0,4.0,,100.0,0.0,,69457.0,5.00000,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,NaT,2015-01-13 11:34:44
29,4.0,24866.0,4.0,,100.0,0.0,,3.0,52.02941,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,NaT,2015-06-01 08:18:59
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
437645,458230.0,34921.0,78663.0,,100.0,0.0,,0.0,4.16667,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,NaT,1970-01-01 00:00:00
437646,458229.0,34922.0,78663.0,2.5,60.0,0.0,,78663.0,2.50000,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,2015-07-26 19:33:35,2015-07-26 19:33:35
437647,458257.0,34921.0,81739.0,,100.0,0.0,,0.0,56.37500,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,NaT,1970-01-01 00:00:00
437648,458256.0,34922.0,81739.0,40.2,60.0,0.0,,81739.0,40.20000,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,2015-07-28 09:15:21,2015-07-28 09:15:22


In [12]:
#use this cell to write any additional piece of code that may be required

**Next, we have the Grade_item_table**

The grade_item table stores information concerning every gradeable item present in the database.

In [13]:
grade_item_table.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30028 entries, 0 to 30027
Data columns (total 28 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               30028 non-null  object 
 1   courseid         30028 non-null  object 
 2   categoryid       23322 non-null  object 
 3   itemname         30028 non-null  object 
 4   itemtype         30028 non-null  object 
 5   itemmodule       21372 non-null  object 
 6   iteminstance     30028 non-null  object 
 7   itemnumber       23322 non-null  object 
 8   idnumber         3454 non-null   object 
 9   calculation      424 non-null    object 
 10  gradetype        30028 non-null  object 
 11  grademax         30028 non-null  float64
 12  grademin         30028 non-null  float64
 13  scaleid          4768 non-null   object 
 14  outcomeid        25260 non-null  object 
 15  gradepass        30028 non-null  float64
 16  multfactor       30028 non-null  object 
 17  plusfactor  

In [14]:
grade_item_table['gradetype'].value_counts()

1.0    23935
2.0     4783
3.0     1205
0.0      105
Name: gradetype, dtype: int64

In [15]:
#the informationformat feature has no null values and is a single value feature, so we remove it
grade_item_table.drop([
                    'itemname',
                    'plusfactor',
                    'timecreated',
                    'timemodified'
                    ],
                    axis = 1, inplace = True)

grade_item_table.rename(columns = {'id' : 'itemid'}, inplace = True)

In [16]:
grade_item_table.describe(include ='all', datetime_is_numeric = True).T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
itemid,30028.0,30028.0,1.0,1.0,,,,,,,
courseid,30028.0,5551.0,1073.0,146.0,,,,,,,
categoryid,23322.0,3363.0,1191.0,145.0,,,,,,,
itemtype,30028.0,4.0,mod,21372.0,,,,,,,
itemmodule,21372.0,11.0,assign,15773.0,,,,,,,
iteminstance,30028.0,17307.0,0.0,1950.0,,,,,,,
itemnumber,23322.0,1.0,0.0,23322.0,,,,,,,
idnumber,3454.0,2691.0,1,46.0,,,,,,,
calculation,424.0,419.0,=101+102,2.0,,,,,,,
gradetype,30028.0,4.0,1.0,23935.0,,,,,,,


In [17]:
grade_item_table['itemmodule'].value_counts()

assign           15773
quiz              4517
forum              264
scorm              237
questionnaire      214
workshop           195
glossary            91
lesson              47
nanogong            21
data                12
pcast                1
Name: itemmodule, dtype: int64

In [18]:
#use this cell to write any additional piece of code that may be required

#### It is possible to see that only a subset of courses has graded items.

As performed by the authors of the Riestra-Gonzalez paper, we will only look to work with courses that have graded assignments. 
The reason for this option is straightforward - we have no access to the SIS, which means that our target will be, in some shape or form, related to the graded assignments.

**After looking at the grades tables, it is important to incorporate the information presented in these tables with the tables about courses**.

For that, we have access to multiple dfs related to the courses themselves.
Behold, the course_table.

In [19]:
course_table.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5732 entries, 0 to 5731
Data columns (total 28 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5732 non-null   object 
 1   category           5732 non-null   object 
 2   sortorder          5732 non-null   float64
 3   fullname           5732 non-null   object 
 4   shortname          5732 non-null   object 
 5   idnumber           5731 non-null   object 
 6   summary            5732 non-null   object 
 7   summaryformat      5732 non-null   object 
 8   format             5732 non-null   object 
 9   showgrades         5732 non-null   object 
 10  newsitems          5732 non-null   object 
 11  startdate          5732 non-null   float64
 12  marker             5732 non-null   object 
 13  maxbytes           5732 non-null   float64
 14  legacyfiles        5732 non-null   object 
 15  showreports        5732 non-null   object 
 16  visible            5732 

In [20]:
course_table.describe(include ='all', datetime_is_numeric = True).T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
id,5732.0,5732.0,1.0,1.0,,,,,,,
category,5732.0,74.0,1.0,3589.0,,,,,,,
sortorder,5732.0,,,,213321.98866,313262.684429,1.0,11432.75,12865.5,380005.25,850116.0
fullname,5732.0,1.0,nombre,5732.0,,,,,,,
shortname,5732.0,5732.0,Uniovi Virtual,1.0,,,,,,,
idnumber,5731.0,5731.0,"T_1C,A_15473",1.0,,,,,,,
summary,5732.0,1.0,-,5732.0,,,,,,,
summaryformat,5732.0,3.0,1.0,3316.0,,,,,,,
format,5732.0,5.0,topics,5710.0,,,,,,,
showgrades,5732.0,2.0,1.0,5228.0,,,,,,,


In [21]:
#the course_table has multiple single value feature with no nans, as with the previous considered features, we will remove them

course_table.drop(['fullname',
                       'summary',
                       'requested',
                       'enablecompletion',
                       'completionnotify'],
                       axis = 1, inplace = True)

#time created and timemodified seem to be time features, so we will appropriately make the conversion to datetime
course_table['startdate'] = pd.to_datetime(course_table['startdate'], unit = 's', errors = 'coerce')
course_table['timecreated'] = pd.to_datetime(course_table['timecreated'], unit = 's', errors = 'coerce')
course_table['timemodified'] = pd.to_datetime(course_table['timemodified'], unit = 's', errors = 'coerce')
course_table['cacherev'] = pd.to_datetime(course_table['cacherev'], unit = 's', errors = 'coerce') 

In [22]:
course_table

Unnamed: 0,id,category,sortorder,shortname,idnumber,summaryformat,format,showgrades,newsitems,startdate,...,showreports,visible,visibleold,groupmode,groupmodeforce,defaultgroupingid,lang,timecreated,timemodified,cacherev
0,1.0,0.0,1.0,Uniovi Virtual,,0.0,site,1.0,3.0,1970-01-01 00:00:00,...,0.0,1.0,1.0,0.0,0.0,0.0,,2014-06-05 17:08:12,2014-08-25 13:13:00,2015-06-03 12:11:01
1,3.0,1.0,13589.0,"T_1C,A_15473","T_1C,A_15473",1.0,topics,1.0,1.0,1999-11-29 23:00:00,...,0.0,0.0,1.0,0.0,0.0,0.0,,2011-06-06 14:32:36,2011-09-12 12:25:12,2015-03-02 16:33:13
2,4.0,1.0,13588.0,"T_1C,A_15470","T_1C,A_15470",1.0,topics,1.0,1.0,2012-10-02 22:00:00,...,0.0,0.0,1.0,0.0,0.0,0.0,,2011-09-26 09:11:45,2012-10-03 01:45:30,2015-03-02 16:33:13
3,5.0,1.0,13587.0,"T_1C,A_15181","T_1C,A_15181",1.0,topics,1.0,1.0,2014-01-21 23:00:00,...,0.0,0.0,1.0,0.0,0.0,0.0,,2011-06-06 14:32:51,2014-01-22 18:42:58,2015-03-02 16:33:13
4,6.0,61.0,310007.0,"T_S,A_MGENYDIV-1-022","T_S,A_MGENYDIV-1-022",1.0,topics,1.0,1.0,2015-01-31 23:00:00,...,0.0,1.0,1.0,0.0,0.0,0.0,,2011-06-20 10:49:05,2015-02-01 21:32:00,2015-03-02 16:33:13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5727,5928.0,108.0,790005.0,"T_CL,P_10119,A_6634","T_CL,P_10119,A_6634",0.0,topics,1.0,5.0,1970-01-01 00:00:00,...,0.0,0.0,0.0,0.0,0.0,0.0,,2015-07-01 09:17:43,2015-07-01 09:17:43,2015-07-01 09:17:43
5728,5929.0,108.0,790004.0,"T_CL,P_10119,A_6635","T_CL,P_10119,A_6635",0.0,topics,1.0,5.0,1970-01-01 00:00:00,...,0.0,0.0,0.0,0.0,0.0,0.0,,2015-07-01 09:17:44,2015-07-01 09:17:44,2015-07-06 12:39:03
5729,5930.0,108.0,790003.0,"T_CL,P_10119,A_6624","T_CL,P_10119,A_6624",0.0,topics,1.0,5.0,1970-01-01 00:00:00,...,0.0,0.0,0.0,0.0,0.0,0.0,,2015-07-01 09:17:44,2015-07-01 09:17:44,2015-07-01 09:17:44
5730,5931.0,108.0,790002.0,"T_CL,P_10119,A_6625","T_CL,P_10119,A_6625",0.0,topics,1.0,5.0,1970-01-01 00:00:00,...,0.0,0.0,0.0,0.0,0.0,0.0,,2015-07-01 09:17:45,2015-07-01 09:17:45,2015-07-06 12:39:02


In [23]:
#use this cell to write any additional piece of code that may be required

#### The course module table is present in other datasets "e.g. The Open Moodle Dataset", 

According to it, the course module table describes every activity performed with Moodle. In our case, it records every activity performed in every course.

Here follows a brief overview of this table.

In [24]:
course_mod_table.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 228216 entries, 0 to 228215
Data columns (total 21 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  228216 non-null  object 
 1   course              228216 non-null  object 
 2   module              228216 non-null  object 
 3   instance            228216 non-null  object 
 4   section             228216 non-null  object 
 5   idnumber            428 non-null     object 
 6   added               228216 non-null  float64
 7   score               228216 non-null  float64
 8   indent              228216 non-null  object 
 9   visible             228216 non-null  object 
 10  visibleold          228216 non-null  object 
 11  groupmode           228216 non-null  object 
 12  groupingid          228216 non-null  float64
 13  groupmembersonly    228216 non-null  object 
 14  completion          228216 non-null  object 
 15  completionview      228216 non-nul

In [25]:
#the course_table has multiple single value feature with no nans, as with the previous considered features, we will remove them

course_mod_table.drop([
                    'groupmembersonly',
                    'completion',
                    'completionview',
                    'showdescription',
                    'completionexpected',
                    'score',
                    ],
                    axis = 1, inplace = True)

#added, availablefrom and availableuntil seem to be time features, so we will appropriately make the conversion to datetime
course_mod_table['added'] = pd.to_datetime(course_mod_table['added'], unit = 's', errors = 'coerce')
course_mod_table['availablefrom'] = pd.to_datetime(course_mod_table['availablefrom'], unit = 's', errors = 'coerce')
course_mod_table['availableuntil'] = pd.to_datetime(course_mod_table['availableuntil'], unit = 's', errors = 'coerce')

#renaming variables that we will use later on for mergers
course_mod_table.rename(columns = {'instance': 'iteminstance', 'course': 'courseid', 'id' : 'assign_id'}, inplace = True)

In [26]:
course_mod_table.describe(include = 'all', datetime_is_numeric = True).T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max,std
assign_id,228216.0,228216.0,1.0,1.0,,,,,,,
courseid,228216.0,5643.0,4460.0,575.0,,,,,,,
module,228216.0,23.0,17.0,144178.0,,,,,,,
iteminstance,228216.0,150172.0,10.0,23.0,,,,,,,
section,228216.0,40573.0,51699.0,268.0,,,,,,,
idnumber,428.0,288.0,2.0,9.0,,,,,,,
added,228216.0,,,,2012-08-19 22:31:16.422490368,2006-02-13 09:09:40,2011-06-01 18:43:16.500000,2013-02-28 00:00:26.500000,2014-04-30 08:43:07.750000128,2015-07-30 14:00:40,
indent,228216.0,34.0,0.0,166823.0,,,,,,,
visible,228216.0,2.0,1.0,173801.0,,,,,,,
visibleold,228216.0,2.0,1.0,199037.0,,,,,,,


In [27]:
course_mod_table

Unnamed: 0,assign_id,courseid,module,iteminstance,section,idnumber,added,indent,visible,visibleold,groupmode,groupingid,availablefrom,availableuntil,showavailability
0,1.0,1.0,25.0,1.0,1.0,,2014-06-06 08:03:57,0.0,1.0,0.0,0.0,0.0,1970-01-01,1970-01-01,0.0
1,2.0,3.0,9.0,1.0,3.0,,2008-07-28 08:33:14,0.0,0.0,1.0,0.0,0.0,1970-01-01,1970-01-01,0.0
2,3.0,3.0,12.0,1.0,3.0,,2012-01-09 11:55:42,0.0,1.0,1.0,0.0,0.0,1970-01-01,1970-01-01,0.0
3,4.0,3.0,17.0,1.0,3.0,,2012-01-25 17:43:03,0.0,1.0,1.0,0.0,0.0,1970-01-01,1970-01-01,0.0
4,5.0,3.0,17.0,2.0,3.0,,2012-01-25 18:00:11,0.0,1.0,1.0,0.0,0.0,1970-01-01,1970-01-01,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
228211,276295.0,4721.0,17.0,179175.0,63006.0,,2015-07-29 18:07:15,0.0,1.0,1.0,0.0,0.0,1970-01-01,1970-01-01,1.0
228212,276298.0,4721.0,20.0,17945.0,63006.0,,2015-07-30 00:28:31,0.0,1.0,1.0,0.0,0.0,1970-01-01,1970-01-01,1.0
228213,276299.0,4721.0,20.0,17946.0,63006.0,,2015-07-30 00:30:30,0.0,1.0,1.0,0.0,0.0,1970-01-01,1970-01-01,1.0
228214,276305.0,4448.0,9.0,9557.0,55307.0,,2015-07-30 10:51:02,0.0,1.0,1.0,0.0,0.0,1970-01-01,1970-01-01,0.0


In [28]:
#use this cell to write any additional piece of code that may be required

The last set of tables to check is the one that contains the context_table. The utility of these tables is rather unclear at this moment.

#### Context table

In [29]:
context_table.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 351126 entries, 0 to 351125
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            351126 non-null  object
 1   contextlevel  351126 non-null  object
 2   instanceid    351126 non-null  object
 3   path          351126 non-null  object
 4   depth         351126 non-null  object
dtypes: object(5)
memory usage: 13.4+ MB


In [30]:
context_table.describe(include = 'all', datetime_is_numeric = True).T

Unnamed: 0,count,unique,top,freq
id,351126,351126,1.0,1
contextlevel,351126,6,70.0,228216
instanceid,351126,240746,119.0,5
path,351126,351126,/1,1
depth,351126,6,4.0,224241


In [31]:
context_table

Unnamed: 0,id,contextlevel,instanceid,path,depth
0,1.0,10.0,0.0,/1,1.0
1,4.0,50.0,1.0,/1/4,2.0
2,6.0,30.0,1.0,/1/6,2.0
3,7.0,30.0,2.0,/1/7,2.0
4,8.0,80.0,1.0,/1/4/8,3.0
...,...,...,...,...,...
351121,400993.0,30.0,81852.0,/1/400993,2.0
351122,400994.0,30.0,81853.0,/1/400994,2.0
351123,400995.0,30.0,81854.0,/1/400995,2.0
351124,400996.0,30.0,81855.0,/1/400996,2.0


In [32]:
#use this cell to write any additional piece of code that may be required

#### 3. To business

The information stored in these tables is pivotal for our work with the logs. Ignoring all other noise potential insights that may arise from this data we are, for the most part, interested in 3 things:

1. Identify the student population - already achieved
2. Compute Student Performance - our target
3. Get course duration - or find a way to compute those - the courses to that we will take forward.

We've been discussing continuously that we want to, in some capacity, predict student performance. As we do not have access to the final grades, we will need to infer it from graded Moodle assignments. The first, and almost immediate observation is that we will can only use courses that use Moodle in this capacity -> which will reduce the number of courses we have to work with.

We will follow the formula adopted by the authors of the Riestra-González paper:

#### Student Performance and Course Duration

The authors got to student performance and course duration by performing inner joins across multiple tables and filtered across different conditions:

course_mod_table,
grades_table,
grade_item_table

We will replicate their steps and hopefully, reach suport tables that return comparable results. The first step is to perform the removal of rows that will be unnecessary for us. We can only construct a solution for items that are graded and for which we have the means to estimate the course duration. 

Thus, in the grades_table, we will look to only keep rows that can, simultaneuously, fulfill the following pre-requisite:
1. Have a valid final grade,

The second phase will be to perform inner joins of the different tables:
1. course_mod_table with grade_item_table on iteminstance and courseid
2. grade_item_table.id with grades_table.itemid
3. The merge of the previous 2 merged tables

In [33]:
# #Step 1, removing all rows that have no interest to us

# grades_table.dropna(subset = ['finalgrade','timecreated', 'timemodified'], inplace = True)

#Step 2: Create temporary tables that associate courses and assignments
placeholder_1 = pd.merge(course_mod_table, grade_item_table, on=['iteminstance','courseid'], how='inner')

#Step 3: Create second temporary table that associates grades with assignments
placeholder_2 = pd.merge(placeholder_1, grades_table, on ='itemid', how='inner')

#step 3: merge both placeholder tables
support_table = placeholder_2.dropna(subset = ['finalgrade'])
support_table['sup_time'] = np.where(support_table['timecreated'] > support_table['timemodified'],
                                support_table['timecreated'], support_table['timemodified'])

#step 4: only keep graded items, which means nonzero max grades
support_table = support_table[support_table['rawgrademax'] > 0]
#support_table = support_table[support_table['sup_time'].dt.year >= 2014]

del placeholder_1, placeholder_2

**As a final step, we will store the start date of each course - as it will provide us with the means to, further down the line, perform the inference for course duration.**

In [34]:
#only keep rows worth merging - this cell can only be run once
course_table = course_table[course_table['startdate'].dt.year >= 2014].filter(['id', 'startdate']).rename(columns = {'id': 'courseid'})

#perform inner join between support table and courses with grades
support_table = pd.merge(support_table, course_table, on = 'courseid', how = 'inner')

In [35]:
support_table.describe(include = 'all', datetime_is_numeric = True).T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max,std
assign_id,154826.0,5028.0,98626.0,560.0,,,,,,,
courseid,154826.0,788.0,2272.0,3540.0,,,,,,,
module,154826.0,14.0,1.0,90898.0,,,,,,,
iteminstance,154826.0,4738.0,6608.0,560.0,,,,,,,
section,154826.0,2005.0,25647.0,3321.0,,,,,,,
idnumber_x,5383.0,116.0,C03,139.0,,,,,,,
added,154826.0,,,,2013-11-14 23:36:32.200114944,2006-11-29 12:40:45,2013-06-12 12:58:54,2014-01-27 12:23:49,2014-11-19 06:45:40,2015-07-22 21:34:51,
indent,154826.0,9.0,0.0,99469.0,,,,,,,
visible,154826.0,2.0,1.0,132101.0,,,,,,,
visibleold,154826.0,2.0,1.0,136263.0,,,,,,,


We will finish this section by filtering the features to keep and, afterward, export the support table to use with the LMS logs. 

In [36]:
#only keep the final result
support_table = support_table.filter(['assign_id', 'courseid', 'startdate', 'userid', 'finalgrade', 
                                      'rawgrademax', 'sup_time'])

#save
support_table.to_csv('../Data/R_Gonz_support_table.csv')

#### Done

From now on we will always work with df_treated in the future notebooks. 