## Notebook 2.2. Understanding and Preprocessing of Moodle Logs

For all intents and purposes, this should be considered as the first real notebook that is part of the thesis work. In it, we will take the original student log file and perform the necessary manipulations to ensure that we have a dataset with the potential to be useful.

#### 1. A Small overview of the logs and each column

The presented logs report to interactions with the Moodle LMS:

    - Each interaction with the LMS is recorded sequentially:
        When is the action performed,
        What is the nature of the interaction,
        Where is the actor when the action is performed,
        Who performed the interaction,
        In the context of which course page,
        What is the specific link,
                
    - Each user is uniquely identified by the userID,
    - Each course is uniquely identified by the courseID,
    - Each specific interaction is recorded -> action performed and clicked url, 
    - Each click is timestamped,
    - The actor's IP is recorded,

A brief description of each column follows:

##### id
A sequentilly numbered unique identifier interactions,

##### time
A float number representation of the timestamp of the event,

##### userid
Unique numerical identifier of user -> be it student, faculty or other,

##### ip
ip adress used by the user when interactiong with the LMS system,

##### course
Unique numerical identifier of a course,

##### cmid
meaning unclear at the moment - to check with other Moodle Sources,

##### action
categorization of nature of the interaction

##### url
link user clicked on

##### info
additional descriptors added by the user

#### 2. We'll start this notebook by importing all relevant packages and data

All data is stored in the csv files that were exported in the previous notebook. 

In order to minimize unecessary steps, as we import these csv files we will immediatly remove, from each dataset:
1. The first unnamed column,
2. All columns that are entirely made of missing values - we have detected some.
3. All numerical columns that are immediatly recognied as categorical (or likely to be categorical values) are also immediatly declared as categoricals - this does not mean that, upon further assessment, other features may be converted to objects,
4. All features that display no null values and have a single value are promptly removed as well, 
5. No preprocessing of time related features is performed at this stage - namely because the features realted with time may require further assessment.

In [1]:
#import libs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
import warnings
warnings.filterwarnings('ignore')

In [2]:
#loading student log data 
student_logs = pd.read_csv('../Data/R_Gonz_data_log.csv', 
                           dtype = {
                                   'id': object,
                                   'itemid': object,
                                   'userid': object,
                                   'course': object,
                                   'cmid': object,
                                   },).drop('Unnamed: 0', axis = 1).dropna(how = 'all', axis = 1) #logs

#loading support table
support_table = pd.read_csv('../Data/R_Gonz_support_table.csv', 
                           dtype = {
                                   'assign_id': object,
                                   'courseid': object,
                                   'userid': object,
                                   }, 
                            parse_dates = ['sup_time', 'startdate']).drop('Unnamed: 0', axis = 1).dropna(how = 'all', axis = 1) #support table

#after checking, we note that time and stime report to the same date and differ in 1 hour, hence, we will only keep the time column
#additionally, we will make the immediate conversion of time
student_logs['time'] = pd.to_datetime(student_logs['time'], unit = 's', errors = 'coerce')
student_logs.drop('stime', axis = 1, inplace = True)

### Taking a preliminary look at the logs

In [3]:
student_logs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47097824 entries, 0 to 47097823
Data columns (total 10 columns):
 #   Column  Dtype         
---  ------  -----         
 0   id      object        
 1   time    datetime64[ns]
 2   userid  object        
 3   ip      object        
 4   course  object        
 5   module  object        
 6   cmid    object        
 7   action  object        
 8   url     object        
 9   info    object        
dtypes: datetime64[ns](1), object(9)
memory usage: 3.5+ GB


In [4]:
student_logs.describe(include ='all', datetime_is_numeric = True).T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max
id,47097824,47097824.0,1.0,1.0,NaT,NaT,NaT,NaT,NaT,NaT
time,47097824,,,,2015-01-20 08:00:31.016559872,2014-06-05 17:09:07,2014-11-10 12:51:08.750000128,2015-01-17 20:12:12,2015-03-27 22:43:11,2015-07-31 03:14:09
userid,47097824,30517.0,0.0,3219653.0,NaT,NaT,NaT,NaT,NaT,NaT
ip,47097824,161783.0,127.0.0.1,30508698.0,NaT,NaT,NaT,NaT,NaT,NaT
course,47097824,5112.0,1.0,17715596.0,NaT,NaT,NaT,NaT,NaT,NaT
module,47097824,39.0,course,17937931.0,NaT,NaT,NaT,NaT,NaT,NaT
cmid,47097824,167235.0,0.0,34846344.0,NaT,NaT,NaT,NaT,NaT,NaT
action,47097824,157.0,view,27239500.0,NaT,NaT,NaT,NaT,NaT,NaT
url,47070765,754343.0,view.php?id=1,6303588.0,NaT,NaT,NaT,NaT,NaT,NaT
info,42907847,693729.0,1,6306585.0,NaT,NaT,NaT,NaT,NaT,NaT


In [5]:
student_logs

Unnamed: 0,id,time,userid,ip,course,module,cmid,action,url,info
0,1.0,2014-06-05 17:09:07,2.0,127.0.0.1,1.0,user,0.0,login,view.php?id=2&course=1,2
1,2.0,2014-06-05 17:14:48,2.0,127.0.0.1,1.0,user,0.0,update,view.php?id=2,
2,3.0,2014-06-05 17:14:48,2.0,127.0.0.1,1.0,user,0.0,update,view.php?id=2,
3,4.0,2014-06-05 17:16:13,2.0,127.0.0.1,1.0,course,0.0,view,view.php?id=1,1
4,5.0,2014-06-06 07:37:19,2.0,127.0.0.1,1.0,user,0.0,login,view.php?id=2&course=1,2
...,...,...,...,...,...,...,...,...,...,...
47097819,47116816.0,2015-07-31 03:00:59,0.0,127.0.0.1,1.0,user,0.0,add,/view.php?id=81854,Cathleen Scheurich
47097820,47116817.0,2015-07-31 03:00:59,0.0,127.0.0.1,1.0,user,0.0,add,/view.php?id=81855,Sara Gil Díez
47097821,47116818.0,2015-07-31 03:00:59,0.0,127.0.0.1,1.0,user,0.0,add,/view.php?id=81856,Eduardo García Bermo
47097822,47116819.0,2015-07-31 03:14:08,0.0,127.0.0.1,635.0,role,0.0,unassign,admin/roles/assign.php?contextid=24578&roleid=5,Estudiante


In [6]:
#use this cell to write any additional piece of code that may be required

### First step: Make it lighter.

One of the first thing to do is to consider the set of students and courses we intend to use. We have, from our support table, a list of the courses and students that we are interested in. We'll then use that list of unique student-course pairs to only get logs for the courses we are interested in.

In [7]:
#We perform a group operation that 
student_courses = support_table.groupby([
                                        'courseid',
                                         'userid',
                                        ],
                                        as_index = False).size().rename(columns = {'courseid':'course'})
#then, we perform an inner merge - only keeping the rows that 
student_logs_actions = pd.merge(student_courses, student_logs, on=[
                                                        'userid',
                                                        'course',
                                                        ], 
                                                        how='inner').drop('size', axis = 1).sort_values(by = 'time')

del student_courses

In [8]:
student_logs_actions.describe(include ='all', datetime_is_numeric = True).T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max
course,12117018,1326.0,4184.0,338054.0,NaT,NaT,NaT,NaT,NaT,NaT
userid,12117018,16606.0,2579.0,275101.0,NaT,NaT,NaT,NaT,NaT,NaT
id,12117018,12117018.0,1673.0,1.0,NaT,NaT,NaT,NaT,NaT,NaT
time,12117018,,,,2015-01-22 15:40:18.737640448,2014-07-01 13:30:43,2014-11-11 08:26:55,2015-01-21 19:13:31.500,2015-03-31 15:58:39.500,2015-07-31 03:00:10
ip,12117018,103074.0,127.0.0.1,7691167.0,NaT,NaT,NaT,NaT,NaT,NaT
module,12117018,32.0,course,5053489.0,NaT,NaT,NaT,NaT,NaT,NaT
cmid,12117018,57320.0,0.0,5862855.0,NaT,NaT,NaT,NaT,NaT,NaT
action,12117018,129.0,view,9698060.0,NaT,NaT,NaT,NaT,NaT,NaT
url,12103143,270043.0,/report/grader/index.php?id=4184,243195.0,NaT,NaT,NaT,NaT,NaT,NaT
info,11714579,86679.0,Ver página de estado de las entregas propios.,1100701.0,NaT,NaT,NaT,NaT,NaT,NaT


In [9]:
student_logs_actions

Unnamed: 0,course,userid,id,time,ip,module,cmid,action,url,info
10367781,5017.0,4.0,1673.0,2014-07-01 13:30:43,127.0.0.1,course,0.0,view,view.php?id=5017,5017
10367782,5017.0,4.0,1674.0,2014-07-01 13:30:46,127.0.0.1,assign,201698.0,view,view.php?id=201698,Ver página de estado de las entregas propios.
10367783,5017.0,4.0,1675.0,2014-07-01 13:30:49,127.0.0.1,assign,201698.0,view submit assignment form,view.php?id=201698,Ver la página propia de entregas a tareas.
10367784,5017.0,4.0,1676.0,2014-07-01 13:31:02,127.0.0.1,assign,201698.0,submit,view.php?id=201698,Estado de la entrega: Borrador (no enviado). <...
10367785,5017.0,4.0,1677.0,2014-07-01 13:31:02,127.0.0.1,assign,201698.0,view,view.php?id=201698,Ver página de estado de las entregas propios.
...,...,...,...,...,...,...,...,...,...,...
2367413,1376.0,20151.0,47116784.0,2015-07-31 03:00:07,127.0.0.1,resource,56623.0,view,view.php?id=56623,36375
2367414,1376.0,20151.0,47116785.0,2015-07-31 03:00:08,127.0.0.1,resource,56621.0,view,view.php?id=56621,36374
2367415,1376.0,20151.0,47116786.0,2015-07-31 03:00:09,127.0.0.1,resource,56620.0,view,view.php?id=56620,36373
2367417,1376.0,20151.0,47116788.0,2015-07-31 03:00:10,127.0.0.1,resource,56618.0,view,view.php?id=56618,36371


**Some preliminary observations of our most common interactions between the students and the systems**

In [10]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(student_logs_actions['action'].value_counts())

view                                   9698060
view all                                372028
continue attempt                        339356
update                                  334560
view forum                              216078
view submit assignment form             189644
view discussion                         164932
submit                                  159235
review                                  125375
view summary                            111008
attempt                                  86777
close attempt                            82857
view section                             72537
view confirm submit assignment form      16809
view forums                              15768
submit for grading                       15503
recent                                   14413
view mailbox                              9673
submission statement accepted             9523
grade submission                          6330
view submission grading table             6223
add post     

In [11]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(student_logs_actions['module'].value_counts())

course             5053489
resource           2712044
assign             1619953
quiz               1105517
forum               407908
grade               334555
user                186387
folder              168607
url                 157004
page                132217
imscp                30268
glossary             29935
wiki                 29926
book                 27745
label                27021
choice               26494
questionnaire        17165
workshop             16138
jmail                13439
scorm                 8408
bigbluebuttonbn       3859
oublog                3154
data                  2559
calendar              1702
bookmark               435
lesson                 339
pcast                  317
nanogong               224
recordingsbn           164
notes                   27
discussion              16
role                     2
Name: module, dtype: int64


We will create a student-course pivot-table from each we can easily obtain the students attending each particular course.

In the pivot-table, we can find the number of LMS interactions (clicks) performed by a student in the context of a given course. The count of valid entries in each column gives us the number of students attending a given curricular unit.

In [12]:
# we get to create a pivot-table that associates students and the courses they are attending
student_list = pd.pivot_table(student_logs_actions, index='userid', columns = 'course', values = 'url',
                    aggfunc='count')

# we use the describe command to get the course-level aggregate statistics
# count -> number of students attending, mean is the average number of clicks performed by each student 
student_count = student_list.describe(include = 'all').T.sort_values(by = 'count', ascending = False)['count'].reset_index()

#from here, we can create a dict that associates each course to the number of students attending the course
student_count = student_count.set_index('course').to_dict()['count']

By now, we know, generally:

- all courses that had graded assignments (i.e. whose max assignment grade was not 0),
- all students that were registered in the curricular unit and performed, at the very least, one action in the logs,
- all activity logs performed by students in the context of the curricular units

### Second step. Calculate support information
While our logs are note entirely consistent with the findings of the R. Gonzalex paper, we can now start to dig deeper into the support information to obtain the remaining variables of interest.

**First**: to classify whether different assignments were mandatory or not

The authors of the paper focused made a split between mandatory and optional assignments. In their view, any assignment whose submittal rate (relative to the number of students attending the course) is 40% or under would be considered an optional assignment.

We can, in some capacity, partially repeat the steps performed in the previous pivot-table and make the option/mandatory classification of each assignment.

In [13]:
# we get to create a pivot-table that associates assignments and the courses are asked on
assign_number = pd.pivot_table(support_table.dropna(), index= 'userid', columns = ['courseid', 'assign_id'], values = 'finalgrade',
                    aggfunc='count')

# we use the describe command to get the course-level aggregate statistics
# count -> number of students delivering the assignment, mean is the average number of students delivering the assignment 
assign_number = assign_number.describe(include = 'all').T.sort_values(by = 'count', ascending = False)['count'].reset_index()

#from her, we can create 2 columns: i) one with the number of students attending the course
assign_number['registered_students'] = assign_number['courseid'].map(student_count)

#then, we can calculate the percentage of assignments delivered relative to the number of attending students
assign_number['%_submissions'] = assign_number['count'] / assign_number['registered_students']

#finally, we classify each assignment as mandatory vs non-mandatory (over 40% submission rates)
assign_number['mandatory_status'] = np.where(assign_number['%_submissions'] > 0.4, 1, 0)

#from here, we can create a dict that associates each course to the number of students attending the course
mandatory_status = assign_number.set_index('assign_id').to_dict()['mandatory_status']

del assign_number

We now have assigned the mandatory status to different assignments. We will not use this knowledge immediatly, but we will need it later. What it allows us is the ability to perform new computations.

**Second**: Enhance the cleaning of unnecessary assignments and courses. We can now perform 2 distinct and important operations:

1st - identify whether the students made the delivery of the assignment or not - nans vs non nans

2nd - give every nan the classification of 0.

3rd - verify whether any assignments have an average finalgrade of 0 - these are meaningless curricular units for us.

In [20]:
# check whether the assignment was delivered by the student or not
support_table['delivered'] = np.where(support_table['finalgrade'].isna(), 0, 1)

#now, we fill the nas of finalgrade with 0
support_table.fillna(0, inplace = True)

#as a final note, we can now verify which assignments/courses we can exclude
#criteria 1: avg finalgrade = 0
#criteria 2 -> if all assignments have average grade 0, the course can be excluded
exclusion_table = support_table.groupby(['courseid', 'assign_id']).agg({
                                                    'userid': 'count',
                                                    'finalgrade' : 'mean',
                                                    'rawgrademax' : 'mean',
                                                    },
                                                    )

In [21]:
exclusion_table.describe(include = 'all', datetime_is_numeric = True).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
userid,9491.0,32.009693,46.960124,1.0,7.0,15.0,43.0,657.0
finalgrade,9491.0,23.646698,78.413648,0.0,0.0,0.646154,8.661373,1001.0
rawgrademax,9491.0,91.894443,160.428539,0.07,10.0,100.0,100.0,1001.0


In [23]:
exclusion_table

Unnamed: 0_level_0,Unnamed: 1_level_0,userid,finalgrade,rawgrademax
courseid,assign_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1000.0,217046.0,2,0.00,100.0
1000.0,217076.0,2,0.00,100.0
1000.0,217078.0,1,0.00,100.0
1000.0,35982.0,12,32.75,41.0
1000.0,35985.0,12,36.00,41.0
...,...,...,...,...
999.0,35919.0,2,0.00,100.0
999.0,35921.0,3,0.00,100.0
999.0,35924.0,51,0.00,100.0
999.0,35928.0,48,0.00,100.0


In [None]:
a['count']

In [None]:
student_logs_actions.groupby([
                                        'course',
                                        #'userid',
                                        ],
                                        as_index = False).count()

Defining mandatory assignments vs non-mandatory assignments:

The authors of the paper defined the mandatory assignments were all assignments with a submission rate of over 40%.

In [None]:
#records to keep for export - eventually
student_list # pivot table with counts of interactions of students for a course
student_count
mandatory_status
student_logs_actions #logs with student actions performed in the context of the course

### Additional Feature Engineering

#### Done

From now on we will always work with df_treated in the future notebooks. 