## Notebook 2.2. Understanding and Preprocessing of Moodle Logs

For all intents and purposes, this should be considered as the first real notebook that is part of the thesis work. In it, we will take the original student log file and perform the necessary manipulations to ensure that we have a dataset with the potential to be useful.

#### 1. A Small overview of the logs and each column

The presented logs report to interactions with the Moodle LMS:

    - Each interaction with the LMS is recorded sequentially:
        When is the action performed,
        What is the nature of the interaction,
        Where is the actor when the action is performed,
        Who performed the interaction,
        In the context of which course page,
        What is the specific link,
                
    - Each user is uniquely identified by the userID,
    - Each course is uniquely identified by the courseID,
    - Each specific interaction is recorded -> action performed and clicked url, 
    - Each click is timestamped,
    - The actor's IP is recorded,

A brief description of each column follows:

##### id
A sequentilly numbered unique identifier interactions,

##### time
A float number representation of the timestamp of the event,

##### userid
Unique numerical identifier of user -> be it student, faculty or other,

##### ip
ip adress used by the user when interactiong with the LMS system,

##### course
Unique numerical identifier of a course,

##### cmid
meaning unclear at the moment - to check with other Moodle Sources,

##### action
categorization of nature of the interaction

##### url
link user clicked on

##### info
additional descriptors added by the user

#### 2. We'll start this notebook by importing all relevant packages and data

All data is stored in the csv files that were exported in the previous notebook. 

In order to minimize unecessary steps, as we import these csv files we will immediatly remove, from each dataset:
1. The first unnamed column,
2. All columns that are entirely made of missing values - we have detected some.
3. All numerical columns that are immediatly recognied as categorical (or likely to be categorical values) are also immediatly declared as categoricals - this does not mean that, upon further assessment, other features may be converted to objects,
4. All features that display no null values and have a single value are promptly removed as well, 
5. No preprocessing of time related features is performed at this stage - namely because the features realted with time may require further assessment.

In [1]:
#import libs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
import warnings
warnings.filterwarnings('ignore')

In [2]:
#loading student log data 
student_logs = pd.read_csv('../Data/R_Gonz_data_log.csv', 
                           dtype = {
                                   'id': object,
                                   'itemid': object,
                                   'userid': object,
                                   'course': object,
                                   'cmid': object,
                                   },).drop('Unnamed: 0', axis = 1).dropna(how = 'all', axis = 1) #logs

#loading support table
support_table = pd.read_csv('../Data/R_Gonz_support_table.csv', 
                           dtype = {
                                   'assign_id': object,
                                   'courseid': object,
                                   'userid': object,
                                   }, 
                            parse_dates = ['sup_time', 'startdate']).drop('Unnamed: 0', axis = 1).dropna(how = 'all', axis = 1) #support table

#after checking, we note that time and stime report to the same date and differ in 1 hour, hence, we will only keep the time column
#additionally, we will make the immediate conversion of time
student_logs['time'] = pd.to_datetime(student_logs['time'], unit = 's', errors = 'coerce')
student_logs.drop('stime', axis = 1, inplace = True)

### Taking a preliminary look at the logs

In [3]:
student_logs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47097824 entries, 0 to 47097823
Data columns (total 10 columns):
 #   Column  Dtype         
---  ------  -----         
 0   id      object        
 1   time    datetime64[ns]
 2   userid  object        
 3   ip      object        
 4   course  object        
 5   module  object        
 6   cmid    object        
 7   action  object        
 8   url     object        
 9   info    object        
dtypes: datetime64[ns](1), object(9)
memory usage: 3.5+ GB


In [4]:
student_logs.describe(include ='all', datetime_is_numeric = True).T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max
id,47097824,47097824.0,1.0,1.0,NaT,NaT,NaT,NaT,NaT,NaT
time,47097824,,,,2015-01-20 08:00:31.016559872,2014-06-05 17:09:07,2014-11-10 12:51:08.750000128,2015-01-17 20:12:12,2015-03-27 22:43:11,2015-07-31 03:14:09
userid,47097824,30517.0,0.0,3219653.0,NaT,NaT,NaT,NaT,NaT,NaT
ip,47097824,161783.0,127.0.0.1,30508698.0,NaT,NaT,NaT,NaT,NaT,NaT
course,47097824,5112.0,1.0,17715596.0,NaT,NaT,NaT,NaT,NaT,NaT
module,47097824,39.0,course,17937931.0,NaT,NaT,NaT,NaT,NaT,NaT
cmid,47097824,167235.0,0.0,34846344.0,NaT,NaT,NaT,NaT,NaT,NaT
action,47097824,157.0,view,27239500.0,NaT,NaT,NaT,NaT,NaT,NaT
url,47070765,754343.0,view.php?id=1,6303588.0,NaT,NaT,NaT,NaT,NaT,NaT
info,42907847,693729.0,1,6306585.0,NaT,NaT,NaT,NaT,NaT,NaT


In [5]:
student_logs

Unnamed: 0,id,time,userid,ip,course,module,cmid,action,url,info
0,1.0,2014-06-05 17:09:07,2.0,127.0.0.1,1.0,user,0.0,login,view.php?id=2&course=1,2
1,2.0,2014-06-05 17:14:48,2.0,127.0.0.1,1.0,user,0.0,update,view.php?id=2,
2,3.0,2014-06-05 17:14:48,2.0,127.0.0.1,1.0,user,0.0,update,view.php?id=2,
3,4.0,2014-06-05 17:16:13,2.0,127.0.0.1,1.0,course,0.0,view,view.php?id=1,1
4,5.0,2014-06-06 07:37:19,2.0,127.0.0.1,1.0,user,0.0,login,view.php?id=2&course=1,2
...,...,...,...,...,...,...,...,...,...,...
47097819,47116816.0,2015-07-31 03:00:59,0.0,127.0.0.1,1.0,user,0.0,add,/view.php?id=81854,Cathleen Scheurich
47097820,47116817.0,2015-07-31 03:00:59,0.0,127.0.0.1,1.0,user,0.0,add,/view.php?id=81855,Sara Gil Díez
47097821,47116818.0,2015-07-31 03:00:59,0.0,127.0.0.1,1.0,user,0.0,add,/view.php?id=81856,Eduardo García Bermo
47097822,47116819.0,2015-07-31 03:14:08,0.0,127.0.0.1,635.0,role,0.0,unassign,admin/roles/assign.php?contextid=24578&roleid=5,Estudiante


In [6]:
#use this cell to write any additional piece of code that may be required

### First step: Make it lighter.

One of the first thing to do is to consider the set of students and courses we intend to use. We have, from our support table, a list of the courses and students that we are interested in. We'll then use that list of unique student-course pairs to only get logs for the courses we are interested in.

In [38]:
#We perform a group operation that 
student_courses = support_table.groupby([
                                        'courseid',
                                         'userid',
                                        ],
                                        as_index = False).size().rename(columns = {'courseid':'course'}).sort_values(by = 'size', ascending = False)

#then, we perform an inner merge - only keeping the rows that 
student_logs_actions = pd.merge(student_courses, student_logs, on=[
                                                        'userid',
                                                        'course',
                                                        ], 
                                                        how='inner').drop('size', axis = 1)

In [39]:
student_logs_actions.describe(include ='all', datetime_is_numeric = True).T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max
course,8073418,787.0,2059.0,152140.0,NaT,NaT,NaT,NaT,NaT,NaT
userid,8073418,13723.0,61564.0,16028.0,NaT,NaT,NaT,NaT,NaT,NaT
id,8073418,8073418.0,26367149.0,1.0,NaT,NaT,NaT,NaT,NaT,NaT
time,8073418,,,,2015-01-22 11:30:42.195557888,2014-07-01 13:30:43,2014-11-11 07:56:32,2015-01-20 13:41:47.500,2015-04-01 14:28:01.750000128,2015-07-31 03:00:10
ip,8073418,78310.0,127.0.0.1,5090458.0,NaT,NaT,NaT,NaT,NaT,NaT
module,8073418,32.0,course,3332811.0,NaT,NaT,NaT,NaT,NaT,NaT
cmid,8073418,34083.0,0.0,3686259.0,NaT,NaT,NaT,NaT,NaT,NaT
action,8073418,127.0,view,6412568.0,NaT,NaT,NaT,NaT,NaT,NaT
url,8064790,233031.0,view.php?id=2059,62213.0,NaT,NaT,NaT,NaT,NaT,NaT
info,7798298,51751.0,Ver página de estado de las entregas propios.,738590.0,NaT,NaT,NaT,NaT,NaT,NaT


In [40]:
student_logs_actions

Unnamed: 0,course,userid,id,time,ip,module,cmid,action,url,info
0,2272.0,72404.0,26367149.0,2015-02-01 21:28:28,127.0.0.1,course,0.0,view,view.php?id=2272,2272
1,2272.0,72404.0,26367215.0,2015-02-01 21:28:51,127.0.0.1,quiz,98690.0,view,view.php?id=98690,1654
2,2272.0,72404.0,26367328.0,2015-02-01 21:29:21,127.0.0.1,quiz,98706.0,view,view.php?id=98706,1657
3,2272.0,72404.0,26367455.0,2015-02-01 21:30:14,127.0.0.1,quiz,98690.0,attempt,review.php?attempt=49216,1654
4,2272.0,72404.0,26367456.0,2015-02-01 21:30:14,127.0.0.1,quiz,98690.0,continue attempt,review.php?attempt=49216,1654
...,...,...,...,...,...,...,...,...,...,...
8073413,999.0,9203.0,42308220.0,2015-05-16 08:27:22,127.0.0.1,resource,260558.0,view,view.php?id=260558,168571
8073414,999.0,9203.0,42308224.0,2015-05-16 08:27:32,127.0.0.1,course,0.0,view,view.php?id=999,999
8073415,999.0,9203.0,43796806.0,2015-05-25 09:26:31,127.0.0.1,course,0.0,view,view.php?id=999,999
8073416,999.0,9203.0,43796893.0,2015-05-25 09:26:55,127.0.0.1,resource,260558.0,view,view.php?id=260558,168571


**Some preliminary observations of our most common interactions between the students and the systems**

In [41]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(student_logs_actions['action'].value_counts())

view                                   6412568
continue attempt                        336654
view all                                254719
view forum                              150884
review                                  124149
view discussion                         115347
view summary                            110710
view submit assignment form             108914
submit                                   91420
attempt                                  86507
close attempt                            82644
view section                             60239
update                                   28473
view confirm submit assignment form      13215
submit for grading                       12244
view forums                              11369
recent                                    9038
submission statement accepted             7774
view mailbox                              6543
add post                                  4245
launch                                    3733
view submissi

In [42]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(student_logs_actions['module'].value_counts())

course             3332811
resource           1686578
quiz               1094432
assign             1058987
forum               286326
user                123465
folder              110551
url                  92491
page                 89080
grade                28067
glossary             28047
imscp                21307
book                 18804
label                18135
workshop             15447
wiki                 14998
questionnaire        14036
choice               13342
scorm                 8380
jmail                 8194
oublog                3016
bigbluebuttonbn       2532
data                  2202
calendar               869
bookmark               433
lesson                 339
pcast                  317
recordingsbn           125
nanogong                82
discussion              13
notes                   10
role                     2
Name: module, dtype: int64


At this stage, we only have course/student pairs with finalgrades and activity. 
For that, we will create a student-course pivot-table from each we can easily obtain the students attending each particular course.

In the pivot-table, we can find the number of LMS interactions (clicks) performed by a student in the context of a given course. The count of valid entries in each column gives us the number of students attending a given curricular unit.  

In [47]:
student_list = pd.pivot_table(student_logs_actions, index='userid', columns = 'course', values = 'url',
                    aggfunc='count')

student_list

course,1000.0,1002.0,1010.0,1013.0,1020.0,1024.0,1026.0,1027.0,1028.0,1031.0,...,845.0,90.0,918.0,922.0,961.0,984.0,985.0,992.0,993.0,999.0
userid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10087.0,,,,,,,,,,,...,,,,,,,,,,
1009.0,,,,,,,,,,,...,,,,,,,,,,
10159.0,,,,,,,,,,,...,,,,,,,,,,
10184.0,,,,,,,,,,,...,,,,,,,,,,
10273.0,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9895.0,,,,,,,,,,,...,,,,,,,,,,
9915.0,,,,,,,,,,,...,,,,,,,,,,
9925.0,,,,,,,,,,,...,,,,,,,,,,
9964.0,,,,,,,,,,,...,,,,,,,,,,


In [49]:
student_list.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13723 entries, 10087.0 to 9966.0
Columns: 787 entries, 1000.0 to 999.0
dtypes: float64(787)
memory usage: 82.5+ MB


In [51]:
student_list.describe()

course,1000.0,1002.0,1010.0,1013.0,1020.0,1024.0,1026.0,1027.0,1028.0,1031.0,...,845.0,90.0,918.0,922.0,961.0,984.0,985.0,992.0,993.0,999.0
count,12.0,22.0,47.0,68.0,61.0,56.0,112.0,55.0,6.0,13.0,...,53.0,15.0,1.0,2.0,12.0,13.0,20.0,16.0,54.0,56.0
mean,218.833333,231.863636,443.425532,217.426471,356.508197,161.178571,174.928571,768.581818,204.5,362.307692,...,303.773585,336.933333,636.0,416.0,110.666667,323.230769,224.55,418.375,421.12963,498.339286
std,81.337494,97.096072,185.565381,76.658638,258.300898,100.790892,96.185886,317.628171,89.511452,198.145311,...,175.296523,102.017132,,172.534055,43.328205,130.449322,105.023043,157.19367,233.602714,555.000975
min,115.0,71.0,155.0,75.0,135.0,10.0,37.0,240.0,110.0,149.0,...,36.0,185.0,636.0,294.0,50.0,138.0,77.0,202.0,121.0,158.0
25%,143.75,176.75,299.0,161.5,235.0,96.5,115.75,598.5,130.75,262.0,...,189.0,252.0,636.0,355.0,86.25,235.0,144.5,301.25,279.0,302.25
50%,233.0,232.0,408.0,210.0,291.0,149.5,157.0,698.0,191.0,296.0,...,246.0,319.0,636.0,416.0,109.5,310.0,213.0,337.0,349.0,386.5
75%,268.0,263.0,572.5,271.0,374.0,208.0,215.25,843.5,270.0,435.0,...,367.0,405.0,636.0,477.0,128.5,373.0,284.0,559.25,514.75,535.75
max,368.0,553.0,982.0,402.0,1957.0,486.0,790.0,2215.0,328.0,768.0,...,828.0,508.0,636.0,538.0,194.0,549.0,447.0,682.0,1584.0,4301.0


After can perform some exploratory analysis over these logs. On a very simple level, we can see, for each course, how do weekly clicks evolve over the weeks.

In [None]:
student_logs_actions.groupby([
                                        'course',
                                        #'userid',
                                        ],
                                        as_index = False).count()

Defining mandatory assignments vs non-mandatory assignments:

The authors of the paper defined the mandatory assignments were all assignments with a submission rate of over 40%.

In [None]:
#using regex to all columns to remove unnecessary text

goliath['ProductFamily_ID'] = goliath['ProductFamily_ID'].str.extract('(\d+)', expand=False)
goliath['ProductCategory_ID'] = goliath['ProductCategory_ID'].str.extract('(\d+)', expand=False)
goliath['ProductBrand_ID'] = goliath['ProductBrand_ID'].str.extract('(\d+)', expand=False)
goliath['ProductName_ID'] = goliath['ProductName_ID'].str.extract('(\d+)', expand=False)
goliath['Point-of-Sale_ID'] = goliath['Point-of-Sale_ID'].str.extract('(\d+)', expand=False)
goliath['ProductPackSKU_ID'] = goliath['ProductPackSKU_ID'].str.extract('(\d+)', expand=False)

In [None]:
#convert dataframe to a dataframe half its size by merging values and units on sku, store and data
values_df = goliath[goliath['Measures']=='Sell-out values']
units_df = goliath[goliath['Measures']=='Sell-out units']


goliath = pd.merge(units_df,values_df[['ProductPackSKU_ID','Point-of-Sale_ID','Date','Value']], on=['ProductPackSKU_ID','Point-of-Sale_ID','Date'],suffixes=('_units', '_price'))
goliath.drop(columns='Measures', inplace = True)

### Additional Feature Engineering

#### Done

From now on we will always work with df_treated in the future notebooks. 