## Notebook 2.1. Understanding and Preprocessing of Moodle Logs

For all intents and purposes, this should be considered as the first real notebook that is part of the thesis work. In it, we will take the original student log file and perform the necessary manipulations to ensure that we have a dataset with the potential to be useful.

#### 1. A Small overview of the logs and each column

The presented logs report to interactions with the Moodle LMS:

    - Each interaction with the LMS is recorded sequentially:
        When is the action performed,
        What is the nature of the interaction,
        Where is the actor when the action is performed,
        Who performed the interaction,
        In the context of which course page,
        What is the specific link,
                
    - Each user is uniquely identified by the userID,
    - Each course is uniquely identified by the courseID,
    - Each specific interaction is recorded -> action performed and clicked url, 
    - Each click is timestamped,
    - The actor's IP is recorded,

A brief description of each column follows:

##### id
A sequentilly numbered unique identifier interactions,

##### time
Unclear at the moment, likely to be a different representation of time - to revise,

##### userid
Unique numerical identifier of user -> be it student, faculty or other,

##### ip
ip adress used by the user when interactiong with the LMS system,

##### course
Unique numerical identifier of a course,

##### cmid
meaning unclear at the moment - to check with other Moodle Sources,

##### action
categorization of nature of the interaction

##### url
link user clicked on

##### info
additional descriptors added by the user

##### stime
timestamp of action

#### 2. We'll start this notebook by importing all relevant packages and data

All data is stored in the csv files that were exported in the previous notebook. 

In order to minimize unecessary steps, as we import these csv files we will immediatly remove, from each dataset:
1. The first unnamed column,
2. All columns that are entirely made of missing values - we have detected some.
3. All numerical columns that are immediatly recognied as categorical (or likely to be categorical values) are also immediatly declared as categoricals - this does not mean that, upon further assessment, other features may be converted to objects,
4. All features that display no null values and have a single value are promptly removed as well, 
5. No preprocessing of time related features is performed at this stage - namely because the features realted with time may require further assessment.

In [1]:
#import libs
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

In [2]:
#loading student log data 
student_logs = pd.read_csv('../Data/R_Gonz_data_log.csv', 
                           dtype = {
                                   'id': object,
                                   'itemid': object,
                                   'userid': object,
                                   'course': object,
                                   'cmid': object,
                                   },).drop('Unnamed: 0', axis = 1).dropna(how = 'all', axis = 1) #logs

### Taking a preliminary look at the logs

In [3]:
student_logs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47097824 entries, 0 to 47097823
Data columns (total 11 columns):
 #   Column  Dtype  
---  ------  -----  
 0   id      object 
 1   time    float64
 2   userid  object 
 3   ip      object 
 4   course  object 
 5   module  object 
 6   cmid    object 
 7   action  object 
 8   url     object 
 9   info    object 
 10  stime   object 
dtypes: float64(1), object(10)
memory usage: 3.9+ GB


In [4]:
student_logs.describe(include ='all', datetime_is_numeric = True).T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
id,47097824.0,47097824.0,1.0,1.0,,,,,,,
time,47097824.0,,,,1421740830.96801,7176818.752714,1401988147.0,1415623868.75,1421525532.0,1427496191.0,1438312449.0
userid,47097824.0,30517.0,0.0,3219653.0,,,,,,,
ip,47097824.0,161783.0,127.0.0.1,30508698.0,,,,,,,
course,47097824.0,5112.0,1.0,17715596.0,,,,,,,
module,47097824.0,39.0,course,17937931.0,,,,,,,
cmid,47097824.0,167235.0,0.0,34846344.0,,,,,,,
action,47097824.0,157.0,view,27239500.0,,,,,,,
url,47070765.0,754343.0,view.php?id=1,6303588.0,,,,,,,
info,42907847.0,693729.0,1,6306585.0,,,,,,,


In [5]:
student_logs

Unnamed: 0,id,time,userid,ip,course,module,cmid,action,url,info,stime
0,1.0,1.401988e+09,2.0,127.0.0.1,1.0,user,0.0,login,view.php?id=2&course=1,2,2014-06-05 18:09:07
1,2.0,1.401988e+09,2.0,127.0.0.1,1.0,user,0.0,update,view.php?id=2,,2014-06-05 18:14:48
2,3.0,1.401988e+09,2.0,127.0.0.1,1.0,user,0.0,update,view.php?id=2,,2014-06-05 18:14:48
3,4.0,1.401989e+09,2.0,127.0.0.1,1.0,course,0.0,view,view.php?id=1,1,2014-06-05 18:16:13
4,5.0,1.402040e+09,2.0,127.0.0.1,1.0,user,0.0,login,view.php?id=2&course=1,2,2014-06-06 08:37:19
...,...,...,...,...,...,...,...,...,...,...,...
47097819,47116816.0,1.438312e+09,0.0,127.0.0.1,1.0,user,0.0,add,/view.php?id=81854,Cathleen Scheurich,2015-07-31 04:00:59
47097820,47116817.0,1.438312e+09,0.0,127.0.0.1,1.0,user,0.0,add,/view.php?id=81855,Sara Gil Díez,2015-07-31 04:00:59
47097821,47116818.0,1.438312e+09,0.0,127.0.0.1,1.0,user,0.0,add,/view.php?id=81856,Eduardo García Bermo,2015-07-31 04:00:59
47097822,47116819.0,1.438312e+09,0.0,127.0.0.1,635.0,role,0.0,unassign,admin/roles/assign.php?contextid=24578&roleid=5,Estudiante,2015-07-31 04:14:08


In [6]:
#use this cell to write any additional piece of code that may be required

### First step: Make it lighter.

We will start by removing all letters from numerically identifiable categories: that is:
ProductFamily_ID	
ProductCategory_ID	
ProductBrand_ID	
ProductName_ID	
ProductPackSKU_ID	

THE SAME PRODUCT MAY HAVE DIFFERENT SKUs!!!!!!

In [None]:
student_logs

In [None]:
#using regex to all columns to remove unnecessary text

goliath['ProductFamily_ID'] = goliath['ProductFamily_ID'].str.extract('(\d+)', expand=False)
goliath['ProductCategory_ID'] = goliath['ProductCategory_ID'].str.extract('(\d+)', expand=False)
goliath['ProductBrand_ID'] = goliath['ProductBrand_ID'].str.extract('(\d+)', expand=False)
goliath['ProductName_ID'] = goliath['ProductName_ID'].str.extract('(\d+)', expand=False)
goliath['Point-of-Sale_ID'] = goliath['Point-of-Sale_ID'].str.extract('(\d+)', expand=False)
goliath['ProductPackSKU_ID'] = goliath['ProductPackSKU_ID'].str.extract('(\d+)', expand=False)

In [None]:
#convert dataframe to a dataframe half its size by merging values and units on sku, store and data
values_df = goliath[goliath['Measures']=='Sell-out values']
units_df = goliath[goliath['Measures']=='Sell-out units']


goliath = pd.merge(units_df,values_df[['ProductPackSKU_ID','Point-of-Sale_ID','Date','Value']], on=['ProductPackSKU_ID','Point-of-Sale_ID','Date'],suffixes=('_units', '_price'))
goliath.drop(columns='Measures', inplace = True)

### Additional Feature Engineering

#### Done

From now on we will always work with df_treated in the future notebooks. 