## Notebook 2 Data Understanding and Preprocessing

For all intents and purposes, this should be considered as the first real notebook that is part of the thesis work. In it, we will take the original student log file and perform the necessary manipulations to ensure that we have a dataset with the potential to be useful.

#### 1. A Small overview of the logs and each column

The presented logs report to interactions with the Moodle LMS:

    - Each interaction with the LMS is recorded sequentially:
        When is the action performed,
        What is the nature of the interaction,
        Where is the actor when the action is performed,
        Who performed the interaction,
        In the context of which course page,
        What is the specific link,
                
    - Each user is uniquely identified by the userID,
    - Each course is uniquely identified by the courseID,
    - Each specific interaction is recorded -> action performed and clicked url, 
    - Each click is timestamped,
    - The actor's IP is recorded,

A brief description of each column follows:

##### id
A sequentilly numbered unique identifier interactions,

##### time
Unclear at the moment, likely to be a different representation of time - to revise,

##### userid
Unique numerical identifier of user -> be it student, faculty or other,

##### ip
ip adress used by the user when interactiong with the LMS system,

##### course
Unique numerical identifier of a course,

##### cmid
meaning unclear at the moment - to check with other Moodle Sources,

##### action
categorization of nature of the interaction

##### url
link user clicked on

##### info
additional descriptors added by the user

##### stime
timestamp of action

In [1]:
#import libs
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

In [5]:
#loading student log data 
student_logs = pd.read_csv('../Data/R_Gonz_data_log.csv').drop('Unnamed: 0', axis = 1) #logs

#other tables with support information
context_table = pd.read_csv('../Data/R_Gonz_data_mdl_context.csv').drop('Unnamed: 0', axis = 1) #context table -> unclear utility
course_table = pd.read_csv('../Data/R_Gonz_data_mdl_course.csv').drop('Unnamed: 0', axis = 1) #course table -> unclear utility
course_mod_table = pd.read_csv('../Data/R_Gonz_data_mdl_course_modules.csv').drop('Unnamed: 0', axis = 1) #course module table -> unclear utility
grades_table = pd.read_csv('../Data/R_Gonz_data_mdl_grade_grades.csv').drop('Unnamed: 0', axis = 1) # grade table -> unclear utility
grade_item_table = pd.read_csv('../Data/R_Gonz_data_mdl_grade_items.csv').drop('Unnamed: 0', axis = 1) # grade_items table -> unclear utility
role_ass_table = pd.read_csv('../Data/R_Gonz_data_mdl_role_assignments.csv').drop('Unnamed: 0', axis = 1) # role assignments table -> unclear utility

#### We are familiarized with the logs

Before going further, we should assess the remaining tables presented in the database. 

Recall, logs record interactions with the system and we are looking for ways to determine whether these interactions can assist educators identify at risk students and high performing students. 

Student performance is, in general, measured by the student's grade.

In [11]:
grades_table.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 437650 entries, 0 to 437649
Data columns (total 21 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   id                 437650 non-null  float64
 1   itemid             437650 non-null  float64
 2   userid             437650 non-null  float64
 3   rawgrade           137820 non-null  float64
 4   rawgrademax        437650 non-null  float64
 5   rawgrademin        437650 non-null  float64
 6   rawscaleid         88518 non-null   float64
 7   usermodified       437650 non-null  float64
 8   finalgrade         236668 non-null  float64
 9   hidden             437650 non-null  float64
 10  locked             437650 non-null  float64
 11  locktime           437650 non-null  float64
 12  exported           437650 non-null  float64
 13  overridden         437650 non-null  float64
 14  excluded           437650 non-null  float64
 15  feedback           20273 non-null   object 
 16  fe

In [13]:
grade_course_modules.info()

NameError: name 'grade_course_modules' is not defined

#### The course module table is present in other datasets "e.g. The Open Moodle Dataset", 

According to it, the course module table describes every activity performed with Moodle. In our case, it records every activity performed in every course.

Here follows a brief overview of this table.

In [14]:
course_mod_table.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 228216 entries, 0 to 228215
Data columns (total 22 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   id                         228216 non-null  float64
 1   course                     228216 non-null  float64
 2   module                     228216 non-null  float64
 3   instance                   228216 non-null  float64
 4   section                    228216 non-null  float64
 5   idnumber                   428 non-null     object 
 6   added                      228216 non-null  float64
 7   score                      228216 non-null  float64
 8   indent                     228216 non-null  float64
 9   visible                    228216 non-null  float64
 10  visibleold                 228216 non-null  float64
 11  groupmode                  228216 non-null  float64
 12  groupingid                 228216 non-null  float64
 13  groupmembersonly           22

### First step: Make it lighter.

We will start by removing all letters from numerically identifiable categories: that is:
ProductFamily_ID	
ProductCategory_ID	
ProductBrand_ID	
ProductName_ID	
ProductPackSKU_ID	

THE SAME PRODUCT MAY HAVE DIFFERENT SKUs!!!!!!

In [6]:
#using regex to all columns to remove unnecessary text

goliath['ProductFamily_ID'] = goliath['ProductFamily_ID'].str.extract('(\d+)', expand=False)
goliath['ProductCategory_ID'] = goliath['ProductCategory_ID'].str.extract('(\d+)', expand=False)
goliath['ProductBrand_ID'] = goliath['ProductBrand_ID'].str.extract('(\d+)', expand=False)
goliath['ProductName_ID'] = goliath['ProductName_ID'].str.extract('(\d+)', expand=False)
goliath['Point-of-Sale_ID'] = goliath['Point-of-Sale_ID'].str.extract('(\d+)', expand=False)
goliath['ProductPackSKU_ID'] = goliath['ProductPackSKU_ID'].str.extract('(\d+)', expand=False)

In [7]:
#convert dataframe to a dataframe half its size by merging values and units on sku, store and data
values_df = goliath[goliath['Measures']=='Sell-out values']
units_df = goliath[goliath['Measures']=='Sell-out units']


goliath = pd.merge(units_df,values_df[['ProductPackSKU_ID','Point-of-Sale_ID','Date','Value']], on=['ProductPackSKU_ID','Point-of-Sale_ID','Date'],suffixes=('_units', '_price'))
goliath.drop(columns='Measures', inplace = True)

In [8]:
goliath.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 91191598 entries, 0 to 91191597
Data columns (total 9 columns):
 #   Column              Dtype  
---  ------              -----  
 0   ProductFamily_ID    object 
 1   ProductCategory_ID  object 
 2   ProductBrand_ID     object 
 3   ProductName_ID      object 
 4   ProductPackSKU_ID   object 
 5   Point-of-Sale_ID    object 
 6   Date                object 
 7   Value_units         float64
 8   Value_price         float64
dtypes: float64(2), object(7)
memory usage: 6.8+ GB


### Additional Feature Engineering

In [9]:
goliath['Unit_Price'] = goliath['Value_price'] / goliath['Value_units']

In [10]:
#This operation will get the max price of each product in each store and use that as the general retail price
goliath['Retail_price'] = goliath.groupby(["ProductPackSKU_ID", "Point-of-Sale_ID"])["Unit_Price"].transform('max')


#Then, create an is_promo column that , if the difference between retail price and unit price (as sold) is larger than 10 %, it was sold on special offer
goliath['Is_Promo'] = np.where(goliath.Unit_Price <= (goliath.Retail_price * 0.9), 1, 0)
goliath

Unnamed: 0,ProductFamily_ID,ProductCategory_ID,ProductBrand_ID,ProductName_ID,ProductPackSKU_ID,Point-of-Sale_ID,Date,Value_units,Value_price,Unit_Price,Retail_price,Is_Promo
0,16,11,306,649,1970,1,2017-03-04,2.0,1540.0,770.0,810.0,0
1,16,11,306,649,1970,1,2016-05-02,4.0,3080.0,770.0,810.0,0
2,16,11,306,649,1970,1,2016-10-24,2.0,1540.0,770.0,810.0,0
3,16,11,306,649,1970,1,2017-10-13,2.0,1620.0,810.0,810.0,0
4,16,11,306,649,1970,1,2017-10-14,2.0,1620.0,810.0,810.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
91191593,4,34,279,577,1813,410,2016-01-28,1.0,643.0,643.0,810.0,1
91191594,4,34,279,577,1813,410,2016-04-20,1.0,638.0,638.0,810.0,1
91191595,4,34,279,577,1813,410,2016-04-25,1.0,652.0,652.0,810.0,1
91191596,4,34,279,577,1813,410,2016-04-28,1.0,643.0,643.0,810.0,1


In [11]:
#storing as a more manageable CSV to be worked with from now on
goliath.to_csv('../Databases/df_treated.csv')

#### Done

From now on we will always work with df_treated in the future notebooks. 