## Notebook 2 Data Understanding and Preprocessing

For all intents and purposes, this should be considered as the first real notebook that is part of the thesis work. In it, we will take the original student log file and perform the necessary manipulations to ensure that we have a dataset with the potential to be useful.

#### 1. A Small overview of the logs and each column

The presented logs report to interactions with the Moodle LMS:

    - Each interaction with the LMS is recorded sequentially:
        When is the action performed,
        What is the nature of the interaction,
        Where is the actor when the action is performed,
        Who performed the interaction,
        In the context of which course page,
        What is the specific link,
                
    - Each user is uniquely identified by the userID,
    - Each course is uniquely identified by the courseID,
    - Each specific interaction is recorded -> action performed and clicked url, 
    - Each click is timestamped,
    - The actor's IP is recorded,
    
Despite knowing some of the 

In [1]:
#import libs
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

In [4]:
student_logs = pd.read_csv('../Data/R_Gonz_data_log.csv').drop('Unnamed: 0', axis = 1)
student_logs

Unnamed: 0,id,time,userid,ip,course,module,cmid,action,url,info,stime
0,1.0,1.401988e+09,2.0,127.0.0.1,1.0,user,0.0,login,view.php?id=2&course=1,2,2014-06-05 18:09:07
1,2.0,1.401988e+09,2.0,127.0.0.1,1.0,user,0.0,update,view.php?id=2,,2014-06-05 18:14:48
2,3.0,1.401988e+09,2.0,127.0.0.1,1.0,user,0.0,update,view.php?id=2,,2014-06-05 18:14:48
3,4.0,1.401989e+09,2.0,127.0.0.1,1.0,course,0.0,view,view.php?id=1,1,2014-06-05 18:16:13
4,5.0,1.402040e+09,2.0,127.0.0.1,1.0,user,0.0,login,view.php?id=2&course=1,2,2014-06-06 08:37:19
...,...,...,...,...,...,...,...,...,...,...,...
47097819,47116816.0,1.438312e+09,0.0,127.0.0.1,1.0,user,0.0,add,/view.php?id=81854,Cathleen Scheurich,2015-07-31 04:00:59
47097820,47116817.0,1.438312e+09,0.0,127.0.0.1,1.0,user,0.0,add,/view.php?id=81855,Sara Gil Díez,2015-07-31 04:00:59
47097821,47116818.0,1.438312e+09,0.0,127.0.0.1,1.0,user,0.0,add,/view.php?id=81856,Eduardo García Bermo,2015-07-31 04:00:59
47097822,47116819.0,1.438312e+09,0.0,127.0.0.1,635.0,role,0.0,unassign,admin/roles/assign.php?contextid=24578&roleid=5,Estudiante,2015-07-31 04:14:08


In [3]:
#general info
goliath.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182342304 entries, 0 to 182342303
Data columns (total 9 columns):
 #   Column              Dtype  
---  ------              -----  
 0   ProductFamily_ID    object 
 1   ProductCategory_ID  object 
 2   ProductBrand_ID     object 
 3   ProductName_ID      object 
 4   ProductPackSKU_ID   object 
 5   Point-of-Sale_ID    object 
 6   Date                object 
 7   Measures            object 
 8   Value               float64
dtypes: float64(1), object(8)
memory usage: 12.2+ GB


In [4]:
goliath.isna().sum()

ProductFamily_ID      0
ProductCategory_ID    0
ProductBrand_ID       0
ProductName_ID        0
ProductPackSKU_ID     0
Point-of-Sale_ID      0
Date                  0
Measures              0
Value                 0
dtype: int64

In [5]:
goliath.describe(include = 'all')

Unnamed: 0,ProductFamily_ID,ProductCategory_ID,ProductBrand_ID,ProductName_ID,ProductPackSKU_ID,Point-of-Sale_ID,Date,Measures,Value
count,182342304,182342304,182342304,182342304,182342304,182342304,182342304,182342304,182342300.0
unique,21,178,1523,2820,8509,410,1401,2,
top,Family_12,Category_178,ProductBrand_1425,ProductName_2609,ProductSKU_3008,POS_282,2018-12-10,Sell-out values,
freq,38915420,126256286,2525774,1802618,975138,975220,204254,91171152,
mean,,,,,,,,,1760.203
std,,,,,,,,,5024.838
min,,,,,,,,,-10.0
25%,,,,,,,,,1.0
50%,,,,,,,,,58.0
75%,,,,,,,,,1654.0


### First step: Make it lighter.

We will start by removing all letters from numerically identifiable categories: that is:
ProductFamily_ID	
ProductCategory_ID	
ProductBrand_ID	
ProductName_ID	
ProductPackSKU_ID	

THE SAME PRODUCT MAY HAVE DIFFERENT SKUs!!!!!!

In [6]:
#using regex to all columns to remove unnecessary text

goliath['ProductFamily_ID'] = goliath['ProductFamily_ID'].str.extract('(\d+)', expand=False)
goliath['ProductCategory_ID'] = goliath['ProductCategory_ID'].str.extract('(\d+)', expand=False)
goliath['ProductBrand_ID'] = goliath['ProductBrand_ID'].str.extract('(\d+)', expand=False)
goliath['ProductName_ID'] = goliath['ProductName_ID'].str.extract('(\d+)', expand=False)
goliath['Point-of-Sale_ID'] = goliath['Point-of-Sale_ID'].str.extract('(\d+)', expand=False)
goliath['ProductPackSKU_ID'] = goliath['ProductPackSKU_ID'].str.extract('(\d+)', expand=False)

In [7]:
#convert dataframe to a dataframe half its size by merging values and units on sku, store and data
values_df = goliath[goliath['Measures']=='Sell-out values']
units_df = goliath[goliath['Measures']=='Sell-out units']


goliath = pd.merge(units_df,values_df[['ProductPackSKU_ID','Point-of-Sale_ID','Date','Value']], on=['ProductPackSKU_ID','Point-of-Sale_ID','Date'],suffixes=('_units', '_price'))
goliath.drop(columns='Measures', inplace = True)

In [8]:
goliath.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 91191598 entries, 0 to 91191597
Data columns (total 9 columns):
 #   Column              Dtype  
---  ------              -----  
 0   ProductFamily_ID    object 
 1   ProductCategory_ID  object 
 2   ProductBrand_ID     object 
 3   ProductName_ID      object 
 4   ProductPackSKU_ID   object 
 5   Point-of-Sale_ID    object 
 6   Date                object 
 7   Value_units         float64
 8   Value_price         float64
dtypes: float64(2), object(7)
memory usage: 6.8+ GB


### Additional Feature Engineering

In [9]:
goliath['Unit_Price'] = goliath['Value_price'] / goliath['Value_units']

In [10]:
#This operation will get the max price of each product in each store and use that as the general retail price
goliath['Retail_price'] = goliath.groupby(["ProductPackSKU_ID", "Point-of-Sale_ID"])["Unit_Price"].transform('max')


#Then, create an is_promo column that , if the difference between retail price and unit price (as sold) is larger than 10 %, it was sold on special offer
goliath['Is_Promo'] = np.where(goliath.Unit_Price <= (goliath.Retail_price * 0.9), 1, 0)
goliath

Unnamed: 0,ProductFamily_ID,ProductCategory_ID,ProductBrand_ID,ProductName_ID,ProductPackSKU_ID,Point-of-Sale_ID,Date,Value_units,Value_price,Unit_Price,Retail_price,Is_Promo
0,16,11,306,649,1970,1,2017-03-04,2.0,1540.0,770.0,810.0,0
1,16,11,306,649,1970,1,2016-05-02,4.0,3080.0,770.0,810.0,0
2,16,11,306,649,1970,1,2016-10-24,2.0,1540.0,770.0,810.0,0
3,16,11,306,649,1970,1,2017-10-13,2.0,1620.0,810.0,810.0,0
4,16,11,306,649,1970,1,2017-10-14,2.0,1620.0,810.0,810.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
91191593,4,34,279,577,1813,410,2016-01-28,1.0,643.0,643.0,810.0,1
91191594,4,34,279,577,1813,410,2016-04-20,1.0,638.0,638.0,810.0,1
91191595,4,34,279,577,1813,410,2016-04-25,1.0,652.0,652.0,810.0,1
91191596,4,34,279,577,1813,410,2016-04-28,1.0,643.0,643.0,810.0,1


In [11]:
#storing as a more manageable CSV to be worked with from now on
goliath.to_csv('../Databases/df_treated.csv')

#### Done

From now on we will always work with df_treated in the future notebooks. 