### 1.0 Data Loading

Raw data set contains over 13 million rows.  The scope of this project requires only the users who visited the site during the trial time.  In addition, the data is converted to smaller data types to reduce file size.

1.1   Setup:  Libraries and Data

1.2   Filter Raw Data

1.3   Data Type Handling

1.4   Save Data Set



1.1 Setup:  Libraries and Data

In [None]:
# Install libraries
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install seaborn
!pip install sklearn

Collecting sklearn
  Using cached sklearn-0.0.post12.tar.gz (2.6 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#  Sample upload of data
#  Data source:  https://www.kaggle.com/datasets/arashnic/uplift-modeling
df=pd.read_csv('raw_data.csv')
df.head()

In [None]:
df_raw.shape

(13931074, 16)

1.2 Filter Raw Data

Remove rows where users did not visit the site

In [None]:
# Filter raw data for only users who visited the site; drop 'visit' column
df_filtered=df_raw[df_raw['visit']==1].reset_index(drop=True).drop('visit', axis=1)

df_filtered.shape

(653157, 15)

In [None]:
visit_ratio=656929/13979592*100
print(f'{visit_ratio:.2f} % of the users visited the website during the time of this test, totalling {df_filtered.shape[0]} users.')

4.70 % of the users visited the website during the time of this test, totalling 653157 users.


In [None]:
df_filtered.head()

Unnamed: 0,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,treatment,conversion,exposure
0,12.781566,10.059654,8.21592,1.114982,11.56105,4.115453,-7.011752,4.833815,3.799079,45.054671,5.303177,-0.337358,1.0,1.0,1.0
1,20.729166,10.059654,8.233548,4.679882,12.31011,3.013064,-12.6418,10.076439,3.760462,45.85949,5.988106,-0.168679,1.0,0.0,0.0
2,24.528223,10.059654,8.403907,4.679882,11.56105,4.115453,-3.993764,4.833815,3.844556,26.606156,6.064094,-0.168679,1.0,0.0,0.0
3,12.616365,10.059654,8.301676,4.679882,11.029584,4.115453,0.294443,4.833815,3.829288,23.570168,6.318202,-0.168679,1.0,0.0,0.0
4,13.018571,10.059654,8.301697,-0.41311,10.280525,4.115453,-10.143546,4.833815,3.876391,30.796373,5.300375,-0.168679,1.0,0.0,1.0


1.3 Data Type Handling

Convert factors f0-f11 to float32, and treatment, conversion, and exposure to int32 to cut memory in half without sacrificing precision (values are less than 100, and have 6 decimals)

In [None]:
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 653157 entries, 0 to 653156
Data columns (total 15 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   f0          653157 non-null  float64
 1   f1          653157 non-null  float64
 2   f2          653157 non-null  float64
 3   f3          653157 non-null  float64
 4   f4          653157 non-null  float64
 5   f5          653157 non-null  float64
 6   f6          653157 non-null  float64
 7   f7          653157 non-null  float64
 8   f8          653157 non-null  float64
 9   f9          653157 non-null  float64
 10  f10         653157 non-null  float64
 11  f11         653157 non-null  float64
 12  treatment   653157 non-null  float64
 13  conversion  653157 non-null  float64
 14  exposure    653157 non-null  float64
dtypes: float64(15)
memory usage: 74.7 MB


In [None]:
#  Convert treatment, conversion, and exposure to integers

for col in ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11']:
  df_filtered[col] = df_filtered[col].astype('float32')

df_filtered['treatment'] = df_filtered['treatment'].astype('int32')
df_filtered['conversion'] = df_filtered['conversion'].astype('int32')
df_filtered['exposure'] = df_filtered['exposure'].astype('int32')

df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 653157 entries, 0 to 653156
Data columns (total 15 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   f0          653157 non-null  float32
 1   f1          653157 non-null  float32
 2   f2          653157 non-null  float32
 3   f3          653157 non-null  float32
 4   f4          653157 non-null  float32
 5   f5          653157 non-null  float32
 6   f6          653157 non-null  float32
 7   f7          653157 non-null  float32
 8   f8          653157 non-null  float32
 9   f9          653157 non-null  float32
 10  f10         653157 non-null  float32
 11  f11         653157 non-null  float32
 12  treatment   653157 non-null  int32  
 13  conversion  653157 non-null  int32  
 14  exposure    653157 non-null  int32  
dtypes: float32(12), int32(3)
memory usage: 37.4 MB


1.4 Save Dataset

In [None]:
# Write cleaned data set

df_filtered.to_csv('filtered_data.csv', index=False)