# Data Split


- This file executes the split of train and test data set and the split of train and validation set. It can be rerun to reproduce the split process from the raw data. Notice the random seed has been used to ensure reproducibility.


## 1. Train test split of raw data

In [1]:
import sys
sys.path.append("..")

import general_utils
import data_prepare_utils
import pandas as pd

In [2]:
TARGET = "label"
RAW_RESAMPLED_FILE_PATH = "./data/raw_data_resampled.csv"
TRAIN_FILE_PATH = "./data/train_df.csv"
VALIDATION_FILE_PATH = "./data/validation_df.csv"
TEST_FILE_PATH = "./data/test_df.csv"

TRAIN_TEST_SPLIT_RANDOM_STATE = 42
TEST_SIZE = 0.20
TRAIN_VALIDATION_SPLIT_RANDOM_STATE = 42
VALIDATION_SIZE = 0.20

In [3]:
df = general_utils.read_csv(RAW_RESAMPLED_FILE_PATH)


Read CSV file ../data/raw_data_resampled.csv into DataFrame:
df.head(): 


Unnamed: 0,label,uid,task_id,adv_id,creat_type_cd,adv_prim_id,dev_id,inter_type_cd,slot_id,spread_app_id,...,list_time,device_price,up_life_duration,up_membership_grade,membership_life_duration,consume_purchase,communication_onlinerate,communication_avgonline_30d,indu_name,pt_d
0,0,2093274,3744,2227,5,142,36,5,18,80,...,12,2,-1,-1,-1,2,5^6^7^8^9^10^11^12^13^14^15^16^17^18^19^20^21^...,11,42,1
1,0,1300943,2088,2314,6,132,60,3,11,78,...,8,5,-1,-1,-1,2,0^1^2^3^4^5^6^7^8^9^10^11^12^13^14^15^16^17^18...,13,47,6
2,0,1630699,3747,4461,7,207,17,5,17,13,...,4,4,20,1,-1,2,5^6^7^8^9^10^11^12^13^14^15^16^17^18^19^20^21^...,11,17,7
3,0,1320249,1220,4477,7,207,17,5,16,13,...,14,5,20,1,-1,2,0^1^2^3^4^5^6^7^8^9^10^11^12^13^14^15^16^17^18...,13,17,7
4,0,1776239,3071,4591,7,109,29,5,12,86,...,5,4,-1,-1,-1,2,7^8^9^10^11^12^13^14^15^16^17^18^19^20^21^22^23,11,17,6


df.shape: (1047678, 36)


In [4]:
df[TARGET].value_counts(normalize=True)

0    0.965508
1    0.034492
Name: label, dtype: float64


- The dataset is heavily imbalanced with over 95% zero values in `'label'` column, this will introduce a risk that the minority class might not be adequately represented in either the training set or the test set (or both). This can lead to models that are poorly generalized or validated.
- To counteract this, **stratified sampling** is applied, which can ensure that both training and test sets have approximately the same percentage of samples of each target class as the raw data set.


In [5]:
df = data_prepare_utils.oversample_data(df, oversample_fraction=0.01, random_state=42)


Perform an oversample of 0.01 due to the high imbalance:
oversampled_df.shape: (1047678, 36)


In [6]:
train_cap_x_df, train_y_df, test_cap_x_df, test_y_df = data_prepare_utils.split_train_test_df(df, target=TARGET, stratify=True, test_size=TEST_SIZE, random_state=TRAIN_TEST_SPLIT_RANDOM_STATE)
del df


Split DataFrame into train and test set:
train set:
(838142, 35) (838142, 1)
test set:
(209536, 35) (209536, 1)
target distribution in train set: 
 label
0        0.965507
1        0.034493
dtype: float64
target distribution in test set: 
 label
0        0.96551
1        0.03449
dtype: float64


## 2. Train validation split

In [7]:
train_cap_x_df, train_y_df, validation_cap_x_df, validation_y_df = data_prepare_utils.split_train_test_df(pd.concat([train_cap_x_df, train_y_df], axis=1), target=TARGET, stratify=True, test_size=VALIDATION_SIZE, random_state=TRAIN_VALIDATION_SPLIT_RANDOM_STATE)


Split DataFrame into train and test set:
train set:
(670513, 35) (670513, 1)
test set:
(167629, 35) (167629, 1)
target distribution in train set: 
 label
0        0.965507
1        0.034493
dtype: float64
target distribution in test set: 
 label
0        0.965507
1        0.034493
dtype: float64


In [8]:
general_utils.save_to_csv(train_cap_x_df, train_y_df, TRAIN_FILE_PATH)
del train_cap_x_df, train_y_df

general_utils.save_to_csv(validation_cap_x_df, validation_y_df, VALIDATION_FILE_PATH)
del validation_cap_x_df, validation_y_df

general_utils.save_to_csv(test_cap_x_df, test_y_df, TEST_FILE_PATH)
del test_cap_x_df, test_y_df


Save DataFrame into csv file:
File saved: ../data/train_df.csv

Save DataFrame into csv file:
File saved: ../data/validation_df.csv

Save DataFrame into csv file:
File saved: ../data/test_df.csv
