## Train test split of raw data

This file executes the split of train and test data set. It can be rerun to reproduce the split process from the raw data. Notice the random seed has been used to ensure reproducibility.

As the data set is large, we will recommend downloading the split train and test data sets using download_data.ipynb for better performance.

In [1]:
import sys
sys.path.append("..")

import module.utils.general_utils as general_utils
import module.utils.data_prepare_utils as data_prepare_utils

In [2]:
target = "label"

In [3]:
df = general_utils.read_csv("../data/raw_data.csv", sep="|")


Read CSV file ../data/raw_data.csv into DataFrame:
df.head(): 


Unnamed: 0,label,uid,task_id,adv_id,creat_type_cd,adv_prim_id,dev_id,inter_type_cd,slot_id,spread_app_id,...,list_time,device_price,up_life_duration,up_membership_grade,membership_life_duration,consume_purchase,communication_onlinerate,communication_avgonline_30d,indu_name,pt_d
0,0,1638254,2112,6869,7,207,17,5,11,13,...,4,4,20,-1,-1,2,0^1^2^3^4^5^6^7^8^9^10^11^12^13^14^15^16^17^18...,12,17,1
1,0,1161786,3104,3247,7,183,29,5,17,86,...,4,4,18,-1,-1,2,3^4^5^6^7^8^9^10^11^12^13^14^15^16^17^18^19^20...,12,17,1
2,0,1814783,5890,4183,7,178,17,5,11,70,...,4,5,20,-1,-1,2,0^1^2^3^4^5^6^7^8^9^10^11^12^13^14^15^16^17^18...,11,36,1
3,0,1468996,1993,5405,7,207,17,5,21,13,...,7,3,-1,-1,-1,2,5^6^7^8^9^10^11^12^13^14^15^16^17^18^19^20^21^...,11,17,1
4,0,2164010,5439,4677,2,138,24,5,12,33,...,7,3,-1,-1,-1,2,2^3^4^5^6^7^8^9^10^11^12^13^14^15^16^17^18^19^...,11,20,1


df.shape: (41907133, 36)


In [4]:
df[target].value_counts(normalize=True)

0    0.965507
1    0.034493
Name: label, dtype: float64

The dataset is heavily imbalanced with over 95% zero values in `'label'` column, this will introduce a risk that the minority class might not be adequately represented in either the training set or the test set (or both). This can lead to models that are poorly generalized or validated.

To counteract this, **stratified sampling** is applied, which can ensure that both training and test sets have approximately the same percentage of samples of each target class as the raw data set.

In [5]:
train_cap_x_df, train_y_df, test_cap_x_df, test_y_df = data_prepare_utils.split_train_test_df(df, target=target, stratify=True)
del df


Split DataFrame into train and test set:
train set:
(33525706, 35) (33525706, 1)
test set:
(8381427, 35) (8381427, 1)
target distribution in train set: 
 label
0        0.965507
1        0.034493
dtype: float64
target distribution in test set: 
 label
0        0.965507
1        0.034493
dtype: float64


In [6]:
general_utils.save_to_csv(train_cap_x_df, train_y_df, "../data/train_df.csv")
del train_cap_x_df, train_y_df
general_utils.save_to_csv(test_cap_x_df, test_y_df, "../data/test_df.csv")
del test_cap_x_df, test_y_df


Save DataFrame into csv file:
File saved: ../data/train_df.csv

Save DataFrame into csv file:
File saved: ../data/test_df.csv
