## Train test split of raw data

This file executes the split of train and test data set. It can be rerun to reproduce the split process from the raw data. Notice the random seed has been used to ensure reproducibility.

As the data set is large, we will recommend downloading the split train and test data sets using download_data.ipynb for better performance.

In [23]:
from sklearn.model_selection import train_test_split
import pandas as pd

In [24]:
df = pd.read_csv("../data/raw_data.csv", sep="|")
df.head()

Unnamed: 0,label,uid,task_id,adv_id,creat_type_cd,adv_prim_id,dev_id,inter_type_cd,slot_id,spread_app_id,...,list_time,device_price,up_life_duration,up_membership_grade,membership_life_duration,consume_purchase,communication_onlinerate,communication_avgonline_30d,indu_name,pt_d
0,0,1638254,2112,6869,7,207,17,5,11,13,...,4,4,20,-1,-1,2,0^1^2^3^4^5^6^7^8^9^10^11^12^13^14^15^16^17^18...,12,17,1
1,0,1161786,3104,3247,7,183,29,5,17,86,...,4,4,18,-1,-1,2,3^4^5^6^7^8^9^10^11^12^13^14^15^16^17^18^19^20...,12,17,1
2,0,1814783,5890,4183,7,178,17,5,11,70,...,4,5,20,-1,-1,2,0^1^2^3^4^5^6^7^8^9^10^11^12^13^14^15^16^17^18...,11,36,1
3,0,1468996,1993,5405,7,207,17,5,21,13,...,7,3,-1,-1,-1,2,5^6^7^8^9^10^11^12^13^14^15^16^17^18^19^20^21^...,11,17,1
4,0,2164010,5439,4677,2,138,24,5,12,33,...,7,3,-1,-1,-1,2,2^3^4^5^6^7^8^9^10^11^12^13^14^15^16^17^18^19^...,11,20,1


In [25]:
df.shape

(41907133, 36)

In [26]:
df['label'].value_counts(normalize=True)

0    0.965507
1    0.034493
Name: label, dtype: float64

The dataset is heavily imbalanced with over 95% zero values in `'label'` column, this will introduce a risk that the minority class might not be adequately represented in either the training set or the test set (or both). This can lead to models that are poorly generalized or validated.

To counteract this, **stratified sampling** is applied, which can ensure that both training and test sets have approximately the same percentage of samples of each target class as the raw data set.

In [27]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])
del df
print(train_df.shape)
print(test_df.shape)

(33525706, 36)
(8381427, 36)


In [28]:
print(train_df['label'].value_counts(normalize=True))
print(test_df['label'].value_counts(normalize=True))

0    0.965507
1    0.034493
Name: label, dtype: float64
0    0.965507
1    0.034493
Name: label, dtype: float64


In [29]:
train_df.to_csv('../data/train_df.csv', index=True, index_label='index')
test_df.to_csv('../data/test_df.csv', index=True, index_label='index')
del train_df
del test_df