## Train test split of raw data

This file executes the split of train and test data set. It can be rerun to reproduce the split process from the raw data. Notice the random seed has been used to ensure reproducibility.

As the data set is large, we will recommend downloading the split train and test data sets using download_data.ipynb for better performance.

In [1]:
from sklearn.model_selection import train_test_split
import pandas as pd

In [2]:
df = pd.read_csv("../data/raw_data.csv", sep="|")
df.head()

In [3]:
df.shape

In [4]:
df['label'].value_counts(normalize=True)

Read CSV file ../data/raw_data.csv into DataFrame: 
df.head: 
<bound method NDFrame.head of           label      uid  task_id  adv_id  creat_type_cd  adv_prim_id  dev_id  \
0             0  1638254     2112    6869              7          207      17   
1             0  1161786     3104    3247              7          183      29   
2             0  1814783     5890    4183              7          178      17   
3             0  1468996     1993    5405              7          207      17   
4             0  2164010     5439    4677              2          138      24   
...         ...      ...      ...     ...            ...          ...     ...   
41907128      0  2154906     5275    5473              7          156      56   
41907129      0  1466996     5952    4158              7          207      17   
41907130      0  1930657     2178    1860              2          142      60   
41907131      0  1550398     1976    6739              7          154      56   
41907132      0  

The dataset is heavily imbalanced with over 95% zero values in `'label'` column, this will introduce a risk that the minority class might not be adequately represented in either the training set or the test set (or both). This can lead to models that are poorly generalized or validated.

To counteract this, **stratified sampling** is applied, which can ensure that both training and test sets have approximately the same percentage of samples of each target class as the raw data set.

In [None]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])
del df
print(train_df.shape)
print(test_df.shape)

In [6]:
print(train_df['label'].value_counts(normalize=True))
print(test_df['label'].value_counts(normalize=True))

Split DataFrame into train and test set:
train set:
(33525706, 35) (33525706, 1)
test set:
(8381427, 35) (8381427, 1)
target distribution in train set: 
 label
0        0.965507
1        0.034493
dtype: float64
target distribution in test set: 
 label
0        0.965507
1        0.034493
dtype: float64


In [7]:
train_df.to_csv('../data/train_df.csv', index=True, index_label='index')
test_df.to_csv('../data/test_df.csv', index=True, index_label='index')
del train_df
del test_df

Save DataFrame into csv file: 
File saved: ../data/train_df.csv
Save DataFrame into csv file: 
File saved: ../data/test_df.csv
