# 2.0.0 - Data Split

### Methodology
This section details the process of splitting the dataset into training, validation, and testing sets. The dataset was split using a time-based strategy to segregate the data into distinct periods for training, validation, and testing. Furthermore, a spatial segregation approach was employed within the training data to create a validation set.

### Conclusion:
- Data Split
    - Training Set: Data from July 2022 to February 2023, used for initial model training.
    - Validation Set: A subset of the training set, spatially distinct, used for tuning model parameters and initial evaluation.
    - Testing Set: Data from March 2023 to April 2024, used to simulate model performance on future, unseen data.


- Data Distribution
The split resulted in the following distribution of samples:

    - Training Set: 9479 samples with a default rate (bads) of 19.4%.
    - Validation Set: 1053 samples with a default rate of 19.94%.
    - Testing Set: 3845 samples with a default rate of 16.9%.

This distribution ensures that each set is representative of the overall data, with the training set encompassing the majority of the data (65.59%), followed by the testing set (26.7%) and the validation set (7.3%).

### 1. Load Data 

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import numpy.random as rnd

In [2]:
def time_split_dataset(
    dataset: pd.DataFrame,
    train_start_date: str,
    train_end_date: str,
    holdout_end_date: str,
    time_column: str,
    space_column: str,
    holdout_start_date: str = None,
    split_seed: int = 42,
    space_holdout_percentage: float = 0.1,
) -> tuple:
    """
    Splits a dataset into training, validation, and testing sets based on time and space dimensions.

    The function first segregates the data into time-based training and testing intervals. 
    Within the training interval, it further splits the data into training and validation sets 
    based on a specified percentage of unique values in the 'space_column', ensuring that 
    the validation set is spatially distinct from the training set.

    Parameters:
    - dataset (pd.DataFrame): The complete dataset to split.
    - train_start_date (str): The start date for the training dataset.
    - train_end_date (str): The end date for the training dataset.
    - holdout_end_date (str): The end date for the testing dataset.
    - time_column (str): The column in the dataset that contains the time information.
    - space_column (str): The column in the dataset that represents the spatial information.
    - holdout_start_date (str, optional): The start date for the testing dataset. Defaults to train_end_date.
    - split_seed (int, optional): The seed for the random state used in spatial sampling. Defaults to 42.
    - space_holdout_percentage (float, optional): The percentage of the space_column's unique values 
      to hold out for validation. Defaults to 0.1.

    Returns:
    - tuple: A tuple containing three pd.DataFrame objects:
        1. train_set: The training dataset.
        2. validation_set: The spatially distinct validation dataset.
        3. test_set: The testing dataset set aside by time.
    """

    state = rnd.RandomState(split_seed)
    holdout_start_date = holdout_start_date if holdout_start_date else train_end_date
    train_set = dataset[
        (dataset[time_column] >= train_start_date)
        & (dataset[time_column] < train_end_date)
    ]
    train_period_space = train_set[space_column].unique()
    test_set = dataset[
        (dataset[time_column] >= holdout_start_date)
        & (dataset[time_column] < holdout_end_date)
    ]
    validation_idx = state.choice(
        a=train_period_space,
        size=int(space_holdout_percentage * len(train_period_space)),
        replace=False,
    )
    validation_set = train_set[train_set[space_column].isin(validation_idx)]
    train_set = train_set[~train_set[space_column].isin(validation_idx)]

    return train_set, validation_set, test_set

In [3]:
DATA_PATH = Path.cwd().parent / "data"
MAIN_DATASET_PATH = DATA_PATH / "processed/202404_final_dataset.pickle"


train_start_date = "2022-07"
train_end_date = "2023-03"
holdout_end_date = "2023-05"
time_column = "loan_origination_datetime_month"
space_column = "customer_id"
target = "target"


In [4]:
df = pd.read_pickle(MAIN_DATASET_PATH)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14454 entries, 0 to 14453
Columns: 291 entries, customer_id to credit_reports__debt_due_ratio
dtypes: float64(286), int64(4), object(1)
memory usage: 32.1+ MB


In [5]:
train_df, validation_df, test_df = time_split_dataset(
    df,
    train_start_date=train_start_date,
    train_end_date=train_end_date,
    holdout_end_date=holdout_end_date,
    time_column=time_column,
    space_column=space_column
)

In [6]:
train_len = len(train_df)
val_len = len(validation_df)
test_len = len(test_df)

train_bads = train_df.target.sum()
val_bads = validation_df.target.sum()
test_bads = test_df.target.sum()

train_min = train_df[time_column].min()
val_min = validation_df[time_column].min()
test_min = test_df[time_column].min()


train_max = train_df[time_column].max()
val_max = validation_df[time_column].max()
test_max = test_df[time_column].max()

train_dist = train_df[target].mean()
val_dist = validation_df[target].mean()
test_dist = test_df[target].mean()

(
    pd.DataFrame(
        {
            "n_samples": [train_len, val_len, test_len],
            "bads": [train_bads, val_bads, test_bads],
            "min": [train_min, val_min, test_min],
            "max": [train_max, val_max, test_max],
            "target_distribution": [train_dist, val_dist, test_dist],
        },
        index=["train", "validation", "test"],
    ).assign(samples_dist=lambda x: x["n_samples"] / sum(x["n_samples"]))
)

Unnamed: 0,n_samples,bads,min,max,target_distribution,samples_dist
train,9479,1839,2022-07,2023-02,0.194008,0.659317
validation,1053,210,2022-07,2023-02,0.19943,0.073242
test,3845,650,2023-03,2023-04,0.169051,0.267441


In [7]:
dataset_date = str(MAIN_DATASET_PATH).split("/")[-1:][0][0:6]
train_df.to_pickle(DATA_PATH / f"processed/{dataset_date}_train_data.pickle")
validation_df.to_pickle(DATA_PATH / f"processed/{dataset_date}_validation_data.pickle")
test_df.to_pickle(DATA_PATH / f"processed/{dataset_date}_test_data.pickle")