# Dataset Loader and Train Sampling Approach

Brief description of loading methodology and training approach. Shows how we encode row information, the reason for the method and how data sampling is done during training. The general aim is to improve generalization for distributions and hopefully out of distribution information.

In [1]:
import sys
import os

script_dir = os.path.dirname(os.path.abspath('loading_test.ipynb'))
parent_directory = os.path.dirname(script_dir)
module_directory = os.path.join(parent_directory, 'module') 
utils_directory = os.path.join(parent_directory, 'utils') 

if (parent_directory not in sys.path):
    sys.path.append(parent_directory)
    
if (module_directory not in sys.path):
    sys.path.append(module_directory)
    
if (utils_directory not in sys.path):
    sys.path.append(utils_directory)  
    

from module.preprocess.load_and_batch import DataBatcher
from utils import config

# How to Load Already Split Dataset

In [2]:
info = DataBatcher()

info.load_and_process(base_loc=config.DATA_LOCATION, minority_loc=config.SPLIT_DATASETS+'target.csv', majority_loc=config.SPLIT_DATASETS+f"dataset_{config.DATASET_CONFIG['split_to_load']}.csv", 
train_test_split=config.DATASET_CONFIG['train_test_split'],
validation_split=config.DATASET_CONFIG['validation_split'],
training = config.DATASET_CONFIG['training_stage'])

------> Size of categorical columns: 185 <------
------> Size of numeric columns: 35 <------


# Conversion of Row Information to Text Format

This section is designed to semantically enrich column data by converting them to a structured text format. Metadata is stored within `<tag>` elements, aiding in preprocessing as detailed in the `number_encoder` notebook. For context the tags used are `<|AMOUNT|>`, `<|NUM|>`, `<|DATE|>`

### Process
- **Data Conversion**: Each table row is converted into a list where each entry contains the text description and values of all associated tags.
- **Metadata Tags**: These are used to differentiate data types during preprocessing, improving model efficiency.
- **Numeric Handling**:
  - Numeric values such as integers from 0 to 9 are not tagged as they are directly encodable by most tokenizers.
  - Other numerics like floats and negative numbers are tagged and processed differently to optimize their representation.
- **Special Cases**:
  - Blocks with null values are explicitly tagged as "empty" to manage missing data ercessing step. 
ningful data.
n meaningful data.
n meaningful data.
n meaningful data.


In [3]:
batch = info.get_meta_data(
        batch_size=20,
        data_type="train",
        ignore_list=["case_id"],
        output_list=["WEEK_NUM", "target"],
    )
    
print(f"Text: \n{batch.texts[0][:10]}")
print("\n-------------------------------------")
print(f"# of amount tags found in the batch: {len(batch.get_data("amount"))}")
print(f"# of number tags found in the batch: {len(batch.get_data("number"))}")
print(f"# of date tags found in the batch: {len(batch.get_data("date"))}")    

actualdpdtolerance_344P is number 0, amtinstpaidbefduel24m_4187115A is <|AMOUNT|>, annuity_780A is <|AMOUNT|>, annuitynextmonth_57A is <|AMOUNT|>, applicationcnt_361L is number 0, applications30d_658L is number 0, applicationscnt_1086L is number 0, applicationscnt_464L is number 0, applicationscnt_629L is number 0, applicationscnt_867L is number 1, avgdbddpdlast24m_3658932P is <|NUM|>, avgdbddpdlast3m_4187120P is <|NUM|>, avgdbdtollast24m_4525197P is <|NUM|>, avgdpdtolclosure24_3658938P is number 0, avginstallast24m_3658937A is <|AMOUNT|>, avglnamtstart24m_4525187A is <|AMOUNT|>, avgmaxdpdlast9m_3716943P is number 0, avgoutstandbalancel6m_4187114A is <|AMOUNT|>, avgpmtlast12m_4525200A is <|AMOUNT|>, bankacctype_710L is empty, cardtype_51L is empty, clientscnt12m_3712952L is number 0, clientscnt3m_3712950L is number 0, clientscnt6m_3712949L is number 0, clientscnt_100L is number 0, clientscnt_1022L is number 0, clientscnt_1071L is number 0, clientscnt_1130L is number 0, clientscnt_136L 

## Note
 - The method of directly encoding integers between "0 and 9" is to focus the number encoder on more challenging tasks.
 - Replacing null blocks with "empty" rather than using a statistical measure [mean, mode, etc] is to maintain the 1 to 1 mapping between the original table and text hereby reducing the preprocessing step.

# Testing Random Sampling Strategy in Training

### Purpose
Random sampling is employed during the training phase to address specific challenges and improve model performance#

### Reasons for Random Sampling
- **Generalization**: This approach helps the network generalize more effectively by exposing it to a diverse range of input distributions.
- **Imbalanced Data**: Our dataset is imbalanced. Random sampling facilitates various techniques such as undersampling, oversampling, or utilizing the natural distribution of the data across different training epochs to mitigatprogresses.of the model.


In [4]:
info.train.new_epoch_prep(force_other=True)
batch = info.get_meta_data(
        batch_size=128,
        data_type="train",
        ignore_list=["case_id"],
        output_list=["WEEK_NUM", "target"],
    )
print(f"Minority percentage: {round(info.train.samp_percentage, 2)}")

Minority percentage: 0.28


In [5]:
info.train.choose_sampling_strat(force_other=True)
batch = info.get_meta_data(
        batch_size=128,
        data_type="train",
        ignore_list=["case_id"],
        output_list=["WEEK_NUM", "target"],
    )
print(f"Minority percentage: {round(info.train.samp_percentage, 2)}")

Minority percentage: 0.24


In [6]:
info.train.new_epoch_prep(force_other=True)
batch = info.get_meta_data(
        batch_size=128,
        data_type="train",
        ignore_list=["case_id"],
        output_list=["WEEK_NUM", "target"],
    )
print(f"Minority percentage: {round(info.train.samp_percentage, 2)}")

Minority percentage: 0.34


## Implementation Details
- **Stratified Batching**: For every batch, we ensure that the target variable is stratified. This means that each batch will have an even distribution of each class of the target variable, enhancing the reliability of model training outcomes.
- **Dynamic Sampling Adjustment**: The `choose_sampling_strat` function is pivotal in this setup. It adjusts the sampling strategy and the specific percentage of samples for each epoch, allowing us to adapt to the changing needs of the model as training progresses. We use feedback from the previous run to adjust sampling strategy for the next run

## Note
These strategies are crucial for tackling common issues such as overfitting and bias due to imbalanced datasets, thus boosting the overall efficacy and fairness of the model.