# Data Preprocessing

1. **Data Loading and Preprocessing**:
   - Loads the data from a CSV file and standardizes specific columns (such as loan information, user states, etc.) for model training.
   - Creates a new feature `installment_timestep` based on `loan_id` and `installment`.

2. **Data Splitting**:
   - Splits the data into training (`train`) and testing (`test`) sets based on the `sample` and `group` columns.

3. **Feature and Label Preparation**:
   - For each `loan_id`, extracts features and labels.
   - Features include `loan_id`, user states, loan information, etc.
   - Labels correspond to the next time step of the relevant state variables (e.g., prediction of `installment`).

4. **Data Saving and Batching**:
   - Saves the processed data as CSV files and stores the data in multiple batches by `loan_id` into pickle files for later training.

5. **Training and Validation Split**:
   - Randomly selects 10% of the training data as a validation set and the rest as the training set.

The ultimate goal of this code is to save the processed data in a format suitable for training, ensuring that the data is standardized, properly split, and stored.


In [None]:
import pandas as pd
data = pd.read_csv('./Data/20240205fullsample_new.csv')
data.head()

In [None]:
# show data summary
data.describe()

The purpose of `installment_timestep` is to generate a timestep for each installment of each loan (loan id), indicating the order in which that installment is repaid in the current loan.

In [None]:
data["installment_timestep"] = data.groupby(["loan_id", "installment"]).cumcount() + 1

In [None]:
data_sim = data.loc[data['sample'] == 'rlsimulator']
data_sim

In [None]:
# specific columns names
loan_id = ['loan_id']
bank_features = ['action_num_actual']
user_features = ['gender',
                 'age',
                 'amount',
                 'num_loan',
                 'duration',
                 'year_ratio',
                 'diff_city',
                 'marriage',
                 'kids',
                 'month_in',
                 'housing',
                 'edu',
                 'motivation']
current_state = ['installment',
                 'installment_timestep',
                 'state_cum_overduelength',
                 'remaining_debt',
                 'state_capital',
                 'state_interests',
                 'state_penalty',
                 ]
other_labels = ['installment_done',
                'loan_done',
                'recovery_rate_weighted']

In [None]:
from tqdm.auto import tqdm


loan_id_list = data_sim["loan_id"].unique().tolist()
# len(loan_id_list)
# loan_id_list
target_state = pd.DataFrame()


col_matching = {
    "installment": "y_installment",
    "installment_timestep": "y_installment_timestep",
    "state_cum_overduelength": "y_state_cum_overduelength",
    "remaining_debt": "y_remaining_debt",
    "state_capital": "y_state_capital",
    "state_interests": "y_state_interests",
    "state_penalty": "y_state_penalty",
}


for example_id in tqdm(loan_id_list):
    example_data = data_sim.loc[data_sim["loan_id"] == example_id]
    y_train = pd.DataFrame()
    y_train = example_data[current_state]
    y_train = y_train.rename(columns=col_matching)

    if y_train.shape[0] > 1:
        y_train = y_train[1:]
        y_train = pd.concat([y_train, y_train.iloc[[-1]]], ignore_index=True)
    target_state = pd.concat([target_state, y_train], ignore_index=True)


target_state

In [None]:
data_sim_full = pd.DataFrame()
data_sim_full = pd.concat(
    [
        data_sim[["group"]],
        data_sim[loan_id + bank_features + user_features + current_state],
        target_state,
        data_sim[other_labels],
    ],
    axis=1,
)
data_sim_full

# Save the data

In [None]:
data_sim_full.to_csv('./Res/simulator_data.csv', index=False)

In [None]:
data_sim_full.to_excel('./Res/simulator_data.xlsx', index=False)