# Data Preprocessing

1. **Data Loading and Preprocessing**:
   - Loads the data from a CSV file and standardizes specific columns (such as loan information, user states, etc.) for model training.
   - Creates a new feature `installment_timestep` based on `loan_id` and `installment`.

2. **Data Splitting**:
   - Splits the data into training (`train`) and testing (`test`) sets based on the `sample` and `group` columns.

3. **Feature and Label Preparation**:
   - For each `loan_id`, extracts features and labels.
   - Features include `loan_id`, user states, loan information, etc.
   - Labels correspond to the next time step of the relevant state variables (e.g., prediction of `installment`).

4. **Data Saving and Batching**:
   - Saves the processed data as CSV files and stores the data in multiple batches by `loan_id` into pickle files for later training.

5. **Training and Validation Split**:
   - Randomly selects 10% of the training data as a validation set and the rest as the training set.

The ultimate goal of this code is to save the processed data in a format suitable for training, ensuring that the data is standardized, properly split, and stored.

In [None]:
import pandas as pd
data = pd.read_csv('./Data/20240205fullsample_new.csv')
data.head()

In [None]:
# show data summary
data.describe()

In [None]:
# show data columns names
data.columns

In [4]:
data['installment_timestep'] = data.groupby(
    ['loan_id', 'installment']).cumcount()+1

In [5]:
states = ['installment', 'installment_timestep', 'state_cum_overduelength',
          'remaining_debt', 'state_capital', 'state_interests',
          'state_penalty', 'gender', 'age',
          'amount', 'num_loan', 'duration',
          'year_ratio', 'diff_city', 'marriage',
          'kids', 'month_in', 'housing',
          'edu', 'motivation']

data[states] = (data[states] - data[states].mean()) / data[states].std()

In [None]:
data

In [None]:
data_rlsim = data[data['sample'] == 'rlsimulator']
data_rlsim.shape

In [None]:
train = data.loc[(data['sample'] == 'rlsimulator')
                 & (data['group'] == 'train')]
train.shape

In [None]:
test = data.loc[(data['sample'] == 'rlsimulator')
                & (data['group'] == 'test')]
test.shape

In [10]:
# data_list = [train, test]
# data_name_list = ['train', 'test']
data_list = [test]
data_name_list = ['test']

In [None]:
from tqdm.auto import tqdm

# varstate_size = 7


for j in tqdm(range(len(data_list)), leave=True):
    dt = data_list[j]
    dt_loan_ids = dt['loan_id'].drop_duplicates().tolist()

    X_df = pd.DataFrame()
    y_df = pd.DataFrame()

    for loan_id in tqdm(dt_loan_ids, leave=True):
        df1 = dt.loc[dt['loan_id'] == loan_id]

        X_train = df1[['loan_id'] + states + ['action_num_actual',
                                              'installment_done',
                                              'loan_done',
                                              'recovery_rate_weighted']]
        # X_train = X_train[:-1]
        # X_df = X_df.append(X_train, ignore_index=True)
        X_df = pd.concat([X_df, X_train], ignore_index=True)

        # y_train: pd.DataFrame = df1[states[:varstate_size]]
        y_train: pd.DataFrame = df1[['installment', 'installment_timestep', 'state_cum_overduelength',
                                    'remaining_debt', 'state_capital', 'state_interests', 'state_penalty']]

        y_train = y_train.rename(columns={'installment': 'installment.1',
                                          'installment_timestep': 'installment_timestep.1',
                                          'state_cum_overduelength': 'state_cum_overduelength.1',
                                          'remaining_debt': 'remaining_debt.1',
                                          'state_capital': 'state_capital.1',
                                          'state_interests': 'state_interests.1',
                                          'state_penalty': 'state_penalty.1'})

        if y_train.shape[0] > 1:
            y_train = y_train[1:]
            # y_train = y_train.append(y_train.iloc[-1], ignore_index=True)
            y_train = pd.concat(
                [y_train, y_train.iloc[[-1]]], ignore_index=True)

        # y_df = y_df.append(y_train, ignore_index=True)
        y_df = pd.concat([y_df, y_train], ignore_index=True)

In [None]:
X_df.shape

In [None]:
y_df.shape

In [15]:
import os
# create a directory to save the data
save_folder = './Res/Data/Test'
if not os.path.exists(save_folder):
    os.makedirs(save_folder)

X_df.to_csv(f'{save_folder}/X_df.csv', index=False)
y_df.to_csv(f'{save_folder}/y_df.csv', index=False)