# Data processing notebook

### Overview on the features
Features types:
- categorical:
   - `person_home_ownership` has 4 unique categories
   - `loan_intent` has 6 unique categories
   - `loan_grade` has 7 unique categories
   - `cb_person_default_on_file` has 2 unique categories
   - `loan_status` target column; binary; already encoded
- numerical:
  - `person_age` range: 20-144
  - `person_income` range: 4,000 - 6,000,000
  - `person_emp_length` range: 0 - 123
  - `loan_amnt` range: 500 - 35,000
  - `loan_int_rate` range: 5.42 - 23.22
  - `loan_percent_income` range: 0.11 - 0.83
  - `cb_person_cred_hist_length` range: 2 - 30
  
### Processing steps

Do Train/Validation split of 0.2

Data processig steps collected from EDA:
- Drop all NaNs
- Drop age outliers such as 130 and 144
- Drop people where `person_age` - `person_emp_length` < 14

Categorical data processing steps:
- implying that `rent -> mortgage -> own`, we will encode `person_home_ownership` with ordinal encoding
- since `cb_person_default_on_file` is binry, we will encode with ordinal binary encoding
- other categorical features would be OneHot encoded.

Numerical data processing steps:
- Convert all features to double
- Apply MinMax Scaler


In [30]:
import os

os.getcwd()

'/Users/maxmartyshov/Desktop/IU/year3/sem2/XAI/Credit-Risk-Analysis-Counterfactual-Explanations/notebooks'

### Load Data

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/raw/credit_risk_dataset.csv")

df.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4


In [3]:
df.describe()

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_cred_hist_length
count,32581.0,32581.0,31686.0,32581.0,29465.0,32581.0,32581.0,32581.0
mean,27.7346,66074.85,4.789686,9589.371106,11.011695,0.218164,0.170203,5.804211
std,6.348078,61983.12,4.14263,6322.086646,3.240459,0.413006,0.106782,4.055001
min,20.0,4000.0,0.0,500.0,5.42,0.0,0.0,2.0
25%,23.0,38500.0,2.0,5000.0,7.9,0.0,0.09,3.0
50%,26.0,55000.0,4.0,8000.0,10.99,0.0,0.15,4.0
75%,30.0,79200.0,7.0,12200.0,13.47,0.0,0.23,8.0
max,144.0,6000000.0,123.0,35000.0,23.22,1.0,0.83,30.0


In [32]:
categorical_one_hot = ['loan_intent', 'loan_grade', 'cb_person_default_on_file']
categorical_ordinal = ['person_home_ownership']
numerical = ['person_age', 'person_income', 'person_emp_length', 'loan_amnt', 'loan_int_rate', 'loan_percent_income', 'cb_person_cred_hist_length']
target = 'loan_status'

### Droping operations

In [33]:
df = df[df['person_age'] <= 85] # drop olp people
df = df[df['person_age'] - df['person_emp_length'] >= 14] # drop early employees
df.dropna(inplace=True)

For future use we will update the `loan_percent_income` with actual ration (not rounded)

In [34]:
df['loan_percent_income'] = df['loan_amnt'] / df['person_income']

### Split Data

In [35]:
from sklearn.model_selection import train_test_split

df_train, df_val = train_test_split(df, test_size=0.2, random_state=42, stratify=df['loan_status'])
df_train.shape, df_val.shape

((22905, 12), (5727, 12))

### Process train data

In [36]:
X_train_numerical = df_train[numerical]
X_train_ordinal = df_train[categorical_ordinal]
X_train_categorical = df_train[categorical_one_hot]
y_train = df_train[target]

In [37]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_numerical_scaled = scaler.fit_transform(X_train_numerical)

X_train_numerical_scaled = pd.DataFrame(X_train_numerical_scaled, columns=numerical)
X_train_numerical_scaled.head()

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_cred_hist_length
0,0.078125,0.025207,0.073171,0.42029,0.298876,0.347888,0.0
1,0.328125,0.022105,0.097561,0.710145,0.652809,0.655995,0.392857
2,0.171875,0.045191,0.097561,0.333333,0.410674,0.159471,0.107143
3,0.0625,0.052068,0.195122,0.15942,0.17809,0.068656,0.035714
4,0.140625,0.008056,0.097561,0.043478,0.139326,0.124719,0.25


In [38]:
X_train_ordinal['person_home_ownership'].unique()

array(['RENT', 'MORTGAGE', 'OWN', 'OTHER'], dtype=object)

In [39]:
X_train_ordinal['person_home_ownership'].value_counts().get('OTHER', 0)

np.int64(72)

In [40]:
X_train_ordinal = X_train_ordinal['person_home_ownership'].replace({'OTHER': 0, 'RENT':1, 'MORTGAGE': 2, 'OWN': 3})
X_train_ordinal = X_train_ordinal.astype(float)

  X_train_ordinal = X_train_ordinal['person_home_ownership'].replace({'OTHER': 0, 'RENT':1, 'MORTGAGE': 2, 'OWN': 3})


In [41]:
print("Same length?", X_train_numerical_scaled.shape[0] == X_train_ordinal.shape[0])
X_train_numerical_scaled = X_train_numerical_scaled.reset_index(drop=True)
X_train_ordinal = X_train_ordinal.reset_index(drop=True)
X_train_numerical_scaled_with_ordinal = pd.concat([X_train_numerical_scaled, X_train_ordinal], axis=1)
X_train_numerical_scaled_with_ordinal.head()

Same length? True


Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,person_home_ownership
0,0.078125,0.025207,0.073171,0.42029,0.298876,0.347888,0.0,1.0
1,0.328125,0.022105,0.097561,0.710145,0.652809,0.655995,0.392857,1.0
2,0.171875,0.045191,0.097561,0.333333,0.410674,0.159471,0.107143,2.0
3,0.0625,0.052068,0.195122,0.15942,0.17809,0.068656,0.035714,2.0
4,0.140625,0.008056,0.097561,0.043478,0.139326,0.124719,0.25,1.0


In [42]:
from sklearn.preprocessing import OneHotEncoder

onehot_encoder = OneHotEncoder()
X_train_categorical_encoded = onehot_encoder.fit_transform(X_train_categorical).toarray()
X_train_categorical_encoded = pd.DataFrame(X_train_categorical_encoded, columns=onehot_encoder.get_feature_names_out(categorical_one_hot))

X_train_categorical_encoded.head()

Unnamed: 0,loan_intent_DEBTCONSOLIDATION,loan_intent_EDUCATION,loan_intent_HOMEIMPROVEMENT,loan_intent_MEDICAL,loan_intent_PERSONAL,loan_intent_VENTURE,loan_grade_A,loan_grade_B,loan_grade_C,loan_grade_D,loan_grade_E,loan_grade_F,loan_grade_G,cb_person_default_on_file_N,cb_person_default_on_file_Y
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [43]:
X_train_numerical_scaled_with_ordinal = X_train_numerical_scaled_with_ordinal.reset_index(drop=True)
X_train_categorical_encoded = X_train_categorical_encoded.reset_index(drop=True)

X_train = pd.concat([X_train_numerical_scaled_with_ordinal, X_train_categorical_encoded], axis=1)
X_train.head()

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,person_home_ownership,loan_intent_DEBTCONSOLIDATION,loan_intent_EDUCATION,...,loan_intent_VENTURE,loan_grade_A,loan_grade_B,loan_grade_C,loan_grade_D,loan_grade_E,loan_grade_F,loan_grade_G,cb_person_default_on_file_N,cb_person_default_on_file_Y
0,0.078125,0.025207,0.073171,0.42029,0.298876,0.347888,0.0,1.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.328125,0.022105,0.097561,0.710145,0.652809,0.655995,0.392857,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,0.171875,0.045191,0.097561,0.333333,0.410674,0.159471,0.107143,2.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0625,0.052068,0.195122,0.15942,0.17809,0.068656,0.035714,2.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.140625,0.008056,0.097561,0.043478,0.139326,0.124719,0.25,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [44]:
print(X_train.shape)

(22905, 23)


In [45]:
y_train = y_train.reset_index(drop=True)
df_train_processed = pd.concat([X_train, y_train], axis=1)
df_train_processed.head()

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,person_home_ownership,loan_intent_DEBTCONSOLIDATION,loan_intent_EDUCATION,...,loan_grade_A,loan_grade_B,loan_grade_C,loan_grade_D,loan_grade_E,loan_grade_F,loan_grade_G,cb_person_default_on_file_N,cb_person_default_on_file_Y,loan_status
0,0.078125,0.025207,0.073171,0.42029,0.298876,0.347888,0.0,1.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
1,0.328125,0.022105,0.097561,0.710145,0.652809,0.655995,0.392857,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1
2,0.171875,0.045191,0.097561,0.333333,0.410674,0.159471,0.107143,2.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0
3,0.0625,0.052068,0.195122,0.15942,0.17809,0.068656,0.035714,2.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
4,0.140625,0.008056,0.097561,0.043478,0.139326,0.124719,0.25,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1


In [46]:
df_train_processed.dtypes

person_age                       float64
person_income                    float64
person_emp_length                float64
loan_amnt                        float64
loan_int_rate                    float64
loan_percent_income              float64
cb_person_cred_hist_length       float64
person_home_ownership            float64
loan_intent_DEBTCONSOLIDATION    float64
loan_intent_EDUCATION            float64
loan_intent_HOMEIMPROVEMENT      float64
loan_intent_MEDICAL              float64
loan_intent_PERSONAL             float64
loan_intent_VENTURE              float64
loan_grade_A                     float64
loan_grade_B                     float64
loan_grade_C                     float64
loan_grade_D                     float64
loan_grade_E                     float64
loan_grade_F                     float64
loan_grade_G                     float64
cb_person_default_on_file_N      float64
cb_person_default_on_file_Y      float64
loan_status                        int64
dtype: object

In [47]:
df_train_processed.to_csv('../data/processed/train.csv')

### Process validation data

In [48]:
X_val_numerical = df_val[numerical]
X_val_ordinal = df_val[categorical_ordinal]
X_val_categorical = df_val[categorical_one_hot]
y_val = df_val[target]

X_val_numerical_scaled = scaler.transform(X_val_numerical)
X_val_numerical_scaled = pd.DataFrame(X_val_numerical_scaled, columns=numerical)
X_val_ordinal = X_val_ordinal['person_home_ownership'].replace({'OTHER': 0, 'RENT':1, 'MORTGAGE': 2, 'OWN': 3})
X_val_ordinal = X_val_ordinal.astype(float)
X_val_numerical_scaled = X_val_numerical_scaled.reset_index(drop=True)
X_val_ordinal = X_val_ordinal.reset_index(drop=True)
X_val_numerical_scaled_with_ordinal = pd.concat([X_val_numerical_scaled, X_val_ordinal], axis=1)
X_val_categorical_encoded = onehot_encoder.transform(X_val_categorical).toarray()
X_val_categorical_encoded = pd.DataFrame(X_val_categorical_encoded, columns=onehot_encoder.get_feature_names_out(categorical_one_hot))
X_val_numerical_scaled_with_ordinal = X_val_numerical_scaled_with_ordinal.reset_index(drop=True)
X_val_categorical_encoded = X_val_categorical_encoded.reset_index(drop=True)
X_val = pd.concat([X_val_numerical_scaled_with_ordinal, X_val_categorical_encoded], axis=1)
y_val = y_val.reset_index(drop=True)
df_val_processed = pd.concat([X_val, y_val], axis=1)
df_val_processed.head()

  X_val_ordinal = X_val_ordinal['person_home_ownership'].replace({'OTHER': 0, 'RENT':1, 'MORTGAGE': 2, 'OWN': 3})


Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,person_home_ownership,loan_intent_DEBTCONSOLIDATION,loan_intent_EDUCATION,...,loan_grade_A,loan_grade_B,loan_grade_C,loan_grade_D,loan_grade_E,loan_grade_F,loan_grade_G,cb_person_default_on_file_N,cb_person_default_on_file_Y,loan_status
0,0.078125,0.032371,0.219512,0.131884,0.0,0.091472,0.035714,2.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
1,0.03125,0.018371,0.04878,0.478261,0.418539,0.527643,0.035714,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1
2,0.03125,0.021613,0.073171,0.173913,0.59382,0.172898,0.0,2.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1
3,0.140625,0.034385,0.268293,0.014493,0.252247,0.015766,0.107143,1.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
4,0.046875,0.049612,0.170732,0.333333,0.333708,0.145661,0.0,2.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0


In [49]:
df_val_processed.to_csv('../data/processed/validation.csv')

### Save scaler and encoder for future use

In [50]:
import pickle

with open('../models/scaler/scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

with open('../models/encoder/encoder.pkl', 'wb') as f:
    pickle.dump(onehot_encoder, f)