**step 2:** prediction model building 

The following imports are done because all previous functions were copied to python script files for easier navigation between notebooks, in this second step we will focus on using the data we cleaned to model predictions of the 'risk_score' parameter.

In [1]:
from src.pipeline import pipeline
from src.preprocess import BeforePipeline
from src.map import data_mapping
from src.feature_engineering import add_engineered_features
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, learning_curve, ShuffleSplit
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib
import pandas as pd
import numpy as np
import eli5
from eli5.sklearn import PermutationImportance
from src.config import cols_to_drop

In [2]:
data_path = '../data/raw/data_chunck.parquet'

In [3]:
def all_data_prep(n): #a function to quickly get how much data we need 
    df = pd.read_parquet(data_path).sample(n=n, random_state=2)
    print(f'Shape before data prep: {df.shape}')

    bp = BeforePipeline()
    df = bp.all_before_pipeline(df)
    df = add_engineered_features(df)
    df_y = np.log1p(df['risk_score']) #we log transform y to help the model
    df = df.drop(columns='risk_score')

    print(f'shape after data prep: {df.shape}')
    return df,df_y

### Feature Engineering
We will try to make new features that help the model generalize better by using existing ones.
here are all the features we will add:

Variants of debt to in income ratio:
- revol_bal_to_income = revol_bal / annual_inc
- loan_to_income = loan_amnt / annual_inc
- bc_limit_to_income = total_bc_limit / annual_inc

Credit utilization refinements:
- total_utilization = revol_bal / total_rev_hi_lim
- bc_to_total_limit_ratio = total_bc_limit / total_rev_hi_lim

New Metrics:
- tot_cur_bal_to_income = tot_cur_bal / annual_inc

Credit history features:
- recent_account_ratio = acc_open_past_24mths / total_acc

Risk Buckets & Flags:
- high_util_flag = (revol_util > 80).astype(int)
- short_history_flag = (credit_history_length < 24).astype(int)
- many_inquiries_flag = (inq_last_6mths >= 3).astype(int)


## Introduction to model building
Let's get started with model building, as we are dealing with very high dimensionality data, a lot of non linearity and mixed-type features, the optimal choice for us will probably be a tree based model, We will use a Random Forest Regressor, which is an application of bagging to decision trees with the additional “random subspace” trick: we create a large number of trees and each tree is trained on a bootstrap sample of the data chosen randomly with replacement, and at each split only a portion of the features are randomly chosen without replacement, generally $\sqrt{n}$ features for classification or $n/3$ for regression for $n$ features. 

This will save a ton of computing power while reducing the variance of our model.

In fact, if each tree result is a random variable, let's say we have $n$ random variables ${X_i}$ equally distributed but not necessarly independant, our model output $\tilde{X}$ would then be the average of all the results of the different trees:
$\tilde{X} = \frac{1}{n}\sum_{i=1}^{n} {X_i}$ if we let $Var({X_i}) = \sigma^{2}$ for all $i$ and $Cov({X_i},{X_j}) = \rho \sigma^{2}$ for all $i \neq j$ ($\rho$ being the coeificient of correlation),

Then, by a simple variance calculation we have that, $$Var(\tilde{X}) = \rho \sigma^{2} + \frac{1 - \rho}{n}\sigma^{2}$$
As n grow larger, $\frac{1 - \rho}{n}\sigma^{2} \xrightarrow[n \to \infty]{} 0$, we are then left with $Var(\tilde{X}) = \rho \sigma^{2} \leq  \sigma^{2}$ as $\rho \leq 1$.

So, the more trees we have, the more we reduce our model's output variance, this variance reduction is crucial for improving model stability, especially when evaluating the model across different economic scenarios to assess its robustness.

## Additional Note:
- In classification, the Random Forest aggregates by majority voting (which is equivalent to averaging the 0/1 votes and then thresholding) or by averaging the class probabilities. The variance formula applies to the probability estimates, which can then be thresholded to get the class label. This variance reduction in the probability estimates leads to more stable predictions.

In [4]:
#data collection
X,y = all_data_prep(n=200_000)
print("Infinite values after cleaning:", np.isinf(X.select_dtypes(include=[np.number])).sum().sum())
print("NaN values after cleaning:", X.isnull().sum().sum())
#baseline scikit-learns random forest
rf = RandomForestRegressor(random_state=1,n_jobs=-1)
pipe = pipeline(model=rf, ver=True)

#cross validation
scores = cross_val_score(estimator=pipe, X=X, y=y, cv=3, scoring='r2')
print(f'cross validation scores: {scores}\nmean: {scores.mean()}')

Shape before data prep: (200000, 148)
===> clean infinites called


  data['risk_score']= data['loan_status'].map(mapping)


===> data_mapping called
===> num_data_prep called
number of cols to drop: 44
data shape before dropping: (199996, 134)
44 columns dropped successfuly
data shape after dropping: (199996, 108)
===> drop_useless called
===> clean infinites called
shape after data prep: (199996, 118)
Infinite values after cleaning: 0
NaN values after cleaning: 14310
[ColumnTransformer] ......... (1 of 5) Processing scale, total=   0.5s
frequency data encoded successfuly
[ColumnTransformer]  (2 of 5) Processing frequency_cols, total=   0.0s
employement lenght encoded successfuly
[ColumnTransformer]  (3 of 5) Processing employement_lenght, total=   0.0s
earliest cr data encoded successfuly
[ColumnTransformer] ... (4 of 5) Processing account_age, total=   0.0s
[ColumnTransformer] ........... (5 of 5) Processing OHE, total=   0.1s
[Pipeline] .......... (step 1 of 2) Processing features, total=   0.8s
[Pipeline] ............. (step 2 of 2) Processing model, total= 3.8min
frequency data encoded successfuly
empl

In [5]:
X_train, X_val, y_train, y_val = train_test_split(X,y)

pipe.fit(X_train, y_train)
print('model fitted')
X_trans = pipe['features'].fit_transform(X_val)
print('transformed')
perms = PermutationImportance(pipe[1], scoring='r2', n_iter=3, random_state=1).fit(X_trans, y_val)

[ColumnTransformer] ......... (1 of 5) Processing scale, total=   0.6s
frequency data encoded successfuly
[ColumnTransformer]  (2 of 5) Processing frequency_cols, total=   0.0s
employement lenght encoded successfuly
[ColumnTransformer]  (3 of 5) Processing employement_lenght, total=   0.0s
earliest cr data encoded successfuly
[ColumnTransformer] ... (4 of 5) Processing account_age, total=   0.0s
[ColumnTransformer] ........... (5 of 5) Processing OHE, total=   0.1s
[Pipeline] .......... (step 1 of 2) Processing features, total=   0.9s
[Pipeline] ............. (step 2 of 2) Processing model, total= 4.4min
model fitted
[ColumnTransformer] ......... (1 of 5) Processing scale, total=   0.2s
frequency data encoded successfuly
[ColumnTransformer]  (2 of 5) Processing frequency_cols, total=   0.0s
employement lenght encoded successfuly
[ColumnTransformer]  (3 of 5) Processing employement_lenght, total=   0.0s
earliest cr data encoded successfuly
[ColumnTransformer] ... (4 of 5) Processing acc

In [6]:
ohe_cols = pipe[0]['OHE'].get_feature_names_out().tolist()
num_cols = [col for col in X_val.columns if X_val[col].dtype != 'object']

feature_names = num_cols + ['purpose','addr_state'] + ['emp_length'] + ['earliest_cr_line'] + ohe_cols
print(num_cols)
eli5.show_weights(perms, feature_names=feature_names)

['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'fico_range_low', 'fico_range_high', 'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'out_prncp_inv', 'total_pymnt_inv', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'policy_code', 'annual_inc_joint', 'dti_joint', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m', 'open_act_il', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m', 'acc_open_past_24mths', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util', 'mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mort_acc', 'mths_since_recent_bc', 'mths_since_recent_bc_dlq', 'mths_since_recent_inq', 'mths_since_recent_revol_delinq', 'num_accts_ever_120_pd', 'num_actv_bc_tl', 'num_a

Weight,Feature
2.6825  ± 0.0234,total_pymnt_inv
2.0445  ± 0.0005,out_prncp_inv
1.1199  ± 0.0082,funded_amnt_inv
0.1232  ± 0.0025,debt_settlement_flag_N
0.0945  ± 0.0011,installment
0.0460  ± 0.0009,funded_amnt
0.0322  ± 0.0008,loan_amnt
0.0258  ± 0.0004,term_ 36 months
0.0038  ± 0.0003,term_ 60 months
0.0036  ± 0.0000,loan_to_income


In [10]:
with open('../notebooks/truc.txt', 'w') as f:        
    for col in X_train.columns:
        f.write(f"'{col}'")