# 🏦 Credit Risk Modeling — Model Training Workflow

This section documents the model training workflow for the Credit Risk Modeling project after feature engineering is completed.

---

## ✅ Objective

To train machine learning models that are **robust to outliers**, using the final engineered dataset. Outliers are **not removed**, but are **flagged using an `is_outlier` column**.

---

## 📊 Data Preparation

- Dataset includes engineered features and an `is_outlier` flag (`1` = outlier, `0` = normal).
- Outliers are retained to preserve data distribution and potential signal.
- **Target Variable:** `TARGET`  
  - `1` → Default  
  - `0` → Non-default
- Features selected based on domain knowledge and feature importance insights.

---

## ⚙️ Modeling Strategy

### 🧱 Baseline Model

- **Logistic Regression with Regularization (L2 penalty)**  
  Used to establish a benchmark performance.

---

### 🌳 Robust Models Trained

| Model              | Robust to Outliers | Notes                        |
|-------------------|--------------------|------------------------------|
| Decision Tree      | ✅ Yes             | Simple and interpretable     |
| Random Forest      | ✅ Yes             | Ensemble of decision trees   |
| XGBoost            | ✅ Yes             | Gradient boosting framework  |
| LightGBM           | ✅ Yes             | Efficient gradient boosting  |
| CatBoost           | ✅ Yes             | Handles categorical features |
| Ridge / Lasso      | ⚠️ Moderate        | Requires feature scaling     |

> ❌ Models such as plain Logistic Regression, KNN, and SVM (RBF kernel) are generally **sensitive to outliers** and are avoided or handled with care.

---

## 🧮 Evaluation Metrics

Models are evaluated using the following metrics:

- **ROC-AUC Score**
- **Precision, Recall, F1-Score**
- **Confusion Matrix**
---

## ✅ Next Steps

- Perform hyperparameter tuning (e.g., GridSearchCV, Optuna)
- Compare model performances and rank based on AUC/F1
- Select the best-performing model for deployment or integration into risk scoring systems
- Optionally explore SHAP or LIME for model explainability

---


### Import Libraries 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import confusion_matrix

### Import dataset

In [2]:
train = pd.read_csv('/kaggle/input/credit-risk-dataset/train_final.csv')
test = pd.read_csv('/kaggle/input/credit-risk-dataset/test_final.csv')

In [3]:
test.columns

Index(['SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR',
       'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT',
       'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE',
       'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE',
       'REGION_POPULATION_RELATIVE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE',
       'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL',
       'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START',
       'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION',
       'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION',
       'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY',
       'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'EXT_SOURCE_2',
       'EXT_SOURCE_3', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
       'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE',
       'FLAG_

In [4]:
train.columns

Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE',
       'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS',
       'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'FLAG_MOBIL',
       'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE',
       'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS',
       'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
       'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
       'ORGANIZATION_TYPE', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
       'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
       'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCL

In [5]:
train['AGE_YEARS'] = train['AGE_YEAR']

In [6]:
train.drop('AGE_YEAR',axis=1,inplace=True)

In [7]:
test['is_outlier_Income'] = test['is_outlier_amt_income_total']

In [8]:
test.drop('is_outlier_amt_income_total',axis=1,inplace=True)

In [9]:
test['is_outlier_Credit'] = test['is_outlier_amt_credit']

In [10]:
test.drop('is_outlier_amt_credit',axis=1,inplace=True)

In [11]:
feature_cols = train.columns.drop('TARGET')

In [12]:
test = test[feature_cols]

In [13]:
train.shape

(304319, 81)

In [14]:
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,is_outlier_amt_req_credit_bureau_day,is_outlier_amt_req_credit_bureau_week,is_outlier_amt_req_credit_bureau_mon,is_outlier_amt_req_credit_bureau_qrt,is_outlier_amt_req_credit_bureau_year,YEAR_EMPLOYED,YEAR_REGISTRATION,YEAR_ID_PUBLISH,YEAR_LAST_PHONE_CHANGE,AGE_YEARS
0,100002,1,0,0,0,1,0,202500,406597.5,24700.5,...,0,0,0,0,0,1.745205,9.994521,5.808219,3.106849,25.920548
1,100003,0,0,1,0,0,0,270000,1293502.5,35698.5,...,0,0,0,0,0,3.254795,3.249315,0.79726,2.268493,45.931507
2,100004,0,1,0,1,1,0,67500,135000.0,6750.0,...,0,0,0,0,0,0.616438,11.671233,6.934247,2.232877,52.180822
3,100006,0,0,1,0,1,0,135000,312682.5,29686.5,...,0,0,1,0,0,8.326027,26.939726,6.676712,1.690411,52.068493
4,100007,0,0,0,0,1,0,121500,513000.0,21865.5,...,0,0,0,0,0,8.323288,11.810959,9.473973,3.030137,54.608219


In [15]:
X = train.drop('TARGET',axis=1)
y = train['TARGET']

In [16]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

## Training LogisticRegression

In [17]:
lg_1 = LogisticRegression()

In [18]:
lg_1.fit(X_train,y_train)

In [19]:
lg_1_pred = lg_1.predict(X_test)

In [20]:
lg_1_roc_auc_score = roc_auc_score(y_test,lg_1.predict_proba(X_test)[:, 1])
print(f'Roc_Auc_score of lg_1:',lg_1_roc_auc_score)

Roc_Auc_score of lg_1: 0.603058133274806


In [21]:
lg_1_f1_score = f1_score(y_test,lg_1_pred)
print(f'F1_score of lg_1 is:',lg_1_f1_score)

F1_score of lg_1 is: 0.0


In [22]:
lg_1_recall_score = recall_score(y_test,lg_1_pred)
print(f'recall_score of lg_1 is:',lg_1_recall_score)

recall_score of lg_1 is: 0.0


In [23]:
lg_1_precision_score = precision_score(y_test,lg_1_pred)
print(f'precision_score of lg_1 is:',lg_1_precision_score)

precision_score of lg_1 is: 0.0


  _warn_prf(average, modifier, msg_start, len(result))


In [24]:
lg_1_confusion_matrix = confusion_matrix(y_test,lg_1_pred)
print(f'confusion_matrix of lg_1 is:',lg_1_confusion_matrix)

confusion_matrix of lg_1 is: [[56023     0]
 [ 4841     0]]


- **So as per these metrics the performance of the logistic regression is really worst because there are so many outliers in the dataset so now i will train those machine learning algorithms that are robust to outliers and see their performance**
- **Logistic regression is like the base model on which i will compare the performance of the other model**