# 📊 Credit Risk Analysis

This project aims to predict credit risk by determining which customers are likely to pay their loans on time and which are not. We'll employ various data science techniques and methodologies.

## 🎯 Objectives

We will cover the following concepts:

1. **🔍 Exploratory Data Analysis (EDA)**:
   - Analyze the dataset to uncover patterns, spot anomalies, and check assumptions using statistical summaries and visualizations.

2. **🛠️ Data Preprocessing**:
   - Clean the data, handle missing values, encode categorical variables, normalize data, and split into training and testing sets.

3. **⭐ Feature Importance**:
   - Identify key features influencing `loan_status` to improve model performance and interpretability.

4. **🔽 Dimensionality Reduction**:
   - Use techniques like PCA to reduce the number of features while retaining essential information, speeding up model training, and reducing overfitting.

5. **🤖 Predictive Modeling**:
   - Build various models (logistic regression, decision trees, random forests, gradient boosting) to predict the target variable `loan_status`.

6. **⚙️ Hyperparameter Optimization**:
   - Perform hyperparameter tuning with Optuna to find the best parameters for our models.

7. **🧪 Model Testing**:
   - Evaluate models using metrics like accuracy, precision, recall, F1 score, and AUC-ROC to test performance on unseen data.

## 🎯 Goal

Predict the `loan_status` to determine the likelihood of customers paying their loans on time. This helps financial institutions make informed loan approval decisions and manage risk effectively.


# 📖 Data Dictionary

| Column Name            | Description                                                                                           |
|------------------------|-------------------------------------------------------------------------------------------------------|
| out_prncp_inv          | Remaining outstanding principal for portion of total amount funded by investors                       |
| policy_code            | Publicly available policy_code=1. <br> New products not publicly available policy_code=2                   |
| pub_rec                | Number of derogatory public records                                                                   |
| purpose                | A category provided by the borrower for the loan request.                                             |
| pymnt_plan             | Indicates if a payment plan has been put in place for the loan                                        |
| recoveries             | Post charge off gross recovery                                                                        |
| revol_bal              | Total credit revolving balance                                                                        |
| revol_util             | Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit. |
| sub_grade              | LC assigned loan subgrade                                                                             |
| term                   | The number of payments on the loan. Values are in months and can be either 36 or 60.                  |
| title                  | The loan title provided by the borrower                                                               |
| total_acc              | The total number of credit lines currently in the borrower's credit file                              |
| total_pymnt            | Payments received to date for total amount funded                                                     |
| total_pymnt_inv        | Payments received to date for portion of total amount funded by investors                             |
| total_rec_int          | Interest received to date                                                                             |
| total_rec_late_fee     | Late fees received to date                                                                            |
| total_rec_prncp        | Principal received to date                                                                            |
| url                    | URL for the LC page with listing data.                                                                |
| verified_status_joint  | Indicates if the co-borrowers' joint income was verified by LC, not verified, or if the income source was verified |
| zip_code               | The first 3 numbers of the zip code provided by the borrower in the loan application.                 |
| open_acc_6m            | Number of open trades in last 6 months                                                                |
| open_il_6m             | Number of currently active installment trades                                                         |
| open_il_12m            | Number of installment accounts opened in past 12 months                                               |
| open_il_24m            | Number of installment accounts opened in past 24 months                                               |
| mths_since_rcnt_il     | Months since most recent installment accounts opened                                                  |
| total_bal_il           | Total current balance of all installment accounts                                                     |
| il_util                | Ratio of total current balance to high credit/credit limit on all install acct                        |
| open_rv_12m            | Number of revolving trades opened in past 12 months                                                   |
| open_rv_24m            | Number of revolving trades opened in past 24 months                                                   |
| max_bal_bc             | Maximum current balance owed on all revolving accounts                                                |
| all_util               | Balance to credit limit on all trades                                                                 |
| total_rev_hi_lim       | Total revolving high credit/credit limit                                                              |
| inq_fi                 | Number of personal finance inquiries                                                                  |
| total_cu_tl            | Number of finance trades                                                                              |
| inq_last_12m           | Number of credit inquiries in past 12 months                                                          |
| acc_now_delinq         | The number of accounts on which the borrower is now delinquent.                                       |
| tot_coll_amt           | Total collection amounts ever owed                                                                    |
| tot_cur_bal            | Total current balance of all accounts                                                                 |

#### * Employer Title replaces Employer Name for all loans listed after 9/23/2013



## Some important features

| Column      | Description                                  |
|-------------|----------------------------------------------|
| loan_amnt   | Amount of money requested by the borrower.   |
| int_rate    | Interest rate of the loan.                   |
| grade       | Loan grade with categories A, B, C, D, E, F, G. |
| annual_inc  | Borrower's annual income.                    |
| purpose     | The primary purpose of borrowing.            |
| installment | Monthly amount payments for opted loan.      |
| term        | Duration of the loan until it’s paid off.    |


# 🚀 Importing libraries and getting started

In [1]:
import pandas as pd
import numpy as np

import plotly.express as px
from sklearn import metrics
from sklearn.impute import KNNImputer, SimpleImputer, MissingIndicator

In [2]:
data = pd.read_csv('C:/Users/saran/OneDrive/Documents/GitHub/credit-risk/data/loan.csv', low_memory=False)

In [3]:
print(data.shape)
print(data.shape[0] * data.shape[1])

(887379, 74)
65666046


### We have about 887K rows and 74 columns, which amounts to about 65 million data points.

In [4]:
data['loan_status'].unique() # Understanding the target variable

array(['Fully Paid', 'Charged Off', 'Current', 'Default',
       'Late (31-120 days)', 'In Grace Period', 'Late (16-30 days)',
       'Does not meet the credit policy. Status:Fully Paid',
       'Does not meet the credit policy. Status:Charged Off', 'Issued'],
      dtype=object)

### Here is what the terms in the target variable mean
| Term                                               | Meaning                                                                                                                                                 |
|----------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Fully Paid**                                     | The borrower has completely repaid the loan.                                                                                                             |
| **Charged Off**                                    | The lender has given up on collecting the loan because the borrower hasn’t paid for a long time. The loan is considered a loss.                          |
| **Current**                                        | The borrower is making payments on time.                                                                                                                 |
| **Default**                                        | The borrower has stopped making payments for a long period, and the loan is in serious trouble.                                                          |
| **Late (31-120 days)**                             | The borrower has missed payments and is behind by 31 to 120 days.                                                                                        |
| **In Grace Period**                                | The borrower missed a payment, but the late fee hasn’t been applied yet because it’s within an allowed period after the due date.                        |
| **Late (16-30 days)**                              | The borrower is behind on payments by 16 to 30 days.                                                                                                     |
| **Does not meet the credit policy. Status: Fully Paid** | The loan didn’t meet the lender's usual criteria but was still given and has been completely repaid.                                                      |
| **Does not meet the credit policy. Status: Charged Off** | The loan didn’t meet the lender's usual criteria, was still given, but ended up in loss as the borrower didn’t repay.                                      |
| **Issued**                                         | The loan has been approved and the money has been given to the borrower.                                                                                 |


In [5]:
for i in data.columns[2:]:
    print(i)
    print(data[i].head(), '\n')

loan_amnt
0     5000.0
1     2500.0
2     2400.0
3    10000.0
4     3000.0
Name: loan_amnt, dtype: float64 

funded_amnt
0     5000.0
1     2500.0
2     2400.0
3    10000.0
4     3000.0
Name: funded_amnt, dtype: float64 

funded_amnt_inv
0     4975.0
1     2500.0
2     2400.0
3    10000.0
4     3000.0
Name: funded_amnt_inv, dtype: float64 

term
0     36 months
1     60 months
2     36 months
3     36 months
4     60 months
Name: term, dtype: object 

int_rate
0    10.65
1    15.27
2    15.96
3    13.49
4    12.69
Name: int_rate, dtype: float64 

installment
0    162.87
1     59.83
2     84.33
3    339.31
4     67.79
Name: installment, dtype: float64 

grade
0    B
1    C
2    C
3    C
4    B
Name: grade, dtype: object 

sub_grade
0    B2
1    C4
2    C5
3    C1
4    B5
Name: sub_grade, dtype: object 

emp_title
0                         NaN
1                       Ryder
2                         NaN
3         AIR RESOURCES BOARD
4    University Medical Group
Name: emp_title, dtype: ob

In [6]:
y = data['loan_status']

In [7]:
y.isna().sum()

0

No null values in target column.

In [8]:
X = data.drop(columns='loan_status')

In [9]:
X.shape

(887379, 73)

In [10]:
print('Number of null data points:', X.isna().sum().sum())
print('Percentage of null data:', round( (X.isna().sum().sum() / (X.shape[0]*X.shape[1]) ) * 100, 2), '%')    

Number of null data points: 17998493
Percentage of null data: 27.78 %


In [11]:
for i in X.columns:
    null = ( X[i].isna().sum() / len(X[i]) ) * 100
    print(f'Percentage null values in {i}: {round(null, 2)}%')

Percentage null values in id: 0.0%
Percentage null values in member_id: 0.0%
Percentage null values in loan_amnt: 0.0%
Percentage null values in funded_amnt: 0.0%
Percentage null values in funded_amnt_inv: 0.0%
Percentage null values in term: 0.0%
Percentage null values in int_rate: 0.0%
Percentage null values in installment: 0.0%
Percentage null values in grade: 0.0%
Percentage null values in sub_grade: 0.0%
Percentage null values in emp_title: 5.8%
Percentage null values in emp_length: 5.05%
Percentage null values in home_ownership: 0.0%
Percentage null values in annual_inc: 0.0%
Percentage null values in verification_status: 0.0%
Percentage null values in issue_d: 0.0%
Percentage null values in pymnt_plan: 0.0%
Percentage null values in url: 0.0%
Percentage null values in desc: 85.8%
Percentage null values in purpose: 0.0%
Percentage null values in title: 0.02%
Percentage null values in zip_code: 0.0%
Percentage null values in addr_state: 0.0%
Percentage null values in dti: 0.0%
Per

Many features have a high percentage of null values.<br /> We will drop features with more than 97% null values.<br /> For the remaining features with a high percentage of nulls, we will handle them differently.

In [12]:
dropped_list = [] # list to track dropped features
for i in X.columns:
    null = ( X[i].isna().sum() / len(X[i]) ) * 100 
    if null < 97:
        print(f'Percentage null values in {i}: {round(null, 2)}%')

    if null > 97:
        print(f'Percentage null values in {i}: {round(null, 2)}% -- FEATURE DROPPED')
        dropped_list.append(i)
        X.drop(columns=i, inplace = True)

print('Total features dropped: ', len(dropped_list))

Percentage null values in id: 0.0%
Percentage null values in member_id: 0.0%
Percentage null values in loan_amnt: 0.0%
Percentage null values in funded_amnt: 0.0%
Percentage null values in funded_amnt_inv: 0.0%
Percentage null values in term: 0.0%
Percentage null values in int_rate: 0.0%
Percentage null values in installment: 0.0%
Percentage null values in grade: 0.0%
Percentage null values in sub_grade: 0.0%
Percentage null values in emp_title: 5.8%
Percentage null values in emp_length: 5.05%
Percentage null values in home_ownership: 0.0%
Percentage null values in annual_inc: 0.0%
Percentage null values in verification_status: 0.0%
Percentage null values in issue_d: 0.0%
Percentage null values in pymnt_plan: 0.0%
Percentage null values in url: 0.0%
Percentage null values in desc: 85.8%
Percentage null values in purpose: 0.0%
Percentage null values in title: 0.02%
Percentage null values in zip_code: 0.0%
Percentage null values in addr_state: 0.0%
Percentage null values in dti: 0.0%
Per

In [13]:
X.shape

(887379, 56)

17 feautures have been dropped.

In [14]:
for i in X.columns:
    null = ( X[i].isna().sum() / len(X[i]) ) * 100
    if null > 50:
        print(f'Percentage null values in {i}: {round(null, 2)}%')
    else:
        pass

Percentage null values in desc: 85.8%
Percentage null values in mths_since_last_delinq: 51.2%
Percentage null values in mths_since_last_record: 84.56%
Percentage null values in mths_since_last_major_derog: 75.02%


We will drop the description feature, and impute other features with 0.

In [15]:
X.drop(columns = 'desc', inplace = True)
dropped_list.append('desc')

In [16]:
num_cols = X.select_dtypes(include=['int64', 'float64']).columns
cat_cols = X.select_dtypes(exclude=['int64', 'float64']).columns

In [40]:
print('NUMERICAL COLUMNS\n')
count_num = 0
count_cat = 0
for i in num_cols:
    null = ( X[i].isna().sum() / len(X[i]) ) * 100
    if null > 0:
        count_num = count_num + 1
        print(f'Percentage null values in {i}: {round(null, 2)}%')
        print(f'Total null values in {i}: {X[i].isna().sum()}\n')
    

print('\nCATEGORICAL COLUMNS\n')
for i in cat_cols:
    null = ( X[i].isna().sum() / len(X[i]) ) * 100
    if null > 0:
        count_cat = count_cat + 1
        print(f'Percentage null values in {i}: {round(null, 2)}%')
        print(f'Total null values in {i}: {X[i].isna().sum()}\n')
    
print(f'Total numeric null features: {count_num}')
print(f'Total categorical null features: {count_cat}')

NUMERICAL COLUMNS

Percentage null values in annual_inc: 0.0%
Total null values in annual_inc: 4

Percentage null values in delinq_2yrs: 0.0%
Total null values in delinq_2yrs: 29

Percentage null values in inq_last_6mths: 0.0%
Total null values in inq_last_6mths: 29

Percentage null values in mths_since_last_delinq: 51.2%
Total null values in mths_since_last_delinq: 454312

Percentage null values in mths_since_last_record: 84.56%
Total null values in mths_since_last_record: 750326

Percentage null values in open_acc: 0.0%
Total null values in open_acc: 29

Percentage null values in pub_rec: 0.0%
Total null values in pub_rec: 29

Percentage null values in revol_util: 0.06%
Total null values in revol_util: 502

Percentage null values in total_acc: 0.0%
Total null values in total_acc: 29

Percentage null values in collections_12_mths_ex_med: 0.02%
Total null values in collections_12_mths_ex_med: 145

Percentage null values in mths_since_last_major_derog: 75.02%
Total null values in mths_s

In [18]:
X.drop(columns = ['id', 'member_id'], inplace = True)
dropped_list.append('id')
dropped_list.append('member_id')
num_cols = X.select_dtypes(include=['int64', 'float64']).columns

In [19]:
num_cols

Index(['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'int_rate',
       'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths',
       'mths_since_last_delinq', 'mths_since_last_record', 'open_acc',
       'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'out_prncp',
       'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp',
       'total_rec_int', 'total_rec_late_fee', 'recoveries',
       'collection_recovery_fee', 'last_pymnt_amnt',
       'collections_12_mths_ex_med', 'mths_since_last_major_derog',
       'policy_code', 'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal',
       'total_rev_hi_lim'],
      dtype='object')

In [20]:
knn_imp = KNNImputer(add_indicator = True)
simp_imp = SimpleImputer(strategy = 'constant',
                         fill_value = 0.0)
mi = MissingIndicator()

In [29]:
new_data = X

In [35]:
for i in num_cols: # Adding a missing indicator for each numerical column with nan values
    if new_data[i].isna().sum() > 0:
        temp = mi.fit_transform(new_data[[i]])
        loc = new_data.columns.get_loc(i)
        new_data.insert(loc+1, value = temp, column = f'{i}_MISSING_INDICATOR')

In [37]:
new_data.shape

(887379, 68)

We have added 15 missing indiactors for 15 columns with missing values.

In [43]:
new_data[num_cols].isna().sum()

loan_amnt                           0
funded_amnt                         0
funded_amnt_inv                     0
int_rate                            0
installment                         0
annual_inc                          4
dti                                 0
delinq_2yrs                        29
inq_last_6mths                     29
mths_since_last_delinq         454312
mths_since_last_record         750326
open_acc                           29
pub_rec                            29
revol_bal                           0
revol_util                        502
total_acc                          29
out_prncp                           0
out_prncp_inv                       0
total_pymnt                         0
total_pymnt_inv                     0
total_rec_prncp                     0
total_rec_int                       0
total_rec_late_fee                  0
recoveries                          0
collection_recovery_fee             0
last_pymnt_amnt                     0
collections_

Now we impute missing numerical values.

In [44]:
new_data['annual_inc'] = knn_imp.fit_transform(new_data[['annual_inc']])

In [45]:
new_data['annual_inc'].isna().sum()

0

In [79]:
for i in num_cols:
    if new_data[i].isna().sum() > 0:
        new_data[i] = simp_imp.fit_transform(new_data[[i]])
        

In [81]:
new_data.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,...,policy_code,application_type,acc_now_delinq,acc_now_delinq_MISSING_INDICATOR,tot_coll_amt,tot_coll_amt_MISSING_INDICATOR,tot_cur_bal,tot_cur_bal_MISSING_INDICATOR,total_rev_hi_lim,total_rev_hi_lim_MISSING_INDICATOR
0,5000.0,5000.0,4975.0,36 months,10.65,162.87,B,B2,,10+ years,...,1.0,INDIVIDUAL,0.0,False,0.0,True,0.0,True,0.0,True
1,2500.0,2500.0,2500.0,60 months,15.27,59.83,C,C4,Ryder,< 1 year,...,1.0,INDIVIDUAL,0.0,False,0.0,True,0.0,True,0.0,True
2,2400.0,2400.0,2400.0,36 months,15.96,84.33,C,C5,,10+ years,...,1.0,INDIVIDUAL,0.0,False,0.0,True,0.0,True,0.0,True
3,10000.0,10000.0,10000.0,36 months,13.49,339.31,C,C1,AIR RESOURCES BOARD,10+ years,...,1.0,INDIVIDUAL,0.0,False,0.0,True,0.0,True,0.0,True
4,3000.0,3000.0,3000.0,60 months,12.69,67.79,B,B5,University Medical Group,1 year,...,1.0,INDIVIDUAL,0.0,False,0.0,True,0.0,True,0.0,True


In [82]:
new_data[num_cols].isna().sum()

loan_amnt                      0
funded_amnt                    0
funded_amnt_inv                0
int_rate                       0
installment                    0
annual_inc                     0
dti                            0
delinq_2yrs                    0
inq_last_6mths                 0
mths_since_last_delinq         0
mths_since_last_record         0
open_acc                       0
pub_rec                        0
revol_bal                      0
revol_util                     0
total_acc                      0
out_prncp                      0
out_prncp_inv                  0
total_pymnt                    0
total_pymnt_inv                0
total_rec_prncp                0
total_rec_int                  0
total_rec_late_fee             0
recoveries                     0
collection_recovery_fee        0
last_pymnt_amnt                0
collections_12_mths_ex_med     0
mths_since_last_major_derog    0
policy_code                    0
acc_now_delinq                 0
tot_coll_a

We have imputed missing values for all numeric columns.