## Problem Statement

Develop a machine learning model that predicts the likelihood of a borrower defaulting on a loan based on factors such as credit history, repayment capacity, and annual income. This model aims to assist financial institutions in assessing the potential financial impact of credit risk and making informed lending decisions.

## Credit Risk Prediction

Credit Risk refers to the likelihood of a borrower failing to repay a loan, leading to potential financial losses for the lender. When financial institutions extend services like mortgages, credit cards, or personal loans, there exists an inherent risk that the borrower may default on their repayment obligations. To evaluate this risk, factors like credit history, repayment capacity, loan terms, and annual income can be considered.

Many companies, especially financial institutions, evaluate the credit risk of their existing and forthcoming customers. With the advent of technologies like machine learning, organizations can analyze customer data to establish a risk profile. Credit risk modeling evaluates a borrower's credit risk based primarily on two factors. The first factor is determining the probability of a borrower defaulting on a loan, while the second factor involves evaluating the financial impact on the lender in case of such a default.

https://www.kaggle.com/code/samudra89/01-credit-risk-modeling

https://www.kaggle.com/datasets/laotse/credit-risk-dataset?resource=download

## Dataset Description

The dataset you'll be working with is the Credit Risk dataset, which includes the following features:

* **person_age** - Age
* **person_income** - Annual Income
* **person_home_ownership** - Home ownership
* **person_emp_length** - Employment length (in years)
* **loan_intent** - Loan intent
* **loan_grade** - Loan grade
* **loan_amnt** - Loan amount
* **loan_int_rate** - Interest rate
* **loan_status** - Loan status (0 is non default 1 is default)
* **loan_percent_income** - Percent income
* **cb_person_default_on_file** - Historical default
* **cb_preson_cred_hist_length** - Credit history length


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Download dataset "credit_risk_dataset.csv"

!gdown https://drive.google.com/uc?id=12IEvK2qRdxgSxPGVTW_5KOQtOdPLV0iQ

Downloading...
From: https://drive.google.com/uc?id=12IEvK2qRdxgSxPGVTW_5KOQtOdPLV0iQ
To: /content/credit_risk_dataset.csv
  0% 0.00/1.80M [00:00<?, ?B/s]100% 1.80M/1.80M [00:00<00:00, 19.1MB/s]


In [None]:
df = pd.read_csv("credit_risk_dataset.csv")
df.shape

(32581, 12)

In [None]:
df.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4


In [None]:
df.describe()

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_cred_hist_length
count,32581.0,32581.0,31686.0,32581.0,29465.0,32581.0,32581.0,32581.0
mean,27.7346,66074.85,4.789686,9589.371106,11.011695,0.218164,0.170203,5.804211
std,6.348078,61983.12,4.14263,6322.086646,3.240459,0.413006,0.106782,4.055001
min,20.0,4000.0,0.0,500.0,5.42,0.0,0.0,2.0
25%,23.0,38500.0,2.0,5000.0,7.9,0.0,0.09,3.0
50%,26.0,55000.0,4.0,8000.0,10.99,0.0,0.15,4.0
75%,30.0,79200.0,7.0,12200.0,13.47,0.0,0.23,8.0
max,144.0,6000000.0,123.0,35000.0,23.22,1.0,0.83,30.0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32581 entries, 0 to 32580
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   person_age                  32581 non-null  int64  
 1   person_income               32581 non-null  int64  
 2   person_home_ownership       32581 non-null  object 
 3   person_emp_length           31686 non-null  float64
 4   loan_intent                 32581 non-null  object 
 5   loan_grade                  32581 non-null  object 
 6   loan_amnt                   32581 non-null  int64  
 7   loan_int_rate               29465 non-null  float64
 8   loan_status                 32581 non-null  int64  
 9   loan_percent_income         32581 non-null  float64
 10  cb_person_default_on_file   32581 non-null  object 
 11  cb_person_cred_hist_length  32581 non-null  int64  
dtypes: float64(3), int64(5), object(4)
memory usage: 3.0+ MB


In [None]:
df.isna().sum()

Unnamed: 0,0
person_age,0
person_income,0
person_home_ownership,0
person_emp_length,895
loan_intent,0
loan_grade,0
loan_amnt,0
loan_int_rate,3116
loan_status,0
loan_percent_income,0


In [None]:
df['loan_int_rate'].mean()

11.011694892245036

In [None]:
# Handle missing values
df['loan_int_rate'] = df['loan_int_rate'].fillna(df['loan_int_rate'].mean())
df['person_emp_length'] = df['person_emp_length'].fillna(df['person_emp_length'].mean())

In [None]:
df.isna().sum()

Unnamed: 0,0
person_age,0
person_income,0
person_home_ownership,0
person_emp_length,0
loan_intent,0
loan_grade,0
loan_amnt,0
loan_int_rate,0
loan_status,0
loan_percent_income,0


In [None]:
# Check categorical values
categorical_cols = df.select_dtypes(include='object').columns
categorical_cols

Index(['person_home_ownership', 'loan_intent', 'loan_grade',
       'cb_person_default_on_file'],
      dtype='object')

In [None]:
for col in categorical_cols:
    print(f"{col} ---> {df[col].unique()}")

person_home_ownership ---> ['RENT' 'OWN' 'MORTGAGE' 'OTHER']
loan_intent ---> ['PERSONAL' 'EDUCATION' 'MEDICAL' 'VENTURE' 'HOMEIMPROVEMENT'
 'DEBTCONSOLIDATION']
loan_grade ---> ['D' 'B' 'C' 'A' 'E' 'F' 'G']
cb_person_default_on_file ---> ['Y' 'N']


In [None]:
# Handle categorical columns
home_ownership_mapping = {'MORTGAGE': 0, 'RENT': 1, 'OWN': 2, 'OTHER': 3}
loan_grade_mapping = {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6}
default_on_file_mapping = {'N': 0, 'Y': 1}

In [None]:
home_ownership_mapping['RENT']

1

In [None]:
# def apply_home_ownership_mapping(value):
#     return home_ownership_mapping[value]

# apply_home_ownership_mapping('RENT')

1

In [None]:
#df['person_home_ownership'] = df['person_home_ownership'].apply(apply_home_ownership_mapping)

In [None]:
df['person_home_ownership'] = df['person_home_ownership'].map(home_ownership_mapping)
df['loan_grade'] = df['loan_grade'].map(loan_grade_mapping)
df['cb_person_default_on_file'] = df['cb_person_default_on_file'].map(default_on_file_mapping)

In [None]:
df.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,1,123.0,PERSONAL,3,35000,16.02,1,0.59,1,3
1,21,9600,2,5.0,EDUCATION,1,1000,11.14,0,0.1,0,2
2,25,9600,0,1.0,MEDICAL,2,5500,12.87,1,0.57,0,3
3,23,65500,1,4.0,MEDICAL,2,35000,15.23,1,0.53,0,2
4,24,54400,1,8.0,MEDICAL,2,35000,14.27,1,0.55,1,4


In [None]:
# Label Encoder
from sklearn.preprocessing import LabelEncoder

loan_intent_encoder = LabelEncoder()

loan_intent_encoder.fit(df['loan_intent'])

df['loan_intent'] = loan_intent_encoder.transform(df['loan_intent'])

In [None]:
df.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,1,123.0,4,3,35000,16.02,1,0.59,1,3
1,21,9600,2,5.0,1,1,1000,11.14,0,0.1,0,2
2,25,9600,0,1.0,3,2,5500,12.87,1,0.57,0,3
3,23,65500,1,4.0,3,2,35000,15.23,1,0.53,0,2
4,24,54400,1,8.0,3,2,35000,14.27,1,0.55,1,4


In [None]:
loan_intent_encoder.classes_

array(['DEBTCONSOLIDATION', 'EDUCATION', 'HOMEIMPROVEMENT', 'MEDICAL',
       'PERSONAL', 'VENTURE'], dtype=object)

In [None]:
loan_intent_encoder.transform(['PERSONAL'])[0]

4

In [None]:
loan_intent_encoder.inverse_transform([4])

array(['PERSONAL'], dtype=object)

In [None]:
df.shape

(32581, 12)

In [None]:
df['loan_status'].value_counts()/len(df)

Unnamed: 0_level_0,count
loan_status,Unnamed: 1_level_1
0,0.781836
1,0.218164


In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('loan_status', axis=1)
y = df['loan_status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((26064, 11), (6517, 11), (26064,), (6517,))

In [None]:
y_train.value_counts()/len(y_train)

Unnamed: 0_level_0,count
loan_status,Unnamed: 1_level_1
0,0.781845
1,0.218155


In [None]:
y_test.value_counts()/len(y_test)

Unnamed: 0_level_0,count
loan_status,Unnamed: 1_level_1
0,0.781801
1,0.218199


In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression()
lr_model.fit(X_train_scaled, y_train)

In [None]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

def evaluate_model(y_test, y_pred):
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)

    print(f"Accuracy: {round(acc, 3)}")
    print(f"F1 Score: {round(f1, 3)}")
    print(f"Precision: {round(precision, 3)}")
    print(f"Recall: {round(recall, 3)}")

In [None]:
y_pred = lr_model.predict(X_test_scaled)

evaluate_model(y_test, y_pred)

Accuracy: 0.838
F1 Score: 0.548
Precision: 0.699
Recall: 0.451


In [None]:
# Decision Tree
from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)

In [None]:
y_pred = dt_model.predict(X_test)

evaluate_model(y_test, y_pred)

Accuracy: 0.889
F1 Score: 0.748
Precision: 0.74
Recall: 0.756


In [None]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)

In [None]:
y_pred = rf_model.predict(X_test)

evaluate_model(y_test, y_pred)

Accuracy: 0.93
F1 Score: 0.815
Precision: 0.96
Recall: 0.707


In [None]:
X_train.head(2)

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
15884,25,241875,0,4.0,1,0,16000,7.05,0.07,0,4
15138,21,18000,1,5.0,4,1,1500,12.18,0.08,0,4


In [None]:
X_test.head(2)

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
6616,22,50000,1,6.0,4,1,6000,11.89,0.12,0,2
21802,32,52000,1,0.0,4,0,7125,7.49,0.14,0,10


In [None]:
# Inference

sample_input = {'person_age': 22,
                'person_income': 50000,
                'person_home_ownership': home_ownership_mapping['RENT'],
                'person_emp_length': 6.0,
                'loan_intent': loan_intent_encoder.transform(['DEBTCONSOLIDATION'])[0],
                'loan_grade': loan_grade_mapping['B'],
                'loan_amnt': 6000,
                'loan_int_rate': 11.89,
                'loan_percent_income': 0.12,
                'cb_person_default_on_file': default_on_file_mapping['N'],
                'cb_person_cred_hist_length': 2}

sample_input_df = pd.DataFrame(sample_input, index=[0])
sample_input_df

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,50000,1,6.0,0,1,6000,11.89,0.12,0,2


In [None]:
pred = rf_model.predict(sample_input_df)

In [None]:
print(pred[0])

0


In [None]:
if pred[0] == 0:
    print("Less likely to default")
if pred[0] == 1:
    print("Likely to default")

Less likely to default


In [None]:
# Ternary operator

print("Less likely to default") if pred[0] == 0 else print("Likely to default")

Less likely to default


In [None]:
def make_prediction(sample_input_df):
    prediction = rf_model.predict(sample_input_df)
    label = "Likely to default" if prediction[0] == 1 else "Less likely to default"
    return label


In [None]:
make_prediction(sample_input_df)

'Less likely to default'

In [None]:
# Save model
import joblib

joblib.dump(rf_model, 'rf_model_loan_default_pred.pkl')

In [None]:
# Save Label encoder

joblib.dump(loan_intent_encoder, 'loan_intent_encoder.pkl')

['loan_intent_encoder.pkl']

In [None]:
# To load model
import joblib

model = joblib.load("my_trained_model.pkl")

model.predict(sample_input_df)

array([0])

In [None]:
intent_encoder = joblib.load("loan_intent_encoder.pkl")
intent_encoder.transform(['PERSONAL'])[0]

4

## Gradio Implementation

In [None]:
#!pip install gradio

In [None]:
!pip -q install gradio

In [None]:
import gradio
import gradio as gr

In [None]:
def greet(fname, lname, day, city):
    response = "Hello " + fname + " " + lname + "!"
    response = response + "\nYour day was " + day + "."
    response = response + "\nYou live in " + city + "."
    return response


In [None]:
greet(fname="Yograj", lname="M")

'Hello Yograj M!'

In [None]:
# Input element
#in_name = gr.Textbox(label="Name")
in_fname = gr.Textbox(label="First name", placeholder="You first name here", value="Default fname")
in_lname = gr.Textbox(label="Last name")
#in_age = gr.Number(label="Age")
in_day = gr.Radio(choices=['Good', 'Not good'], label="How was your day")
in_city = gr.Dropdown(['a', 'b', 'c'], label="You city")


# Output element
out_greet = gr.Textbox(label="Greet Message")




# Create Interface oblect
iface = gr.Interface(fn=greet, inputs = [in_fname, in_lname, in_day, in_city], outputs = out_greet)


# Launch the interface object
iface.launch()


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://bd1546fc5f2c509889.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [None]:
df.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,1,123.0,4,3,35000,16.02,1,0.59,1,3
1,21,9600,2,5.0,1,1,1000,11.14,0,0.1,0,2
2,25,9600,0,1.0,3,2,5500,12.87,1,0.57,0,3
3,23,65500,1,4.0,3,2,35000,15.23,1,0.53,0,2
4,24,54400,1,8.0,3,2,35000,14.27,1,0.55,1,4


In [None]:
categorical_cols

Index(['person_home_ownership', 'loan_intent', 'loan_grade',
       'cb_person_default_on_file'],
      dtype='object')

In [None]:
df.columns

Index(['person_age', 'person_income', 'person_home_ownership',
       'person_emp_length', 'loan_intent', 'loan_grade', 'loan_amnt',
       'loan_int_rate', 'loan_status', 'loan_percent_income',
       'cb_person_default_on_file', 'cb_person_cred_hist_length'],
      dtype='object')

In [None]:
list(loan_intent_encoder.classes_)

['DEBTCONSOLIDATION',
 'EDUCATION',
 'HOMEIMPROVEMENT',
 'MEDICAL',
 'PERSONAL',
 'VENTURE']

In [None]:
# Input elements
in_person_age = gr.Number(label="Age",)
in_person_income = gr.Number(label="Annual Income")
in_person_home_ownership = gr.Radio(choices=["MORTGAGE", "RENT", "OWN", "OTHER"], label="Home Ownership")
in_person_emp_length = gr.Number(label="Employment Length")
in_loan_intent = gr.Radio(choices=list(loan_intent_encoder.classes_), label="Loan Intent")
in_loan_grade = gr.Radio(choices=["A", "B", "C", "D", "E", "F", "G"], label="Loan Grade")
in_loan_amnt = gr.Number(label="Loan Amount")
in_loan_int_rate = gr.Number(label="Interest Rate")
in_loan_percent_income = gr.Number(label="Percent Income")
in_cb_person_default_on_file = gr.Radio(choices=["N", "Y"], label="Historical Default")
in_cb_person_cred_hist_length = gr.Number(label="Credit History Length")


# Output element
out_loan_status = gr.Textbox(label="Prediction")


# Function
def predict_loan_status(person_age, person_income, person_home_ownership, person_emp_length, loan_intent, loan_grade, loan_amnt, loan_int_rate, loan_percent_income, cb_person_default_on_file, cb_person_cred_hist_length):
    sample_input = {'person_age': person_age,
                    'person_income': person_income,
                    'person_home_ownership': home_ownership_mapping[person_home_ownership],
                    'person_emp_length': person_emp_length,
                    'loan_intent': loan_intent_encoder.transform([loan_intent]),
                    'loan_grade': loan_grade_mapping[loan_grade],
                    'loan_amnt': loan_amnt,
                    'loan_int_rate': loan_int_rate,
                    'loan_percent_income': loan_percent_income,
                    'cb_person_default_on_file': default_on_file_mapping[cb_person_default_on_file],
                    'cb_person_cred_hist_length': cb_person_cred_hist_length}

    sample_input_df = pd.DataFrame(sample_input, index=[0])
    label = make_prediction(sample_input_df)

    return label



In [None]:
# Create interface
iface = gr.Interface(fn=predict_loan_status,
                     inputs=[in_person_age, in_person_income, in_person_home_ownership, in_person_emp_length, in_loan_intent, in_loan_grade, in_loan_amnt, in_loan_int_rate, in_loan_percent_income, in_cb_person_default_on_file, in_cb_person_cred_hist_length],
                     outputs=out_loan_status,
                     title="Loan Default Prediction")


In [None]:
iface.launch(debug=True)

## API Request

In [1]:
import requests
import json

In [7]:
# Example data to send to the model endpoint
data = {
  "person_age": 30,
  "person_income": 1000000,
  "person_home_ownership": "OWN",
  "person_emp_length": 6,
  "loan_intent": "DEBTCONSOLIDATION",
  "loan_grade": "B",
  "loan_amnt": 500000,
  "loan_int_rate": 12.0,
  "loan_percent_income": 0.5,
  "cb_person_default_on_file": "N",
  "cb_person_cred_hist_length": 4
}

# Convert data to JSON format (optional since requests library can handle dicts too)
headers = {'Content-Type': 'application/json'}

# Endpoint URL (replace with your actual URL)
url = "https://upgraded-halibut-x7j49459v4hpgr6-8080.app.github.dev/predict"

# Make a POST request to the endpoint
response = requests.post(url, json=data, headers=headers)

# Check the response status code
if response.status_code == 200:
    # If successful, parse the prediction response
    prediction = response.json()  # Assuming the response is in JSON format
    print(f"Predicted class: {prediction['prediction']}")
else:
    print(f"Error: {response.status_code}, Message: {response.text}")


Predicted class: Less likely to default
