# **Credit Risk Analysis and Prediction**

This notebook analyzes the credit risk dataset and builds a machine learning model to predict loan default status. The dataset includes features such as `person_age` `person_income` `loan_amnt` `loan_int_rate` and `loan_status`.

Analysis Steps:
1. Data Loading and Exploration
2. Data Preprocessing
3. Model Training
4. Model Evaluation
5. Inference (Testing Predictions)"

In [23]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

## **Data Loading and Exploration**

In [24]:
# Load the dataset
df = pd.read_csv('credit_risk_dataset.csv')

In [25]:
# Display first few rows
df.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4


In [26]:
# Display dataset info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32581 entries, 0 to 32580
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   person_age                  32581 non-null  int64  
 1   person_income               32581 non-null  int64  
 2   person_home_ownership       32581 non-null  object 
 3   person_emp_length           31686 non-null  float64
 4   loan_intent                 32581 non-null  object 
 5   loan_grade                  32581 non-null  object 
 6   loan_amnt                   32581 non-null  int64  
 7   loan_int_rate               29465 non-null  float64
 8   loan_status                 32581 non-null  int64  
 9   loan_percent_income         32581 non-null  float64
 10  cb_person_default_on_file   32581 non-null  object 
 11  cb_person_cred_hist_length  32581 non-null  int64  
dtypes: float64(3), int64(5), object(4)
memory usage: 3.0+ MB


In [27]:
# Check for missing values
df.isnull().sum()

person_age                       0
person_income                    0
person_home_ownership            0
person_emp_length              895
loan_intent                      0
loan_grade                       0
loan_amnt                        0
loan_int_rate                 3116
loan_status                      0
loan_percent_income              0
cb_person_default_on_file        0
cb_person_cred_hist_length       0
dtype: int64

## **Data Preprocessing**

- Handle missing values
- Encode categorical variables
- Scale numerical features
- Remove outliers (e.g., unrealistic ages)

In [28]:
# Handle missing values
df['loan_int_rate'].fillna(df['loan_int_rate'].median(), inplace=True)
df['person_emp_length'].fillna(df['person_emp_length'].median(), inplace=True)

In [29]:
df.isnull().sum()

person_age                    0
person_income                 0
person_home_ownership         0
person_emp_length             0
loan_intent                   0
loan_grade                    0
loan_amnt                     0
loan_int_rate                 0
loan_status                   0
loan_percent_income           0
cb_person_default_on_file     0
cb_person_cred_hist_length    0
dtype: int64

In [30]:
# Remove outliers (e.g., age > 100 or emp_length > 100)
df = df[(df['person_age'] <= 100) & (df['person_emp_length'] <= 50)]

In [31]:
# Encode categorical variables
le = LabelEncoder()
categorical_cols = ['person_home_ownership', 'loan_intent', 'loan_grade', 'cb_person_default_on_file']
for col in categorical_cols:
    df[col] = le.fit_transform(df[col])

In [32]:
# Define features and target
X = df.drop('loan_status', axis=1)
y = df['loan_status']

In [33]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [34]:
# Scale numerical features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## **Model Training**

Using RandomForestClassifier for its robustness and ability to handle imbalanced datasets.

In [35]:
# Train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

In [36]:
# Make predictions on the test set
y_pred = model.predict(X_test)

## **Model Evaluation**

In [37]:
# Evaluate the model
print("Accuracy Score:", accuracy_score(y_test, y_pred))

Accuracy Score: 0.9338449731389102


In [38]:
print("Classification Report:")
print(classification_report(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.99      0.96      5099
           1       0.97      0.72      0.83      1416

    accuracy                           0.93      6515
   macro avg       0.95      0.86      0.89      6515
weighted avg       0.94      0.93      0.93      6515



In [39]:
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Confusion Matrix:
[[5063   36]
 [ 395 1021]]


## **Testing Predictions**

Let's test the model with a few sample inputs to predict loan default status.

In [40]:
# Create sample test cases
sample_data = pd.DataFrame({
    'person_age': [25, 40, 30],
    'person_income': [50000, 120000, 80000],
    'person_home_ownership': ['RENT', 'MORTGAGE', 'OWN'],
    'person_emp_length': [5.0, 10.0, 3.0],
    'loan_intent': ['EDUCATION', 'DEBTCONSOLIDATION', 'MEDICAL'],
    'loan_grade': ['B', 'A', 'C'],
    'loan_amnt': [10000, 20000, 15000],
    'loan_int_rate': [10.99, 7.9, 13.49],
    'loan_percent_income': [0.2, 0.17, 0.19],
    'cb_person_default_on_file': ['N', 'Y', 'N'],
    'cb_person_cred_hist_length': [3, 15, 7]})

In [41]:
# Preprocess sample data
for col in categorical_cols:
    sample_data[col] = le.fit_transform(sample_data[col])

In [42]:
# Scale sample data
sample_data_scaled = scaler.transform(sample_data)

In [43]:
# Make predictions
predictions = model.predict(sample_data_scaled)

In [45]:
# Display results
print("Inference Results for Sample Data:")
for i, pred in enumerate(predictions):
    status = 'Gagal Bayar' if pred == 1 else 'Lunas'
    print(f"Sample {i+1}: Predicted Loan Status = {status}")

Inference Results for Sample Data:
Sample 1: Predicted Loan Status = Lunas
Sample 2: Predicted Loan Status = Lunas
Sample 3: Predicted Loan Status = Lunas
