# Task 4: Loan Default Prediction  
# **Objective:** Build a classification model to predict whether a loan applicant will default using  financial data.  
● Dataset: **Lending Club Loan Dataset**

**● Steps:**  
1. Preprocess the data, handling missing values and class imbalance using  techniques like SMOTE.  
2. Train classifiers such as LightGBM or SVM.  
3. Evaluate performance using metrics such as Precision, Recall, and F1 Score.
4. Generate a comprehensive performance report and recommendations for lenders.  
**● Outcome:** A classification model to identify high-risk loan  
applicants, helping reduce defaults.  


# **Install and Import Required Libraries**

In [21]:
# Install necessary packages
!pip install imbalanced-learn lightgbm openpyxl scikit-learn

# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
import lightgbm as lgb



# **Load and Preprocess Data**

In [26]:
# Load the dataset
df = pd.read_csv('/content/loan.csv', low_memory=False)

# Filter to only 'Charged Off' and 'Fully Paid'
df = df[df['loan_status'].isin(['Fully Paid', 'Charged Off'])]

# Encode target variable: 1 = Charged Off (Default), 0 = Fully Paid
df['loan_status'] = df['loan_status'].map({'Fully Paid': 0, 'Charged Off': 1})

# Select features
selected_features = ['loan_amnt', 'int_rate', 'annual_inc', 'dti',
                     'delinq_2yrs', 'revol_util', 'total_acc',
                     'loan_status']

# Drop rows with missing values in selected columns
df = df[selected_features].dropna()

# Separate features (X) and target (y)
X = df.drop('loan_status', axis=1)
y = df['loan_status']

# Handle class imbalance with SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# **Train-Test Split**

In [28]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X_resampled, y_resampled, test_size=0.2, random_state=42
)

# **Model Training (LightGBM)**

In [29]:
# Train LightGBM classifier
model = lgb.LGBMClassifier(random_state=42)
model.fit(X_train, y_train)

[LightGBM] [Info] Number of positive: 5920, number of negative: 5878
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001735 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1367
[LightGBM] [Info] Number of data points in the train set: 11798, number of used features: 7
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.501780 -> initscore=0.007120
[LightGBM] [Info] Start training from score 0.007120


# **Model Evaluation**

In [30]:
# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.96      0.99      0.97      1496
           1       0.98      0.96      0.97      1454

    accuracy                           0.97      2950
   macro avg       0.97      0.97      0.97      2950
weighted avg       0.97      0.97      0.97      2950

[[1474   22]
 [  65 1389]]


# **Performance Report**

**Model Performance:**

The loan default prediction model demonstrates **exceptional performance** with high accuracy (0.97), precision (0.96 for class 0, 0.98 for class 1), recall (0.99 for class 0, 0.96 for class 1), and F1-score (0.97 for both classes). This indicates a **high level of accuracy in predicting both loan repayment and default**, with a **good balance between identifying potential defaulters and minimizing false positives**.

**Strengths:**

* **High precision and recall** for both classes suggest that the model is effective in identifying both low-risk and high-risk borrowers.
* **High F1-score** indicates a good balance between precision and recall showcasing the model's overall effectiveness.
* **High accuracy** demonstrates the model's ability to correctly predict the loan status for a large majority of borrowers.

**Weaknesses:**

* Given the already high performance metrics, there is **limited room for significant improvement.**

**Recommendations for Lenders**

* **Confident Loan Approvals:** The model's high precision and recall for non-defaulters (class 0) suggest that lenders can be confident in approving loans to borrowers predicted as low-risk.
* **Early Default Identification:** The model's strong performance in identifying defaulters (class 1) enables lenders to implement early warning systems and proactive measures to mitigate potential losses.
* **Risk-Based Pricing:** While the model exhibits high accuracy, consider using predicted probabilities of default to fine-tune interest rates and loan terms. This allows for risk-based pricing and personalized loan offers.
* **Continuous Monitoring:** Regularly retrain and update the model with new data to ensure it remains effective in changing economic conditions. Monitor performance metrics to identify any potential degradation and adjust strategies accordingly.
* **Human Oversight:** Despite the model's high performance, human oversight should be maintained in the loan approval process. Consider the model's predictions as a valuable tool to support decision-making but not as the sole determinant.
* **Transparency and Explainability:** As the model is used for critical financial decisions, ensure its predictions are transparent and explainable to borrowers and regulators. This fosters trust and accountability.

The provided classification report and confusion matrix indicate that your loan default prediction model has achieved excellent performance. By incorporating these recommendations, lenders can effectively utilize the model to improve loan approval processes, manage risk, and enhance overall profitability.