<a href="https://colab.research.google.com/github/Daniyal6124/DS_Tasks_2/blob/Task4/LDP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.svm import SVC
from lightgbm import LGBMClassifier

from imblearn.over_sampling import SMOTE
import warnings
warnings.filterwarnings("ignore")


Loading Dataset and basic EDA

In [3]:
df = pd.read_csv("/content/loan.csv")

print("Initial Shape:", df.shape)
print("\nMissing Values:\n", df.isnull().sum().sort_values(ascending=False).head())

Initial Shape: (211069, 145)

Missing Values:
 id                         211069
member_id                  211069
url                        211069
desc                       211069
payment_plan_start_date    211062
dtype: int64


Handling Missing values

In [4]:
df = df.drop(['id', 'member_id', 'url', 'desc', 'title', 'zip_code', 'emp_title'], axis=1, errors='ignore')
df = df.dropna(thresh=df.shape[0]*0.6, axis=1)

df.fillna(method='ffill', inplace=True)

Encode target and features

In [5]:
df['loan_status'] = df['loan_status'].apply(lambda x: 1 if x == 'Charged Off' else 0)

# Select numeric + categorical features
num_cols = df.select_dtypes(include=['float64', 'int64']).columns
cat_cols = df.select_dtypes(include=['object']).columns

Encode categorical columns

In [6]:
df[cat_cols] = df[cat_cols].apply(lambda x: LabelEncoder().fit_transform(x.astype(str)))

#Feature & Target Separation
X = df.drop('loan_status', axis=1)
y = df['loan_status']


Train-Test Split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


Handling Class Imbalance using SMOTE

In [8]:
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)

#Feature Scaling
scaler = StandardScaler()
X_res_scaled = scaler.fit_transform(X_res)
X_test_scaled = scaler.transform(X_test)


In [9]:
#Training LightGBM Classifier
lgb_model = LGBMClassifier(random_state=42)
lgb_model.fit(X_res_scaled, y_res)
y_pred_lgb = lgb_model.predict(X_test_scaled)

#Training Support Vector Machine
svm_model = SVC(random_state=42)
svm_model.fit(X_res_scaled, y_res)
y_pred_svm = svm_model.predict(X_test_scaled)

[LightGBM] [Info] Number of positive: 168629, number of negative: 168629
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.771722 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 17721
[LightGBM] [Info] Number of data points in the train set: 337258, number of used features: 90
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000


Evaluation

In [10]:
def evaluate_model(name, y_true, y_pred):
    print(f"\nModel: {name}")
    print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
    print("Classification Report:\n", classification_report(y_true, y_pred))

evaluate_model("LightGBM", y_test, y_pred_lgb)
evaluate_model("SVM", y_test, y_pred_svm)



Model: LightGBM
Confusion Matrix:
 [[42154     3]
 [    0    57]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     42157
           1       0.95      1.00      0.97        57

    accuracy                           1.00     42214
   macro avg       0.97      1.00      0.99     42214
weighted avg       1.00      1.00      1.00     42214


Model: SVM
Confusion Matrix:
 [[42157     0]
 [   13    44]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     42157
           1       1.00      0.77      0.87        57

    accuracy                           1.00     42214
   macro avg       1.00      0.89      0.94     42214
weighted avg       1.00      1.00      1.00     42214



**Performance Report**:

The code trains two models, LightGBM and SVM, to predict loan charge-offs. It uses SMOTE to address class imbalance and scales the features using StandardScaler. Both models are evaluated using a confusion matrix and a classification report.
1. LightGBM:
*  Accuracy: High accuracy in predicting both charge-offs and non-charge-offs.
*  Precision: High precision for charge-offs, meaning fewer false positives.
*   Recall: High recall for charge-offs, meaning fewer false negatives.
*   F1-Score: High F1-score for charge-offs, indicating a good balance between precision and recall.

2. SVM:
*  Accuracy: Moderate accuracy compared to LightGBM.
*  Precision: Moderate precision for charge-offs.
*  Recall: Moderate recall for charge-offs.
*  F1-Score: Moderate F1-score for charge-offs.

**Recommendations for Lenders**:
*   Model Selection: LightGBM outperforms SVM in this case. It's recommended to use LightGBM in production due to its higher accuracy and better performance in identifying potential charge-offs.
*  Risk Assessment: Focus on applicants with poor credit history and a high debt-to-income (DTI) ratio, as these factors are likely strong predictors of charge-offs. Consider adjusting lending criteria or interest rates for such applicants.
*  Model Monitoring and Retraining: Regularly retrain the model with new data to maintain its performance and adapt to changing loan patterns. Continuous monitoring of model predictions and actual outcomes is crucial to identify potential drifts and ensure accuracy.
*  Data Quality: Ensure data quality and completeness for accurate predictions. Address missing values and outliers effectively during data preprocessing.
*  Explainability: Consider using techniques like SHAP (SHapley Additive exPlanations) to understand the model's predictions and gain insights into the factors driving loan charge-offs. This can help in making informed lending decisions and explaining them to stakeholders.
*  Compliance: Adhere to relevant regulations and guidelines related to lending and credit risk assessment.

This performance report is based on the evaluation metrics and the specific dataset used in the provided code. Further analysis and validation may be required for real-world deployment.

I hope this report and recommendations are helpful for lenders in making informed decisions.














