### Task:
- Predict the probability that a loan borrower will default based on their financial profile.
- Use this to estimate the expected financial loss on any future loan application.

### Business Value:
- Improves loan portfolio risk assessment, reducing potential default losses by prioritizing safe borrowers.
- Allows proactive credit decision-making that can save up to $X per 1,000 loans based on predicted loss.

### Problem Solved:
- Eliminates guesswork from loan approvals by turning borrower data into measurable risk scores.
- Addresses the challenge of rising defaults by flagging high-risk applicants before issuing credit.

### Model Metrics (Logistic Regression):
- Achieved an AUC of 0.78, indicating strong classification performance on imbalanced data.
- Also reports a Gini coefficient of 0.56 and F1-score of 0.31, supporting practical business use.

### Read the libraries

In [2]:
# Important Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    roc_auc_score, precision_score, recall_score,
    f1_score, confusion_matrix, roc_curve
)


- Import data

In [3]:
# Load data
df = pd.read_csv("Task 3 and 4_Loan_Data.csv")

- Clean Data
- Drop unecessary columns
1. customer_id: It’s just a unique identifier and contains no predictive value for default risk.
2. credit_lines_outstanding:  Redundant and correlated with loan_amt_outstanding, risking multicollinearity.
3. total_debt_outstanding: May overlap with other features like loan_amt_outstanding, leading to data leakage and reduced model interpretability.

In [4]:
df.drop(columns=["customer_id", "credit_lines_outstanding", "total_debt_outstanding"], inplace=True)


- add random noise
- To test for overfitting,  if my model assigns high importance to a completely meaningless feature (random noise), it may be learning noise instead of signal.
-  It acts as a baseline dummy feature. A good model should assign near-zero importance to it, if not, my model might be too complex or your data too small.

In [5]:
# Add random noise and use the same seed to be able to run the same experiment with the same numbers
np.random.seed(42)
df["random_noise"] = np.random.rand(len(df))

- Separate the Features from labels

In [6]:
X = df.drop(columns=["default"])

- Target Column

In [8]:
y = df["default"]

- Features list

In [9]:
feature_columns = X.columns.tolist()

- Train and test split
- I am using stratify=y to make sure the split keeps the same balance of classes (like yes/no or 0/1) in both training and test data.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

- Scale the feature
- I am scaling the features to make sure all values are on a similar scale, so that no feature dominates the others and the model can learn better and faster.

In [11]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

- Model training
- Logistic Regression
- Random Forest Classfier

In [13]:
lr_model = LogisticRegression(max_iter=10000)
lr_model.fit(X_train_scaled, y_train)

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


- Important Metrics
- AUC: Measures how well the model separates the classes; higher means better distinction between positive and negative.
- Gini: A scaled version of AUC that shows model's discriminatory power; higher is better.
- Precision: Out of all predicted positives, how many were actually correct.
- Recall: Out of all actual positives, how many did the model correctly find.
- F1 Score: Harmonic mean of precision and recall; balances false positives and false negatives.
- Confusion Matrix: Shows counts of true/false positives and negatives to summarize prediction results.

In [14]:
# 6. Evaluate models
def gini_coefficient(y_true, y_prob):
    return 2 * roc_auc_score(y_true, y_prob) - 1

lr_probs = lr_model.predict_proba(X_test_scaled)[:, 1]
lr_preds = lr_model.predict(X_test_scaled)

rf_probs = rf_model.predict_proba(X_test)[:, 1]
rf_preds = rf_model.predict(X_test)

metrics = {
    "Logistic Regression": {
        "AUC": roc_auc_score(y_test, lr_probs),
        "Gini": gini_coefficient(y_test, lr_probs),
        "Precision": precision_score(y_test, lr_preds),
        "Recall": recall_score(y_test, lr_preds),
        "F1 Score": f1_score(y_test, lr_preds),
        "Confusion Matrix": confusion_matrix(y_test, lr_preds)
    },
    "Random Forest": {
        "AUC": roc_auc_score(y_test, rf_probs),
        "Gini": gini_coefficient(y_test, rf_probs),
        "Precision": precision_score(y_test, rf_preds),
        "Recall": recall_score(y_test, rf_preds),
        "F1 Score": f1_score(y_test, rf_preds),
        "Confusion Matrix": confusion_matrix(y_test, rf_preds)
    }
}

- Create the loss calculation
- Expected Loss on a loan, which estimates how much money you might lose if the borrower defaults:
- It multiplies the probability of default by the portion of the loan not recovered and the loan amount to estimate the potential financial loss.
- Expected Loss=Probability of Default×(1−Recovery Rate)×Loan Amount

In [15]:
def calculate_expected_loss(prob_default, loan_amount, recovery_rate=0.1):
    return prob_default * (1 - recovery_rate) * loan_amount

- Probability Of Default and Loss
- why do [0][1] 
- predict_proba() returns a list like [[prob_no_default, prob_default]].
- [0] gets the first (and only) row.
- [1] picks the second value, which is the probability of default (class 1).

In [16]:
 # Define a function to predict default probability and expected loss
def predict_default_and_loss(model, scaler, input_dict, loan_amount):
    # Create a NumPy array from input features in the correct order
    input_vector = np.array([[input_dict[feat] for feat in feature_columns]])
    # Check if the model is logistic regression 
    if model == lr_model:
        # Scale the features
        input_scaled = scaler.transform(input_vector)
        # Make the prediction and predict the probability for each class
        pd = model.predict_proba(input_scaled)[0][1]
    # Else predict probability with random forest which needs no scaling
    else:
        # Predict the probability
        pd = model.predict_proba(input_vector)[0][1]
    # calculate the expected loss
    expected_loss = calculate_expected_loss(pd, loan_amount)
    # round in four digits and two digits
    return round(pd, 4), round(expected_loss, 2)

- Function to interpret results

In [17]:
# interpret the probability
def interpret_result(pd, el, scenario_name):
    # if probability of default is lower than 0.2 it means low risk
    if pd < 0.2:
        risk_level = "Low risk of default"
    # If probability of default is lower than 0.5 then there is moderate risk
    elif pd < 0.5:
        risk_level = "Moderate risk of default"
    # If the probability is bigger then say High risk
    else:
        risk_level = "High risk of default"
    return (
        f"📌 {scenario_name}:\n"
        f"• Predicted Probability of Default: {pd * 100:.2f}%\n"
        f"• Expected Financial Loss: ${el:,.2f}\n"
        f"• Risk Assessment: {risk_level}\n"
    )


In [20]:

# Scenarios
scenario_inputs = {
    "Scenario 1 - High income, low loan, stable job": {
        "income": 120000,
        "loan_amt_outstanding": 2000,
        "years_employed": 10,
        "fico_score": 750,
        "random_noise": 0.25
    },
    "Scenario 2 - Low income, high loan, short employment": {
        "income": 30000,
        "loan_amt_outstanding": 15000,
        "years_employed": 1,
        "fico_score": 580,
        "random_noise": 0.77
    },
    "Scenario 3 - Mid income, average loan, decent credit": {
        "income": 60000,
        "loan_amt_outstanding": 7000,
        "years_employed": 4,
        "fico_score": 670,
        "random_noise": 0.55
    }
}

# 10. Scenario outputs with a loan of 10000
for scenario_name, input_data in scenario_inputs.items():
    pd_val, el_val = predict_default_and_loss(lr_model, scaler, input_data, loan_amount=10000)
    print(interpret_result(pd_val, el_val, scenario_name))

# 11. Display metrics
print("\n📊 Model Performance Summary:")
for model_name, stats in metrics.items():
    print(f"\n🔍 {model_name}")
    for metric, value in stats.items():
        if metric == "Confusion Matrix":
            print(f"{metric}:\n{value}")
        else:
            print(f"{metric}: {value:.4f}")


📌 Scenario 1 - High income, low loan, stable job:
• Predicted Probability of Default: 0.07%
• Expected Financial Loss: $6.38
• Risk Assessment: Low risk of default

📌 Scenario 2 - Low income, high loan, short employment:
• Predicted Probability of Default: 99.26%
• Expected Financial Loss: $8,933.18
• Risk Assessment: High risk of default

📌 Scenario 3 - Mid income, average loan, decent credit:
• Predicted Probability of Default: 28.76%
• Expected Financial Loss: $2,588.13
• Risk Assessment: Moderate risk of default


📊 Model Performance Summary:

🔍 Logistic Regression
AUC: 0.7818
Gini: 0.5635
Precision: 0.6210
Recall: 0.2081
F1 Score: 0.3117
Confusion Matrix:
[[1583   47]
 [ 293   77]]

🔍 Random Forest
AUC: 0.7442
Gini: 0.4883
Precision: 0.5417
Recall: 0.2108
F1 Score: 0.3035
Confusion Matrix:
[[1564   66]
 [ 292   78]]


