# Decision Trees for Loan Approval

**Dataset**: Loan Approval Classification Data (Kaggle)

---
## Part 1 â€“ Understanding the Problem & Dataset

### Business Question
Banks want to decide whether to approve or reject a loan application.
- **Approve "1"**: the applicant is trustworthy enough.
- **Reject "0"**: too risky for the bank.

We will try to predict loan approval (binary classification) using features like:
- Age of the applicant
- Gender
- Credit score
- Loan intent (education, personal, home improvement, â€¦)
- Income and loan amount
- Previous defaults
- Others check on kaggle

This is a business decision problem: approving risky loans costs the bank money, rejecting too many loans loses potential customers.

### Step 1 â€“ Load the dataset

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("loan_data.csv")

# First look at the dataset
df.head()

### Step 2 â€“ Dataset structure

In [None]:
df.info()

### Step 3 â€“ Target distribution

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x="loan_status", data=df)
plt.title("Loan Status Distribution (0=Rejected, 1=Approved)")
plt.show()

### Step 4 â€“ Gender distribution

In [None]:
gender_counts = df['person_gender'].value_counts()
plt.figure(figsize=(6,6))
plt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%', 
        startangle=140, colors=['#66b3ff','#ff9999'])
plt.title("Distribution of Genders")
plt.axis("equal")
plt.show()

### Step 5 â€“ Credit score distribution

In [None]:
sns.histplot(df['credit_score'], bins=30, kde=True)
plt.title("Distribution of Credit Scores")
plt.show()

### ðŸ”Ž Step 6 â€“ Loan amount vs approval

In [None]:
plt.figure(figsize=(10,6))
sns.kdeplot(data=df[df['loan_status']==1], x='loan_amnt', label='Approved',
            fill=True, color='green')
sns.kdeplot(data=df[df['loan_status']==0], x='loan_amnt', label='Declined',
            fill=True, color='red')
plt.title("Loan Amount Distribution by Loan Status")
plt.xlabel("Loan Amount")
plt.ylabel("Density")
plt.legend()
plt.show()

### Step 7 â€“ Correlation analysis

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df1 = df.copy()
df1['person_gender'] = le.fit_transform(df1['person_gender'])
df1['previous_loan_defaults_on_file'] = le.fit_transform(df1['previous_loan_defaults_on_file'])
df1 = pd.get_dummies(df1, columns=['person_education', 'person_home_ownership', 'loan_intent'], drop_first=True, dtype=int)
correlation_matrix = df1.corr()

plt.figure(figsize=(12,10))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()

---
## Part 2 â€“ Building a Decision Tree Classifier

### Step 1 â€“ Splitting the dataset

We don't want to test our model on the same data we train it on â†’ risk of overfitting.

So we split into Train, Validation, Test sets:
- **Train (60%)** â†’ learn model parameters
- **Validation (20%)** â†’ tune hyperparameters (max depth, criterion, â€¦)
- **Test (20%)** â†’ final evaluation, untouched until the end

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import numpy as np

def make_data(df, cols=None, verbos=False):
    # Features & Target
    X = df.drop(columns=["loan_status"])
    if cols:
        X = X[cols]
    y = df["loan_status"]

    # Splits
    TEST_SIZE = 0.2
    VAL_SIZE = 0.2
    RANDOM_STATE = 42

    # First split off test
    X_temp, X_test, y_temp, y_test = train_test_split(
        X, y, test_size=TEST_SIZE, stratify=y, random_state=RANDOM_STATE
    )

    # Then split temp into train/val
    val_size_adjusted = VAL_SIZE / (1 - TEST_SIZE)
    X_train, X_val, y_train, y_val = train_test_split(
        X_temp, y_temp, test_size=val_size_adjusted, stratify=y_temp, random_state=RANDOM_STATE
    )

    if verbos:
        print(
            f"Train: {len(X_train)} ({(1-TEST_SIZE-VAL_SIZE):.0%}) | "
            f"Val: {len(X_val)} ({VAL_SIZE:.0%}) | "
            f"Test: {len(X_test)} ({TEST_SIZE:.0%})"
        )

    # Identify column types
    cat_cols = X.select_dtypes(include=["object", "category"]).columns.tolist()
    num_cols = X.select_dtypes(include=[np.number, "float64", "int64"]).columns.tolist()

    # Preprocessor: OneHot for categorical, passthrough for numeric
    preprocess = ColumnTransformer(
        transformers=[
            ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), cat_cols),
            ("num", "passthrough", num_cols),
        ]
    )
    return X_train, X_val, X_test, y_train, y_val, y_test, cat_cols, num_cols, preprocess

### Step 2 â€“ Training a Decision Tree

We'll start with a shallow tree (max_depth=3) for interpretability, just like in the Iris demo.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

def model_tree(preprocess, X_train, X_val, y_train, y_val, max_depth=3, criterion="gini", val=True):
    tree_clf = DecisionTreeClassifier(max_depth=max_depth, criterion=criterion, random_state=42)
    model = Pipeline(steps=[("prep", preprocess), ("tree", tree_clf)])
    model.fit(X_train, y_train)

    if val:
        y_pred = model.predict(X_val)
        print(classification_report(y_val, y_pred, digits=3))
    return model

### Step 3 â€“ Train and Evaluate

In [None]:
# Prepare splits
X_train, X_val, X_test, y_train, y_val, y_test, cat_cols, num_cols, preprocess = make_data(df, verbos=True)

# Train a simple tree
model = model_tree(preprocess, X_train, X_val, y_train, y_val, max_depth=5, val=True)

### Step 4 â€“ Visualizing the Tree

We export to Graphviz to see the rules.

In [None]:
from sklearn.tree import export_graphviz
from graphviz import Source
import os

# Recover one-hot encoded feature names
ohe = model.named_steps["prep"].named_transformers_["cat"]
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.r_[ohe_names, num_cols]

# Export tree
path_file = "DT/images/loan_tree.dot"
export_graphviz(
    model.named_steps["tree"],
    out_file=path_file,
    feature_names=feature_names,
    class_names=["Rejected", "Approved"],
    rounded=True,
    filled=True,
    proportion=True,
    impurity=True
)
Source.from_file(path_file)

---
## Part 3 â€“ Interpreting the Decision Tree

Now that we have trained a Decision Tree, let's understand what it learned:
- Which features are most important?
- How do we visualize the decision boundaries in 2D?

### Step 1 â€“ Feature Importance

Decision Trees assign an importance score to each feature based on how much it reduces impurity (Gini/Entropy).

In [None]:
# Extract feature names after preprocessing
feature_names = [i[5:] for i in model.named_steps['prep'].get_feature_names_out()]

# Get feature importances from classifier
importances = model.named_steps['tree'].feature_importances_ * 100

# Build DataFrame
Importance = pd.DataFrame({'Importance': importances}, index=feature_names)

# Sort and plot
Importance.sort_values('Importance').plot(kind='barh', color='r', figsize=(8,5))
plt.xlabel('Variable Importance (%)')
plt.title("Feature Importance in Decision Tree")
plt.show()

### Step 2 â€“ Focus on two features for visualization

For easy plotting, let's choose just two numerical features:
- **loan_int_rate** â†’ the loan's interest rate
- **loan_percent_income** â†’ what fraction of income goes to loan payments

This creates a 2D feature space where we can plot decision regions.

In [None]:
cols = ["loan_int_rate", "loan_percent_income"]

X_train, X_val, X_test, y_train, y_val, y_test, cat_cols, num_cols, preprocess = make_data(df, cols=cols)

model = model_tree(preprocess, X_train, X_val, y_train, y_val,
                   max_depth=3, val=True)  # shallow tree for clear boundaries

### Step 3 â€“ Decision Boundary Visualization

We now plot the decision regions learned by the tree.

In [None]:
from utils import plot_decision_boundary_binary

plot_decision_boundary_binary(
    model, X_train[:1000], y_train[:1000], feature_names=cols,
    title="Loan Approval Decision Boundary (max_depth=3)"
)
plt.show()