# Case Study: Predicting Loan Approval Status

## Objective

The goal of this case study is to predict whether a loan application will be approved based on various applicant attributes. The data consists of information about the applicant, including income, loan amount, credit history, and more. The target variable is whether the loan application was approved (Loan_Status), which is a binary classification problem.

## Dataset Overview

**The dataset contains the following features:**

- Loan_ID: Unique Loan ID
- Gender: Male/Female
- Married: Applicant married (Yes/No)
- Dependents: Number of dependents
- Education: Applicant Education (Graduate/Not Graduate)
- Self_Employed: Self-employed (Yes/No)
- ApplicantIncome: Applicant income
- CoapplicantIncome: Coapplicant income
- LoanAmount: Loan amount in thousands
- Loan_Amount_Term: Term of loan in months
- Credit_History: Credit history meets guidelines (1/0)
- Property_Area: Urban/Semi-Urban/Rural
- Loan_Status: Loan approved (Y/N)

## Load the Dataset

In [1]:
import pandas as pd

#Load the dataset
df = pd.read_csv('data/Loan_Prediction.csv')

In [2]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [4]:
df.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,592.0,600.0,564.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199
std,6109.041673,2926.248369,85.587325,65.12041,0.364878
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,168.0,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


**ApplicantIncome**
- Mean: ~5403, Std Dev: ~6109 - high variation.
- Min: 150, Max: 81000 - extreme outliers present.
- **Action**: Consider log transformation or capping outliers. But no transformation needed for tree-based models.

**CoapplicantIncome**
- Mean: ~1621, Std Dev: ~2926 - high variation.
- Many zero values (no coapplicant).
- Max: 41667 - extreme outliers present.
- **Action**: Treat zeros carefully, consider log transform for non-zero values. But no transformation needed for tree-based models.

**LoanAmount**
- Mean: ~146, Std Dev: ~85 - moderately spread.
- Min: 9, Max: 700 - presence of outliers.
- Count: 592 (22 missing values).
- **Action**: LoanAmount has min = 9, max = 700 so these extremes can pull the mean upward. Hence instead of mean impute missing values with median (robust against outliers and skewness). But no transformation needed for tree-based models.

**Loan_Amount_Term**
- Mean: ~342, Median: 360 - majority of loans are 30 years (360 months).
- Std Dev: ~65 - limited variation.
- Min: 12 months, Max: 480 months.
- Count: 600 (14 missing values).
- **Action**: Impute missing values with mode (360 months), since most loans follow this standard term.

**Credit_History**
- Mean: ~0.84, Median: 1 - most applicants have a good credit history.
- Binary feature (0 = bad, 1 = good).
- Count: 564 (50 missing values).
- **Action**: Impute missing with mode (1).

---

**Key Insights**
1. ApplicantIncome and CoapplicantIncome show high variation with extreme outliers (likely skewed),but tree-based models handle this well.
2. LoanAmount contains missing values and outliers; median imputation is suitable.
3. Loan_Amount_Term is dominated by 360 months; mode imputation is suitable.
4. Credit_History is a strong predictor but has missing values; mode imputation recommended.

Overall, the dataset has variation, outliers, and missing values. Only the missing values need to be addressed before modeling with tree-based methods.

In [5]:
print(df.isnull().sum())

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64


## Preprocessing

### Step1: Handle missing values

In [6]:
# Impute missing values

# Categorial columns
for col in ["Gender", "Married", "Self_Employed"]:
    mode_val = df[col].mode()[0]
    df[col] = df[col].fillna(mode_val)

# Numerical columns
# Dependents - mode (most common value)
df['Dependents'] = df['Dependents'].fillna(df['Dependents'].mode()[0])
df['Dependents'] = df['Dependents'].replace('3+', 3).astype(float)

# LoanAmount - median
df['LoanAmount'] = df['LoanAmount'].fillna(df['LoanAmount'].median())

# Loan_Amount_Term - mode (360 months)
df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0])

# Credit_History → mode (most are 1)
df['Credit_History'] = df['Credit_History'].fillna(df['Credit_History'].mode()[0])

In [7]:
print(df.isnull().sum())

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64


### Step2: Convert categorical variables to numerical using LabelEncoder

Decision Trees can inherently handle categorical data, but encoding might still be necessary depending on the implementation.
   - Bagging (e.g., Random Forest) often uses Decision Trees, and categorical variables should be encoded for algorithms like Scikit-learn. Decision Trees and Random Forest can handle categorical splits without encoding, but Scikit-learn implementation expects numbers.
   - Boosting (e.g., Gradient Boosting, AdaBoost) requires categorical data to be numerically encoded because gradient-based algorithms need numerical inputs.

In [8]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['Gender'] = label_encoder.fit_transform(df['Gender'])
df['Married'] = label_encoder.fit_transform(df['Married'])
df['Education'] = label_encoder.fit_transform(df['Education'])
df['Self_Employed'] = label_encoder.fit_transform(df['Self_Employed'])
df['Property_Area'] = label_encoder.fit_transform(df['Property_Area'])
df['Loan_Status'] = label_encoder.fit_transform(df['Loan_Status'])  # Target variable

In [9]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,1,0,0.0,0,0,5849,0.0,128.0,360.0,1.0,2,1
1,LP001003,1,1,1.0,0,0,4583,1508.0,128.0,360.0,1.0,0,0
2,LP001005,1,1,0.0,0,1,3000,0.0,66.0,360.0,1.0,2,1
3,LP001006,1,1,0.0,1,0,2583,2358.0,120.0,360.0,1.0,2,1
4,LP001008,1,0,0.0,0,0,6000,0.0,141.0,360.0,1.0,2,1


### Step3: Split the Data

In [10]:
from sklearn.model_selection import train_test_split
X = df.drop(columns=['Loan_ID', 'Loan_Status'])  # Features
y = df['Loan_Status']  # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [11]:
X_train.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
553,1,1,0.0,1,0,2454,2333.0,181.0,360.0,0.0,2
601,1,1,0.0,1,0,2894,2792.0,155.0,360.0,1.0,0
261,1,0,0.0,0,0,2060,2209.0,134.0,360.0,1.0,1
496,1,1,0.0,1,0,2600,1700.0,107.0,360.0,1.0,0
529,1,0,0.0,1,0,6783,0.0,130.0,360.0,1.0,1


In [12]:
y_train.head()

553    0
601    1
261    1
496    1
529    1
Name: Loan_Status, dtype: int64

## Decision Tree Classifier

### With Entropy

In [13]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize the model
# The model will use Information Gain (based on entropy) to decide how to split nodes.
dt_entropy_model = DecisionTreeClassifier(criterion='entropy',random_state=42)

# Train the model
dt_entropy_model.fit(X_train, y_train)

# Predictions and evaluation
y_pred_entropy = dt_entropy_model.predict(X_test)
accuracy_entropy = accuracy_score(y_test, y_pred_entropy)

print("Decision Tree with Entropy - Accuracy: ", accuracy_entropy)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_entropy))
print("Classification Report:\n", classification_report(y_test, y_pred_entropy))

Decision Tree with Entropy - Accuracy:  0.7135135135135136
Confusion Matrix:
 [[33 32]
 [21 99]]
Classification Report:
               precision    recall  f1-score   support

           0       0.61      0.51      0.55        65
           1       0.76      0.82      0.79       120

    accuracy                           0.71       185
   macro avg       0.68      0.67      0.67       185
weighted avg       0.70      0.71      0.71       185



### With Gini Index

In [14]:
dt_gini_model = DecisionTreeClassifier(criterion='gini', random_state=42)
dt_gini_model.fit(X_train, y_train)

# Predictions and evaluation
y_pred_gini = dt_gini_model.predict(X_test)
accuracy_gini = accuracy_score(y_test, y_pred_gini)

print("Decision Tree with Gini Index - Accuracy: ", accuracy_gini)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_gini))
print("Classification Report:\n", classification_report(y_test, y_pred_gini))

Decision Tree with Gini Index - Accuracy:  0.6432432432432432
Confusion Matrix:
 [[30 35]
 [31 89]]
Classification Report:
               precision    recall  f1-score   support

           0       0.49      0.46      0.48        65
           1       0.72      0.74      0.73       120

    accuracy                           0.64       185
   macro avg       0.60      0.60      0.60       185
weighted avg       0.64      0.64      0.64       185



### Pruning Techniques

Pruning helps to reduce the size of the decision tree and prevent overfitting.

* Pre-Pruning: Limit the depth or number of samples required for a split.
* Post-Pruning: Perform pruning after the tree is fully grown, removing nodes that do not improve the model’s performance.


In [15]:
# Train a Decision Tree with pre-pruning:

# - max_depth=4 - limits tree growth to avoid overfitting
# - min_samples_split=10 - a node must have at least 10 samples to split further
# - criterion defaults to "gini" (Gini Impurity)
dt_pruned = DecisionTreeClassifier(max_depth=4, min_samples_split=10, random_state=42)
dt_pruned.fit(X_train, y_train)

# Make predictions on test data
y_pred_pruned = dt_pruned.predict(X_test)

# Evaluate model performance
accuracy_pruned = accuracy_score(y_test, y_pred_pruned)
print("Decision Tree with Pre-Pruning - Accuracy:", accuracy_pruned)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_pruned))
print("Classification Report:\n", classification_report(y_test, y_pred_pruned))

Decision Tree with Pre-Pruning - Accuracy: 0.7405405405405405
Confusion Matrix:
 [[ 22  43]
 [  5 115]]
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.34      0.48        65
           1       0.73      0.96      0.83       120

    accuracy                           0.74       185
   macro avg       0.77      0.65      0.65       185
weighted avg       0.76      0.74      0.70       185



### Hyperparameter Tuning

In [16]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'max_depth': [3, 5, 7, 9],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}

# Perform Grid Search with 5-fold cross-validation
grid_search_dt = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid_search_dt.fit(X_train, y_train)

# Extract best model and evaluate
print("Best Parameters: ", grid_search_dt.best_params_)
best_dt = grid_search_dt.best_estimator_
y_pred_best_dt = best_dt.predict(X_test)
accuracy_best_dt = accuracy_score(y_test, y_pred_best_dt)

print("Best Decision Tree - Accuracy: ", accuracy_best_dt)


Best Parameters:  {'criterion': 'gini', 'max_depth': 3, 'min_samples_split': 2}
Best Decision Tree - Accuracy:  0.7783783783783784



This code performs hyperparameter tuning for a Decision Tree classifier using GridSearchCV. It searches for the best combination of max_depth, min_samples_split, and criterion over 5-fold cross-validation. After finding the best parameters, it trains the best model on the training data and predicts on the test set, followed by printing the best parameters and the model's accuracy on the test data.

Overfitting is handled by pruning and limiting tree complexity.

## Bagging with Random Forest

Bagging stands for Bootstrap Aggregating and is a way of reducing variance by training multiple models on different subsets of data.

In [17]:
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest:
# - n_estimators=100 - build 100 decision trees
# - bootstrap=True   - sample data with replacement (bagging)
# - random_state=42  - reproducibility
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, bootstrap=True)

# Train on training data
rf_model.fit(X_train, y_train)

# Make predictions on test data
y_pred_rf = rf_model.predict(X_test)

# Evaluate performance
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print("Random Forest (Bagging) - Accuracy:", accuracy_rf)

# Show confusion matrix (TP, TN, FP, FN counts)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))

# Show precision, recall, f1-score per class
print("Classification Report:\n", classification_report(y_test, y_pred_rf))


Random Forest (Bagging) - Accuracy: 0.7783783783783784
Confusion Matrix:
 [[ 32  33]
 [  8 112]]
Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.49      0.61        65
           1       0.77      0.93      0.85       120

    accuracy                           0.78       185
   macro avg       0.79      0.71      0.73       185
weighted avg       0.78      0.78      0.76       185



### Feature Importance in Random Forest

Random Forest provides feature importance, which helps in identifying the most relevant features.

In [18]:
# Feature Importance
feature_importance = rf_model.feature_importances_
features = X_train.columns

# Display the feature importance
for feature, importance in zip(features, feature_importance):
    print(f"{feature}: {importance}")

Gender: 0.023617892886428113
Married: 0.02816497673677057
Dependents: 0.04832053088658131
Education: 0.024060564241272963
Self_Employed: 0.01913863998076867
ApplicantIncome: 0.2019245859094005
CoapplicantIncome: 0.11218164678732671
LoanAmount: 0.18450008562593972
Loan_Amount_Term: 0.047780718081722996
Credit_History: 0.25668081056109027
Property_Area: 0.05362954830269823


### Hyperparameter Tuning of RandomForrest

In [19]:
# Define the hyperparameter grid to search
param_grid_rf = {
    'n_estimators': [100, 200, 300],    # number of trees
    'max_depth': [3, 5, 7, 9],          # max depth of each tree
    'min_samples_split': [2, 5, 10]     # min samples required to split a node
}

# Grid Search with 5-fold cross-validation
grid_search_rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_rf, cv=5)
grid_search_rf.fit(X_train, y_train)

# Get the best parameters and the best model
best_params_rf = grid_search_rf.best_params_
best_rf_model = grid_search_rf.best_estimator_

# Print the best parameters
print(f"Best parameters for Random Forest: {best_params_rf}")

# Use the best model to make predictions on the test set
y_pred_rf_grid = best_rf_model.predict(X_test)

# Evaluate the model's performance
accuracy_rf_grid = accuracy_score(y_test, y_pred_rf_grid)
print("Random Forest Accuracy (with Grid Search CV): ", accuracy_rf_grid)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf_grid))
print("Classification Report:\n", classification_report(y_test, y_pred_rf_grid))


Best parameters for Random Forest: {'max_depth': 3, 'min_samples_split': 5, 'n_estimators': 100}
Random Forest Accuracy (with Grid Search CV):  0.7837837837837838
Confusion Matrix:
 [[ 27  38]
 [  2 118]]
Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.42      0.57        65
           1       0.76      0.98      0.86       120

    accuracy                           0.78       185
   macro avg       0.84      0.70      0.71       185
weighted avg       0.82      0.78      0.76       185



## Boosting

### AdaBoost

In [20]:
from sklearn.ensemble import AdaBoostClassifier

# Initialize AdaBoost with Decision Tree as base estimator
ada_model = AdaBoostClassifier(n_estimators=100, random_state=42)

# Train the AdaBoost model
ada_model.fit(X_train, y_train)

# Make predictions on test data
y_pred_ada = ada_model.predict(X_test)

# Evaluate AdaBoost model performance
accuracy_ada = accuracy_score(y_test, y_pred_ada)
print("AdaBoost Classifier - Accuracy:", accuracy_ada)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_ada))
print("Classification Report:\n", classification_report(y_test, y_pred_ada))



AdaBoost Classifier - Accuracy: 0.7783783783783784
Confusion Matrix:
 [[ 31  34]
 [  7 113]]
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.48      0.60        65
           1       0.77      0.94      0.85       120

    accuracy                           0.78       185
   macro avg       0.79      0.71      0.72       185
weighted avg       0.79      0.78      0.76       185



#### Hyperparameter Tuning for ADA Boost

In [21]:
param_grid_ada = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1],
    'algorithm': ['SAMME']
}

grid_search_ada = GridSearchCV(AdaBoostClassifier(random_state=42), param_grid_ada, cv=5)
grid_search_ada.fit(X_train, y_train)

# Get the best parameters and the best model
best_params_ada = grid_search_ada.best_params_
best_ada_model = grid_search_ada.best_estimator_

# Print the best parameters
print(f"Best parameters for Ada Boost: {best_params_ada}")

# Use the best model to make predictions on the test set
y_pred_ada_grid = best_ada_model.predict(X_test)

# Evaluate the model's performance
accuracy_ada_grid = accuracy_score(y_test, y_pred_ada_grid)
print("ADA Boost Accuracy (with Grid Search CV): ", accuracy_ada_grid)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_ada_grid))
print("Classification Report:\n", classification_report(y_test, y_pred_ada_grid))


Best parameters for Ada Boost: {'algorithm': 'SAMME', 'learning_rate': 0.01, 'n_estimators': 100}
ADA Boost Accuracy (with Grid Search CV):  0.7837837837837838
Confusion Matrix:
 [[ 27  38]
 [  2 118]]
Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.42      0.57        65
           1       0.76      0.98      0.86       120

    accuracy                           0.78       185
   macro avg       0.84      0.70      0.71       185
weighted avg       0.82      0.78      0.76       185



### Gradient Boosting

In [22]:
from sklearn.ensemble import GradientBoostingClassifier

# Initialize Gradient Boosting Classifier
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)

# Train the Gradient Boosting model
gb_model.fit(X_train, y_train)

# Make predictions on test data
y_pred_gb = gb_model.predict(X_test)

# Evaluate Gradient Boosting model performance
accuracy_gb = accuracy_score(y_test, y_pred_gb)
print("Gradient Boosting Classifier - Accuracy:", accuracy_gb)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_gb))
print("Classification Report:\n", classification_report(y_test, y_pred_gb))

Gradient Boosting Classifier - Accuracy: 0.7567567567567568
Confusion Matrix:
 [[ 25  40]
 [  5 115]]
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.38      0.53        65
           1       0.74      0.96      0.84       120

    accuracy                           0.76       185
   macro avg       0.79      0.67      0.68       185
weighted avg       0.77      0.76      0.73       185



#### Hyperparameter Tuning for Gradient Boosting

In [23]:
param_grid_gb = {
    'n_estimators': [100, 200],
    'max_depth': [3, 6],
    'learning_rate': [0.01, 0.1]
}

grid_search_gb = GridSearchCV(GradientBoostingClassifier(random_state=42), param_grid_gb, cv=5)
grid_search_gb.fit(X_train, y_train)

# Get the best parameters and the best model
best_params_gb = grid_search_gb.best_params_
best_gb_model = grid_search_gb.best_estimator_

# Print the best parameters
print(f"Best parameters for GB Boost: {best_params_gb}")

# Use the best model to make predictions on the test set
y_pred_gb_grid = best_gb_model.predict(X_test)

# Evaluate the model's performance
accuracy_gb_grid = accuracy_score(y_test, y_pred_gb_grid)
print("Gradient Boost Accuracy (with Grid Search CV): ", accuracy_gb_grid)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_gb_grid))
print("Classification Report:\n", classification_report(y_test, y_pred_gb_grid))


Best parameters for GB Boost: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100}
Gradient Boost Accuracy (with Grid Search CV):  0.7783783783783784
Confusion Matrix:
 [[ 26  39]
 [  2 118]]
Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.40      0.56        65
           1       0.75      0.98      0.85       120

    accuracy                           0.78       185
   macro avg       0.84      0.69      0.71       185
weighted avg       0.81      0.78      0.75       185



### XGBoost (Extreme Gradient Boosting)

In [24]:
import xgboost as xgb

# Initialize XGBoost
xgb_model = xgb.XGBClassifier(n_estimators=100, random_state=42)
xgb_model.fit(X_train, y_train)

# Predictions and evaluation
y_pred_xgb = xgb_model.predict(X_test)
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)

print("XGBoost Accuracy: ", accuracy_xgb)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_xgb))
print("Classification Report:\n", classification_report(y_test, y_pred_xgb))


XGBoost Accuracy:  0.745945945945946
Confusion Matrix:
 [[ 33  32]
 [ 15 105]]
Classification Report:
               precision    recall  f1-score   support

           0       0.69      0.51      0.58        65
           1       0.77      0.88      0.82       120

    accuracy                           0.75       185
   macro avg       0.73      0.69      0.70       185
weighted avg       0.74      0.75      0.74       185



#### Hyperparameter Tuning for XGB Boosting

In [25]:
param_grid_xgb = {
    'n_estimators': [100, 200],
    'max_depth': [3, 6],
    'learning_rate': [0.01, 0.1]
}

grid_search_xgb = GridSearchCV(xgb.XGBClassifier(random_state=42), param_grid_xgb, cv=5)
grid_search_xgb.fit(X_train, y_train)

# Get the best parameters and the best model
best_params_xgb = grid_search_xgb.best_params_
best_xgb_model = grid_search_xgb.best_estimator_

# Print the best parameters
print(f"Best parameters for XGB Boost: {best_params_xgb}")

# Use the best model to make predictions on the test set
y_pred_xgb_grid = best_xgb_model.predict(X_test)

# Evaluate the model's performance
accuracy_xgb_grid = accuracy_score(y_test, y_pred_xgb_grid)
print("XGB Boost Accuracy (with Grid Search CV): ", accuracy_xgb_grid)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_xgb_grid))
print("Classification Report:\n", classification_report(y_test, y_pred_xgb_grid))


Best parameters for XGB Boost: {'learning_rate': 0.01, 'max_depth': 6, 'n_estimators': 100}
XGB Boost Accuracy (with Grid Search CV):  0.7783783783783784
Confusion Matrix:
 [[ 28  37]
 [  4 116]]
Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.43      0.58        65
           1       0.76      0.97      0.85       120

    accuracy                           0.78       185
   macro avg       0.82      0.70      0.71       185
weighted avg       0.80      0.78      0.75       185



## Conclusion

This case study demonstrated the use of Decision Tree, Bagging (Random Forest), and Boosting (AdaBoost, Gradient Boosting, XGBoost) for predicting loan approval.

**Key observations from the results include:**

- **Decision Trees** are simple and interpretable but prone to overfitting, resulting in lower accuracy compared to ensemble methods. Pre-pruning and hyperparameter tuning improve performance but still lag behind ensemble approaches.

- **Bagging (Random Forest)** effectively reduces variance and produces stable, robust models. It performed very well, achieving high accuracy with relatively straightforward tuning.

- **Boosting methods (AdaBoost, Gradient Boosting, XGBoost)** improve model performance by sequentially focusing on harder-to-predict instances. In this dataset, after hyperparameter tuning, all boosting methods achieved accuracy comparable to Random Forest. XGBoost did not significantly outperform AdaBoost or Random Forest, though it remains a strong choice for larger datasets or more complex problems.

Overall, ensemble methods clearly outperform standalone Decision Trees, while differences among tuned ensemble methods are minor for this dataset.

## Final Recommendation

- **Random Forest** is recommended for a robust, stable model that is quick and easy to tune. It offers strong performance with minimal computational complexity.

- **AdaBoost or Gradient Boosting** can be considered if you want to experiment with boosting techniques, as they achieve similar accuracy to Random Forest.

- **XGBoost** is suitable when computational resources allow for more careful tuning or for larger, more complex datasets. For this dataset, it does not provide a clear accuracy advantage over Random Forest or AdaBoost.