<a href="https://colab.research.google.com/github/Boxika/Credit-Defaults-AI/blob/main/Notebooks/XGBoost_Train_and_Evaluate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# XGBoost Training and Evaluation Notebook
This notebook documents the training and evaluation of an XGBoost model for predicting credit default using the Default of Credit Card Clients Dataset.

## Installation of Dependencies
This will load all the libraries necessary to run the script

In [None]:
# Install necessary libraries
!pip install pandas xgboost scikit-learn imblearn joblib

In [None]:
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

## Loading and Preprocessing the Data
The dataset is accessed from Google Drive, and subsequent steps ensure that the data is preprocessed appropriately, particularly addressing class imbalance and splitting the data into training, validation, and test sets.

In [None]:
# Load the dataset
from google.colab import drive
drive.mount('/content/drive')

file_path = '/content/drive/My Drive/UCI_Credit_Card.csv'  # Update this path if needed
data = pd.read_csv(file_path)

# Assume 'default.payment.next.month' is the target variable and features are all other columns
X = data.drop(columns=['default.payment.next.month'])
y = data['default.payment.next.month']

# Apply SMOTE to handle class imbalance
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)

# Split the dataset into training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X_res, y_res, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

## Training the Model

The code cell demonstrates hyperparameter tuning and training of an XGBoost classifier using grid search. It involves defining a parameter grid, initializing the XGBoost model, and running GridSearchCV for optimization. The parameter grid specifies values for alpha, reg_lambda, learning_rate, max_depth, n_estimators, and scale_pos_weight. The model is fit with the best parameters identified by GridSearchCV, and the best estimator is further trained with early stopping using a validation set. The best parameters and AUC score are printed.


In [None]:
# Define the parameter grid
param_grid = {
    'alpha': [0.1, 0.5,],
    'reg_lambda': [1, 10,],
    'learning_rate': [0.01, 0.1,],
    'max_depth': [3, 4,],
    'n_estimators': [100, 200],
    'scale_pos_weight': [1, 2, 2.5]
}

# Initialize the model
xgb_clf = xgb.XGBClassifier(
    objective='binary:logistic',
    random_state=42,
    tree_method='hist'
)

# Set up GridSearchCV
grid_search = GridSearchCV(estimator=xgb_clf, param_grid=param_grid, scoring='roc_auc', cv=5, verbose=1, n_jobs=-1)

# Fit the model
grid_search.fit(X_train, y_train)

# Best parameters and best score
print("Best parameters found: ", grid_search.best_params_)
print("Best AUC score: ", grid_search.best_score_)

# Train the best model with early stopping
best_xgb_clf = grid_search.best_estimator_
best_xgb_clf.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    eval_metric='auc',
    early_stopping_rounds=10,
    verbose=True
)

## Evaluating the Model

The code cell evaluates the performance of the best XGBoost classifier on training, validation, and test datasets. It starts by predicting the labels and probabilities for the training data and then prints the classification report, confusion matrix, and ROC-AUC score. The same evaluation steps are repeated for the validation and test datasets. Additionally, it plots the ROC curve for the test data, displaying the trade-off between the true positive rate and false positive rate, along with the ROC-AUC score. This evaluation helps in understanding the model's performance across different data splits.

In [None]:
# Evaluate on training data
y_train_pred = best_xgb_clf.predict(X_train)
y_train_pred_proba = best_xgb_clf.predict_proba(X_train)[:, 1]

print("Training Classification Report:")
print(classification_report(y_train, y_train_pred))
print("Training Confusion Matrix:")
print(confusion_matrix(y_train, y_train_pred))
print(f"Training ROC-AUC Score: {roc_auc_score(y_train, y_train_pred_proba):.4f}")

# Evaluate on validation data
y_val_pred = best_xgb_clf.predict(X_val)
y_val_pred_proba = best_xgb_clf.predict_proba(X_val)[:, 1]

print("Validation Classification Report:")
print(classification_report(y_val, y_val_pred))
print("Validation Confusion Matrix:")
print(confusion_matrix(y_val, y_val_pred))
print(f"Validation ROC-AUC Score: {roc_auc_score(y_val, y_val_pred_proba):.4f}")

# Evaluate on test data
y_test_pred = best_xgb_clf.predict(X_test)
y_test_pred_proba = best_xgb_clf.predict_proba(X_test)[:, 1]

print("Test Classification Report:")
print(classification_report(y_test, y_test_pred))
print("Test Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred))
print(f"Test ROC-AUC Score: {roc_auc_score(y_test, y_test_pred_proba):.4f}")

# Plot ROC curve for test data
fpr, tpr, _ = roc_curve(y_test, y_test_pred_proba)
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.4f)' % roc_auc_score(y_test, y_test_pred_proba))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

## Saving the Model

The code saves the trained XGBoost model to a file named 'best_xgb_model.joblib' using the `joblib` library.

In [None]:
import joblib

# Save the model to a file
joblib.dump(best_xgb_clf, 'best_xgb_model.joblib')
print("Model saved to best_xgb_model.joblib")


## Conclusion

In this notebook, we trained and evaluated an XGBoost model for credit default prediction. The model was trained using SMOTE to handle class imbalance and hyperparameter tuning was performed using GridSearchCV. The evaluation showed promising results with a high ROC-AUC score.

## Next Steps

- Experiment with different feature engineering techniques to improve model performance.
- Try different machine learning models and compare their performance.
- Deploy the model using a Flask API for real-time predictions.
