# Keras(TensorFlow) Training and Evaluation Notebook

This notebook documents the training and evaluation of a Keras model for predicting credit default using the Default of Credit Card Clients Dataset. Keras, which runs on top of TensorFlow, is a powerful deep learning library that allows for easy and fast prototyping, making it ideal for building neural networks.

## Installation of Dependencies

This will load all libraries necessary to run the script.

In [None]:
# Install necessary libraries
!pip install pandas scikit-learn imblearn tensorflow keras joblib scikeras


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.regularizers import l2
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import joblib
import os

## Loading and Preprocessing Data

The dataset is accessed from Google Drive, and subsequent steps ensure that the data is preprocessed appropriately, particularly addressing class imbalance and splitting the data into training and test sets.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Load the data
file_path = '/content/drive/My Drive/VCRK Credit Defaults/Datasets/UCI_Credit_Card.csv'  # Update this path to your file location in Google Drive
data = pd.read_csv(file_path)

# Preprocessing
X = data.drop(columns=['default.payment.next.month'])
y = data['default.payment.next.month']

# Handle class imbalance with SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)

# Split the dataset into training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X_res, y_res, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

Mounted at /content/drive


## Hyperparameter Tuning and Model Definition

This code snippet demonstrates building, scaling, and hyperparameter tuning a Keras (TensorFlow) neural network model. To avoid issues with multiprocessing in a multithreaded environment, it sets environment variables. The data is then scaled using StandardScaler to normalize the features, ensuring the input data has zero mean and unit variance. The create_model function defines a neural network model with three hidden layers, each followed by a dropout layer to prevent overfitting. The model uses the Adam optimizer, binary cross-entropy loss, and tracks accuracy. Hyperparameter tuning is performed using RandomizedSearchCV, which tests different combinations of hyperparameters such as optimizer, activation function, dropout rate, batch size, and epochs. The best set of hyperparameters is identified and printed, optimizing the model's performance.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.regularizers import l2
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from scipy.stats import uniform, randint
import os

# Avoid fork issues with multithreading
os.environ['JOBLIB_MULTIPROCESSING'] = '0'
os.environ['JOBLIB_TEMP_FOLDER'] = '/tmp'

# Scale the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

# Define a function to create the model
def create_model(input_dim, optimizer='adam', activation='relu', dropout_rate=0.5, l2_reg=0.001):
    model = Sequential([
        Dense(128, input_dim=input_dim, activation=activation, kernel_regularizer=l2(l2_reg)),
        Dropout(dropout_rate),
        Dense(64, activation=activation, kernel_regularizer=l2(l2_reg)),
        Dropout(dropout_rate),
        Dense(32, activation=activation, kernel_regularizer=l2(l2_reg)),
        Dropout(dropout_rate),
        Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Hyperparameter tuning function
def hyperparameter_tuning(X_train, y_train, input_dim):
    model = KerasClassifier(model=create_model, input_dim=input_dim, verbose=0)

    param_dist = {
        'model__optimizer': ['adam', 'rmsprop'],
        'model__activation': ['relu', 'tanh'],
        'model__dropout_rate': uniform(0.3, 0.4),  # This will sample between 0.3 and 0.7
        'batch_size': randint(32, 129),  # This will sample between 32 and 128
        'epochs': randint(50, 150)  # This will sample between 50 and 150
    }

    random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=5, n_jobs=1, cv=3, random_state=42)
    random_search_result = random_search.fit(X_train, y_train)

    # Summarize the results
    print("Best: %f using %s" % (random_search_result.best_score_, random_search_result.best_params_))
    return random_search_result.best_params_

# Perform hyperparameter tuning
best_params = hyperparameter_tuning(X_train, y_train, X_train.shape[1])


## Training the Best Model

The best parameters for the optimizer, activation function, dropout rate, L2 regularization, batch size, and epochs are extracted from the best_params dictionary. Using these parameters, a new model is created with the create_model function, tailored to the specific input dimensions of the training data. The model is then trained using the training data, with early stopping configured to monitor the validation loss and stop training if it doesn't improve for 10 consecutive epochs, restoring the best weights observed during training. The training history is recorded for further analysis.

In [None]:
# Extract the best parameters
best_optimizer = best_params['model__optimizer']
best_activation = best_params['model__activation']
best_dropout_rate = best_params['model__dropout_rate']
best_l2_reg = best_params['model__l2_reg']
best_batch_size = best_params['batch_size']
best_epochs = best_params['epochs']

# Create the model with the best parameters
best_model = create_model(input_dim=X_train.shape[1], optimizer=best_optimizer, activation=best_activation, dropout_rate=best_dropout_rate, l2_reg=best_l2_reg)

# Train the best model with early stopping
from tensorflow.keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
history = best_model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=best_epochs, batch_size=best_batch_size, callbacks=[early_stopping], verbose=1)


## Evaluating the Model

This code snippet evaluates the performance of a trained Keras model on validation and test datasets using key metrics and visualizations. It prints the classification report, confusion matrix, and ROC-AUC score for both datasets and plots the ROC curve for the test set.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Evaluate on validation data
y_val_pred_proba = best_model.predict(X_val).ravel()
y_val_pred = (y_val_pred_proba > 0.5).astype(int)

print("Validation Classification Report:")
print(classification_report(y_val, y_val_pred))
print("Validation Confusion Matrix:")
print(confusion_matrix(y_val, y_val_pred))
print(f"Validation ROC-AUC Score: {roc_auc_score(y_val, y_val_pred_proba):.4f}")

# Evaluate on test data
y_test_pred_proba = best_model.predict(X_test).ravel()
y_test_pred = (y_test_pred_proba > 0.5).astype(int)

print("Test Classification Report:")
print(classification_report(y_test, y_test_pred))
print("Test Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred))
print(f"Test ROC-AUC Score: {roc_auc_score(y_test, y_test_pred_proba):.4f}")

# Plot ROC curve for test data
fpr, tpr, _ = roc_curve(y_test, y_test_pred_proba)
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.4f)' % roc_auc_score(y_test, y_test_pred_proba))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()


## Saving the Model

This is used to save the best model after training and evaluation.

In [None]:
import joblib

# Save the scaler
joblib.dump(scaler, 'scaler.joblib')
print("Scaler saved to scaler.joblib")

# Save the model
best_model.save('keras_model.h5')
print("Model saved to keras_model.h5")


## Conclusion

In this notebook, we trained and evaluated a Keras (TensorFlow) model for credit default prediction. The model was trained using SMOTE to handle class imbalance. The evaluation showed promising results with a high ROC-AUC score.

## Next Steps

- Experiment with different neural network architectures and hyperparameters to improve model performance.
- Try different machine learning models and compare their performance.
- Deploy the model using a Flask API for real-time predictions.