
Preprocessing includes separating the features (input) and target (output) variables and identifying categorical and numerical columns.
Categorical variables are one-hot encoded, and numerical features are standardized using StandardScaler.
The dataset is split into training, validation, and test sets for model evaluation.

ANN Model Definition:
A function build_model(hp) is defined to create a neural network using two hidden layers.
Hyperparameters such as the number of units in each layer, activation function, and learning rate are tuned dynamically through Keras Tuner.
The model is compiled using the Adam optimizer with a binary cross-entropy loss function and accuracy as a performance metric.

Hyperparameter Tuning:

The program leverages different hyperparameter optimization techniques, such as Random Search, Bayesian Optimization, and Hyperband, to explore the hyperparameter space and find the best model.
The model is evaluated on a test set after tuning to assess its performance.

Performance Evaluation:

The evaluate_model function is used to assess the model on the test set by calculating the test loss.
Results for different optimization techniques, such as test loss and time taken, are collected and displayed in a table for comparison. 


In [10]:
# keras-tuner
!pip install keras-tuner -q

#necessary libraries
import pandas as pd
import numpy as np
import time
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam, RMSprop, Adagrad
from keras_tuner import RandomSearch, BayesianOptimization, Hyperband
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
import matplotlib.pyplot as plt
import seaborn as sns

import os
os.chdir(r"C:\Users\hp\OneDrive - The University of Texas at Dallas\Financial Institutions")
data = pd.read_csv("loan_data.csv")

# List of the categorical columns 
categorical_cols = ['person_gender', 'person_education', 'person_home_ownership', 'loan_intent', 'previous_loan_defaults_on_file']

# unique values for each specified categorical column
for col in categorical_cols:
    print(f"Unique values in {col}: {data[col].unique()}")


Unique values in person_gender: ['female' 'male']
Unique values in person_education: ['Master' 'High School' 'Bachelor' 'Associate' 'Doctorate']
Unique values in person_home_ownership: ['RENT' 'OWN' 'MORTGAGE' 'OTHER']
Unique values in loan_intent: ['PERSONAL' 'EDUCATION' 'MEDICAL' 'VENTURE' 'HOMEIMPROVEMENT'
 'DEBTCONSOLIDATION']
Unique values in previous_loan_defaults_on_file: ['No' 'Yes']


In [11]:
# Preprocessing: Separate features and target variable

X = data.drop(columns=['loan_status'])
y = data['loan_status']

# Separate categorical and numerical features
categorical_features = ["person_gender", "person_home_ownership", "loan_intent"]
label_encoded_features = ["person_education", "previous_loan_defaults_on_file"]

# Label encode ordinal variables separately
for col in label_encoded_features:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])

# Identify numerical features (excluding categorical + label encoded ones)
numerical_features = X.drop(columns=categorical_features + label_encoded_features).columns.tolist()

# ColumnTransformer: One hot encoding & label encoded passthrough
data_transformer = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(drop='first'), categorical_features)  # One-Hot Encode nominal features
    ],
    remainder='passthrough'  # Keeping label-encoded columns as they are
)


In [12]:

import warnings
warnings.filterwarnings('ignore')

# Split into training (60%) and temp (40%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)

# Split temp data into 50% validation, 50% test (which results in 20% validation, 20% test of full data)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Apply preprocessing (only fit on training data)
X_train_scaled = data_transformer.fit_transform(X_train)  # Fit only on training data
X_val_scaled = data_transformer.transform(X_val)          # Transform validation data
X_test_scaled = data_transformer.transform(X_test)        # Transform test data

# model-building function for tuning with two hidden layers and added 'sigmoid' as an activation choice
#The function must be defined to dynamically create the neural network model with different hyperparameters.
#The function takes hp (hyperparameter object) as input and builds the model based on the values provided during each trial.
#The function returns a compiled model, which Keras Tuner uses for training and evaluation.
def build_model(hp):
    model = Sequential()

    # First hidden layer
    model.add(Dense(units=hp.Int('units_0', min_value=32, max_value=128, step=32),
                    activation=hp.Choice('activation', values=['relu', 'tanh', 'sigmoid']),
                    input_shape=(X_train_scaled.shape[1],)))

    # Second hidden layer
    model.add(Dense(units=hp.Int('units_1', min_value=32, max_value=128, step=32),
                    activation=hp.Choice('activation', values=['relu', 'tanh', 'sigmoid'])))

    # Output layer
    model.add(Dense(1, activation='sigmoid'))


     # Choose optimizer dynamically
    optimizer_choice = hp.Choice('optimizer', values=['adam', 'adagrad', 'rmsprop'])

    # Choose learning rate dynamically
    learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])

    # Assign the chosen optimizer with the selected learning rate
    if optimizer_choice == 'adam':
        optimizer = Adam(learning_rate=learning_rate)
    elif optimizer_choice == 'adagrad':
        optimizer = Adagrad(learning_rate=learning_rate)
    else:
        optimizer = RMSprop(learning_rate=learning_rate)

    # Compile the model
    model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

    return model

# Function to evaluate model on the test set and return loss
def evaluate_model(model, X_test, y_test):
    test_loss = model.evaluate(X_test, y_test, verbose=0)[0]  # Return only test loss
    return test_loss

# Set number of trials (consistent across all methods)
max_trials = 10
max_epochs = 3

In [13]:
# Random Search with overwrite=True to force re-run
random_tuner = RandomSearch(
    build_model,
    objective='val_loss',
    max_trials=max_trials,
    directory='random_search',
    project_name='customer_churn',
    overwrite=True
)

start_time = time.time()
#The search() method of Keras Tuner starts the hyperparameter tuning process, where it tries different combinations of hyperparameters
#and trains the model with those values.
random_tuner.search(X_train_scaled, y_train, epochs=max_epochs, validation_data=(X_val_scaled, y_val), verbose=0)
random_search_time = time.time() - start_time

# Get best model and evaluate on the test set
random_search_best_model = random_tuner.get_best_models(num_models=1)[0]
random_search_test_loss = evaluate_model(random_search_best_model, X_test_scaled, y_test)
random_search_best_hyperparameters = random_tuner.get_best_hyperparameters(num_trials=1)[0].values




In [14]:
# Bayesian Optimization with overwrite=True
bayesian_tuner = BayesianOptimization(
    build_model,
    objective='val_loss',
    max_trials=max_trials,
    directory='bayesian_opt',
    project_name='customer_churn',
    overwrite=True
)

start_time = time.time()
bayesian_tuner.search(X_train_scaled, y_train, epochs=max_epochs, validation_data=(X_val_scaled, y_val), verbose=0)
bayesian_search_time = time.time() - start_time

# Get best model and evaluate on the test set
bayesian_search_best_model = bayesian_tuner.get_best_models(num_models=1)[0]
bayesian_search_test_loss = evaluate_model(bayesian_search_best_model, X_test_scaled, y_test)
bayesian_search_best_hyperparameters = bayesian_tuner.get_best_hyperparameters(num_trials=1)[0].values

In [15]:
# Hyperband with overwrite=True
hyperband_tuner = Hyperband(
    build_model,
    objective='val_loss',
    max_epochs=max_epochs,
    directory='hyperband',
    project_name='customer_churn',
    overwrite=True
)

start_time = time.time()
hyperband_tuner.search(X_train_scaled, y_train, validation_data=(X_val_scaled, y_val), verbose=0)
hyperband_search_time = time.time() - start_time

# Get best model and evaluate on the test set
hyperband_search_best_model = hyperband_tuner.get_best_models(num_models=1)[0]
hyperband_search_test_loss = evaluate_model(hyperband_search_best_model, X_test_scaled, y_test)
hyperband_search_best_hyperparameters = hyperband_tuner.get_best_hyperparameters(num_trials=1)[0].values

In [16]:
# Adjust Pandas setting to display full DataFrame content
pd.set_option('display.max_colwidth', None)

# Collecting the results
results = {
    "Method": ["Random Search", "Bayesian Optimization", "Hyperband"],
    "Test Loss": [random_search_test_loss, bayesian_search_test_loss, hyperband_search_test_loss],
    "Time (seconds)": [random_search_time, bayesian_search_time, hyperband_search_time],
    "Best Hyperparameters": [
        random_search_best_hyperparameters,
        bayesian_search_best_hyperparameters,
        hyperband_search_best_hyperparameters
    ]
}

# Creating a DataFrame to display the results
results_df = pd.DataFrame(results)
print(results_df)

                  Method  Test Loss  Time (seconds)  \
0          Random Search   0.202476      152.223099   
1  Bayesian Optimization   0.198675      143.298021   
2              Hyperband   0.197792       57.601464   

                                                                                                                                                                                                Best Hyperparameters  
0                                                                                                                {'units_0': 32, 'activation': 'tanh', 'units_1': 32, 'optimizer': 'rmsprop', 'learning_rate': 0.01}  
1                                                                                                            {'units_0': 128, 'activation': 'sigmoid', 'units_1': 32, 'optimizer': 'rmsprop', 'learning_rate': 0.01}  
2  {'units_0': 128, 'activation': 'relu', 'units_1': 64, 'optimizer': 'rmsprop', 'learning_rate': 0.01, 'tuner/epochs': 3, 'tuner/init

In [17]:
bayesian_tuner.results_summary()



Results summary
Results in bayesian_opt\customer_churn
Showing 10 best trials
Objective(name="val_loss", direction="min")

Trial 08 summary
Hyperparameters:
units_0: 128
activation: sigmoid
units_1: 32
optimizer: rmsprop
learning_rate: 0.01
Score: 0.18966706097126007

Trial 02 summary
Hyperparameters:
units_0: 64
activation: sigmoid
units_1: 96
optimizer: rmsprop
learning_rate: 0.01
Score: 0.19087915122509003

Trial 04 summary
Hyperparameters:
units_0: 32
activation: sigmoid
units_1: 64
optimizer: rmsprop
learning_rate: 0.01
Score: 0.19434860348701477

Trial 09 summary
Hyperparameters:
units_0: 128
activation: tanh
units_1: 128
optimizer: adam
learning_rate: 0.01
Score: 0.19737371802330017

Trial 00 summary
Hyperparameters:
units_0: 96
activation: tanh
units_1: 64
optimizer: rmsprop
learning_rate: 0.001
Score: 0.20036818087100983

Trial 07 summary
Hyperparameters:
units_0: 32
activation: tanh
units_1: 64
optimizer: rmsprop
learning_rate: 0.001
Score: 0.2092665135860443

Trial 06 summar

In [18]:
hyperband_tuner.results_summary()

Results summary
Results in hyperband\customer_churn
Showing 10 best trials
Objective(name="val_loss", direction="min")

Trial 0003 summary
Hyperparameters:
units_0: 128
activation: relu
units_1: 64
optimizer: rmsprop
learning_rate: 0.01
tuner/epochs: 3
tuner/initial_epoch: 1
tuner/bracket: 1
tuner/round: 1
tuner/trial_id: 0000
Score: 0.19221267104148865

Trial 0005 summary
Hyperparameters:
units_0: 32
activation: relu
units_1: 64
optimizer: rmsprop
learning_rate: 0.001
tuner/epochs: 3
tuner/initial_epoch: 0
tuner/bracket: 0
tuner/round: 0
Score: 0.19535459578037262

Trial 0000 summary
Hyperparameters:
units_0: 128
activation: relu
units_1: 64
optimizer: rmsprop
learning_rate: 0.01
tuner/epochs: 1
tuner/initial_epoch: 0
tuner/bracket: 1
tuner/round: 0
Score: 0.22235043346881866

Trial 0004 summary
Hyperparameters:
units_0: 64
activation: tanh
units_1: 32
optimizer: rmsprop
learning_rate: 0.0001
tuner/epochs: 3
tuner/initial_epoch: 0
tuner/bracket: 0
tuner/round: 0
Score: 0.2231452167034

In [19]:
###choosing hyperband because faster and similar test loss to bayesian

In [20]:
best_model = hyperband_tuner.get_best_models(num_models=1)[0]
y_pred_prob = best_model.predict(X_test_scaled)  # Get probability outputs
y_pred = (y_pred_prob > 0.5).astype(int)  # Convert probabilities to binary labels (0 or 1)
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.4f}")

[1m282/282[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
Test Accuracy: 0.9144


In [21]:
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

print("TP is:", conf_matrix[1,1])
print("TN is:", conf_matrix[0,0])
print("FP is:", conf_matrix[0,1])
print("FN is:", conf_matrix[1,0])

Confusion Matrix:
 [[6784  223]
 [ 547 1446]]
TP is: 1446
TN is: 6784
FP is: 223
FN is: 547


In [22]:
from sklearn.metrics import classification_report

print("Classification Report:\n", classification_report(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.97      0.95      7007
           1       0.87      0.73      0.79      1993

    accuracy                           0.91      9000
   macro avg       0.90      0.85      0.87      9000
weighted avg       0.91      0.91      0.91      9000



In [36]:
###logisitc regression

X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size = 0.3, random_state=42)


# Apply preprocessing (only fit on training data)
X_train_scaled1 = data_transformer.fit_transform(X_train1)  # Fit only on training data 
X_test_scaled1 = data_transformer.transform(X_test1)        # Transform test data


In [37]:
from sklearn.linear_model import LogisticRegression

# Run the Logistic Regression Model: 

## (a) Define function, train the model. Report coefficient.
logreg = LogisticRegression()
logreg.fit(X_train_scaled1, y_train1)
print(logreg.coef_)
print(logreg.intercept_)

y_pred1 = logreg.predict(X_test_scaled1)
## (b) Evaluation model performance - accuracy
print("Test Accuracy is:", logreg.score(X_test_scaled1, y_test1))


[[ 0.16914444  0.0371894  -0.14033906 -0.62910834  1.0058119   1.38698634
  -0.0159445  -0.44918338  0.02944378  0.18088065 -1.55347438  0.67255987
  -0.78408788  0.00792862 -0.25910475 -0.6595777  -1.14837178  0.00771856
  -7.70466907]]
[-0.50809809]
Test Accuracy is: 0.8938518518518519


In [38]:
conf_matrix = confusion_matrix(y_test1, y_pred1)
print("Confusion Matrix:\n", conf_matrix)

print("TP is:", conf_matrix[1,1])
print("TN is:", conf_matrix[0,0])
print("FP is:", conf_matrix[0,1])
print("FN is:", conf_matrix[1,0])

from sklearn.metrics import classification_report

print("Classification Report:\n", classification_report(y_test1, y_pred1))

Confusion Matrix:
 [[9822  671]
 [ 762 2245]]
TP is: 2245
TN is: 9822
FP is: 671
FN is: 762
Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.94      0.93     10493
           1       0.77      0.75      0.76      3007

    accuracy                           0.89     13500
   macro avg       0.85      0.84      0.85     13500
weighted avg       0.89      0.89      0.89     13500

