<h1>Assignment 4</h1>

Lawrence Technological University<br>
Department of Math and Computer Science<br>
MCS 5623 Machine Learning<br>
Assignment # 04<br>
Michael Giuliani<br>
10/27/2024<br>

Credit Card Default Prediction (Team assignment maximum of two students) <br>

The aim here is to predict whether a client will default or not. The data and information about attributes are available at<br>
    https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients <br>
    Links to an external site.<br>
         You will follow the different stages of the data pre-processing. This means performing data exploration/visualization, checking for missing values, duplicate removal, and feature selection/reduction.<br>
         Once you have downloaded the data, you will prepare a data visualization report. Feel free to provide any additional visualization that might help in better understanding of the data. Write a paragraph about the characteristics of the data you see via visualization.<br>
      You must perform N-fold cross-validation of your models and report the mean and standard deviation of any performance metric you use.<br>
      Your final model must be implemented and demonstrated as an app using Gradio platform.   <br>
      For using Gradio, watch the YouTupe: Building and deploying your first machine learning app in Python using Gradio<br>


In [97]:
import pandas as pd # for statistical calculations and data handling
import numpy as np # for math operations and data handling
import gradio as gr # for user interface demo
from sklearn.model_selection import cross_val_score, StratifiedKFold # K Fold cross validation
from sklearn.preprocessing import MinMaxScaler # Feature scaling
from sklearn.linear_model import LogisticRegression # Logistical regression classifier
from sklearn.ensemble import RandomForestClassifier # Random forest classifier
import matplotlib.pyplot as plt # Plotting
import seaborn as sns # Cross correlation
import time
import warnings # Suppressing unwanted warnings
warnings.filterwarnings("ignore")

<br><br><br><h1>Data Loading and Preprocessing</h1>

In [98]:
dataset = pd.read_excel("default of credit card clients.xls") # read the dataset into a Pandas dataframe

# rename the columns using the descriptions for readability, no reason to use the X notation
new_column_names = dataset.iloc[0].tolist() # get the column descriptions
old_column_names = dataset.columns.tolist() # get the original column names
renaming_template = dict(zip(old_column_names, new_column_names)) # match the X_ notated column names to their descriptions
renaming_template['X6'] = 'PAY_1' # Fix description (new label) for the first pay column from PAY_0, since the other columns start at 1
dataset.rename(columns=renaming_template, inplace = True) # modify the original dataset with the new column names

dataset.drop(0, inplace = True) # drop the description row since it is no longer needed

dataset.drop(['ID'], axis = 1, inplace = True) # drop the ID row since it just numbers the rows, not needed

dataset.drop_duplicates(inplace=True) # dropping the ID column exposed duplicate rows that we don't find before doing that, so drop them

<br><br><br><h1>Visualize Dataset</h1>

Print Info

In [99]:
def print_dataset_info():
    head = dataset.head() # see what the data looks like
    description = dataset.describe(include='all')  # get some general info on the data and transpose it
    description_labels = description.index.tolist()  # get labels for the description rows
    description.insert(0, 'Description', description_labels) # add labels for the descriptions for better visualization

    null_counts = dataset.isnull().sum() # check for missing values
    null_counts = pd.DataFrame(null_counts).reset_index() # have to make this back into a dataframe for Gradio to be happy
    null_counts.columns = ['Column Name', 'Null Count'] # adding some info to make the visualization more readable

    duplicate_counts = int(dataset.duplicated().sum()) # check for duplicate rows

    class_label_counts = dataset['default payment next month'].value_counts() # get the count of how many defaulted or not defaulted samples are in the dataset
    class_label_counts = pd.DataFrame(class_label_counts).reset_index() # have to make this back into a dataframe for Gradio to be happy
    class_label_counts['default payment next month'].iloc[0] = "Not Defaulted" # make these row labels readable in visualization
    class_label_counts['default payment next month'].iloc[1] = "Defaulted"

    return head, description, null_counts, duplicate_counts, class_label_counts

Correlation Heatmap

In [100]:
def cross_corr_heatmap():
    columns = ['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', # make a list of all the colums we want to cross correlate
                'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6',
                'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6',
                'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'default payment next month']

    columns_to_corr = dataset[columns]  # get the data from the columns we're correlating
    corr_matrix = columns_to_corr.corr() # get cross correlations

    fig, _ = plt.subplots(figsize = (20, 8))
    sns.heatmap(corr_matrix, annot = True, cmap = 'coolwarm', vmin = -1, vmax = 1) # plot correlations, -1 strong negative correlation (blue), 1 strong positive correlation (red)
    plt.title('Default of Credit Card Clients Dataset Correlation Matrix')
    plt.show()

    return fig


Age Vs. Population Defaulted (Negligible Correlation)

In [101]:
def plot_age_dist():
    defaulted_by_age_counts = dataset.groupby(['AGE', 'default payment next month']).size().unstack(fill_value = 0) # get ages and label of whether each person defaulted
    
    fig, ax = plt.subplots(figsize = (12, 5))
    defaulted_by_age_counts.plot(kind = 'bar', stacked = True, ax = ax) # plot each age with one bar for those that defaulted and one for not
    plt.xlabel('Age')
    plt.ylabel('Population')
    plt.title('Defaulted Payment by Age (Negligible Correlation)')
    plt.legend(title = 'Defaulted', labels = ['No', 'Yes'])
    
    return fig

History of Past Payment (September) Vs. Population Defaulted (Medium-Low Positive Correlation)

In [102]:
def plot_pay_hist_vs_defaulted():
    defaulted_by_age_counts = dataset.groupby(['PAY_1', 'default payment next month']).size().unstack(fill_value = 0) # get first month's pay history and whether the person defaulted
    
    fig, ax = plt.subplots(figsize = (12, 5))
    defaulted_by_age_counts.plot(kind = 'bar', stacked = True, ax = ax) # plot the population's pay histories with one bar for those that defaulted and one for those that didn't default
    plt.xlabel('Months of Payment Delay')
    plt.ylabel('Population')
    plt.title('History of Past Payment (September) Vs. Population Defaulted (Medium-Low Positive Correlation)')
    plt.legend(title = 'Defaulted', labels = ['No', 'Yes'])
    plt.show()

    return fig

Balance Limit Vs. Population Defaulted (Low Negative Correlation)

In [103]:
def plot_balance_lim_vs_defaulted():
    defaulted_by_age_counts = dataset.groupby(['LIMIT_BAL', 'default payment next month']).size().unstack(fill_value = 0) # get account limit balances and whether the person defaults
    
    fig, ax = plt.subplots(figsize = (20, 5))
    defaulted_by_age_counts.plot(kind = 'bar', stacked = True, ax = ax) # plot the population's balance limits with one bar for those that defaulted and one for those that didn't
    plt.xlabel('Balance Limit')
    plt.ylabel('Population')
    plt.title('Balance Limit Vs. Population Defaulted (Low Negative Correlation)')
    plt.legend(title = 'Defaulted', labels = ['No', 'Yes'])
    plt.show()

    return fig

Class Split

In [104]:
def plot_class_split():
    class_label_counts = dataset['default payment next month'].value_counts() # get the count across the dataset of how many people did or didn't default
    class_label_counts = pd.DataFrame(class_label_counts).reset_index() # have to make this back into a dataframe for Gradio to be happy
    class_label_counts['default payment next month'].iloc[0] = "Not Defaulted" # make these row labels readable in visualization
    class_label_counts['default payment next month'].iloc[1] = "Defaulted"
    
    fig, ax = plt.subplots()
    class_label_counts.plot(kind = 'bar', ax = ax, legend = False)
    plt.xlabel('Class')
    plt.ylabel('Population')
    plt.title('Dataset Class Label Split')
    plt.show()

    return fig

<br><br><br><h1>Logistic Regression Classifier</h1>

In [105]:
def lin_reg_classifier(training_columns):
    data = dataset[training_columns] # get the column(s) being used for classification
    labels = dataset['default payment next month'].astype(int) # get the labels and convert them to ints

    scaler = MinMaxScaler() # create a scaler object that will scale the data between 0 and 1
    scaled_data = scaler.fit_transform(data) # scale the data

    # set up a K-Fold Cross Validaton object to use 1/5th of the data for testing
    # shuffle the data before splitting it
    # use a set randomization seed for reproducability
    # Stratified K-Fold ensures that the percent split of classes is the same in each fold as there is in the whole dataset
    stratified_k_fold = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 1)

    model = LogisticRegression(max_iter = 1000, solver='newton-cholesky') # set up a logistics regression object with max epochs to run until convergence before ending, 'newton-cholesky' decreases execution time

    # get predictions from cross validation using the lin reg model, min-max scaled data, truth labels, stratified K-Fold object, and return the accuracy for each fold
    start_time = time.time()
    accuracy = cross_val_score(model, scaled_data, labels, cv = stratified_k_fold, scoring = 'accuracy')
    end_time = time.time()
    execution_time = end_time - start_time

    mean_accuracy = np.mean(accuracy) # calc average accuracy
    accuracy_std_deviation = np.std(accuracy) # get the std dev of the accuracies

    scores_2d = [[score] for score in accuracy] # format the results to make them better to add for visualization in gradio
    mean_accuracy_str = f"{mean_accuracy:.4f}"
    std_deviation_str = f"{accuracy_std_deviation:.4f}"

    return scores_2d, mean_accuracy_str, std_deviation_str, execution_time

<br><br><br><h1>Random Forest Classifier</h1>

In [106]:
def random_forest_classifier(training_columns):
    data = dataset[training_columns] # get the column(s) being used for classification
    labels = dataset['default payment next month'].astype(int) # get the labels and convert them to ints

    # set up a K-Fold Cross Validaton object to use 1/5th of the data for testing
    # shuffle the data before splitting it
    # use a set randomization seed for reproducability
    # Stratified K-Fold ensures that the percent split of classes is the same in each fold as there is in the whole dataset
    stratified_k_fold = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 1)

    rf_classifier = RandomForestClassifier(random_state = 1, max_depth = 1, n_estimators = 1) # create a random forest classifier object with a set seed for reproducability. max_depth and n_estimators being 1 creates a single decision stump, which is optimal performance with PAY_1
    
    # get predictions from cross validation using the random forest model, data, truth labels, stratified K-Fold object, and return the accuracy for each fold
    start_time = time.time()
    accuracy = cross_val_score(rf_classifier, data, labels, cv = stratified_k_fold, scoring = 'accuracy')
    end_time = time.time()
    execution_time = end_time - start_time
    
    mean_accuracy = np.mean(accuracy) # calc average accuracy
    accuracy_std_deviation = np.std(accuracy) # get the std dev of the accuracies

    scores_2d = [[score] for score in accuracy] # format the results to make them better to add for visualization in gradio
    mean_accuracy_str = f"{mean_accuracy:.4f}" 
    std_deviation_str = f"{accuracy_std_deviation:.4f}"

    return scores_2d, mean_accuracy_str, std_deviation_str, execution_time

<br><br><br><h1>Gradio</h1>

In [107]:
# Dataset Plots Interface Wrapper
def visualize_data(): # wrapper function so the gradio interface will plot everything at once for us
    heatmap = cross_corr_heatmap() # overall cross correlation heatmap
    plot1 = plot_age_dist() # age distribution
    plot2 = plot_pay_hist_vs_defaulted() # how on-time payments were at month 1 vs. if the defaulted
    plot3 = plot_balance_lim_vs_defaulted() # credit limit vs. if they defaulted
    plot4 = plot_class_split() # total population class label split 0 = not defaulted, 1 = defaulted

    return heatmap, plot1, plot2, plot3, plot4

In [108]:
with gr.Blocks() as credit_default_interface: # make a different tab for every piece of the assignment/application
    with gr.Tab("Dataset Info"):
        gr.Interface(fn = print_dataset_info, 
                    inputs = [], 
                    outputs = [gr.Dataframe(label = "Example Rows"),
                               gr.Dataframe(label = "Basic Description Info"),
                               gr.Dataframe(label = "Missing Values"),
                               gr.Number(label = "Duplicate Value Count"),
                               gr.Dataframe(label = "Class Representation"),])
    
    with gr.Tab("Dataset Plots"):
        gr.Interface(fn = visualize_data, 
                    inputs = [], 
                    outputs = [gr.Plot(), 
                               gr.Plot(), 
                               gr.Plot(),
                               gr.Plot(),
                               gr.Plot()])
        
    with gr.Tab("Linear Regression Classification"):
        gr.Interface(fn = lin_reg_classifier, 
                    inputs = gr.CheckboxGroup(label = "Select Columns", choices=dataset.columns.tolist()), 
                    outputs = [gr.Dataframe(label = "Fold Accuracies"),
                               gr.Textbox(label = "Accuracy Mean"),
                               gr.Textbox(label = "Accuracy Standard Deviation"),
                               gr.Number(label = "Execution Time (s)")])
        
    with gr.Tab("Random Forest Classification"):
        gr.Interface(fn = random_forest_classifier, 
                    inputs = gr.CheckboxGroup(label = "Select Columns", choices=dataset.columns.tolist()), 
                    outputs = [gr.Dataframe(label = "Fold Accuracies"),
                               gr.Textbox(label = "Accuracy Mean"),
                               gr.Textbox(label = "Accuracy Standard Deviation"),
                               gr.Number(label = "Execution Time (s)")])

credit_default_interface.launch(share = True)

Running on local URL:  http://127.0.0.1:7868
Running on public URL: https://373d74a3d2a6dedb5c.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


