## Assessing credit risk using various machine learning models

In this notebook we will train a few machine learning models on the German Credit scoring dataset. We will perform a fairness assessment using FairLearn to obtain some metadata about each model we have trained. The goal is to then take this metadata and use an LLM to help decide which model(s) are most fair.

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,log_loss
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report, RocCurveDisplay
import itertools
from sklearn.model_selection import GridSearchCV
pd.options.display.float_format = '{:.2f}'.format
import warnings                   # to deal with warnings
warnings.filterwarnings('ignore')

In [2]:
# View dataframe
url='https://drive.google.com/file/d/1IZoVERZH1dSp9zXhlIn8ITQ3CBn8USWp/view?usp=drive_link'
file_id=url.split('/')[-2]
dwn_url='https://drive.google.com/uc?id=' + file_id
df = pd.read_csv(dwn_url)

df = df.replace({np.nan: 'none'})
df.head()

Unnamed: 0.1,Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose,Risk
0,0,67,male,2,own,none,little,1169,6,radio/TV,good
1,1,22,female,2,own,little,moderate,5951,48,radio/TV,bad
2,2,49,male,1,own,little,none,2096,12,education,good
3,3,45,male,2,free,little,little,7882,42,furniture/equipment,good
4,4,53,male,2,free,little,little,4870,24,car,bad


In [3]:
# Separate categorical and numerical features

num_df = df[['Age','Job','Duration']]
cat_df = df[['Sex','Housing','Saving accounts','Checking account','Credit amount','Purpose','Risk']]

In [4]:
# label encoding

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for i in cat_df:
  cat_df[i] = le.fit_transform(df[i])

# Join encoded data to numeric data

main_df = pd.concat([num_df, cat_df], axis=1)
main_df.head(5)

Unnamed: 0,Age,Job,Duration,Sex,Housing,Saving accounts,Checking account,Credit amount,Purpose,Risk
0,67,2,6,1,1,2,0,142,5,1
1,22,2,48,0,1,0,1,770,5,0
2,49,1,12,1,1,0,2,390,3,1
3,45,2,42,1,0,0,0,848,4,1
4,53,2,24,1,0,0,0,734,1,0


In [5]:
# Inputs and outputs

X_df = main_df.drop('Risk', axis=1)
y = main_df['Risk']

# Training and testing splits

X_train, X_test, y_train, y_test = train_test_split(X_df, y, test_size=0.2, random_state=42, stratify=y)

We now want to choose some models to use and find the best hyperparameters. We can use ```GridSearchCV``` to find the optimal hyperparameters for model accuracy.

In [6]:
# For Random Forest

param_grid_RFC = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2],
    'max_features': ['auto', 'sqrt', 'log2'],
    'bootstrap': [True, False]
}

grid_RFC = GridSearchCV(RandomForestClassifier(random_state=42),
                    param_grid_RFC,
                    refit=True,
                    verbose=3,
                    scoring='accuracy',  # Optimize for accuracy
                    n_jobs=-1)  # Use all available cores

grid_RFC.fit(X_train, y_train)

print(grid_RFC.best_params_)

Fitting 5 folds for each of 432 candidates, totalling 2160 fits
{'bootstrap': False, 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 300}


In [8]:
# For SVC

param_grid_SVC = {'C': [0.1, 1, 5],
                  'kernel': ['linear', 'rbf'],
                  'degree': [1, 2, 3, 4],
                  'gamma' :['scale','auto']}

grid_SVC = GridSearchCV(SVC(),
                    param_grid_SVC,
                    refit=True,
                    verbose=3,
                    scoring = 'accuracy',
                    n_jobs = -1)
grid_SVC.fit(X_train, y_train)

print(grid_SVC.best_params_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
{'C': 1, 'degree': 1, 'gamma': 'scale', 'kernel': 'linear'}


In [9]:
# For Linear regression

param_grid_LR = {
    'penalty': ['l1', 'l2'],
    'C': [0.1, 1, 10],
    'solver': ['liblinear', 'saga'],
    'max_iter': [50, 100, 200, 300]
}

grid_LR = GridSearchCV(LogisticRegression(),
                    param_grid_LR,
                    refit=True,
                    verbose=3,
                    scoring = 'accuracy',
                    n_jobs = -1)

grid_LR.fit(X_train, y_train)

print(grid_LR.best_params_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
{'C': 0.1, 'max_iter': 50, 'penalty': 'l2', 'solver': 'liblinear'}


In [13]:
# XGBoost

param_grid_XG = {
    'n_estimators': [50, 100, 200, 300],
    'learning_rate': [0.01, 0.025, 0.5, 0.1, 0.15],
    'max_depth': [3, 5, 7, 10, 12],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.8, 0.9, 1.0, 1.2],
    'gamma': [0.1, 0.2],
}

grid_XG = GridSearchCV(XGBClassifier(),
                    param_grid_XG,
                    refit=True,
                    verbose=3,
                    scoring='accuracy',  # Optimize for accuracy
                    n_jobs = -1)  # Use all available cores

grid_XG.fit(X_train, y_train)

print(grid_XG.best_params_)

Fitting 5 folds for each of 2400 candidates, totalling 12000 fits
{'colsample_bytree': 0.8, 'gamma': 0.1, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200, 'subsample': 1.0}


We are now going to use these optimised hyperparameters to train 5 instances of each model type.

In [11]:
num_models = 5

RFCS = []
for i in range(num_models):
  RFC = RandomForestClassifier(**grid_RFC.best_params_)
  RFC.fit(X_train, y_train)
  RFCS.append(RFC)

SVCS = []
for i in range(num_models):
  classifier = SVC(**grid_SVC.best_params_)
  classifier.fit(X_train, y_train)
  SVCS.append(classifier)

LRS = []
for i in range(num_models):
  LR = LogisticRegression(**grid_LR.best_params_)
  LR.fit(X_train, y_train)
  LRS.append(LR)

# Does changing the hyperparameters here make any real difference to the fairness?
XGBS = []
for i in range(num_models):
  XGB = XGBClassifier(**grid_XG.best_params_)
  XGB.fit(X_train, y_train)
  XGBS.append(XGB)

We can now calculate the accuracy score for each model and place them into an array.

In [12]:
predictions = []
for i in range(num_models):
  predictions.append(RFCS[i].predict(X_test))
  predictions.append(SVCS[i].predict(X_test))
  predictions.append(LRS[i].predict(X_test))
  predictions.append(XGBS[i].predict(X_test))

accuracy_scores = [accuracy_score(y_test, pred) for pred in predictions]

# number of columns of accuracy matrix = number of type of models
col_dim = int(len(accuracy_scores)/num_models)

# Place accuracy scores into matrix
accuracy_matrix = np.array(accuracy_scores).reshape(num_models, col_dim)
print(accuracy_matrix)

[[0.775 0.745 0.755 0.74 ]
 [0.78  0.745 0.755 0.74 ]
 [0.785 0.745 0.755 0.74 ]
 [0.765 0.745 0.755 0.74 ]
 [0.775 0.745 0.755 0.74 ]]


**Conducting a simple fairness assessment using FairLearn**

In [13]:
from sklearn.metrics import accuracy_score, precision_score
from fairlearn.metrics import (
    MetricFrame,
    count,
    false_negative_rate,
    false_positive_rate,
    selection_rate,
)

We will look at the FPR between men and women for each model as our fairness metric.

In [14]:
# Select a feature
sex = X_test['Sex']#
# Select metrics
metrics = {
    #"accuracy": accuracy_score,
    #"precision": precision_score,
    "false positive rate": false_positive_rate,
    #"false negative rate": false_negative_rate,
    #"selection rate": selection_rate,
    #"count": count,
}
metrics_list = []
for i in range(len(predictions)):

    FPRS = MetricFrame(
        metrics=metrics,
        y_true=y_test,
        y_pred=predictions[i],
        sensitive_features=sex,
    )
    metrics_list.append(FPRS.by_group.values)

Now that we have an accuracy score and a fairness metric for each model, we want to arrange them into arrays.

In [15]:
FPRS_matrix = np.array(metrics_list).reshape(num_models,col_dim,2)  # first dimension is the umber of sub arrays, equal to the number of models
# second dimension is the number of model types
# third dimension is the number of features in the FPR calculation, i.e. male/female
print(FPRS_matrix)

[[[0.5   0.55 ]
  [0.7   0.75 ]
  [0.7   0.7  ]
  [0.5   0.6  ]]

 [[0.55  0.55 ]
  [0.7   0.75 ]
  [0.7   0.7  ]
  [0.5   0.6  ]]

 [[0.5   0.575]
  [0.7   0.75 ]
  [0.7   0.7  ]
  [0.5   0.6  ]]

 [[0.55  0.6  ]
  [0.7   0.75 ]
  [0.7   0.7  ]
  [0.5   0.6  ]]

 [[0.55  0.6  ]
  [0.7   0.75 ]
  [0.7   0.7  ]
  [0.5   0.6  ]]]


In [16]:
# Reshape the array so each sub array corresponds to one model type
grouped_rows = [FPRS_matrix[:, i, :] for i in range(FPRS_matrix.shape[1])]
# Stack the grouped rows into a single 3D array
combined_array = np.stack(grouped_rows, axis=0)
print(combined_array)

[[[0.5   0.55 ]
  [0.55  0.55 ]
  [0.5   0.575]
  [0.55  0.6  ]
  [0.55  0.6  ]]

 [[0.7   0.75 ]
  [0.7   0.75 ]
  [0.7   0.75 ]
  [0.7   0.75 ]
  [0.7   0.75 ]]

 [[0.7   0.7  ]
  [0.7   0.7  ]
  [0.7   0.7  ]
  [0.7   0.7  ]
  [0.7   0.7  ]]

 [[0.5   0.6  ]
  [0.5   0.6  ]
  [0.5   0.6  ]
  [0.5   0.6  ]
  [0.5   0.6  ]]]


In [17]:
print(accuracy_matrix)

[[0.785 0.745 0.755 0.79 ]
 [0.77  0.745 0.755 0.79 ]
 [0.775 0.745 0.755 0.79 ]
 [0.765 0.745 0.755 0.79 ]
 [0.775 0.745 0.755 0.79 ]]


Given an array containing information about the accuracy score of each model, and fairness metrics about each model, can an LLM provide a useful insight into which model is best with respect to accuracy and fairness? How can we prompt a model to give us this information?

In [18]:
# combined_array is an array containing the FPR rates for either gender for each model

# [[[{FPR for males} {FPR for females}
#    [{FPR for males} {FPR for feamles}]...
#    ]]]

# Where each sub array represents one type of model (e.g. linear regression) and each row in each sub-array represents a specific instance of the type of model
# There are 4 model types, Random Forest, SVM, Linear Regression and XGBoost.
# There are 5 instances of each model type.
# Each row in the sub array represents the false positive rate for males and females respectively for each model

# accuracy_matrix is an array containing the accuracy score for each model.
# each column represents a model type, in the same order as  Random Forest, SVM, Linear Regression and XGBoost.
# Each row is an instance of each model type, there are 5 total.

The following array contains accuracy scores for 4 types of machine learning model. There are 5 instances of each model. Each column represents a model and each row represents an instance of that model. The first column is Random Forest, the second is SVM, the third is linear regression and the fourth is XGBoost.

```[[0.77  0.7   0.74  0.795]
 [0.79  0.7   0.74  0.795]
 [0.775 0.7   0.74  0.795]
 [0.77  0.7   0.74  0.795]
 [0.765 0.7   0.74  0.795]]

 ```
 The next array contains information about the false positive rates (FPR) of the models. Each sub array contains the FPRs of one type of model. The first column in each sub array is the FPR with respect to men, and the second column is the FPR with respect to women. Each row in each sub array corresponds to an instance of the model type, and each sub array corresponds to the model type, i.e. the first sub array is for Random forest, the second for SVM, the third for linear regression and the fourth for XGBoost.

 ```

[[[0.5   0.525]
  [0.5   0.525]
  [0.5   0.5  ]
  [0.5   0.475]
  [0.5   0.475]]

 [[1.    1.   ]
  [1.    1.   ]
  [1.    1.   ]
  [1.    1.   ]
  [1.    1.   ]]

 [[0.7   0.7  ]
  [0.7   0.7  ]
  [0.7   0.7  ]
  [0.7   0.7  ]
  [0.7   0.7  ]]

 [[0.45  0.625]
  [0.45  0.625]
  [0.45  0.625]
  [0.45  0.625]
  [0.45  0.625]]]

```

Which model from this information above provides the best tradeoff between accuracy and fairness?



In [19]:
# print FPR and other metrics

# train multiple models to obtain a model space

# use LLM to suggest model parameters to optimise for fairness and accuracy tradeoff

# Need to optimise the prompt for this.