Three well-known machine learning models—Logistic Regression, Decision Tree Classifier, and Random Forest Classifier—will be compared in this analysis in order to better grasp their performance characteristics. To generate actionable insights based on complicated information, machine learning models have been widely employed in a banking data. The models chosen for this comparison have proven successful in various fields and are often used in both academic and real-world settings.
An established statistical model called logistic regression uses input factors to estimate the likelihood of a binary outcome of credit risk (Bad/good). Through a series of feature splits, Decision Tree Classifier uses a hierarchical framework to reach conclusions. The Random Forest Classifier creates an ensemble model by combining many decision trees, using the combined wisdom of the individual trees to provide more accurate predictions.
> **

In [1]:
import pandas as pd
import numpy as np

# importing packages for viz 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go


In [2]:
# importing package to handle warnings
import warnings
warnings.filterwarnings("ignore")

# Import required csv file from (https://www.kaggle.com/datasets/kabure/german-credit-data-with-risk)
data = pd.read_csv("../input/german-credit-data-with-risk/german_credit_data.csv", index_col=0)

#data = pd.read_csv ('german_credit_data.csv')   
data.drop(data.columns[0], inplace=True, axis=1)
data.head()

Unnamed: 0,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose,Risk
0,male,2,own,,little,1169,6,radio/TV,good
1,female,2,own,little,moderate,5951,48,radio/TV,bad
2,male,1,own,little,,2096,12,education,good
3,male,2,free,little,little,7882,42,furniture/equipment,good
4,male,2,free,little,little,4870,24,car,bad


In [3]:
##descriptive measures before data cleaning
data.describe()

Unnamed: 0,Job,Credit amount,Duration
count,1000.0,1000.0,1000.0
mean,1.904,3271.258,20.903
std,0.653614,2822.736876,12.058814
min,0.0,250.0,4.0
25%,2.0,1365.5,12.0
50%,2.0,2319.5,18.0
75%,2.0,3972.25,24.0
max,3.0,18424.0,72.0


In [4]:
### removing Upper case letters

data.columns = [x.lower().replace(" ","_") for x in data.columns]
data.columns

Index(['sex', 'job', 'housing', 'saving_accounts', 'checking_account',
       'credit_amount', 'duration', 'purpose', 'risk'],
      dtype='object')

In [5]:
### Risk Distribution 

fig = px.histogram(data, x='risk', color='risk', title='Credit Risk Distribution')
fig.update_layout(xaxis_title='Credit Risk', yaxis_title='Count')
fig.show()

In [6]:
## checking for missing values


print("Missing values in each column:\n{}".format(data.isnull().sum()))


Missing values in each column:
sex                   0
job                   0
housing               0
saving_accounts     183
checking_account    394
credit_amount         0
duration              0
purpose               0
risk                  0
dtype: int64


In [7]:
print("Unique values in each categorical column:")
for col in data.select_dtypes(include=[object]):
    print(col,":", data[col].unique())

Unique values in each categorical column:
sex : ['male' 'female']
housing : ['own' 'free' 'rent']
saving_accounts : [nan 'little' 'quite rich' 'rich' 'moderate']
checking_account : ['little' 'moderate' nan 'rich']
purpose : ['radio/TV' 'education' 'furniture/equipment' 'car' 'business'
 'domestic appliances' 'repairs' 'vacation/others']
risk : ['good' 'bad']


In [8]:
factors = ['sex', 'housing', 'saving_accounts', 'checking_account', 'purpose','duration']

def visualize_factors(data, col_list, hue='risk'):
    for col in col_list:
        fig = px.histogram(data, x=col, color=hue, title=f'{col} distribution by Credit Risk')
        fig.show()

visualize_factors(data, factors)

In [9]:
fig = px.histogram(data, x='credit_amount', nbins=20,
                   color='risk', marginal='box', 
                   title='Credit Amount Distribution')
fig.show()

In [10]:

# The second column we will replace nan with word "non_exc"
data['saving_accounts'] = data['saving_accounts'].fillna('non_exc')



data["checking_account"].fillna(data['checking_account'].mode()[0], inplace=True)
print("Missing values in each column:\n{}".format(data.isnull().sum()))


Missing values in each column:
sex                 0
job                 0
housing             0
saving_accounts     0
checking_account    0
credit_amount       0
duration            0
purpose             0
risk                0
dtype: int64


In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   sex               1000 non-null   object
 1   job               1000 non-null   int64 
 2   housing           1000 non-null   object
 3   saving_accounts   1000 non-null   object
 4   checking_account  1000 non-null   object
 5   credit_amount     1000 non-null   int64 
 6   duration          1000 non-null   int64 
 7   purpose           1000 non-null   object
 8   risk              1000 non-null   object
dtypes: int64(3), object(6)
memory usage: 78.1+ KB


In [12]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder

In [13]:
columns_to_encode = ['sex', 'housing',  'purpose', 'risk', 'saving_accounts','checking_account']

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to each column
for column in columns_to_encode:
    data[column] = label_encoder.fit_transform(data[column])

In [14]:
data.head()

Unnamed: 0,sex,job,housing,saving_accounts,checking_account,credit_amount,duration,purpose,risk
0,1,2,1,2,0,1169,6,5,1
1,0,2,1,0,1,5951,48,5,0
2,1,1,1,0,0,2096,12,3,1
3,1,2,0,0,0,7882,42,4,1
4,1,2,0,0,0,4870,24,1,0


In [15]:
## importing required packages for scaling the categorical columns
from sklearn.preprocessing import StandardScaler
stdscaler = StandardScaler()
data[['duration', 'credit_amount', 'job', 'age']] = stdscaler.fit_transform(data[['duration', 'credit_amount', 'job', 'age']])

KeyError: "['age'] not in index"

In [None]:
### Descriptive stats for Transformed and scaled data
data.describe()

In [None]:
## Subsetting independent variables or explanatory variable
explanatory_variable = data.drop("risk", axis=1)
### subsetting response variable 
response_variable = data["risk"]
X_train, X_val, y_train, y_val = train_test_split(explanatory_variable, response_variable, test_size=0.3, random_state=56)

In [None]:
print(X_train.shape)
print(y_train.shape)


In [None]:
from sklearn.metrics import precision_recall_curve, roc_curve, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

def metrics(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    roc_auc = roc_auc_score(y_true, y_pred)
    
    return accuracy, precision, recall, f1, roc_auc

def perf_measures(y_true, y_pred):
    accuracy, precision, recall, f1_score, roc_auc = metrics(y_true, y_pred)
    print("Accuracy: %.3f\nPrecision: %.3f\nRecall: %.3f\nF1 Score: %.3f\nROC AUC: %.3f" % (accuracy, precision, recall, f1_score, roc_auc))

def plot_roc(y_true, probas):
    fpr, tpr, tresholds = roc_curve(y_true, probas)
    plt.plot(fpr, tpr, color="g")
    plt.plot([0, 1], [0, 1], color="black", linestyle="--")
    plt.title("ROC Curve")
    plt.xlabel("False Positive Rate (FPR)")
    plt.ylabel("True Positive Rate (TPR)")
    fig.show()
    


In [None]:
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV


# define models and parameters
model = LogisticRegression()
solvers = ['newton-cg', 'lbfgs', 'liblinear']
penalty = ['l2']
c_values = [100, 10, 1.0, 0.1, 0.01]
# define grid search
grid = dict(solver=solvers,penalty=penalty,C=c_values)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(explanatory_variable, response_variable)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

In [None]:
logistic_model = LogisticRegression(C = 0.1, penalty = 'l2', solver = 'liblinear')
logistic_model.fit(X_train, y_train)


In [None]:
logistic_pred = logistic_model.predict(X_val)
logistic_cm = confusion_matrix(y_val, logistic_pred)
logistic_cm
perf_measures(y_val, logistic_pred)
plot_roc(y_val, logistic_pred)

In [None]:




# Importing necessary packages
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV

# parameters and distribution
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}

# training Decision Tree classifier
tree = DecisionTreeClassifier()

# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)

# Fit it to the data
tree_cv.fit(explanatory_variable,response_variable)

# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))



In [None]:

# Classification Trees
tree_model = DecisionTreeClassifier(criterion = 'entropy', max_depth= 3, max_features= 8, min_samples_leaf= 5)
tree_model.fit(X_train, y_train)


In [None]:
# Decision Trees Classifier
tree_pred = tree_model.predict(X_val)
tree_cm = confusion_matrix(y_val, tree_pred)
perf_measures(y_val,tree_pred)
plot_roc(y_val, tree_pred)


In [None]:
#hyperparameters for RandomForestClassifier
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# define models and parameters
model = RandomForestClassifier()
n_estimators = [10, 100, 1000]
max_features = ['sqrt', 'log2']
# define grid search
grid = dict(n_estimators=n_estimators,max_features=max_features)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(explanatory_variable,response_variable)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

model_rf = RandomForestClassifier(max_features='sqrt', n_estimators = 1000)
model_rf.fit(X_train,y_train)
#print(model_rf.fit)

In [None]:
# RandomForest Classiir Trees
rf_pred = model_rf.predict(X_val)
rf_cm = confusion_matrix(y_val, rf_pred)
perf_measures(y_val,rf_pred)
plot_roc(y_val, rf_pred)


****** Due to higher accuracy of 72.3 percent among all models and F1 score of 0.825 the logistic regression model has emerged as the one that best predicts risk factors given the features of the credit data. It is the best option for activities involving credit risk assessment due to its capacity for managing complicated relationships and providing precise predictions. Decision tree classifier has 69 percent accuracy and lowest performer in above three models. 

In [None]:
In conclusion, the comparison of the performance characteristics of Logistic Regression, Decision Tree Classifier, and Random Forest Classifier using German credit data has yielded useful insights. The Logistic regression Classifier has established itself as a reliable and effective model for estimating creditworthiness and evaluating credit risk and making decisions.

In conclusion, the comparison of the performance characteristics of Logistic Regression, Decision Tree Classifier, and Random Forest Classifier using German credit data has yielded useful insights. The Logistic regression Classifier has established itself as a reliable and effective model for estimating creditworthiness and evaluating credit risk and making decisions.******