### LENDING CLUB - CASE STUDY

This project explores the Lending Club dataset, which contains information about loans issued through the platform. The goal is to predict whether a loan will be fully repaid or defaulted by applying the key stages of a Machine Learning workflow: data preparation, model selection, and evaluation.

To assess performance, different models are compared and ranked using metrics such as AUC, accuracy, recall, and F1-score.


### Variable Description

<ul> 
<li> revol_bal – Total revolving credit balance of the borrower.
<li> dti (debt-to-income ratio) - percentage of the borrower’s monthly income allocated to debt payments.
<li> funded_amnt_inv – Loan amount funded by investors.
<li> revol_util – Revolving credit utilization rate (percentage of available credit in use).
<li> annual_inc – Declared annual income of the borrower.
<li> funded_amnt – Loan amount approved by the platform.
<li> loan_amnt – Loan amount requested by the borrower.
<li> term – Loan term (repayment period).
<li> grade - Credit grade assigned by the platform based on risk assessment.
<li> delinq_2yrs – Number of borrower’s delinquencies in the past 2 years.
<li> Fully Paid – Target variable (1 = Fully Paid, 0 = Default). Indicates whether the loan was fully repaid or defaulted.
<ul> 


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report, roc_curve, auc
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier


In [None]:
# Load dataset

df = pd.read_csv("/Users/paolaavellino/Desktop/GITHUB/lending club loans.csv", sep=";")

In [None]:
# Display of the first five rows to get an initial overview of the dataset

df.head(5)



In [None]:
# Display of the last five rows to review the final columns of the dataset

df.tail(5)

In [None]:
# Check descriptive statistics for the dataset columns

df.describe

In [None]:
# Check the column names

print(df.columns)

# Confirm the number of rows and columns

(df.shape)

In [None]:
# Examine the entire dataset to identify missing values

pd.set_option('display.max_rows', None)  
pd.set_option('display.max_columns', None) 

print(df.isnull().sum())  

In [None]:
# Calculate the percentage of missing values in each column for easier analysis

df.isnull().sum()/df.shape[0]

In [None]:
# Drop columns where all values are missing

df = df.dropna(axis=1, how='all') 

In [None]:
# Check the number of columns remaining after cleaning

df.shape

In [None]:
# Recheck the percentage of missing values in the columns

df.isnull().sum()/df.shape[0]

In [None]:
# Remove columns with more than 40% missing values

columns_to_drop = [
    "mths_since_last_delinq", "mths_since_last_record", "next_pymnt_d",
    "debt_settlement_flag_date", "settlement_status", "settlement_date",
    "settlement_amount", "settlement_percentage", "settlement_term"
]

df = df.drop(columns=columns_to_drop)

In [None]:
# Verify that columns with more than 40% missing values were successfully removed

df.isnull().sum()/df.shape[0]

In [None]:
# Check the data types and non-null counts of each column

df.info()

In [None]:
# Reset dataset display options to improve performance

pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')

In [None]:
# Display the DataFrame and its columns

df

In [None]:
df.columns

In [None]:
# Remove four variables considered irrelevant for the model

df = df.drop(columns=['desc', 'zip_code', 'emp_title', 'addr_state'])

# Remove variables containing post-loan information, as they won't be used in the models

df.drop(columns=[
    'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv',
    'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee',
    'recoveries', 'collection_recovery_fee', 'last_pymnt_d',
    'last_pymnt_amnt', 'last_credit_pull_d'
], inplace=True)


In [None]:
# Fix columns 'int_rate' and 'revol_util' recognized as objects instead of numeric (float) 

# Replace commas with dots as decimal separators
df['int_rate'] = df['int_rate'].str.replace(',', '.', regex=False)
df['revol_util'] = df['revol_util'].str.replace(',', '.', regex=False)

# Convert columns to numeric (float)
df['int_rate'] = pd.to_numeric(df['int_rate'], errors='coerce')
df['revol_util'] = pd.to_numeric(df['revol_util'], errors='coerce')

# Verify that the changes were applied correctly
print(df[['int_rate', 'revol_util']].head())

In [None]:
# Convert columns recognized as objects into proper datetime format

df['issue_d'] = pd.to_datetime(df['issue_d'], format='%d/%m/%y', errors='coerce')
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'], format='%d/%m/%y', errors='coerce')


In [None]:
# Convert 'emp_length' column values into a numeric format for better analysis

def convert_emp_length(emp_length):
    if isinstance(emp_length, str):
        if "10+" in emp_length:
            return 10
        elif "< 1 year" in emp_length:
            return 0.5
        elif "year" in emp_length:
            num = ''.join(filter(str.isdigit, emp_length))
            return int(num) if num else None
        elif emp_length == "":
            return ""
        elif "n/a" in emp_length.lower():
            return None
    return None  

df['emp_length_num'] = df['emp_length'].apply(convert_emp_length)

df[['emp_length', 'emp_length_num']]


In [None]:
df.info()



For the models, I selected variables that do not rely on post-default information and that are relevant for predicting whether a loan will be fully repaid or defaulted.

Independent variables:

<ul>
<li> loan_amnt – Loan amount.
<li> int_rate – Interest rate.
<li> grade – Loan grade.
<li> home_ownership – Type of home ownership.
<li> term – Loan term.
<li> annual_inc – Annual income.
<li> purpose – Loan purpose.
<li> emp_length_num – Employment length.
</ul>

Dependent variable:

<li> loan_status – Loan status (target variable).
<ul>

In [None]:
# Convert the categorical variables into numeric dummy variables for modeling

grade = pd.get_dummies(df['grade'],drop_first=True)

grade = grade * 1

grade.head()

In [None]:
home_ownership = pd.get_dummies(df['home_ownership'],drop_first=True)

home_ownership = home_ownership * 1

home_ownership.head()

In [None]:
term = pd.get_dummies(df['term'], drop_first=False)

term = term *1

term.head()

In [None]:

purpose = pd.get_dummies(df['purpose'], drop_first=False)

purpose = purpose * 1

purpose.head()


In [None]:
# Fill missing values in 'loan_status' with 'Missing' and create dummy variables for modeling

df['loan_status'] = df['loan_status'].fillna('Missing')  

loan_status = pd.get_dummies(df['loan_status'], drop_first=False)

loan_status = loan_status * 1

loan_status



In [None]:
df.shape

In [None]:
# Concatenate the new dummy variable columns to the original DataFrame and drop the original categorical columns no longer needed

df = pd.concat([df, grade, home_ownership, term, purpose, loan_status], axis=1)

df = df.drop(columns=['grade', 'home_ownership', 'term', 'purpose', 'loan_status'])

df.head()

## Variable Evaluation

This metric helps assess the discriminative power of the variables.

The evaluation criteria are as follows:

<ul>
<li> IV < 0.02 – Not predictive, provides no relevant information for classification.

<li> 0.02 ≤ IV < 0.1 – Weakly predictive.

<li> 0.1 ≤ IV < 0.3 – Moderately predictive, contributes to classification.

<li> 0.3 ≤ IV < 0.5 – Highly predictive and useful for the model.

<li> IV ≥ 0.5 – Extremely predictive.
<ul>

In [None]:
# Define a function to calculate the Information Value (IV) of each predictor variable in relation to the target variable (Fully Paid)

predictor_vars = [
    ' 36 months', ' 60 months', 'B', 'C', 'D', 'E', 'F', 'G', 'NONE', 'OTHER', 
    'OWN', 'RENT', 'car', 'credit_card', 'debt_consolidation', 'educational',
    'home_improvement', 'house', 'major_purchase', 'medical', 'moving', 
    'other', 'renewable_energy', 'small_business', 'vacation', 'wedding'
]


def calc_iv(df, feature, target, pr=0):
    lst = []
    for i in range(df[feature].nunique()):
        val = list(df[feature].unique())[i]
        lst.append([feature, val, df[df[feature] == val].count()[feature], df[(df[feature] == val) & (df[target] == 1)].count()[feature]])

    data = pd.DataFrame(lst, columns=['Variable', 'Value', 'All', 'Bad'])
    data = data[data['Bad'] > 0]
    data['Share'] = data['All'] / data['All'].sum()
    data['Bad Rate'] = data['Bad'] / data['All']
    data['Distribution Good'] = (data['All'] - data['Bad']) / (data['All'].sum() - data['Bad'].sum())
    data['Distribution Bad'] = data['Bad'] / data['Bad'].sum()
    data['WoE'] = np.log(data['Distribution Good'] / data['Distribution Bad'])
    data['IV'] = (data['WoE'] * (data['Distribution Good'] - data['Distribution Bad'])).sum()
    data = data.sort_values(by=['Variable', 'Value'], ascending=True)

    if pr == 1:
        print(data)

    return data['IV'].values[0]

# Calculate IV for all predictor variables and display results
iv_values = {}
for var in predictor_vars:
    iv_values[var] = calc_iv(df, var, 'Fully Paid')

for var, iv in iv_values.items():
    print(f"IV for {var}: {iv}")

After performing the discriminant analysis, I observed that all the selected variables showed low discriminative power. Therefore, I decided to discard most of them and restart the search for more relevant predictors.

In [None]:
# Drop binary variables with low discriminative power, keeping only 'grade' and 'term'


columns_low_iv = [ 'NONE', 'OTHER', 'OWN', 'RENT', 'car', 'credit_card', 'debt_consolidation', 
                  'educational', 'home_improvement', 'house', 'major_purchase', 'medical', 
                  'moving', 'other', 'renewable_energy', 'small_business', 'vacation', 
                  'wedding']


df = df.drop(columns=columns_low_iv)


print(df.columns)
print(df.shape)

In [None]:
# Drop additional 'loan_status' categories not needed, keeping only 'Fully Paid' as the target variable

df.drop(columns=[
    'Charged Off', 
    'Does not meet the credit policy. Status:Charged Off',
    'Does not meet the credit policy. Status:Fully Paid', 
    'Missing'
], inplace=True)

df.columns

In [None]:
df.shape

In [None]:
df.info()

In [None]:
# Drop 'title' and 'emp_length' (replaced by numeric 'emp_length_num') to keep the dataset clean. Also drop 'policy_code' as it does not provide relevant information for the model

df.drop(columns=['title', 'emp_length'], inplace=True)
df.drop(columns=['policy_code'], inplace=True)

In [None]:
df.info()

In [None]:
# Convert all remaining categorical variables into binary dummy variables for evaluation


sub_grade = pd.get_dummies(df['sub_grade'], drop_first=True) * 1


verification_status = pd.get_dummies(df['verification_status'], drop_first=True) * 1


application_type = pd.get_dummies(df['application_type'], drop_first=True) * 1


pymnt_plan = pd.get_dummies(df['pymnt_plan'], drop_first=True) * 1


initial_list_status = pd.get_dummies(df['initial_list_status'], drop_first=True) * 1


hardship_flag = pd.get_dummies(df['hardship_flag'], drop_first=True) * 1


disbursement_method = pd.get_dummies(df['disbursement_method'], drop_first=True) * 1


debt_settlement_flag = pd.get_dummies(df['debt_settlement_flag'], drop_first=True) * 1



In [None]:
# Concatenate all dummy variables into the original DataFrame and drop the original categorical versions

df = pd.concat([df, sub_grade, verification_status, application_type, pymnt_plan, 
                initial_list_status, hardship_flag, disbursement_method, debt_settlement_flag], axis=1)

df.drop(columns=['sub_grade', 'verification_status', 'application_type', 'pymnt_plan', 
                 'initial_list_status', 'hardship_flag', 'disbursement_method', 'debt_settlement_flag'], inplace=True)


In [None]:
df.info()


In [None]:
df.head()

In [None]:
# Evaluate all variables in the dataset (except the target) to identify the most useful predictors for the models

pd.set_option('display.max_rows', None)  
pd.set_option('display.max_columns', None) 

def calc_iv(df, feature, target):
    lst = []

    for val in df[feature].unique():
        all_count = df[df[feature] == val].shape[0]
        bad_count = df[(df[feature] == val) & (df[target] == 1)].shape[0]

        if bad_count == 0 or all_count == bad_count:  
            continue

        lst.append([feature, val, all_count, bad_count])

    data = pd.DataFrame(lst, columns=['Variable', 'Value', 'All', 'Bad'])
    
    if data.empty:
        return 0  

    data['Share'] = data['All'] / data['All'].sum()
    data['Bad Rate'] = data['Bad'] / data['All']
    data['Distribution Good'] = (data['All'] - data['Bad']) / (data['All'].sum() - data['Bad'].sum())
    data['Distribution Bad'] = data['Bad'] / data['Bad'].sum()
    data['WoE'] = np.log(data['Distribution Good'] / data['Distribution Bad'])
    data['IV'] = data['WoE'] * (data['Distribution Good'] - data['Distribution Bad'])

    return data['IV'].sum()  

iv_results = {col: calc_iv(df, col, 'Fully Paid') for col in df.columns if col != 'Fully Paid'}

iv_results_df = pd.DataFrame.from_dict(iv_results, orient='index', columns=['IV'])
iv_results_df.sort_values(by='IV', ascending=False, inplace=True)

print(iv_results_df)

In [None]:
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')

In [None]:
# Based on the discriminant analysis results:
# - Discard variables with IV < 0.1, as they add little predictive value (exceptions: 'term', 'delinq_2yrs', and 'grade')
# - Keep variables with IV between 0.1 and 0.5, as they provide better discriminative power
# - Discard variables with IV > 0.5, as they are too aligned with the target and could introduce bias

selected_variables = [
    'revol_bal', 'dti', 'funded_amnt_inv', 
    'revol_util', 'annual_inc', 
    'funded_amnt', 'loan_amnt', ' 36 months', ' 60 months', 
    'B', 'C', 'D', 'E', 'F', 'G', 'delinq_2yrs'
]

df = df[selected_variables + ['Fully Paid']]



In [None]:
# Attempted to train a Logistic Regression model but encountered an error due to remaining null values

print(df.isnull().sum())  

In [None]:
df.describe()

In [None]:
# Fill missing values with the mean

imputador = SimpleImputer(strategy='mean')


df.iloc[:, :-1] = imputador.fit_transform(df.iloc[:, :-1])  

print("Remaining missing values in the DataFrame:")
print(df.isnull().sum().sum(), "null values in total")


## Logistic Regression

In [None]:
# Repeat the Logistic Regression process

# Split the data into training and test sets

X = df.drop(columns=['Fully Paid'])
y = df['Fully Paid']

In [None]:
# Split the dataset: 70% for training and 30% for model evaluation

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
X_train

In [None]:
X_test

In [None]:
y_train

In [None]:
y_test

In [None]:
# Initialize the Logistic Regression model

model = LogisticRegression(max_iter=1000, random_state=42)

In [None]:
model.fit(X_train,y_train)

In [None]:
model.score(X_train,y_train)

In [None]:
# Generate predictions: class labels (0 or 1) and probabilities for metrics such as AUC-ROC

y_pred = model.predict(X_test)

y_prob = model.predict_proba(X_test)[:, 1] 


In [None]:
# Calculate evaluation metrics

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)  
f1 = f1_score(y_test, y_pred)
roc_auc_logreg = roc_auc_score(y_test, y_prob)

print("Logistic Regression Evaluation:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"AUC-ROC: {roc_auc_logreg:.4f}")



In [None]:
# Confusion Matrix

conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)

sns.heatmap(conf_matrix, annot=True, fmt="d")
plt.show()


In [None]:
# Classification Report

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

In [None]:
# ROC Curve

fpr, tpr, _ = roc_curve(y_test, y_prob)

plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, label=f"AUC-ROC = {roc_auc_logreg:.4f}")
plt.plot([0, 1], [0, 1], linestyle='--', color='gray') 
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - Logistic Regression")
plt.legend()
plt.show()

In [None]:
# Gini Index for Logistic Regression

gini_logreg = 2 *  roc_auc_logreg - 1

print(f"Gini Index (Logistic Regression): {gini_logreg:.4f}")


## Logistic Regression Results

Logistic Regression turned out to be a weak predictor in this case, as its ability to separate the classes is almost negligible.

## Decision Tree


In [None]:
# Initialize the Decision Tree model

decision_tree_model = DecisionTreeClassifier(random_state=42, class_weight="balanced", max_depth=10)

In [None]:
# Train the Decision Tree model

decision_tree_model.fit(X_train, y_train)

In [None]:
# Visualize the Decision Tree model

plt.figure(figsize=(20,20))
plot_tree(decision_tree=decision_tree_model, filled=True);

In [None]:
# Evalúo el modelo

y_pred_tree = decision_tree_model.predict(X_test)
y_prob_tree = decision_tree_model.predict_proba(X_test)[:, 1]

In [None]:
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred_tree)
precision = precision_score(y_test, y_pred_tree)
recall = recall_score(y_test, y_pred_tree)
f1 = f1_score(y_test, y_pred_tree)
roc_auc_tree = roc_auc_score(y_test, y_prob_tree)

print("Decision Tree Evaluation:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"AUC-ROC: {roc_auc_tree:.4f}")

In [None]:
# Confusion Matrix

conf_matrix_tree = confusion_matrix(y_test, y_pred_tree)
print("\nConfusion Matrix:")
print(conf_matrix_tree)

sns.heatmap(conf_matrix_tree, annot=True, fmt="d")
plt.show()


In [None]:
# Classification Report

print("\nClassification Report:")
print(classification_report(y_test, y_pred_tree))


In [None]:
# ROC Curve

fpr_tree, tpr_tree, _ = roc_curve(y_test, y_prob_tree)

plt.figure(figsize=(8,6))
plt.plot(fpr_tree, tpr_tree, label=f"AUC-ROC = {roc_auc_tree:.4f}")
plt.plot([0, 1], [0, 1], linestyle='--', color='gray') 
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - Decision Tree")
plt.legend()
plt.show()


In [None]:
# Gini Index for Decision Tree

gini_tree = 2 * roc_auc_tree - 1

print(f"Gini Index (Decision Tree): {gini_tree:.4f}")


## Decision Tree Results

Although the Decision Tree improves predictive performance compared to Logistic Regression, it still struggles with correctly classifying the negative class (0). The model tends to favor positive cases.

## Random Forest

In [None]:
# Initialize the Random Forest model

rf_model = RandomForestClassifier(
    n_estimators=300, 
    max_depth=15,  
    min_samples_split=10,  
    min_samples_leaf=5, 
    class_weight="balanced", 
    random_state=42
)


In [None]:
# Train the Random Forest model

rf_model.fit(X_train, y_train)


In [None]:
# Evaluate the Random Forest model

y_pred_rf = rf_model.predict(X_test)
y_prob_rf = rf_model.predict_proba(X_test)[:, 1]


In [None]:
# Calculate evaluation metrics

accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)
roc_auc_rf = roc_auc_score(y_test, y_prob_rf)

print("Random Forest Evaluation:")
print(f"Accuracy: {accuracy_rf:.4f}")
print(f"Precision: {precision_rf:.4f}")
print(f"Recall: {recall_rf:.4f}")
print(f"F1 Score: {f1_rf:.4f}")
print(f"AUC-ROC: {roc_auc_rf:.4f}")


In [None]:
# Confusion Matrix

conf_matrix_rf = confusion_matrix(y_test, y_pred_rf)
print("\nConfusion Matrix:")
print(conf_matrix_rf)

sns.heatmap(conf_matrix_rf, annot=True, fmt="d")
plt.show()


In [None]:
# Classification Report

print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf))


In [None]:
# ROC Curve

fpr_rf, tpr_rf, _ = roc_curve(y_test, y_prob_rf)
roc_auc_rf = roc_auc_score(y_test, y_prob_rf)

plt.figure(figsize=(8,6))
plt.plot(fpr_rf, tpr_rf, label=f"AUC-ROC = {roc_auc_rf:.4f}")
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')  
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - Random Forest")
plt.legend()
plt.show()


In [None]:
# Gini Index for Random Forest

gini_rf = 2 * roc_auc_rf - 1

print(f"Gini Index (Random Forest): {gini_rf:.4f}")


## Random Forest Results

Random Forest shows better overall performance, reducing the overfitting issues observed in the Decision Tree. Although detecting the negative class remains a challenge, the model proves to be more reliable than the previous ones.

## XGBoost

In [None]:
# !pip install xgboost

In [None]:
# Initialize the XGBoost model

xgb_model = XGBClassifier(
    n_estimators=300,  
    learning_rate=0.05,  
    max_depth=7,  
    gamma=0.1,  
    subsample=0.8,  
    colsample_bytree=0.8,  
    scale_pos_weight=(y_train.value_counts()[0] / y_train.value_counts()[1]),  
    random_state=42
)


In [None]:
# Train the XGBoost model

xgb_model.fit(X_train, y_train)


In [None]:
# Evaluate the XGBoost model

y_pred_xgb = xgb_model.predict(X_test)
y_prob_xgb = xgb_model.predict_proba(X_test)[:, 1]


In [None]:
# Calculate evaluation metrics

accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
precision_xgb = precision_score(y_test, y_pred_xgb)
recall_xgb = recall_score(y_test, y_pred_xgb)
f1_xgb = f1_score(y_test, y_pred_xgb)
roc_auc_xgb = roc_auc_score(y_test, y_prob_xgb)

print("XGBoost Evaluation:")
print(f"Accuracy: {accuracy_xgb:.4f}")
print(f"Precision: {precision_xgb:.4f}")
print(f"Recall: {recall_xgb:.4f}")
print(f"F1 Score: {f1_xgb:.4f}")
print(f"AUC-ROC: {roc_auc_xgb:.4f}")


In [None]:
# Confusion Matrix

conf_matrix_xgb = confusion_matrix(y_test, y_pred_xgb)
print("\nConfusion Matrix:")
print(conf_matrix_xgb)

sns.heatmap(conf_matrix_xgb, annot=True, fmt="d")
plt.show()


In [None]:
# Classification Report

print("\nClassification Report:")
print(classification_report(y_test, y_pred_xgb))


In [None]:
# ROC Curve

fpr_xgb, tpr_xgb, _ = roc_curve(y_test, y_prob_xgb)
roc_auc_xgb = roc_auc_score(y_test, y_prob_xgb)

plt.figure(figsize=(8,6))
plt.plot(fpr_xgb, tpr_xgb, label=f"AUC-ROC = {roc_auc_xgb:.4f}")
plt.plot([0, 1], [0, 1], linestyle='--', color='gray') 
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - XGBoost")
plt.legend()
plt.show()


In [None]:
# Gini Index for XGBoost

gini_xgb = 2 * roc_auc_xgb - 1

print(f"Gini Index (XGBoost): {gini_xgb:.4f}")


## Final Comparison

In [None]:

results = pd.DataFrame({
    "Model": ["Logistic Regression", "Decision Tree", "Random Forest", "XGBoost"],
    "AUC-ROC": [0.5142, 0.6598, 0.7275, 0.7285],
    "Gini": [0.0283, 0.3197, 0.4550, 0.4571],
    "Accuracy": [0.8045, 0.6949, 0.7574, 0.7155],
    "Precision": [0.8060, 0.8608, 0.8605, 0.8737],
    "Recall": [0.9970, 0.7406, 0.8337, 0.7556],
    "F1-score": [0.8914, 0.7961, 0.8469, 0.8104]
})

results


By analyzing the results, we can see that the weakest models were Logistic Regression and the Decision Tree. Logistic Regression achieved an AUC-ROC of 0.5142 and a Gini Index of 0.0283, indicating almost no discriminative power, performing only slightly better than random guessing. The Decision Tree, while an improvement over Logistic Regression, still showed limited performance with an AUC-ROC of 0.6598 and a Gini Index of 0.3197. Although this model managed to separate the classes better, it was not effective enough, misclassifying many negative cases as positive and showing relatively low accuracy.

In contrast, the Random Forest delivered significantly stronger results, reaching an AUC-ROC of 0.7275 and a Gini Index of 0.4550. This model demonstrated good predictive power and achieved a solid balance between precision and recall, making it a reliable option. Random Forest was particularly effective at capturing positive cases without sacrificing too much precision.

However, the best overall performer was XGBoost. It achieved the highest AUC-ROC (0.7285) and Gini Index (0.4571), showing superior ability to separate the classes compared to the other models. Additionally, it maintained high precision and a good balance between sensitivity and specificity. Although its recall was slightly lower than Random Forest’s, XGBoost proved to be the most robust and effective model in this analysis.