# Credit Default Modelling

***

This Jupyter notebook increments a base line model performance (**i.e. Logistic Regression initial model**) over a credit default dataset. Therefore, this notebook is specifiacally designed for financial and mathematical analyst in the *credit risk assessment financial industry*.

The problem that we are going to solve is a **binary classification problem**: is our new client going to default ?

The whole notebook is divided accorsing to the following sections:

1. **[Service factory](#service_factory)**: 

    This is a collection of functions that are multiple times used all over the notebook. The good thing about it is that *they are portable*: any coder that will be cut and pasting this notebook can re-adapt the to his/her needs.<br><br>

2. **[Exploratory data analysis](#eda)**: 

Understanding the *business subject is important* as much as good modelling. Therefore, I code EDA first. In this part of the notebook we can 
<br>

3. **[Model Selection](#model_selection)**: 

We have been comparing different mode
<br>

4. **[Conclusions](#conclusions)**:
<br>

5. **[Appendix](#appendix)**: 

In this notebook we have performed models *one after the other*. We would like to shocase a *lazy* but less customizable approach for the curious reader to explore !
<br>

***

### Service Factory <a class="anchor" id="service_factory"></a>

In [None]:
# Create general contants and overall warning disabling together with general required packages installation.

import os
import warnings
warnings.filterwarnings("ignore")

##################################
RANDOM_STATE = 999
MAIN_PATH    = os.getcwd()
##################################

## Uncomment the following line to install all the packages used in this notebook.
# import sys
# !{sys.executable} -m pip install numpy pandas matplotlib seaborn xlrd lazypredict
# !{sys.executable} -m pip install networkx --force-reinstall --no-deps --upgrade --user
# !{sys.executable} -m pip install hyperopt

In [None]:
def perform_bayes_opt():
    
    pass

In [None]:
def standar_scaler():
    
    pass

In [None]:
def plot_roc():
    
    pass

In [None]:
def display_performance_metrics():
    
    pass

In [None]:
def split_test_train_for_all():
    
    pass

***

### Exploratory data analysis (EDA) <a class="anchor" id="eda"></a>

In [None]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
os.chdir('..')
print(f'Main path for this project is {os.getcwd()}')

In [None]:
# Import the data and perform EDA

credit_data = pd.read_csv('./data/UCI_Credit_Card.csv')
credit_data = credit_data.rename(columns = {'default.payment.next.month':'default'})

In [None]:
# Before we even begin, let's see the datatypes of our data and its types...

credit_data.info()

In [None]:
# Before we even continue, let's spot possible missing data to then manage them...
# ...in pandas is just one line of code !

credit_data.isnull().sum()

In [None]:
# Overall, we are speaking about this dataframe:

credit_data.head(3)

There are no missing datapoints so we will be not performing any missing data management as shown [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3668100/).

*After* initial missing data and types exploration is performed, we can do some basic *Statistics and Correlations* as follows:

1. we displayed the **basic statistics** of all the features using data.describe().
2. we visualized the **correlation matrix** using a heatmap to understand the relationships between different features.

In [None]:
# 1. Let's perform some correlations

#print(credit_data.describe())

plt.figure(figsize=(15, 10))
sns.heatmap(credit_data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

We are supposed to have a distribution of Numerical Features now by **plotting histograms for all the numerical features to visualize their distributions, just as follows.***

In [None]:
# 2. Let's visualise the distribution of all the numerical features...

numerical_features = ['LIMIT_BAL', 'AGE', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 
                      'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

for feature in numerical_features:
    
    plt.figure()
    sns.histplot(credit_data[feature], kde=True)
    plt.title(f'{feature} Distribution')
    plt.show()

As we did for numerical features, we perform the same over categorical features with count plots just *using count plots for all the categorical features to visualize the count of each category in the features.*

In [None]:
# 3. Let's visualise the distribution of all the numerical features...

categorical_features = ['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']

for feature in categorical_features:
    
    plt.figure()
    sns.countplot(x=feature, data=credit_data)
    plt.title(f'{feature} Count Plot')
    plt.show()

We know investigate over *Default Rates Across Different Categories*. <br>
To do so, we do bar plots to **visualize the default rates across different categories** in the categorical features just as follows:

In [None]:
# 4. Let's visualise the distribution of the default rates across different categories...

for feature in categorical_features:
    
    plt.figure()
    sns.barplot(x=feature, y='default', data=credit_data)
    plt.title(f'Default Rate by {feature}')
    plt.show()

We can perform one last analysis now: **we plot a pair plot for a subset of variables to visualize the relationships between them and how they are affected by the target variable ('default').**

In [None]:
# Let's visualise the pair plot for a subset of variables to visualize relationships

sns.pairplot(credit_data[['LIMIT_BAL', 'AGE', 'BILL_AMT1', 'PAY_AMT1', 'default']], hue='default')
plt.title('Pair Plot')
plt.show()

We must now keep an eye on the nature of how much this **dataset is balanced (or not)***.

In [None]:
# The problem we're are going to solve is imbalanced: let's see how much it is imbalanced

credit_data['default'].value_counts(normalize=True)

We see that the **77.88% of our data is about good payers while the remaining is about "defaulters"**. <br>
Since our dataset has 30,000 observations, if mean we have ca. 6636 bad payers.

Our data is not that imbalanced as we could have been thinking at the benigging which means, under a business point of view, that the institution that has this dataset has lots of bad payers overall. <br>

In order to be accurate in out predictions, we are going to leverage the following tecniques to balance this dataset:

1. [SMOTE](https://learn.microsoft.com/en-us/azure/machine-learning/component-reference/smote?view=azureml-api-2)
2. [Oversampling](https://en.wikipedia.org/wiki/Oversampling)
3. [Undersampling](https://en.wikipedia.org/wiki/Undersampling)

As follows the code to do so.

In [None]:
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler

In [None]:
# --> Step 1: Split the data into training and testing sets

X = credit_data.drop('default', axis=1)
y = credit_data['default']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)

In [None]:
# --> Step 2: Apply SMOTE

smote            = SMOTE(random_state=RANDOM_STATE)
X_smote, y_smote = smote.fit_resample(X_train, y_train)

print(f'SMOTE shape is {X_smote.shape}')
y_smote.value_counts()

In [None]:
# --> Step 3: apply random oversampling

oversampler                  = RandomOverSampler(random_state=RANDOM_STATE)
X_oversampled, y_oversampled = oversampler.fit_resample(X_train, y_train)

print(f'Oversampled shape is {X_oversampled.shape}')
y_oversampled.value_counts()

In [None]:
# --> Step 4: apply random undersampling

undersampler                   = RandomUnderSampler(random_state=RANDOM_STATE)
X_undersampled, y_undersampled = undersampler.fit_resample(X_train, y_train)

print(f'Undersample shape is {X_undersampled.shape}')
y_undersampled.value_counts()

From what we have coded here above we can summarise:

>- **SMOTE**: we applied the SMOTE technique to generate synthetic samples of the minority class in the training data, helping to balance the class distribution.
>- **Random Oversampling**: we oversampled the minority class by randomly selecting samples with replacement, increasing the number of minority class samples in the training data.
>- **Random Undersampling**: we undersampled the majority class by randomly removing samples, reducing the number of majority class samples in the training data to balance the class distribution.

The *undersampled dataset* has a the best final ratio since it has "just" the bigger class (not defaulted) undersampled having still enough datapoints to do **inference** on (i.e 10,664 observations in total, equally split between defualt and not default).

In [None]:
# Our final dataset will be then what follows

y_undersampled       = pd.DataFrame(np.array(y_undersampled), columns=['default'])
credit_data_balanced = pd.concat([X_undersampled, y_undersampled], axis=1)

# This is going to be out balanced dataset...finally !
credit_data_balanced.head(3)

We will be keeping on executing out code on the **credit_data_balanced** dataset

***

### Model Selection <a class="anchor" id="model_selection"></a>

We will start with an hypotestis: out best model is **a base model centered on the Logistic Regression**.

Any of the future model that will be performing better than a "simple" Logistic Regression will be acceppted like our **refrence model** for the resolution of this task.

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, roc_auc_score
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK

In [None]:
# --> Step 1: Split the data into training and testing sets
X = credit_data_balanced.drop('default', axis=1)
y = credit_data_balanced['default']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=RANDOM_STATE)

In [None]:
# --> Step 2: Standardize the data
scaler         = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

In [None]:
# --> Step 3: Hyperparameter optimization using GridSearchCV
param_grid  = {'C':       list(np.linspace(0.0001, 10, 10)), #[0.0001, 0.001, 0.01, 0.1, 1, 10, 100], 
               'penalty': ['l1', 'l2', 'elasticnet'], 
               'solver':  ['saga', 'newton-cg', 'liblinear', 'lbfgs', 'sag']}

grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=10)
grid_search.fit(X_train_scaled, y_train)

In [None]:
# --> Step 4: Train the logistic regression model with the best hyperparameters
best_params = grid_search.best_params_
log_reg     = LogisticRegression(**best_params)
log_reg.fit(X_train_scaled, y_train)

In [None]:
# --> Step 5: Evaluate the model on the test set
y_pred       = log_reg.predict(X_test_scaled)
y_pred_proba = log_reg.predict_proba(X_test_scaled)[:, 1]

In [None]:
# --> Step 6: Display performance metrics

print("Best Hyperparameters:", best_params, '\n')
print("Accuracy Score:", accuracy_score(y_test, y_pred), '\n')
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred), '\n')
print("Classification Report:")
print(classification_report(y_test, y_pred))

In [None]:
# --> Step 7: Plot the ROC curve

fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % roc_auc_score(y_test, y_pred_proba))
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

Explanation:

- **ROC Curve**: In the script above, we added a section to plot the ROC (Receiver Operating Characteristic) curve. The ROC curve is a graphical representation of the true positive rate against the false positive rate for the logistic regression model.
  
- **AUC Score**: The area under the ROC curve (AUC) score is also calculated and displayed in the legend of the plot. The AUC score gives us a single value summary of the performance of the model, where a score of 0.5 indicates a model with no discriminative power, and a score of 1.0 indicates a perfect model.

- **Interpretation**:
  - **True Positive Rate (Sensitivity)**: It is the ratio of the number of positive instances correctly predicted by the model to the total number of positive instances. It is given by the formula: TPR = TP / (TP + FN).
  - **False Positive Rate (1 - Specificity)**: It is the ratio of the number of negative instances incorrectly predicted as positive by the model to the total number of negative instances. It is given by the formula: FPR = FP / (FP + TN).
  - **ROC Curve**: The ROC curve is created by plotting the TPR against the FPR at various threshold settings. It gives us a sense of the trade-off between the true positive rate and false positive rate.
  - **AUC**: The AUC gives us a single value metric to compare models. A higher AUC indicates a better performing model.

- **Thresholds**: The ROC curve is created by varying the threshold used to classify instances as positive or negative. By analyzing the ROC curve, you can choose a threshold that gives a good balance between sensitivity and specificity, depending on the specific requirements of your problem.

This script will now plot the ROC curve and display the AUC score in addition to the other performance metrics, giving you a more comprehensive view of the model's performance. Adjustments might be needed based on further analysis and insights derived from the EDA.

### Random Forest

### Conclusions <a class="anchor" id="conclusions"></a>

### Appendix <a class="anchor" id="appendix"></a>

In [None]:
# Random Forest


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, roc_auc_score
import matplotlib.pyplot as plt
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK

# Step 1: Load and preprocess the data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"
data = pd.read_excel(url, header=1)
data.rename(columns={'default payment next month': 'default'}, inplace=True)
data.drop('ID', axis=1, inplace=True)

# Step 2: Split the data into training and testing sets
X = data.drop('default', axis=1)
y = data['default']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 4: Define the objective function for hyperopt
def objective(params):
    model = RandomForestClassifier(**params, random_state=42)
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    acc = accuracy_score(y_test, y_pred)
    return {'loss': -acc, 'status': STATUS_OK}

# Step 5: Define the hyperparameter space
space = {
    'n_estimators': hp.choice('n_estimators', range(50, 300)),
    'max_depth': hp.choice('max_depth', range(1, 20)),
    'min_samples_split': hp.choice('min_samples_split', range(2, 20)),
    'min_samples_leaf': hp.choice('min_samples_leaf', range(2, 20)),
    'max_features': hp.choice('max_features', ['auto', 'sqrt', 'log2']),
}

# Step 6: Run hyperopt optimization
trials = Trials()
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=50, trials=trials)

# Step 7: Train the Random Forest model with the best hyperparameters
best_params = {
    'n_estimators': best['n_estimators'] + 50,
    'max_depth': best['max_depth'] + 1,
    'min_samples_split': best['min_samples_split'] + 2,
    'min_samples_leaf': best['min_samples_leaf'] + 1,
    'max_features': ['auto', 'sqrt', 'log2'][best['max_features']],
}
rf_model = RandomForestClassifier(**best_params, random_state=42)
rf_model.fit(X_train_scaled, y_train)

# Step 8: Evaluate the model on the test set
y_pred = rf_model.predict(X_test_scaled)
y_pred_proba = rf_model.predict_proba(X_test_scaled)[:, 1]

# Step 9: Display performance metrics
print("Best Hyperparameters:", best_params)
print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Step 10: Plot the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.figure()
plt.plot(fpr, tpr, label='Random Forest (area = %0.2f)' % roc_auc_score(y_test, y_pred_proba))
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()


In [None]:
# Support Vector Machine


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, roc_auc_score
import matplotlib.pyplot as plt
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK

# Step 1: Load and preprocess the data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"
data = pd.read_excel(url, header=1)
data.rename(columns={'default payment next month': 'default'}, inplace=True)
data.drop('ID', axis=1, inplace=True)

# Step 2: Split the data into training and testing sets
X = data.drop('default', axis=1)
y = data['default']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 4: Define the objective function for hyperopt
def objective(params):
    model = SVC(**params, probability=True, random_state=42)
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    acc = accuracy_score(y_test, y_pred)
    return {'loss': -acc, 'status': STATUS_OK}

# Step 5: Define the hyperparameter space
space = {
    'C': hp.loguniform('C', -4, 2),
    'kernel': hp.choice('kernel', ['linear', 'rbf', 'poly', 'sigmoid']),
    'gamma': hp.choice('gamma', ['scale', 'auto']),
    'degree': hp.choice('degree', [2, 3, 4, 5, 6]),
}

# Step 6: Run hyperopt optimization
trials = Trials()
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=20, trials=trials)

# Step 7: Train the SVM model with the best hyperparameters
best_params = {
    'C': best['C'],
    'kernel': ['linear', 'rbf', 'poly', 'sigmoid'][best['kernel']],
    'gamma': ['scale', 'auto'][best['gamma']],
    'degree': best['degree'] + 2,
}
svm_model = SVC(**best_params, probability=True, random_state=42)
svm_model.fit(X_train_scaled, y_train)

# Step 8: Evaluate the model on the test set
y_pred = svm_model.predict(X_test_scaled)
y_pred_proba = svm_model.predict_proba(X_test_scaled)[:, 1]

# Step 9: Display performance metrics
print("Best Hyperparameters:", best_params)
print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Step 10: Plot the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.figure()
plt.plot(fpr, tpr, label='SVM (area = %0.2f)' % roc_auc_score(y_test, y_pred_proba))
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()


 10%|█████████████████████                                                                                                                                                                                             | 2/20 [07:58<1:03:39, 212.21s/trial, best loss: -0.8105]

Explanation:

- **Support Vector Machines (SVM)**: In step 7, we used the `SVC` class from `scikit-learn` to build and train the SVM model.
- **Hyperparameter Space**: In step 5, we defined a space of hyperparameters to search, including the regularization parameter `C`, the kernel type, the kernel coefficient `gamma`, and the degree of the polynomial kernel function.
- **Objective Function**: In step 4, we defined an objective function that takes a set of hyperparameters, trains an SVM model, and returns the negative accuracy as the loss to be minimized by `hyperopt`.
- **Bayesian Optimization**: In step 6, we used `hyperopt` to perform Bayesian optimization, using the Tree-structured Parzen Estimator (TPE) algorithm to find the best hyperparameters over 50 evaluations.
- **Best Hyperparameters**: In step 7, we extracted the best hyperparameters from the optimization results and trained the SVM model using those hyperparameters.

This script now uses an SVM model with Bayesian optimization for hyperparameter tuning, aiming to find the best hyperparameters in a more efficient manner compared to grid search. Adjustments might be needed based on further analysis and insights derived from the EDA.

In [None]:
# Logistic regression con ottimizzazzione bayes

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, roc_auc_score
import matplotlib.pyplot as plt
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK

# Step 1: Load and preprocess the data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"
data = pd.read_excel(url, header=1)
data.rename(columns={'default payment next month': 'default'}, inplace=True)
data.drop('ID', axis=1, inplace=True)

# Step 2: Split the data into training and testing sets
X = data.drop('default', axis=1)
y = data['default']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 4: Define the objective function for hyperopt
def objective(params):
    model = LogisticRegression(**params)
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    acc = accuracy_score(y_test, y_pred)
    return {'loss': -acc, 'status': STATUS_OK}

# Step 5: Define the hyperparameter space
space = {
    'C': hp.loguniform('C', -4, 2),
    'penalty': hp.choice('penalty', ['l1', 'l2']),
    'solver': hp.choice('solver', ['liblinear'])
}

# Step 6: Run hyperopt optimization
trials = Trials()
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=50, trials=trials)

# Step 7: Train the logistic regression model with the best hyperparameters
best_params = {
    'C': best['C'],
    'penalty': ['l1', 'l2'][best['penalty']],
    'solver': 'liblinear'
}
log_reg = LogisticRegression(**best_params)
log_reg.fit(X_train_scaled, y_train)

# Step 8: Evaluate the model on the test set
y_pred = log_reg.predict(X_test_scaled)
y_pred_proba = log_reg.predict_proba(X_test_scaled)[:, 1]

# Step 9: Display performance metrics
print("Best Hyperparameters:", best_params)
print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Step 10: Plot the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % roc_auc_score(y_test, y_pred_proba))
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()


Explanation:

- **Hyperparameter Space**: In step 5, we defined the hyperparameter space using `hyperopt`'s `hp` module. We defined a log-uniform distribution for `C` to explore a wide range of values on a logarithmic scale. We also defined the choices for the `penalty` and `solver` parameters.
- **Objective Function**: In step 4, we defined an objective function that takes a set of hyperparameters as input, trains a logistic regression model using those hyperparameters, and returns the negative accuracy as the loss. `hyperopt` minimizes the loss, so we return the negative accuracy to ensure that `hyperopt` is maximizing the accuracy.
- **Bayesian Optimization**: In step 6, we used `hyperopt`'s `fmin` function to perform Bayesian optimization. We used the Tree-structured Parzen Estimator (TPE) as the optimization algorithm and performed 50 evaluations to find the best hyperparameters.
- **Best Hyperparameters**: In step 7, we extracted the best hyperparameters from the optimization results and trained the logistic regression model using those hyperparameters.

This script now uses Bayesian optimization for hyperparameter tuning, which can potentially find better hyperparameters in fewer iterations compared to grid search. Adjustments might be needed based on further analysis and insights derived from the EDA.