# Baseline Model Creation

In [22]:
import os

# Move working directory to project root if executed inside notebooks/
if os.getcwd().endswith("notebooks"):
    os.chdir("..")

print("Working directory:", os.getcwd())

Working directory: c:\Coding\pytorch\bank-marketing-ml


# Imports

In [23]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, f1_score


# **UNBALANCED DATASET**

For the purposes of understanding, I will first perform these models on an unbalanced dataset

# Load Dataset

In [None]:
dataset_path = "data/processed/bank_processed_unbalanced.csv"
df = pd.read_csv(dataset_path)


# Seperate

In [25]:
X = df.drop(columns=['y_yes'])
y = df["y_yes"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Logistic Regression

In [26]:
lr = LogisticRegression().fit(X_train, y_train)
y_pred = lr.predict(X_test)
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Confusion Matrix:
[[7560  104]
 [ 830  177]]

Classification Report:
              precision    recall  f1-score   support

       False       0.90      0.99      0.94      7664
        True       0.63      0.18      0.27      1007

    accuracy                           0.89      8671
   macro avg       0.77      0.58      0.61      8671
weighted avg       0.87      0.89      0.86      8671



# Analysis
The preliminary model exhibits very poor performance on the positive class, a consequence of the significant class imbalance and the relatively weak predictive signal available in the features.

For the sake of robustness, we will implement random forest next with this unbalanced data

# Random Forest

In [28]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))
print("\nRandom Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))

Random Forest Confusion Matrix:
[[7532  132]
 [ 790  217]]

Random Forest Classification Report:
              precision    recall  f1-score   support

       False       0.91      0.98      0.94      7664
        True       0.62      0.22      0.32      1007

    accuracy                           0.89      8671
   macro avg       0.76      0.60      0.63      8671
weighted avg       0.87      0.89      0.87      8671



# Analysis
Better, but only slightly. This dataset has far too much imbalance, and this will need to be solved. However, for robustness, we will in the next notebook implement a neural network with the unbalanced dataset

# **BALANCED DATASET**