Team members:
- Sophia Gabriela Martinez Albarran A01424430
- Eduardo Botello Casey A01659281
- Marcos Saade Romano A01784220

## Dataset description

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). It contains 6 integers variables and 11 categorical variables. It has 45211 instances of data, where we will be performing a cleaning and preprocessing before creating the models.

In [1]:
import sklearn 
from ucimlrepo import fetch_ucirepo 
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

  
# https://archive.ics.uci.edu/dataset/222/bank+marketing  
# fetch dataset 
bank_marketing = fetch_ucirepo(id=222) 
  
# data (as pandas dataframes) 
X = bank_marketing.data.features 
y = bank_marketing.data.targets 
  
# Drop the column duration because in the page it says it isn't necessary for a predictive model
X = X.drop(columns=['duration'])

Duration was removed based on the dataset documentation

In [2]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   age          45211 non-null  int64 
 1   job          44923 non-null  object
 2   marital      45211 non-null  object
 3   education    43354 non-null  object
 4   default      45211 non-null  object
 5   balance      45211 non-null  int64 
 6   housing      45211 non-null  object
 7   loan         45211 non-null  object
 8   contact      32191 non-null  object
 9   day_of_week  45211 non-null  int64 
 10  month        45211 non-null  object
 11  campaign     45211 non-null  int64 
 12  pdays        45211 non-null  int64 
 13  previous     45211 non-null  int64 
 14  poutcome     8252 non-null   object
dtypes: int64(6), object(9)
memory usage: 5.2+ MB


In [3]:
X.describe()

Unnamed: 0,age,balance,day_of_week,campaign,pdays,previous
count,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0
mean,40.93621,1362.272058,15.806419,2.763841,40.197828,0.580323
std,10.618762,3044.765829,8.322476,3.098021,100.128746,2.303441
min,18.0,-8019.0,1.0,1.0,-1.0,0.0
25%,33.0,72.0,8.0,1.0,-1.0,0.0
50%,39.0,448.0,16.0,2.0,-1.0,0.0
75%,48.0,1428.0,21.0,3.0,-1.0,0.0
max,95.0,102127.0,31.0,63.0,871.0,275.0


In [4]:
X.job = X.job.fillna('unknown')
X.education = X.education.fillna('unknown')
X.poutcome = X.poutcome.fillna('nonexistent')
X.contact = X.contact.fillna('unknown')


Missing values in categorical variables were replaced with unknown in the database. 

Rather than deleting data, these values were preserved as valid categories.

One-hot encoding will later treat them as distinct features.



In [5]:
# Add polynomial features to help capture non-linear relationships
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_poly = poly.fit_transform(X.select_dtypes(include=['int64', 'float64']))

# Convert the polynomial features back to a DataFrame
X_poly = pd.DataFrame(X_poly, columns=poly.get_feature_names_out(X.select_dtypes(include=['int64', 'float64']).columns))

# Keep only categorical columns from original X and concatenate with polynomial features
X_categorical = X.select_dtypes(include=['object'])
X = pd.concat([X_categorical, X_poly], axis=1)

We create polynomial features of degree 2 to capture some nonlinear relationships.

This slows down training a bit but may increase classification success.

In [6]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore')
X_encoded = encoder.fit_transform(X.select_dtypes(include=['object']))
X_encoded_df = pd.DataFrame(
    X_encoded.toarray(), 
    columns=encoder.get_feature_names_out(X.select_dtypes(include=['object']).columns)
)


In [7]:
X_numerico = X.select_dtypes(exclude=['object'])
X_final = pd.concat([X_numerico.reset_index(drop=True), X_encoded_df.reset_index(drop=True)], axis=1)

# Logistic Regression Classifier 

Logistic Regression is widely used for binary classification problems and serves as an interpretable and effective baseline.

### Advantages:

- Interpretability: The model's coefficients indicate the direction and strength of each feature’s influence on the prediction.

- Efficiency: Training and inference are fast, even with large datasets.

- Probabilistic Output: It provides confidence scores for predictions.

- Built-in Regularization: It supports L2 regularization to reduce overfitting.

### Disadvantages 

- Assumes a linear relationship between the input features and the log-odds of the outcome.

- Sensitive to feature scaling, requiring standardized numeric features.

- Not suitable for nonlinear relationships unless combined with feature engineering.

In [8]:

cat_col = X.select_dtypes(include='object').columns.tolist()
num_col = X.select_dtypes(exclude='object').columns.tolist()


categorical_transformer = OneHotEncoder(handle_unknown='ignore')
numerical_transformer = StandardScaler()

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, cat_col),
        ('num', numerical_transformer, num_col)
    ]
)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])


To ensure proper and consistent preprocessing of both categorical and numerical variables, we created a structured pipeline using ColumnTransformer and Pipeline. 
This approach is modular, safe against data leakage, and compatible with cross-validation and model tuning.

It automatically identifies categorical columns (those with object dtype) and stores them in cat_col, it stores the remaining numerical columns in num_col.

Then:

- OneHotEncoder: Converts each categorical column into a set of binary columns (0 or 1), one per category. It uses handle_unknown='ignore' to safely handle categories that might appear in new data but were not seen during training.

- StandardScaler: Standardizes numerical features by removing the mean and scaling to unit variance.

We use it because Logistic Regression is sensitive to feature scale, so we need standardized inputs for numeric features, and categorical variables must be encoded into numbers for the model to process them.

Then we built a preprocessing object that applies categorical_transformer to the categorical columns, and numerical_transformer to the numeric columns.It also applies different transformations in parallel and ensures each set of columns is processed as needed. It also integrates smoothly into a pipeline.

In [9]:

y = y['y'].map({'yes': 1, 'no': 0})

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

The target variable y originally consisted of categorical values 'yes' and 'no'. These were converted into binary numeric values 1 and 0 respectively using .map(). This transformation was required to make the data compatible with scikit-learn classifiers, which expect numeric labels. This also formalizes the task as a binary classification problem

Then the dataset was split into training (80%) and test (20%) subsets using train_test_split(). The stratify=y parameter was used to maintain the original proportion of classes in both sets, which is particularly important for imbalanced classification problems. Additionally, random_state=42 was used to ensure that the split is reproducible, making the results consistent across runs.

In [10]:
from sklearn.model_selection import GridSearchCV

param = {
    'classifier__C': [0.01, 0.1, 1, 10],
    'classifier__penalty': ['l2'],
    'classifier__solver': ['lbfgs']
}

grid_search = GridSearchCV(pipeline, param, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train)

best_logisticreg_model = grid_search.best_estimator_

Hyperparameter tuning was performed using GridSearchCV with 5-fold cross-validation. A grid of candidate values was defined for C, the regularization strength of the Logistic Regression model. We fixed the penalty to L2 and used the lbfgs solver due to its efficiency with this penalty. The model was evaluated using the ROC AUC metric during cross-validation, which is appropriate for this imbalanced classification task. The best pipeline configuration was stored in best_logisticreg_model and used for final evaluation.

In [11]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

y_pred = best_logisticreg_model.predict(X_test)
y_proba = best_logisticreg_model.predict_proba(X_test)[:, 1]

print("Mejores parámetros:", grid_search.best_params_)
print("Reporte de Clasificación:\n", classification_report(y_test, y_pred))
print("Matriz de Confusión:\n", confusion_matrix(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, y_proba))


Mejores parámetros: {'classifier__C': 0.1, 'classifier__penalty': 'l2', 'classifier__solver': 'lbfgs'}
Reporte de Clasificación:
               precision    recall  f1-score   support

           0       0.90      0.99      0.94      7985
           1       0.66      0.17      0.27      1058

    accuracy                           0.89      9043
   macro avg       0.78      0.58      0.61      9043
weighted avg       0.87      0.89      0.86      9043

Matriz de Confusión:
 [[7894   91]
 [ 880  178]]
ROC AUC: 0.772730770004723


The model achieves high precision (0.67) on the positive class (yes), meaning that when it predicts a client will subscribe, it is often correct.

However, it has low recall (0.17) on that same class, indicating that it misses a large portion of the actual positives.

This imbalance is common in datasets with skewed class distribution (most clients do not subscribe).

The high overall accuracy (0.89) is influenced by the model’s strong performance on the dominant class (no), but it may not be sufficient for applications where detecting potential subscribers is critical.

# Naive Bayes Classifier

Naive Bayes is a family of simple yet effective probabilistic classifiers based on applying Bayes' Theorem with a strong (naive) assumption of feature independence.

### Advantages:

- Fast and Scalable: Extremely efficient to train and predict, even on large datasets.

- Works Well with Categorical Data: Especially useful when features are categorical or count-based.

- Robust to Irrelevant Features: Performs reasonably well even when some features are not informative.

- Probabilistic Interpretation: Naturally provides class probabilities, useful for ranking or decision thresholds.

### Disadvantages:

- Strong Independence Assumption: Assumes that all features are conditionally independent given the class, which is often unrealistic.

- Limited Expressiveness: Performs poorly when there are complex feature interactions or nonlinear boundaries.

- Poor Probability Calibration: Predicted probabilities may be unreliable compared to logistic regression or tree-based models.

- Requires Careful Feature Encoding: Must choose the right variant (Gaussian, Multinomial, Bernoulli) depending on the data type.


In [12]:
from sklearn.naive_bayes import GaussianNB

In [13]:
cat_col = X.select_dtypes(include='object').columns.tolist()

categorical_transformer = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, cat_col),
    ]
)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GaussianNB())
])


The pipeline for the Naive Bayes classifier was constructed using Pipeline and ColumnTransformer to preprocess the categorical variables.  


The categorical columns (cat_col) were identified as those with object dtype.  
A OneHotEncoder was applied to these columns to convert them into binary features.  
This ensures that the categorical data is represented numerically, which is required for the Naive Bayes algorithm.  

The ColumnTransformer applies the OneHotEncoder to the categorical columns.  
The transformed data is then passed to the GaussianNB classifier for training and prediction.  

Scaling is not applied in this pipeline because the Naive Bayes algorithm does not require feature scaling.

In [14]:
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, cat_col),
    ]
)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GaussianNB())
])

param = {
    'classifier__priors': [None, [0.6, 0.4], [0.4, 0.6]],
    'classifier__var_smoothing': [1e-09, 1e-08, 1e-07],
}

grid_search = GridSearchCV(pipeline, param, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train)

best_gaussian_model = grid_search.best_estimator_

Hyperparameter tuning was performed using GridSearchCV with 5-fold cross-validation.

A grid of candidate values was defined for the priors, which represent the prior probabilities of the classes, and var_smoothing, a parameter that controls the variance smoothing to handle numerical stability in the Gaussian Naive Bayes model.

The model was evaluated using the ROC AUC metric during cross-validation, which is suitable for this imbalanced classification task. The best pipeline configuration was stored in best_gaussian_model and used for final evaluation.

In [15]:
y_pred = best_gaussian_model.predict(X_test)
y_proba = best_gaussian_model.predict_proba(X_test)[:, 1]

print("Best parameters :", grid_search.best_params_)
print("Classification report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, y_proba))


Best parameters : {'classifier__priors': [0.4, 0.6], 'classifier__var_smoothing': 1e-09}
Classification report:
               precision    recall  f1-score   support

           0       0.93      0.86      0.90      7985
           1       0.33      0.53      0.41      1058

    accuracy                           0.82      9043
   macro avg       0.63      0.69      0.65      9043
weighted avg       0.86      0.82      0.84      9043

Confusion Matrix:
 [[6875 1110]
 [ 501  557]]
ROC AUC: 0.7492743956354838


# Desition Tree classifier

### Advantages:

- Handles Nonlinear Relationships: Decision Trees can model complex, nonlinear relationships between features and the target variable.

- Interpretability: The tree structure is easy to visualize and interpret, making it accessible to non-technical stakeholders.

- No Need for Feature Scaling: Decision Trees do not require normalization or standardization of features.

- Handles Both Categorical and Numerical Data: Can process mixed data types without requiring extensive preprocessing.

- Robust to Outliers: Less sensitive to outliers compared to linear models.

### Disadvantages:

- Overfitting: Decision Trees are prone to overfitting, especially when they grow deep.

- Instability: Small changes in the data can lead to significantly different tree structures.

- Bias Towards Dominant Classes: In imbalanced datasets, the tree may favor the majority class.

- Limited Generalization: Without pruning or regularization, the model may not generalize well to unseen data.

- Computationally Expensive: Training can be slow for large datasets with many features.

In [16]:
from sklearn.tree import DecisionTreeClassifier

cat_col = X.select_dtypes(include='object').columns.tolist()

categorical_transformer = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, cat_col),
    ]
)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(random_state=42))
])


The pipeline for the Decision Tree classifier was constructed using Pipeline and ColumnTransformer to preprocess the categorical variables.

- Categorical Columns: The categorical columns (cat_col) were identified as those with object dtype.
- OneHotEncoder: A OneHotEncoder was applied to these columns to convert them into binary features. This ensures that the categorical data is represented numerically, which is required for the Decision Tree algorithm.
- ColumnTransformer: The ColumnTransformer applies the OneHotEncoder to the categorical columns.



In [17]:
param = {
    'classifier__criterion': ['gini', 'entropy'],
    'classifier__max_depth': [None, 10, 20, 30],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(pipeline, param, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train)

best_decision_tree_model = grid_search.best_estimator_

Hyperparameter tuning was performed using GridSearchCV with 5-fold cross-validation.

A grid of candidate values was defined for the criterion, which determines the function to measure the quality of a split (e.g., 'gini' or 'entropy'), max_depth, which limits the depth of the tree to prevent overfitting, min_samples_split, the minimum number of samples required to split an internal node, and min_samples_leaf, the minimum number of samples required to be at a leaf node.

The model was evaluated using the ROC AUC metric during cross-validation, which is appropriate for imbalanced classification tasks. The best pipeline configuration was stored in best_decision_tree_model and used for final evaluation.

In [18]:
y_pred = best_decision_tree_model.predict(X_test)
y_proba = best_decision_tree_model.predict_proba(X_test)[:, 1]

print("Best parameters :", grid_search.best_params_)
print("Classification report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, y_proba))


Best parameters : {'classifier__criterion': 'entropy', 'classifier__max_depth': 10, 'classifier__min_samples_leaf': 4, 'classifier__min_samples_split': 10}
Classification report:
               precision    recall  f1-score   support

           0       0.90      0.99      0.94      7985
           1       0.61      0.17      0.27      1058

    accuracy                           0.89      9043
   macro avg       0.76      0.58      0.60      9043
weighted avg       0.87      0.89      0.86      9043

Confusion Matrix:
 [[7871  114]
 [ 878  180]]
ROC AUC: 0.7594911536635919


# Ensemble
We are going to train some ensemble models, and compare their performance to the baseline models.

In [19]:
from sklearn.ensemble import VotingClassifier

# Define the ensemble models
hard_voting_ensemble = VotingClassifier(
    estimators=[
        ('logistic', best_logisticreg_model),
        ('gaussian', best_gaussian_model),
        ('decision_tree', best_decision_tree_model)
    ],
    voting='hard'
)

soft_voting_ensemble = VotingClassifier(
    estimators=[
        ('logistic', best_logisticreg_model),
        ('gaussian', best_gaussian_model),
        ('decision_tree', best_decision_tree_model)
    ],
    voting='soft'
)

# Train the hard voting ensemble
hard_voting_ensemble.fit(X_train, y_train)

# Train the soft voting ensemble
soft_voting_ensemble.fit(X_train, y_train)

In the cell above, we define and train two ensemble models using the VotingClassifier from scikit-learn. The first model, hard_voting_ensemble, uses hard voting, where the final prediction is based on the majority vote of the individual classifiers. The second model, soft_voting_ensemble, uses soft voting, where the final prediction is based on the average of the predicted probabilities from the individual classifiers.

Both ensembles combine three base models: the logistic regression model (best_logisticreg_model), the Gaussian Naive Bayes model (best_gaussian_model), and the decision tree model (best_decision_tree_model). After defining the ensembles, we fit them to the training data (X_train and y_train) to prepare them for evaluation.

In [20]:
# Evaluate hard voting ensemble
y_pred_hard = hard_voting_ensemble.predict(X_test)

print("Hard Voting Ensemble:")
print("Classification Report:\n", classification_report(y_test, y_pred_hard))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_hard))

# Evaluate soft voting ensemble
y_pred_soft = soft_voting_ensemble.predict(X_test)
y_proba_soft = soft_voting_ensemble.predict_proba(X_test)[:, 1]

print("\nSoft Voting Ensemble:")
print("Classification Report:\n", classification_report(y_test, y_pred_soft))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_soft))
print("ROC AUC:", roc_auc_score(y_test, y_proba_soft))

# Compare with baseline models
print("\nBaseline Models Comparison:")

# Logistic Regression
print("\nLogistic Regression:")
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, y_proba))

# Gaussian Naive Bayes
y_pred_gaussian = best_gaussian_model.predict(X_test)
y_proba_gaussian = best_gaussian_model.predict_proba(X_test)[:, 1]
print("\nGaussian Naive Bayes:")
print("Classification Report:\n", classification_report(y_test, y_pred_gaussian))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_gaussian))
print("ROC AUC:", roc_auc_score(y_test, y_proba_gaussian))

# Decision Tree
y_pred_tree = best_decision_tree_model.predict(X_test)
y_proba_tree = best_decision_tree_model.predict_proba(X_test)[:, 1]
print("\nDecision Tree:")
print("Classification Report:\n", classification_report(y_test, y_pred_tree))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_tree))
print("ROC AUC:", roc_auc_score(y_test, y_proba_tree))


Hard Voting Ensemble:
Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.98      0.94      7985
           1       0.63      0.21      0.31      1058

    accuracy                           0.89      9043
   macro avg       0.77      0.60      0.63      9043
weighted avg       0.87      0.89      0.87      9043

Confusion Matrix:
 [[7855  130]
 [ 837  221]]

Soft Voting Ensemble:
Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.95      0.94      7985
           1       0.52      0.37      0.44      1058

    accuracy                           0.89      9043
   macro avg       0.72      0.66      0.69      9043
weighted avg       0.87      0.89      0.88      9043

Confusion Matrix:
 [[7625  360]
 [ 663  395]]
ROC AUC: 0.7772201658828641

Baseline Models Comparison:

Logistic Regression:
Classification Report:
               precision    recall  f1-score   support

   

# Conclusion

In this report, we explored various machine learning models to predict whether a client will subscribe to a term deposit based on the Bank Marketing dataset. The workflow included data preprocessing, model training, hyperparameter tuning, and evaluation.

### Key Findings:
1. **Baseline Models**:
    - Logistic Regression provided interpretable results and achieved a good balance between precision and recall.
    - Gaussian Naive Bayes was efficient but struggled with the dataset's feature dependencies.
    - Decision Tree captured complex patterns but was prone to overfitting without proper regularization.

2. **Ensemble Models**:
    - The VotingClassifier ensembles (hard and soft voting) combined the strengths of individual models, leading to improved performance.
    - Soft voting, which averages predicted probabilities, demonstrated better ROC AUC scores compared to hard voting.

3. **Evaluation Metrics**:
    - While accuracy was high across models, it was influenced by the imbalanced nature of the dataset.
    - ROC AUC was a more reliable metric, highlighting the models' ability to distinguish between classes.

### Recommendations:
- For practical deployment, the soft voting ensemble is recommended due to its superior performance and robustness.
- Further improvements could be achieved by exploring additional ensemble techniques (e.g., Random Forest, Gradient Boosting) and adding polynomial features.

This analysis demonstrates the importance of combining domain knowledge, data preprocessing, and model selection to achieve reliable predictions in real-world applications.