# Practical Application III: Comparing Classifiers

**Overview**: In this practical application, your goal is to compare the performance of the classifiers we encountered in this section, namely K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines.  We will utilize a dataset related to marketing bank products over the telephone.  



### Getting Started

Our dataset comes from the UCI Machine Learning repository [link](https://archive.ics.uci.edu/ml/datasets/bank+marketing).  The data is from a Portugese banking institution and is a collection of the results of multiple marketing campaigns.  We will make use of the article accompanying the dataset [here](CRISP-DM-BANK.pdf) for more information on the data and features.



### Problem 1: Understanding the Data

To gain a better understanding of the data, please read the information provided in the UCI link above, and examine the **Materials and Methods** section of the paper.  How many marketing campaigns does this data represent?

This dataset represents a single comprehensive marketing campaign carried out by a Portuguese retail bank, which was executed over multiple years — from May 2008 to June 2013. 

Although the dataset includes contacts spread across several months and years, it is treated as one long-term telemarketing campaign where each contact with a client is considered part of the same ongoing marketing effort to promote long-term deposits.

### Problem 2: Read in the Data

Use pandas to read in the dataset `bank-additional-full.csv` and assign to a meaningful variable name.

In [70]:
# Read the dataset into a meaningful variable
import pandas as pd

# 'bank_df' is used as a meaningful variable name representing the bank marketing dataset
bank_df = pd.read_csv("data/bank-additional-full.csv", sep=';')

# Display the first few rows
bank_df.head()


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


### Problem 3: Understanding the Features


Examine the data description below, and determine if any of the features are missing values or need to be coerced to a different data type.


```
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
```



In [71]:
# Check for missing values
missing_values = bank_df.isnull().sum()
print("Missing values per column:")
print(missing_values[missing_values > 0])

# Check for 'object' columns and inspect unique values for coercion
print("\nData types of each column:")
print(bank_df.dtypes)

print("\nUnique values in object (categorical) columns:")
for col in bank_df.select_dtypes(include='object').columns:
    print(f"{col}: {bank_df[col].unique()[:5]}... (total {bank_df[col].nunique()} unique values)")


Missing values per column:
Series([], dtype: int64)

Data types of each column:
age                 int64
job                object
marital            object
education          object
default            object
housing            object
loan               object
contact            object
month              object
day_of_week        object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome           object
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
y                  object
dtype: object

Unique values in object (categorical) columns:
job: ['housemaid' 'services' 'admin.' 'blue-collar' 'technician']... (total 12 unique values)
marital: ['married' 'single' 'divorced' 'unknown']... (total 4 unique values)
education: ['basic.4y' 'high.school' 'basic.6y' 'basic.9y' 'professional.course']... (total 8 unique values)
default: ['no' 'unknown' 'yes']

### Problem 4: Understanding the Task

After examining the description and data, your goal now is to clearly state the *Business Objective* of the task.  State the objective below.

#### Business Objective

We want to predict whether a client will subscribe to a term deposit.  
To achieve this, we will follow a data-driven approach that includes:

- Understanding and preparing the data
- Engineering relevant features
- Training and comparing different classification models
- Evaluating model performance using appropriate metrics
- Drawing insights that can support future marketing decisions


### Problem 5: Engineering Features

Now that you understand your business objective, we will build a basic model to get started.  Before we can do this, we must work to encode the data.  Using just the bank information features, prepare the features and target column for modeling with appropriate encoding and transformations.

In [72]:
# Create a copy of the original dataframe
df = bank_df.copy()

# Encode the target variable
df['y'] = df['y'].map({'yes': 1, 'no': 0})

# Convert categorical variables using one-hot encoding
categorical_cols = df.select_dtypes(include='object').columns
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Display the structure of the processed dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 54 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   age                            41188 non-null  int64  
 1   duration                       41188 non-null  int64  
 2   campaign                       41188 non-null  int64  
 3   pdays                          41188 non-null  int64  
 4   previous                       41188 non-null  int64  
 5   emp.var.rate                   41188 non-null  float64
 6   cons.price.idx                 41188 non-null  float64
 7   cons.conf.idx                  41188 non-null  float64
 8   euribor3m                      41188 non-null  float64
 9   nr.employed                    41188 non-null  float64
 10  y                              41188 non-null  int64  
 11  job_blue-collar                41188 non-null  bool   
 12  job_entrepreneur               41188 non-null 

### Problem 6: Train/Test Split

With your data prepared, split it into a train and test set.

In [73]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Define features and target
X = df.drop("y", axis=1)
y = df["y"]

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# Display shapes
X_train.shape, X_test.shape


((28831, 53), (12357, 53))

### Problem 7: A Baseline Model

Before we build our first model, we want to establish a baseline.  What is the baseline performance that our classifier should aim to beat?

In [74]:
# Always predicting the majority class
from sklearn.dummy import DummyClassifier
from sklearn.metrics import classification_report

dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
y_pred_dummy = dummy.predict(X_test)

print("Classification Report for Baseline Model:")
print(classification_report(y_test, y_pred_dummy))


Classification Report for Baseline Model:
              precision    recall  f1-score   support

           0       0.89      1.00      0.94     10968
           1       0.00      0.00      0.00      1389

    accuracy                           0.89     12357
   macro avg       0.44      0.50      0.47     12357
weighted avg       0.79      0.89      0.83     12357



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Problem 8: A Simple Model

Use Logistic Regression to build a basic model on your data.  

In [75]:
# Train a Logistic Regression as a simple model
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)

print("Classification Report for Logistic Regression:")
print(classification_report(y_test, y_pred_logreg))


Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           0       0.93      0.97      0.95     10968
           1       0.68      0.42      0.52      1389

    accuracy                           0.91     12357
   macro avg       0.80      0.70      0.74     12357
weighted avg       0.90      0.91      0.90     12357



### Problem 9: Score the Model

What is the accuracy of your model?

In [76]:
from sklearn.metrics import roc_auc_score

# Evaluate ROC AUC
y_proba_logreg = logreg.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test, y_proba_logreg)

print(f"ROC AUC Score for Logistic Regression: {roc_auc:.3f}")


ROC AUC Score for Logistic Regression: 0.936


### Problem 10: Model Comparisons

Now, we aim to compare the performance of the Logistic Regression model to our KNN algorithm, Decision Tree, and SVM models.  Using the default settings for each of the models, fit and score each.  Also, be sure to compare the fit time of each of the models.  Present your findings in a `DataFrame` similar to that below:

| Model | Train Time | Train Accuracy | Test Accuracy |
| ----- | ---------- | -------------  | -----------   |
|     |    |.     |.     |

In [77]:
import time
from sklearn.metrics import accuracy_score

models_with_time = {
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Support Vector Machine": SVC(probability=True, random_state=42)
}

timing_results = []

for name, model in models_with_time.items():
    start = time.time()
    model.fit(X_train, y_train)
    end = time.time()
    
    train_time = end - start
    train_accuracy = accuracy_score(y_train, model.predict(X_train))
    test_accuracy = accuracy_score(y_test, model.predict(X_test))
    
    timing_results.append({
        "Model": name,
        "Train Time (s)": round(train_time, 4),
        "Train Accuracy": round(train_accuracy, 4),
        "Test Accuracy": round(test_accuracy, 4)
    })

timing_df = pd.DataFrame(timing_results)
timing_df

Unnamed: 0,Model,Train Time (s),Train Accuracy,Test Accuracy
0,K-Nearest Neighbors,0.0078,0.9195,0.8977
1,Logistic Regression,0.341,0.9118,0.9123
2,Decision Tree,0.4672,1.0,0.8899
3,Support Vector Machine,274.7309,0.9251,0.9093


### Problem 11: Improving the Model

Now that we have some basic models on the board, we want to try to improve these.  Below, we list a few things to explore in this pursuit.

- More feature engineering and exploration.  For example, should we keep the gender feature?  Why or why not?
- Hyperparameter tuning and grid search.  All of our models have additional hyperparameters to tune and explore.  For example the number of neighbors in KNN or the maximum depth of a Decision Tree.  
- Adjust your performance metric

In [78]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier

# Fit a quick model to check feature importance
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

importances = rf.feature_importances_
features = df.drop("y", axis=1).columns
importance_df = pd.DataFrame({
    "Feature": features,
    "Importance": importances
}).sort_values(by="Importance", ascending=False)

importance_df.head(10)

Unnamed: 0,Feature,Importance
1,duration,0.295058
8,euribor3m,0.094876
0,age,0.085516
9,nr.employed,0.054814
2,campaign,0.041603
3,pdays,0.030177
7,cons.conf.idx,0.029846
5,emp.var.rate,0.024137
52,poutcome_success,0.023622
6,cons.price.idx,0.022525


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# Grid search for KNN
knn_params = {'n_neighbors': [3, 5, 7, 9]}
knn_grid = GridSearchCV(KNeighborsClassifier(), knn_params, cv=5, scoring='roc_auc')
knn_grid.fit(X_train, y_train)

print(f"Best KNN parameters: {knn_grid.best_params_}")
print(f"Best KNN ROC AUC: {knn_grid.best_score_:.4f}")

# Grid search for Decision Tree
dt_params = {'max_depth': [3, 5, 10, 15, None], 'min_samples_split': [2, 5, 10]}
dt_grid = GridSearchCV(DecisionTreeClassifier(random_state=42), dt_params, cv=5, scoring='roc_auc')
dt_grid.fit(X_train, y_train)

print(f"Best Decision Tree parameters: {dt_grid.best_params_}")
print(f"Best Decision Tree ROC AUC: {dt_grid.best_score_:.4f}")

Best KNN parameters: {'n_neighbors': 9}
Best KNN ROC AUC: 0.8409
Best Decision Tree parameters: {'max_depth': 5, 'min_samples_split': 5}
Best Decision Tree ROC AUC: 0.9245


In [81]:
# Compare performance of best tuned models (e.g., KNN and Decision Tree)
from time import time
from sklearn.metrics import accuracy_score

best_models = {
    "Tuned KNN": knn_grid.best_estimator_,
    "Tuned Decision Tree": dt_grid.best_estimator_
}

tuned_results = []

for name, model in best_models.items():
    start = time()
    model.fit(X_train, y_train)
    end = time()

    train_time = end - start
    train_accuracy = accuracy_score(y_train, model.predict(X_train))
    test_accuracy = accuracy_score(y_test, model.predict(X_test))

    tuned_results.append({
        "Model": name,
        "Train Time (s)": round(train_time, 4),
        "Train Accuracy": round(train_accuracy, 4),
        "Test Accuracy": round(test_accuracy, 4)
    })

tuned_results_df = pd.DataFrame(tuned_results)
tuned_results_df

Unnamed: 0,Model,Train Time (s),Train Accuracy,Test Accuracy
0,Tuned KNN,0.0092,0.9111,0.8992
1,Tuned Decision Tree,0.1611,0.9174,0.9162
