---
title: 'Lab01: Portugese Bank Marketing Data'
subtitle: "MSDS 7331"
authors: "Anthony Burton-Cordova, Will Jones, Nick Sager"
date: September 24, 2023
jupyter: python3
---

## Introduction

For an introduction to the data, business understanding, and explanation of the the dataset, please see [Lab01](Lab01.ipynb), which contains the exploratory data analysis (EDA) from Lab 01. This notebook will focus on the modeling of the data.

## Rubric
### Reference only - delete before submitting

| Category                 | Available | Requirements |
|--------------------------|-----------|--------------|
| Total Points             | 100       | Total             |
| Create Models            | 50        | Create a logistic regression model and a support vector machine model for the classification task involved with your dataset. Assess how well each model performs (use 80/20 training/testing split for your data). Adjust parameters of the models to make them more accurate. If your dataset size requires the use of stochastic gradient descent, then linear kernel only is fine to use. That is, the SGDClassifier is fine to use for optimizing logistic regression and linear support vector machines. For many problems, SGD will be required in order to train the SVM model in a reasonable timeframe. |
| Model Advantages         | 10        | Discuss the advantages of each model for each classification task. Does one type of model offer superior performance over another in terms of prediction accuracy? In terms of training time or efficiency? Explain in detail. |
| Interpret Feature Importance | 30    | Use the weights from logistic regression to interpret the importance of different features for the classification task. Explain your interpretation in detail. Why do you think some variables are more important? |
| Interpret Support Vectors   | 10     | Look at the chosen support vectors for the classification task. Do these provide any insight into the data? Explain. If you used stochastic gradient descent (and therefore did not explicitly solve for support vectors), try subsampling your data to train the SVC model— then analyze the support vectors from the subsampled dataset. |

## Import and Process Data

The following code chunks are explained in more detail in [Lab01](Lab01.ipynb).

In [1]:
import pandas as pd

# Choose File
# RawBank = "https://raw.githubusercontent.com/NickSager/DS7331_Projects/main/data/bank-additional-full.csv"
RawBank = "data/bank-additional-full.csv"

# Read the CSV file with a semicolon ; separator
bank = pd.read_csv(RawBank, sep=';')

# Get info on the dataset
# print(bank.info())
# bank.describe()

In [2]:
import numpy as np

# let's set those values to NaN, so that Pandas understand they are missing
df = bank.copy() # make a copy of the dataframe
df = df.replace(to_replace = 'unknown', value = np.nan) # replace unknown with NaN (not a number)
df = df.replace(to_replace = 999, value = np.nan) # replace 999 with NaN (not a number)
df = df.replace(to_replace = 'nonexistent', value = np.nan) # replace nonexistent with NaN (not a number)

# print (df.info())
# df.describe() # scroll over to see the values

# From course material "01. Pandas.ibynb"

In [3]:
# Change NA Categoricals to 'unknown'
df['job'] = df['job'].fillna('unknown')
df['marital'] = df['marital'].fillna('unknown')
df['education'] = df['education'].fillna('unknown')

# Change NA Credit history values to 'no'
df['default'] = df['default'].fillna('no')
df['housing'] = df['housing'].fillna('no')
df['loan'] = df['loan'].fillna('no')

# Change NA Previous Outcome to 'not contacted'
df['poutcome'] = df['poutcome'].fillna('not contacted')

# Change NA pdays to the mean
df['pdays'] = df['pdays'].fillna(df['pdays'].mean())

# Change NA Duration to '999'
df['duration'] = df['duration'].fillna(999)

# let's break up the age variable
df['age_range'] = pd.cut(df.age,[0,40,60,1e6],3,labels=['Young','Middle-Age','Old']) # this creates a new variable
df.age_range.describe()

# print(df.info())

count     41188
unique        3
top       Young
freq      23768
Name: age_range, dtype: object

In [4]:
# Convert all features to numeric using dummy variables
df = pd.get_dummies(df, columns=['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome', 'age_range'], drop_first=True)

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 53 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   age                            41188 non-null  int64  
 1   duration                       41188 non-null  float64
 2   campaign                       41188 non-null  int64  
 3   pdays                          41188 non-null  float64
 4   previous                       41188 non-null  int64  
 5   emp.var.rate                   41188 non-null  float64
 6   cons.price.idx                 41188 non-null  float64
 7   cons.conf.idx                  41188 non-null  float64
 8   euribor3m                      41188 non-null  float64
 9   nr.employed                    41188 non-null  float64
 10  y                              41188 non-null  object 
 11  job_blue-collar                41188 non-null  uint8  
 12  job_entrepreneur               41188 non-null 

## Create Models

In this section, we will create Logistic Regression and Support Vector Machine (SVM) models to classify whether a customer will subscribe to a term deposit. We will split the data into training and validation sets using an 80/20 split. ROC - AUC will be used to evaluate the models.

The code in this section is adopted from the course material in the notebook '04. Logits and SVM.ipynb'.

First, we will change the data into a format that Scikit-Learn can use. We will also split the data into training and validation sets.

In [8]:
from sklearn.model_selection import train_test_split

# Consider deleting duration for practicality
# if 'duration' in df: del df['duration']

# we want to predict the X and y data as follows:
X = df.drop(columns=['y']).values
y = df['y'].values
    
# Split into training and test sets
X_train_holdout, X_test_holdout, y_train_holdout, y_test_holdout = train_test_split(
    X, y, test_size=0.2, random_state=137, stratify=y)


The 'holdout' splits will be used to evaluate the final model on unseen data. We will additionally use cross-validation to tune the hyperparameters of the models.

Next we will define the parts of the pipeline that will be used to transform the data and fit the models. We will use a StandardScaler to scale the data, and a PCA to reduce the dimensionality of the data. We will use a LogisticRegression model and a SVC model.

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Define the model
model = LogisticRegression()

# Define the pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', model)
])

In [10]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold

# Lists to store metrics for each fold
accuracies = []

# Define the cross validation method on training holdout
skf = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)

for train_index, val_index in skf.split(X_train_holdout, y_train_holdout):
    # Splitting the data
    X_train, X_val = X_train_holdout[train_index], X_train_holdout[val_index]
    y_train, y_val = y_train_holdout[train_index], y_train_holdout[val_index]
    
    # Train the model on the training data
    pipe.fit(X_train, y_train)
    
    # Predict on the test data
    y_pred = pipe.predict(X_val)
    
    # Calculate accuracy or any other metric
    acc = accuracy_score(y_val, y_pred)
    accuracies.append(acc)
    
    # Optionally, print the accuracy for each fold
    print(f"Accuracy for fold: {len(accuracies)} is {acc:.4f}")

# Calculate mean and std deviation of the accuracies
mean_acc = np.mean(accuracies)
std_acc = np.std(accuracies)

print(f"\nOverall accuracy: {mean_acc:.4f} (+/- {std_acc:.4f})")


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy for fold: 1 is 0.9090
Accuracy for fold: 2 is 0.9123


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy for fold: 3 is 0.9073
Accuracy for fold: 4 is 0.9149
Accuracy for fold: 5 is 0.9109

Overall accuracy: 0.9109 (+/- 0.0026)


Next, we will evaluate the final model on the holdout set.

In [11]:
pipe.fit(X_train_holdout, y_train_holdout)
final_performance = pipe.score(X_test_holdout, y_test_holdout)
print(f"Final Model Performance on Holdout Set: {final_performance:.4f}")


Final Model Performance on Holdout Set: 0.9090


## Model Advantages

## Interpret Feature Importance



## Interpret Support Vectors



## References

@article{Moro2014ADA,
  title={A data-driven approach to predict the success of bank telemarketing},
  author={S{\'e}rgio Moro and P. Cortez and Paulo Rita},
  journal={Decis. Support Syst.},
  year={2014},
  volume={62},
  pages={22-31},
  url={https://api.semanticscholar.org/CorpusID:14181100}
}