#  G04 Naïve Bayes & Linear Regression - Group 03

## Intructions
Apply k-fold Cross Validation and Bootstrap method to a Classification model based on the data set available on https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

You need to identify whether a customer will subscribe to a term deposit or not.

You need not use all the variables to build the model. You are can preprocess the data using Pandas as necessary. 

Build a simple model and focus on interpretation and communication of your insights from the analysis.

## 01: Load the data
Initial the project and download the data.

In [8]:
import pandas as pd
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import resample
import numpy as np
import os as os
import requests
import zipfile
from io import BytesIO

# Create a temporary directory in the current working directory
tmp_dir = os.path.join(os.getcwd(), "tmp")
os.makedirs(tmp_dir, exist_ok=True)

# URL to download the zip file
url = 'https://archive.ics.uci.edu/static/public/222/bank+marketing.zip'

# Send a HTTP request to the URL of the zipfile
response = requests.get(url)

# Read the content of the file and create a BytesIO object from it
file_object = BytesIO(response.content)

# Create a zipfile object from the BytesIO
zipfile_object = zipfile.ZipFile(file_object)

# Extract the zip file
zipfile_object.extractall(tmp_dir)

# Now you have two more zips to unzip, let's do it one by one

# Unzip bank.zip
with zipfile.ZipFile(os.path.join(tmp_dir, "bank.zip"), 'r') as zip_ref:
    zip_ref.extractall(tmp_dir)

# Unzip bank-additional.zip
with zipfile.ZipFile(os.path.join(tmp_dir, "bank-additional.zip"), 'r') as zip_ref:
    zip_ref.extractall(tmp_dir)

# Read the data files
bank = pd.read_csv(os.path.join(tmp_dir, 'bank-full.csv'), sep=';')
bank_additional= pd.read_csv(os.path.join(tmp_dir, 'bank-additional', 'bank-additional-full.csv'), sep=';')


## 02: Data Preprocessing
Inspect, clean and preprocess the downloaded data to get it ready for the machine learning model.

In the following analysis, we will focus on the bank.csv data set.

In [12]:
# Preview the data
bank.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [13]:
# Check the data type in the bank dataset.
bank.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

From the output, the column 'y' represents the target with 'yes' meaning the client subscribed to a term deposit and 'no' meaning they did not.

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Function to preprocess the data
def preprocess_data(df):
    le = LabelEncoder()
    categorical_cols = df.select_dtypes(include='object').columns.tolist()
    df[categorical_cols] = df[categorical_cols].apply(lambda col: le.fit_transform(col))

    # Split the data into features 'X' and target 'y'
    X = df.drop('y', axis=1)
    y = df['y']

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    return X_train, X_test, y_train, y_test


# Preprocess the data
X_train_bank, X_test_bank, y_train_bank, y_test_bank = preprocess_data(bank)
X_train_bank_additional, X_test_bank_additional, y_train_bank_additional, y_test_bank_additional = preprocess_data(bank_additional)

## 03: Build a Classification Model

In this case, we are going to use the Logistic Regression model from the scikit-learn library, to build the Classification Model. Logistic Regression model is common use for thebinary classication problem like this.

In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize the model
model_bank = LogisticRegression(max_iter=10000)


# Fit the model to the training data
model_bank.fit(X_train_bank, y_train_bank)


# Use the model to predict the test data
y_pred_bank = model_bank.predict(X_test_bank)


# Compute accuracy
acc_bank = accuracy_score(y_test_bank, y_pred_bank)

print(f"Accuracy for bank dataset: {acc_bank}")


Accuracy for bank dataset: 0.8854362490324007


The accuracy of appoximattely 88.54% on the test data which means that the model is correctly predicting whether a client will subscribe a term deposit around 88.54% of the time on this dataset. This is a good accuracy score.

However, we could improve our model peromance with the "Hyperparameter Tuning". 

In [33]:
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

scaler_bank = StandardScaler()

# Fit the scaler to the training data and transform both the training and test data
X_train_bank_scaled = scaler_bank.fit_transform(X_train_bank)
X_test_bank_scaled = scaler_bank.transform(X_test_bank)

# Define the parameter grid
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l2'],  # lbfgs solver supports only 'l2' penalty
}

# Initialize a logistic regression model
model = LogisticRegression(max_iter=10000)

# Initialize the GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy', verbose=1)

# Fit the GridSearchCV to the scaled training data
grid_search.fit(X_train_bank_scaled, y_train_bank)

# Get the best parameters
best_params = grid_search.best_params_

print(f"Best parameters: {best_params}")

# Fit a logistic regression model with the best parameters to the scaled training data
best_model = LogisticRegression(max_iter=1000, C=best_params['C'], penalty=best_params['penalty'])
best_model.fit(X_train_bank_scaled, y_train_bank)

# Use the model with the best parameters to predict the scaled test data
y_pred_bank_best = best_model.predict(X_test_bank_scaled)

# Compute accuracy
acc_bank_best = accuracy_score(y_test_bank, y_pred_bank_best)

print(f"Accuracy for bank dataset with best model: {acc_bank_best}")

Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best parameters: {'C': 0.1, 'penalty': 'l2'}
Accuracy for bank dataset with best model: 0.8880902355413026


The accuracy of our logistic regression model after hyperparameter tuning ('C': 0.1, 'penalty': 'l2') is 0.8881, which is slightly higher than the accuracy we got initially with default hyperparameters (0.8854). This means the hyperparameter tuning process improved our model's performance slightly, which is a good sign. We will keep use this model to do the k-fold Cross valiadtion and boostrap method.

## 04 Apply k-fold Cross Validation

Apply the sklearn's cross_val_score method to apply k-fold cross validation

In [34]:
from sklearn.model_selection import cross_val_score

best_model.set_params(max_iter=10000)

# Compute cross-validation score
cv_scores_bank = cross_val_score(best_model, X_train_bank, y_train_bank, cv=10)

print(f"Cross-validation scores for bank dataset: {cv_scores_bank}")

# Compute mean and standard deviation of cross-validation scores
mean_cv_scores_bank = np.mean(cv_scores_bank)
std_cv_scores_bank = np.std(cv_scores_bank)

# Print mean and standard deviation of cross-validation scores
print(f"Mean cross-validation score for bank dataset: {mean_cv_scores_bank:.3f}")
print(f"Standard deviation of cross-validation score for bank dataset: {std_cv_scores_bank:.3f}")

Cross-validation scores for bank dataset: [0.890517   0.8885817  0.88747581 0.890517   0.88941111 0.89245231
 0.890517   0.89632292 0.88938053 0.89712389]
Mean cross-validation score for bank dataset: 0.891
Standard deviation of cross-validation score for bank dataset: 0.003


The logistic regression model demonstrated robust and consistent performance across different partitions of the training set, achieving an average accuracy of 0.891. The standard deviation of 0.003 suggests a very low variability in the model's performance across the different folds in the 5-fold cross-validation process. This implies that our model is not just fitting the data well, but also generalizing well to unseen data, which is a critical aspect of any machine learning model. Given the small standard deviation, we can expect the model's accuracy to typically lie in the range of 0.888 to 0.894 on new, unseen data. The model therefore provides a reliable tool for predicting whether a client will subscribe to a term deposit in the bank.

##: 05 Bootstrap method

In [35]:
from sklearn.utils import resample
from sklearn.metrics import accuracy_score

# Initialize a list to store all bootstrap sample scores
bootstrap_scores = []

# Number of bootstrap samples to create
n_bootstrap = 100

# For each bootstrap sample
for i in range(n_bootstrap):
    print(f"Bootstrap sample {i+1}/{n_bootstrap}...")
    # Create a bootstrap sample from the training set
    X_train_boot, y_train_boot = resample(X_train_bank, y_train_bank, replace=True)
    
    # Fit the model on the bootstrap sample
    best_model.fit(X_train_boot, y_train_boot)
    
    # Make predictions on the test set
    y_pred_boot = best_model.predict(X_test_bank)
    
    # Compute accuracy and append to the list of scores
    score = accuracy_score(y_test_bank, y_pred_boot)
    bootstrap_scores.append(score)

# Compute the mean and standard deviation of the bootstrap scores
mean_bootstrap_score = np.mean(bootstrap_scores)
std_bootstrap_score = np.std(bootstrap_scores)

print(f"Mean bootstrap score for bank dataset: {mean_bootstrap_score}")
print(f"Standard deviation of bootstrap score for bank dataset: {std_bootstrap_score}")

Bootstrap sample 1/100...
Bootstrap sample 2/100...
Bootstrap sample 3/100...
Bootstrap sample 4/100...
Bootstrap sample 5/100...
Bootstrap sample 6/100...
Bootstrap sample 7/100...
Bootstrap sample 8/100...
Bootstrap sample 9/100...
Bootstrap sample 10/100...
Bootstrap sample 11/100...
Bootstrap sample 12/100...
Bootstrap sample 13/100...
Bootstrap sample 14/100...
Bootstrap sample 15/100...
Bootstrap sample 16/100...
Bootstrap sample 17/100...
Bootstrap sample 18/100...
Bootstrap sample 19/100...
Bootstrap sample 20/100...
Bootstrap sample 21/100...
Bootstrap sample 22/100...
Bootstrap sample 23/100...
Bootstrap sample 24/100...
Bootstrap sample 25/100...
Bootstrap sample 26/100...
Bootstrap sample 27/100...
Bootstrap sample 28/100...
Bootstrap sample 29/100...
Bootstrap sample 30/100...
Bootstrap sample 31/100...
Bootstrap sample 32/100...
Bootstrap sample 33/100...
Bootstrap sample 34/100...
Bootstrap sample 35/100...
Bootstrap sample 36/100...
Bootstrap sample 37/100...
Bootstrap 

The bootstrap method was applied to assess the stability and performance of our logistic regression model on the bank dataset. The mean bootstrap score, or the average accuracy of the model across multiple bootstrap samples, was approximately 0.886. This suggests that, on average, our model correctly classifies around 88.6% of instances in the 100 bootstrap samples.

The standard deviation of the bootstrap scores, a measure of variability or dispersion of the scores, was approximately 0.00114. This relatively low standard deviation indicates that the performance of our model is consistent across different bootstrap samples. In other words, our model's performance does not significantly change or deviate with different sets of data sampled from the original dataset.

These results, collectively, suggest that our logistic regression model is stable and robust in its performance, showing consistent high accuracy across different bootstrap samples of the bank dataset.

## 06: Interpretation and communication of your insights

Based on the evaluations and tests performed on the **bank.csv** dataset, we can conclude the following:

1.   **Performance of the Model:** Our logistic regression model performs quite well with an accuracy of approximately 88.6%, as observed in both cross-validation and bootstrap analyses. This indicates that our model correctly classifies around 88.6% of the instances, suggesting a high level of prediction accuracy.
2.   **Comparison to Baseline:** In terms of comparison to a baseline model, our logistic regression model performs significantly better than a simple random guessing model, which would only have an accuracy of 50%. Our model, therefore, offers a substantial improvement in predictive ability over a simple guess or naive baseline model.
3. **Important Features:** Unfortunately, without explicit feature importance results from the logistic regression model, it's difficult to directly interpret which features are most influential in predicting customer behavior. However, typically in a banking scenario, features like age, job type, balance, housing, loan, contact, month, day of the week, duration of contact, and number of contacts performed during the campaign can be significant predictors. A separate analysis could be conducted to specifically extract feature importance from the logistic regression model. In the feautre, we could run the logistic regression model on the "bank-additonal" data set which contaions more attributes would generated differnt results.
4. **Insights about Customers:** ue to the nature of the logistic regression model, interpreting the specific characterizations of customers who are likely to subscribe to a term deposit is challenging without further analysis. A deeper dive into the coefficients of the logistic regression model could provide insights on the features that increase the odds of a customer subscribing to a term deposit. Moreover, combining this model with an exploratory data analysis could potentially reveal important insights about the customers, like certain job types or age groups are more inclined to subscribe, or customers contacted during specific months have a higher likelihood of subscription, and so on.

In conclusion, the logistic regression model built has shown robust and reliable performance in predicting whether a bank's clients will subscribe to a term deposit or not. Although the feature importance is not directly interpretable from the model, further analysis could provide deeper insights that would be valuable for the bank's marketing strategy and customer understanding.

