# Bank Marketing with Machine Learning

## Introduction

This project aims to use machine learning algorithms to solve a classification problem in data science. Specifically, we will be using a dataset related to direct marketing campaigns of a Portuguese banking institution to predict whether a client will subscribe or not to a term deposit. The dataset contains information on client demographics, previous marketing interactions, and economic indicators.

We will explore various machine learning algorithms, including logistic regression, decision trees, and random forests, to build models and evaluate their performance using appropriate metrics. Through this project, we hope to gain practical experience in data science while addressing a real-world problem.

## Dataset

| Feature    | Description |     
| --- | :--- |
| Age        | Age of the bank client (numeric)                                                                       |
| Job        | Type of job (categorical: 'admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown') |
| Marital    | Marital status (categorical: 'divorced', 'married', 'single'; note: 'divorced' means divorced or widowed) |
| Education  | Education level (categorical: 'primary', 'secondary', 'tertiary', 'unknown')                             |
| Default    | Has credit in default? (categorical: 'no', 'yes')                                                      |
| Balance    | Account balance of the client (numeric)                                                                |
| Housing    | Has housing loan? (categorical: 'no', 'yes')                                                           |
| Loan       | Has personal loan? (categorical: 'no', 'yes')                                                           |
| Contact    | Contact communication type (categorical: 'cellular', 'telephone', 'unknown')                           |
| Day        | Last contact day of the month (numeric)                                                                 |
| Month      | Last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')                        |
| Duration   | Last contact duration, in seconds (numeric)                                                            |
| Campaign   | Number of contacts performed during this campaign and for this client (numeric, includes last contact) |
| Pdays      | Number of days that passed by after the client was last contacted from a previous campaign (numeric; -1 means client was not previously contacted) |
| Previous   | Number of contacts performed before this campaign and for this client (numeric)                       |
| Poutcome   | Outcome of the previous marketing campaign (categorical: 'failure', 'success', 'other', 'unknown')    |
| Deposit    | Desired target. Has the client subscribed a term deposit? (binary: 'yes', 'no')                                        |

## Project Definition

The classification goal is to predict if a client will subscribe to the bank term deposit (yes/no).


# Import Base Libraries

Import the relevant libraries: NumPy, Pandas, Matplotlib, and Seaborn. We will also use a Jupyter magic command `%matplotlib notebook` to enable interactive plots.

In [1]:
# import relevant libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')
%matplotlib notebook

# Logging

We define a custom logging system using an enum and a class. The enum defines different logging levels with associated emojis, while the class provides a way to log messages with a specified level. We also define a `Logger` object with a default logging level of `LogLevel.INFO`. This code can be used to log messages in a more expressive and customizable way than the built-in `print()` function.

In [2]:
from enum import Enum

def bold(text):
    return "\033[1m" + text + "\033[0m"

class LogLevel(Enum):
    ERROR = "\033[91m🔴 ERROR\033[0m"
    WARNING = "\033[93m⚠️ WARNING\033[0m"
    INFO = "\033[94m🟡 INFO\033[0m"
    DEBUG = "\033[34m🔵 DEBUG\033[0m"
    SUCCESS = "\033[32m🟢 SUCCESS\033[0m"


class Logger:
    def __init__(self, level=LogLevel.INFO):
        self.level = level

    def log(self, level, message, **kwargs):
        if level.value <= self.level.value:
            nl = kwargs.get('nl', False)
            if nl:
                print()
            print(f"{level.value}: {bold(message)}")

    def error(self, message, **kwargs):
        self.log(LogLevel.ERROR, message, **kwargs)

    def warning(self, message, **kwargs):
        self.log(LogLevel.WARNING, message, **kwargs)

    def info(self, message, **kwargs):
        self.log(LogLevel.INFO, message, **kwargs)

    def debug(self, message, **kwargs):
        self.log(LogLevel.DEBUG, message, **kwargs)

    def success(self, message, **kwargs):
        self.log(LogLevel.SUCCESS, message, **kwargs)
        
logger = Logger(level=LogLevel.INFO)

# Preprocessing Presets

We have a set of preprocessing methods that can be used to clean and transform data before we can use it for machine learning tasks. First, we have a `split_data` method that takes in a dataset and splits it into input and output components, where the `target_col` parameter specifies which column should be treated as the output variable. Next, we have several encoding methods like `label_encode`, `one_hot_encode`, and `ordinal_encode`, which can be used to transform categorical data into numerical format. The `full_clean` method is a comprehensive method that applies various encoding techniques and feature engineering techniques to a given dataset. Finally, we have the `standard_scale` method, which can be used to normalize data by scaling it to a standard deviation of 1.

We will use this code as a toolset to prepare and clean data for various machine learning tasks.

In [3]:
# import relevant libraries
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder, StandardScaler, MinMaxScaler, RobustScaler

# preprocessing methods
def split_data(data, target_col=None):
    data = data.dropna()
    if target_col is None:
        X = data.iloc[:, :-1]
        y = data.iloc[:, -1]
    else:
        X = data.drop(target_col, axis=1)
        y = data[target_col]
    return X, y


def label_encode(target, mode='sklearn'):
    if mode == 'sklearn':
        return LabelEncoder().fit_transform(target)

def one_hot_encode(X, mode='pandas', **kwargs):
    if mode == 'pandas':
        return pd.get_dummies(X, **kwargs)
    if mode == 'sklearn':
        return OneHotEncoder().fit_transform(X, **kwargs)

def ordinal_encode(X, mode='sklearn', **kwargs):
    if mode == 'sklearn':
        return OrdinalEncoder().fit_transform(X, **kwargs)

def minmax_scale(X, mode='sklearn', **kwargs):
    if mode == 'sklearn':
        return MinMaxScaler().fit_transform(X, **kwargs)

def standard_scale(X, mode='sklearn', **kwargs):
    if mode == 'sklearn':
        return StandardScaler().fit_transform(X, **kwargs)

def robust_scale(X, mode='sklearn', **kwargs):
    if mode == 'sklearn':
        return RobustScaler().fit_transform(X, **kwargs)
    
def full_clean(df):
    df.drop_duplicates(inplace=True)
    
    df['is_default'] = df['default'].apply(lambda row: 1 if row == 'yes' else 0)
    df['is_housing'] = df['housing'].apply(lambda row: 1 if row == 'yes' else 0)
    df['is_loan'] = df['loan'].apply(lambda row: 1 if row == 'yes' else 0)
    df['target'] = df['deposit'].apply(lambda row: 1 if row == 'yes' else 0)

    marital_dummies = one_hot_encode(df['marital'], prefix='marital', dtype='int')
    marital_dummies.drop('marital_divorced', axis=1, inplace=True)
    df = pd.concat([df, marital_dummies], axis=1)

    job_dummies = one_hot_encode(df['job'], prefix='job', dtype='int')
    job_dummies.drop('job_unknown', axis=1, inplace=True)
    df = pd.concat([df, job_dummies], axis=1)

    education_dummies = one_hot_encode(df['education'], prefix='education', dtype='int')
    education_dummies.drop('education_unknown', axis=1, inplace=True)
    df = pd.concat([df, education_dummies], axis=1)

    contact_dummies = one_hot_encode(df['contact'], prefix='contact', dtype='int')
    contact_dummies.drop('contact_unknown', axis=1, inplace=True)
    df = pd.concat([df, contact_dummies], axis=1)

    poutcome_dummies = one_hot_encode(df['poutcome'], prefix='poutcome', dtype='int')
    poutcome_dummies.drop('poutcome_unknown', axis=1, inplace=True)
    df = pd.concat([df, poutcome_dummies], axis=1)

    months = {'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6, 'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10,
              'nov': 11, 'dec': 12}
    df['month'] = df['month'].map(months)

    df.drop(['job', 'education', 'marital', 'default', 'housing', 'loan', 'contact', 'poutcome', 'deposit'],
            axis=1, inplace=True)
    
    numerical_cols = ['balance', 'duration', 'campaign', 'pdays', 'previous', 'age', 'month']
    df[numerical_cols] = standard_scale(df[numerical_cols])
    
    return df

# Read Data for Data Exploration

In this code, we explore the dataset using various data preprocessing techniques. The goal is to gain insights into the data and prepare it for potential machine learning modeling. These preprocessing steps can help us understand and transform the data to a more suitable format for potential machine learning modeling in the future.

In [4]:
from sklearn.model_selection import train_test_split

data = pd.read_csv('bank.csv')

# Data Exploration

### Overview of the Dataset

There is a total of 11162 samples.

There are 7 numeric columns and 10 object columns. Out of the 10 object columns 4 have only 2 unique values, so we can assume they are binary columns. In total, we have 7 numeric columns, 4 binary columns and 6 categorical columns.

In [5]:
print('%s: %s\n' % (bold('Number of samples'), len(data)))

print(bold('Dataset info'))
print(data.info(), '\n')

print(bold('Dataset preview'))
print(data.head(), '\n')

# split the data
X, y = split_data(data)

# get the feature labels
features = X.columns

# use simple ordinal and label encode so we can maintain dataset structure
X = ordinal_encode(X)
y = label_encode(y)

# get numeric and categorical columns
numeric_col = data.select_dtypes(include=['int64', 'float64']).columns
category_col = data.select_dtypes(include=['object']).columns

print('%s: %s\n' % (bold('Numeric columns'), numeric_col.values))
print('%s: %s\n' % (bold('Categorical columns'), category_col.values))

print(bold('Dataset granularity'))
for col in data.columns:
    print(col, ':', data[col].nunique(), 'unique values')

[1mNumber of samples[0m: 11162

[1mDataset info[0m
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11162 entries, 0 to 11161
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        11162 non-null  int64 
 1   job        11162 non-null  object
 2   marital    11162 non-null  object
 3   education  11162 non-null  object
 4   default    11162 non-null  object
 5   balance    11162 non-null  int64 
 6   housing    11162 non-null  object
 7   loan       11162 non-null  object
 8   contact    11162 non-null  object
 9   day        11162 non-null  int64 
 10  month      11162 non-null  object
 11  duration   11162 non-null  int64 
 12  campaign   11162 non-null  int64 
 13  pdays      11162 non-null  int64 
 14  previous   11162 non-null  int64 
 15  poutcome   11162 non-null  object
 16  deposit    11162 non-null  object
dtypes: int64(7), object(10)
memory usage: 1.4+ MB
None 

[1mDataset preview[0m
   age       

### Prevalence of Target Class
This pie chart represents the distribution of the deposit column in the dataset. The yes and no labels show whether the customer has made a deposit or not, respectively. The chart shows that the dataset is relatively balanced, with no representing 52.6% of the data and yes representing 47.4%.

If it was inbalanced we would need to adjust our data to prevent anomalies during training.

In [6]:
def calc_prevalence(y_actual):
    # this function calculates the prevalence of the positive class (label = 1)
    return sum(y_actual) / len(y_actual)

plt.figure()

labels = ['No deposit', 'Has deposit']
n = len(y)

# Define the colors for 'no' and 'yes' 
colors = ["#a50404", "#028A0F"]

# Create the pizza plot
_, _, autotexts = plt.pie(data['deposit'].value_counts(), 
        labels=labels, 
        colors=colors, 
        autopct=lambda pct: f'{pct:.1f}%\n({int(round(pct/100*n))})',
        startangle=90, 
        textprops={'color': 'black', 'fontsize': 12})

for autotext in autotexts:
    autotext.set_size(16)
    autotext.set_color('white')
    
# Add a title to the plot
plt.title('Deposit Distribution', fontsize=18)

# Show the plot
plt.show()

# count the number of rows for each type
print(bold('Deposit class values'))
print(data.groupby('deposit').size(), '\n')

print('%s: %.3f\n' % (bold('Prevalence of the positive class'), calc_prevalence(y)))

<IPython.core.display.Javascript object>

[1mDeposit class values[0m
deposit
no     5873
yes    5289
dtype: int64 

[1mPrevalence of the positive class[0m: 0.474



### Point-Biserial Correlation for Each Feature with Target Variable

The point-biserial Correlation is a special case of the Pearson Correlation and is used when to measure the relationship between a continuous variable and a dichotomous variable, or one that has two values (i.e. male/female, yes/no, true/false).

This code computes the point-biserial correlation between each feature in a dataset and a desired target variable. The resulting correlation coefficients are plotted as a bar chart. The x-axis shows the feature names, and the y-axis shows the correlation coefficients. The resulting plot allows us to visualize the strength and direction of the relationship between each feature and the target variable.

The figure gives us an indication on how important a feature might be. As mentioned in the introduction, the duration feature has a significant impact in the final result.

In [7]:
from scipy.stats import pointbiserialr

# Calculate Point-Biserial correlation for each feature with target variable
correlations = []
for i in range(X.shape[1]):
    corr, pval = pointbiserialr(X[:,i], y)
    correlations.append(corr)

# Define color palette and map correlations to colors
cmap = sns.dark_palette('#69d', n_colors=len(features), reverse=True)
cor_range = max(correlations) - min(correlations)
color_mapping = dict(zip(sorted(correlations), cmap))  # map correlations to colors

# Set up visualization using seaborn
plt.figure(figsize=(8, 6))
ax = sns.barplot(x=features, y=correlations, palette=cmap)

# Color each bar based on its correlation value
for i, corr in enumerate(correlations):
    color = color_mapping[corr]
    ax.get_children()[i].set_color(color)

# Add title and axis labels, adjust font size, and rotate x-axis labels
plt.title('Point-Biserial Correlation of every feature to desired target', fontsize=14)
plt.xlabel('Features', fontsize=14)
plt.ylabel('Correlation Coefficient', fontsize=14)
yticklabels = ax.get_yticklabels()
yticks = ax.get_yticks()
ax.set_yticks(yticks)
ax.set_yticklabels(yticklabels, fontsize=14)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', fontsize=14)
plt.subplots_adjust(bottom=.2)

# show the plot
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

### Effect of Balance on Term Deoosits

In [8]:
# calculate the mean balance
mean_balance = data['balance'].mean()

# create a new dataframe with balance_level column
grouped_data = pd.DataFrame({
    'deposit': data['deposit'],
    'balance': np.where(data['balance'] > mean_balance, 'above-average', 'below-average')
})

# group the data by balance level and deposit
grouped_data = grouped_data.groupby(['balance', 'deposit']).size().reset_index(name='count')

# create the bar plot
plt.figure(figsize=(8, 6))
sns.barplot(x="balance", y="count", hue="deposit", data=grouped_data, palette={'yes': "#028A0F", 'no': "#a50404"})
plt.show()

<IPython.core.display.Javascript object>

### Numerical Columns

#### Presence of Outliers in Numeric Columns

This plot shows how the numerical columns are distributed in quantiles, it enables us to identify possible outliers and the 

In [9]:
# Calculate the quantiles
print(bold('Quantiles'))
quantiles = data[numeric_col].quantile([0, 0.05, 0.50, 0.95, 1]).T
print(quantiles)

# Calculate the number of rows and columns needed to fit all the plots
n_cols = 2
n_rows = (len(numeric_col) + 1) // 2  # Round up if necessary

# Create a subplot grid with the specified number of rows and columns
fig, axes = plt.subplots(n_rows, n_cols, figsize=(10, 4*n_rows))

# Flatten the axes array to iterate over the subplots
axes = axes.flatten()

# Iterate over the continuous columns and create a histogram for each one
for i, col in enumerate(numeric_col):
    sns.boxplot(data=data, x=col, hue='deposit', ax=axes[i])
    #axes[i].set_title(col)
    axes[i].set_xlabel(col, fontsize=12)
    axes[i].tick_params(axis='both', which='major', labelsize=12)
    xticks, xticklabels = plt.xticks()
    plt.xticks(xticks, xticklabels, fontsize=12)

# Remove any unused subplots
for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

# Show the plots
plt.tight_layout()
plt.show()

[1mQuantiles[0m
            0.00   0.05   0.50     0.95     1.00
age         18.0  26.00   39.0    61.00     95.0
balance  -6847.0 -54.95  550.0  6026.45  81204.0
day          1.0   3.00   15.0    30.00     31.0
duration     2.0  51.00  255.0  1079.90   3881.0
campaign     1.0   1.00    2.0     7.00     63.0
pdays       -1.0  -1.00   -1.0   326.00    854.0
previous     0.0   0.00    0.0     5.00     58.0


<IPython.core.display.Javascript object>

#### Correlation Heatmap

This heatmap shows the correlation between the numeric features of the dataset. The cells colored in blue represent high positive correlation while the cells colored in white represent low correlation. The values in each cell represent the correlation coefficient between the corresponding features. This plot can be useful in identifying which features are strongly correlated and can be used in combination in a machine learning model.

In [10]:
# Create the heatmap
plt.figure(figsize=(8, 6))
heatmap = sns.heatmap(data[numeric_col].corr(), cmap='Blues', annot=True)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize': 14})
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

#### Pair Grid

This plot is a PairGrid visualization that shows the relationship between different numeric variables in the dataset, with each point on the scatterplot representing a single observation. The scatterplot points are colored by the deposit column, with the blue color representing 'no' and light blue representing 'yes'. This plot helps to identify any potential patterns or relationships between variables in the dataset and how they relate to the deposit column.

In [11]:
# Visualize distribution of dataset information
my_palette = {'yes': "#028A0F", 'no': "#a50404"}
g = sns.pairplot(data, vars=numeric_col, hue='deposit', height=1, aspect=1, palette=my_palette)
g.add_legend()
plt.show()

<IPython.core.display.Javascript object>

#### Histograms

This code generates a grid of histograms, one for each numerical feature in the dataset. The histograms are grouped by the deposit status, with 'yes' and 'no' colored in blue and dark blue, respectively. Each histogram shows the distribution of the feature's values, with the x-axis representing the range of values and the y-axis representing the frequency or count of observations falling in that range. Overall, this plot provides a quick and visual way to compare the distribution of each numerical feature across the two deposit groups.

In [12]:
# Plot for each numerical feature

# Calculate the number of rows and columns needed to fit all the plots
n_cols = 2
n_rows = (len(numeric_col) + 1) // 2  # Round up if necessary
my_palette = {'yes': "#028A0F", 'no': "#a50404"}

# Create a subplot grid with the specified number of rows and columns
fig, axes = plt.subplots(n_rows, n_cols, figsize=(10, 4*n_rows))

# Flatten the axes array to iterate over the subplots
axes = axes.flatten()

# Iterate over the continuous columns and create a histogram for each one
for i, col in enumerate(numeric_col):
    sns.histplot(data=data, x=col, hue='deposit', kde=True, multiple='stack', alpha=0.7, palette=my_palette, ax=axes[i])
    #axes[i].set_title(col)
    axes[i].set_xlabel(col, fontsize=12)
    axes[i].set_ylabel('Count', fontsize=12)
    axes[i].tick_params(axis='both', which='major', labelsize=12)
    xticks, xticklabels = plt.xticks()
    plt.xticks(xticks, xticklabels, fontsize=12)

# Remove any unused subplots
for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

# Show the plots
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

### Categorical Columns

#### Countplot

This code generates a grid of subplots, with each subplot containing a countplot of a categorical column from a dataset. The 'deposit' column is used to color-code the bars in each countplot. The resulting plot provides an overview of the distribution of each categorical column and how it relates to the 'deposit' column.

In [13]:
# discard the deposit column
category_col_new = category_col.drop('deposit')

# Create a subplot grid with the specified number of rows and columns
n_cols = 2
n_rows = (len(category_col_new) + 1) // n_cols  # Round up if necessary
fig, axes = plt.subplots(n_rows, n_cols, figsize=(10, 4*n_rows))

# Flatten the axes array to iterate over the subplots
axes = axes.flatten()

# Define the order of the months
month_order = ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']

# Iterate over the categorical columns and create a count plot for each one
for i, col in enumerate(category_col_new):
    # Sort the 'month' column by the specified order
    if col == 'month':
        data[col] = pd.Categorical(data[col], categories=month_order, ordered=True)
    sns.countplot(data=data, x=col, hue='deposit', palette={'yes': "#028A0F", 'no': "#a50404"}, ax=axes[i])
    axes[i].set_xlabel(col, fontsize=12)
    
    # Set the ylabel for the first column of each row
    if i % n_cols != 0:
        axes[i].set_ylabel(None)
    else:
        axes[i].set_ylabel('count', fontsize=12)
        
    axes[i].tick_params(axis='both', which='major', labelsize=10)
    
    # Set custom rotation and alignment for 'job' category labels
    if col == 'job':
        axes[i].set_xticklabels(axes[i].get_xticklabels(), rotation=45, ha='right', fontsize=10)

# Remove any unused subplots
for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

# Show the plots
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

#### Correlation Heatmap of Every Feature

This heatmap shows the correlation between every pair of features in the dataset. The cells colored in blue represent high positive correlation while the cells colored in white represent low correlation. The values in each cell represent the correlation coefficient between the corresponding features. This plot can be useful in identifying which features are strongly correlated and can be used in combination in a machine learning model.

In [14]:
# create a new dataframe with encoded features and target
data_encoded = pd.concat([pd.DataFrame(X, columns=features), pd.Series(y, name='deposit')], axis=1)

# calculate correlation matrix
corr = data_encoded.corr()

# create heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr, cmap='Blues', annot=True, fmt='.2f')
plt.title('Correlation Heatmap', fontdict={'fontsize': 14})
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

# Analytics Presets

The next code block contains several functions for analyzing and visualizing the results of a machine learning model. These functions include calculating various metrics such as accuracy score, area under the ROC curve, and generating confusion matrices and classification reports. Additionally, there are functions for plotting ROC curves, decision trees, and feature importance. These tools can aid in understanding how well a model is performing and which features are most important in predicting the target variable.

In [15]:
# import relevant libraries
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, auc, roc_auc_score, f1_score, precision_score, recall_score
from sklearn.model_selection import learning_curve
from sklearn.tree import plot_tree

# analytics methods
def get_best(accuracy_map, out=True):
    sorted_accuracy_map = sorted(accuracy_map.items(), key=lambda x: x[1], reverse=True)
    if out:
        for i, (k, v) in enumerate(sorted_accuracy_map, start=1):
            print(f"{i}. {k} ({v})")
        print()
    return sorted_accuracy_map

def get_report(y_test, y_pred, y_pred_prob=None, out=True):
    accuracy_score_ = accuracy_score(y_test, y_pred)
    auc_roc_score_ = roc_auc_score(y_test, y_pred if y_pred_prob is None else y_pred_prob)
    precision_score_ = precision_score(y_test, y_pred)
    f1_score_ = f1_score(y_test, y_pred)
    recall_score_ = recall_score(y_test, y_pred)
    confusion_matrix_ = confusion_matrix(y_test, y_pred)
    classification_report_ = classification_report(y_test, y_pred)
    if out:
        print_report(accuracy_score_, auc_roc_score_, precision_score_, f1_score_, recall_score_, confusion_matrix_, classification_report_)
    return accuracy_score_, auc_roc_score_, precision_score_, f1_score_, recall_score_, confusion_matrix_, classification_report_

def print_report(accuracy_score_=None, auc_roc_score_=None, precision_score_=None, f1_score_=None, recall_score_=None, confusion_matrix_=None, classification_report_=None):
    variables = [
        ('Accuracy score:', accuracy_score_), 
        ('AUC-ROC score:', auc_roc_score_),
        ('Precision score:', precision_score_), 
        ('F1 score:', f1_score_), 
        ('Recall score:', recall_score_),
        ('Confusion matrix:', confusion_matrix_), 
        ('Classification report:', classification_report_)
    ]

    for name, value in variables:
        if value is not None:
            print(bold(name), value)

    print()


def plot_confusion_matrix(y_test, y_pred):
    conf_mat = confusion_matrix(y_test, y_pred)
    sns.heatmap(conf_mat, annot=True, cmap='Blues')
    plt.xlabel('Predicted labels')
    plt.ylabel('True labels')
    plt.show()

def plot_roc(model, X_test, y_test):
    y_pred_prob = model.predict_proba(X_test)[:, 1]
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
    roc_auc = auc(fpr, tpr)
    plt.figure(figsize=(10, 6))
    plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)
    plt.plot([0, 1], [0, 1], 'k--')  # Add a dashed diagonal line for comparison
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    
    # find the threshold that maximizes Youden's J statistic
    youden_j = tpr - fpr
    best_threshold = thresholds[np.argmax(youden_j)]
    
    # add a second label to the ROC curve indicating the best threshold
    plt.plot([0, 1], [0, 1], 'k--', alpha=.5)
    plt.plot(fpr[np.argmax(youden_j)], tpr[np.argmax(youden_j)], 'ro', label='Best Threshold = %0.3f' % best_threshold)
    
    plt.legend(loc="lower right")
    plt.show()
    
    return roc_auc, best_threshold

def plot_decision_tree(clf, column_values):
    plt.figure(figsize=(20, 12), dpi=100)
    plot_tree(clf, feature_names=column_values, class_names=["no", "yes"], filled=True, fontsize=10, max_depth=3)
    plt.show()

def plot_feature_importance(model, features):
    fig, ax = plt.subplots(figsize=(10, 6))
    sns.barplot(x=model.feature_importances_, y=features, ax=ax, orient='h')
    ax.set_title('Feature Importance')
    plt.tight_layout()
    plt.show()
    
def plot_learning_curve(lr, X, y, **kwargs):
    # Calculate the learning curve using the best estimator
    train_sizes, train_scores, test_scores = learning_curve(lr, X, y, n_jobs=-1, **kwargs)

    # Calculate the mean and standard deviation of the training and test scores
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    test_mean = np.mean(test_scores, axis=1)
    test_std = np.std(test_scores, axis=1)

    # Plot the learning curve
    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, color='blue', marker='o', markersize=5, label='training accuracy')
    plt.fill_between(train_sizes, train_mean + train_std, train_mean - train_std, alpha=0.15, color='blue')
    plt.plot(train_sizes, test_mean, color='green', linestyle='--', marker='s', markersize=5, label='validation accuracy')
    plt.fill_between(train_sizes, test_mean + test_std, test_mean - test_std, alpha=0.15, color='green')
    plt.xlabel('Number of training examples')
    plt.ylabel('Accuracy')
    plt.legend(loc='lower right')
    plt.show()
    
def plot_validation(x_label, y_label, plot_title, train_data, val_data):
    plt.figure(figsize=(10,6))
    labels = ["1st Fold", "2nd Fold", "3rd Fold", "4th Fold", "5th Fold"]
    X_axis = np.arange(len(labels))
    ax = plt.gca()
    plt.ylim(0.40000, 1)
    plt.bar(X_axis - 0.2, train_data, 0.4, color='blue', label='Training')
    plt.bar(X_axis + 0.2, val_data, 0.4, color='red', label='Validation')
    plt.title(plot_title, fontsize=30)
    plt.xticks(X_axis, labels)
    plt.xlabel(x_label, fontsize=14)
    plt.ylabel(y_label, fontsize=14)
    plt.legend()
    plt.grid(True)
    plt.show()


# Training Framework

This code defines a method that enables the training of various machine learning classification models. The method is named `train` and it allows the training of models such as Logistic Regression, Decision Trees, Random Forests, Support Vector Machines, and others. When calling the `train` method, the user can specify which model they want to train and pass additional arguments to the specific model being trained.

In [16]:
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, AdaBoostClassifier, \
    ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression, PassiveAggressiveClassifier, RidgeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_validate

def cross_validation(X, y, name='LogisticRegression', full_result=True, out=True, options={}, **kwargs):
    model = return_function_name(name, **kwargs)
    result = cross_validate(model, X, y, return_train_score=True, **options)

    output = result if full_result else {k: [np.mean(v), np.std(v)] for k, v in result.items()}

    if out:
        for metric, scores in output.items():
            print(bold(metric), scores)
        print()

    return output

def return_function_name(name:str, **kwargs):
    return eval(name + "(**kwargs)")

def train(X_train, y_train, name='LogisticRegression', **kwargs):
    model = return_function_name(name, **kwargs)
    model.fit(X_train, y_train)
    return model

# Read Data for Model Selection

Read data from a CSV file, clean and encode the data, and split it into training and testing sets.

In [17]:
from sklearn.model_selection import train_test_split

# Load the data
data = pd.read_csv('bank.csv')

# Create a dataframe with column names and data types before cleaning
before_cleaning = pd.DataFrame({'Feature': data.columns, 'Type': data.dtypes})

# Clean the data
new_data = full_clean(data)

# Create a dataframe with column names and data types after cleaning
after_cleaning = pd.DataFrame({'Feature': new_data.columns, 'Type': new_data.dtypes})

# Create a figure with two subplots, vertically aligned
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 8))

# Set the fontsize of both tables to 20
fontsize = 14

# Display the table for the data before cleaning in the first subplot
ax1.axis('off')
table1 = ax1.table(cellText=before_cleaning.values, colLabels=before_cleaning.columns, cellLoc='left', loc='center', colWidths=[0.75, 0.25])
table1.auto_set_font_size(False)
table1.set_fontsize(fontsize)

# Make the text in the first row bold
for j in range(len(before_cleaning.columns)):
    cell = table1[0, j]
    cell.set_text_props(weight='bold')
    cell.set_text_props(ha='left')
    
# Display the table for the cleaned data in the second subplot
ax2.axis('off')
table2 = ax2.table(cellText=after_cleaning.values, colLabels=after_cleaning.columns, cellLoc='left', loc='center', colWidths=[0.75, 0.25])
table2.auto_set_font_size(False)
table2.set_fontsize(fontsize)

# Make the text in the first row bold
for j in range(len(after_cleaning.columns)):
    cell = table2[0, j]
    cell.set_text_props(weight='bold')
    cell.set_text_props(ha='left')
    
# Add space between the subplots
plt.subplots_adjust(hspace=0.5, wspace=0.5)

# Add arrow in the middle of the subplots
plt.text(-0.38, 0.5, 'Transform', bbox = {'facecolor': 'oldlace', 'alpha': 0.5, 'boxstyle': "rarrow,pad=0.3", 'ec': 'red'})

# Show the figure
plt.show()

X, y = split_data(new_data, target_col='target')
features = X.columns
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=60)

<IPython.core.display.Javascript object>

# Model Training and Evaluation Pipeline

This code defines several functions, each of which corresponds to a different machine learning algorithm, such as Logistic Regression, Decision Tree, Random Forest, etc. Each function trains the corresponding algorithm on the given dataset, predicts on a test dataset, and then evaluates the model's performance based on several metrics such as accuracy, area under the ROC curve, and confusion matrix.

We will use accuracy metric to evaluate the classification models because the Bank Marketing Dataset is balanced, with roughly equal numbers of positive and negative examples. Accuracy measures the proportion of correctly classified examples out of all examples, and it is a suitable metric for balanced datasets. However, we will also consider other metrics, such as precision, recall, F1 score, and AUC-ROC, depending on the specific problem and the trade-offs between different types of errors.

In [18]:
model_map = {}

# General Function
def run_model(name, natural_name=None, ntimes=1, **kwargs):
    if not natural_name:
        natural_name = name
    logger.info(f'Running {natural_name}')
    
    metric_names = ['accuracy', 'auc_roc', 'precision', 'f1', 'recall']

    global_metrics = {metric_name: 0 for metric_name in metric_names}

    for _ in range(ntimes):
        model = train(X_train, y_train, name=name, **kwargs)
        y_pred = model.predict(X_test)
        y_pred_prob = model.predict_proba(X_test)[:, 1]

        accuracy_score, auc_roc_score, precision_score, f1_score, recall_score, confusion_matrix, classification_report = get_report(y_test, y_pred, y_pred_prob, out=False)

        metrics = [accuracy_score, auc_roc_score, precision_score, f1_score, recall_score]

        for i, metric_name in enumerate(metric_names):
            global_metrics[metric_name] += metrics[i]

    for metric_name in metric_names:
        global_metrics[metric_name] /= ntimes

    global_accuracy = global_metrics['accuracy']
    global_auc = global_metrics['auc_roc']
    global_precision = global_metrics['precision']
    global_f1 = global_metrics['f1']
    global_recall = global_metrics['recall']
    
    print_report(global_accuracy, global_auc, global_precision, global_f1, global_recall)
    
    model_map[natural_name] = global_accuracy


# Ranking
def run_ranking():
    logger.info('Ranking')

    get_best(model_map)

# Run Classification Models with Default Hyperparameters

Call the pre-defined machine learning methods. The methods are run with default hyperparameters, meaning that the models will not be fine-tuned for maximum accuracy.

The results of each model's performance can be used to compare their accuracy and efficiency, but they should be interpreted with caution as the default hyperparameters may not be optimal.

In [19]:
function_names = [
        {'name': "RandomForestClassifier"},
        {'name': "LogisticRegression", 'max_iter': 10000, 'penalty': None, 'natural_name': "LogisticsRegressionUnregularized"},
        {'name': "LogisticRegression", 'max_iter': 10000},
        {'name': "DecisionTreeClassifier"},
        {'name': "GradientBoostingClassifier"},
        {'name': "SVC", 'probability': True},
        {'name': "GaussianNB"},
        {'name': "KNeighborsClassifier"},
        {'name': "AdaBoostClassifier"},
        {'name': "XGBClassifier"},
        {'name': "MLPClassifier", 'max_iter': 1000},
        {'name': "ExtraTreesClassifier"},
        {'name': "LGBMClassifier"},
        {'name': "CatBoostClassifier", 'logging_level': 'Silent'},
]

for args in function_names:
    run_model(**args, ntimes=5)

[94m🟡 INFO[0m: [1mRunning RandomForestClassifier[0m
[1mAccuracy score:[0m 0.852395879982087
[1mAUC-ROC score:[0m 0.9222844150155876
[1mPrecision score:[0m 0.8309228178432196
[1mF1 score:[0m 0.8456957594355723
[1mRecall score:[0m 0.8610104861773117

[94m🟡 INFO[0m: [1mRunning LogisticsRegressionUnregularized[0m
[1mAccuracy score:[0m 0.8199731303179579
[1mAUC-ROC score:[0m 0.8937912233014712
[1mPrecision score:[0m 0.8317948717948717
[1mF1 score:[0m 0.8013833992094861
[1mRecall score:[0m 0.773117254528122

[94m🟡 INFO[0m: [1mRunning LogisticRegression[0m
[1mAccuracy score:[0m 0.8213166144200625
[1mAUC-ROC score:[0m 0.8939280975446371
[1mPrecision score:[0m 0.8336755646817249
[1mF1 score:[0m 0.8027681660899655
[1mRecall score:[0m 0.7740705433746425

[94m🟡 INFO[0m: [1mRunning DecisionTreeClassifier[0m
[1mAccuracy score:[0m 0.7955217196596507
[1mAUC-ROC score:[0m 0.794342262901605
[1mPrecision score:[0m 0.7867185287569061
[1mF1 score:[0m 0.

# Ranking of Classification Models with Default Hyperparameters

Since the dataset is relatively balanced then accuracy is a decent metric to compare and rank the models.

In [20]:
run_ranking()

[94m🟡 INFO[0m: [1mRanking[0m
1. LGBMClassifier (0.8696820420958351)
2. CatBoostClassifier (0.8687863860277654)
3. XGBClassifier (0.8634124496193462)
4. GradientBoostingClassifier (0.8544558889386475)
5. RandomForestClassifier (0.852395879982087)
6. MLPClassifier (0.8294670846394985)
7. SVC (0.8253470667263769)
8. AdaBoostClassifier (0.8235557545902374)
9. ExtraTreesClassifier (0.8223914017017465)
10. LogisticRegression (0.8213166144200625)
11. LogisticsRegressionUnregularized (0.8199731303179579)
12. DecisionTreeClassifier (0.7955217196596507)
13. KNeighborsClassifier (0.793999104343932)
14. GaussianNB (0.7223466188983431)



# Cross-Validation - Learning

In [21]:
clr = cross_validation(X, y, name='LogisticRegression', full_result=False, options={"cv": 5}, max_iter=10000)
clru = cross_validation(X, y, name='LogisticRegression', full_result=False, options={"cv": 5}, max_iter=10000, penalty=None)

[1mfit_time[0m [0.15539917945861817, 0.027007826788646196]
[1mscore_time[0m [0.002200031280517578, 0.00040036346739415977]
[1mtest_score[0m [0.7687739463601533, 0.030612674843967746]
[1mtrain_score[0m [0.81013715892589, 0.0034166339129334013]

[1mfit_time[0m [0.19359855651855468, 0.053533261693428244]
[1mscore_time[0m [0.0018012046813964844, 0.0003986148851523329]
[1mtest_score[0m [0.7686843406253863, 0.030500109931951763]
[1mtrain_score[0m [0.8102043456673318, 0.0035137676611767103]



# Classification threshold - Learning

In [22]:
# Train GBM model
gbm = LGBMClassifier()
gbm.fit(X_train, y_train)

# Compute ROC curve and AUC
roc_auc, best_threshold = plot_roc(gbm, X_test, y_test)

# Create new GBM model with best threshold
gbm_new = LGBMClassifier()
gbm_new.fit(X_train, y_train)
y_pred_prob_new = gbm_new.predict_proba(X_test)[:, 1]
y_pred_new = (y_pred_prob_new >= best_threshold).astype(int)

# Calculate accuracy of original GBM model
y_pred_prob = gbm.predict_proba(X_test)[:, 1]
y_pred = (y_pred_prob >= 0.5).astype(int)
acc_original = accuracy_score(y_test, y_pred)

# Calculate accuracy of new GBM model with best threshold
acc_new = accuracy_score(y_test, y_pred_new)

# Print results
print("ROC AUC score: {:.3f}".format(roc_auc))
print("Best threshold: {:.3f}".format(best_threshold))
print("Accuracy of original GBM model: {:.3f}".format(acc_original))
print("Accuracy of new GBM model with best threshold: {:.3f}".format(acc_new))

lr = LogisticRegression(C=0.001, penalty='l2', solver='lbfgs', max_iter=10000, multi_class='auto')

lr.fit(X, y)

plot_learning_curve(lr, X, y, cv=8)

<IPython.core.display.Javascript object>

ROC AUC score: 0.934
Best threshold: 0.452
Accuracy of original GBM model: 0.870
Accuracy of new GBM model with best threshold: 0.876


<IPython.core.display.Javascript object>