# Predicting Bank Term Deposit Subscriptions in a Marketing Campaign

We are using Bank Marketing data from UC Irvine's Machine Learning Repository. This data is related to the marketing campaigns of a Portugese banking institution. These marketing campaigns were based on calls to clients to determine if they would be subscribing to a bank term deposit.

This project focuses on leveraging machine learning techniques to predict whether clients will subscribe to a term deposit as part of a direct marketing campaign. The primary goal is to develop the most accurate predictive model that can help the banking institution target its marketing efforts more effectively.

The structure of this project is as follows:
- First import and view the data
- Focus on preprocessing & cleaning the dataset for our ML model
- Exploratory analysis to better understand data correlation
- Build base ML model to predict if a client would subscribe to a term deposit or not.
- Hyperparameters of base model are fine-tuned using grid search
- Final model is trained and its metrics are evaluated

## Import Data From UC Irvine Machine Learning Repository

In [None]:
# Import Data From UCI Machine Learning Repository
from ucimlrepo import fetch_ucirepo
import pandas as pd
  
# fetch dataset 
bank_marketing = fetch_ucirepo(id=222) 
  
# data (as pandas dataframes) 
X = bank_marketing.data.features 
y = bank_marketing.data.targets 
  
# metadata 
print(bank_marketing.metadata) 
  
# variable information 
print(bank_marketing.variables) 

pd.set_option('display.max_column', None) # Set dataframe to show max columns

In [None]:
# Inspect predictor variable dataset
X.head()

In [None]:
# Inspect target variable
y.head()

### Inspect Data

In [None]:
# View summary statistics
print(X.describe())
print(y.describe())

In [None]:
# View df info
print(X.info)
print(y.info)

In [None]:
# View data types
print(X.dtypes)
print(y.dtypes)

In [None]:
# Checking the missing values in the dataset
X.isna().sum()

In [None]:
# Checking the missing values in target variable
y.isna().sum()

## Data Preprocessing

In this part of the project, we first handle the missing values we discovered from the previous step in the project. Since there are only missing values in our categorical variables, I decide to fill those rows with the most frequent occurring values in their respective column. However, our 'poutcome' column has over 75% of the column with missing values, so we decide to let one-hard encoding in the next step handle those values. In other instances, I would like to drop this column, but as we see later on in the project, the 'poutcome' column has a strong correlation to the target variable, so we decide to keep it.

After handling the missing values in the 'job', 'education', and 'contact' column, we focus on converting the categorical variables into numeric values with one-hard encoding using the get_dummies() method from pandas.

In [None]:
# Replace null values with most frequent value

columns_replace = ['job', 'education', 'contact']

for col in columns_replace:
    most_frequent_value = X[col].mode()[0]
    X[col].fillna(most_frequent_value, inplace=True)

In [None]:
# Make sure the missing values have been replaced
X.isna().sum()

### Now convert categorical variables to numeric

In [None]:
# Convert the categorical features to numeric ones using get_dummies() method from pandas
X_dummies = pd.get_dummies(X) # This will give each category in each categorical column it's own separate column, with boolean True False in it's respective rows

X_dummies.head()

In [None]:
# We don't want to do the same with the target variable, since we just want one column in it's df. So we map the 'yes' and 'no' values to True/False

# Changing y df from "yes/no" to True/False
mapping = {'yes' : True, 'no': False}

# Apply mapping to column
y['y'] = y['y'].map(mapping)

In [None]:
# Last thing we do now is convert the boolean values (True/False) to binary (0/1) values

X_dummies = X_dummies.astype(int)

y = y.astype(int)

In [None]:
# Check work
print(X_dummies.dtypes)

print(y.dtypes)

## Analyzing Statistical Correlation between Predictor Variables (X) and Target Variable (y)

In this section of the project, we want to view the correlation between the predictor variables and target variable. We do this to see if there are any variables we can remove for our model. However, we don't decide to remove any after this analysis since there is no single variable that has a strong correlation with the target variable, we we choose to keep them all.

I also want to note that with this analysis, we see that the 'poutcome' column had the second highest correlation to the predictor variable compared to the rest, which is why we never dropped the column earlier in the project.

In [None]:
# Looking at correlation with heatmap

import seaborn as sns

df = pd.concat([X_dummies, y], axis = 1) # combining predictor variables df (X) with target variable df (y)

sns.heatmap(df.corr(), cmap="YlGnBu")

In [None]:
# View correlation 
corr_matrix = df.corr()

corr_target = corr_matrix.iloc[:-1, -1]

sorted_corr = corr_target.abs().sort_values(ascending=False)

print(sorted_corr)

## Choosing a Base Model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report

# Split data into training and testing sets (80% for training, 20% for testing)
X_train, X_test, y_train, y_test = train_test_split(X_dummies, y, test_size=0.2, random_state=1) 

# 'test_size' determines the proportion of the dataset used for the training and testing data
# 'random_state' sets the seed for reproducability 

### Logistic Regression Model

In [None]:
# Create Logist Regression Model
regressionmodel = LogisticRegression(random_state=10)

# Train model on training data
regressionmodel.fit(X_train, y_train)

# Make predictions on test data
y_pred_regression = regressionmodel.predict(X_test)

# Evaluate models performance
accuracy_regression = accuracy_score(y_test, y_pred_regression)
print(f'Accuracy : {accuracy_regression}')

# Generate a classification report
print(classification_report(y_test, y_pred_regression))

### Random Forest Model

In [None]:
# Create a Random Forest Classifier instance
model_forest = RandomForestClassifier(random_state=1)

# Train the classifier on training data
model_forest.fit(X_train, y_train)

# Use model to predict on test data
y_pred_forest = model_forest.predict(X_test)

# Evaluate accuracey
accuracy_forest = accuracy_score(y_test, y_pred_forest)
print(f'Accuracy: {accuracy_forest}')

# Generate a classification report
print(classification_report(y_test, y_pred_forest))

### Gradient Boosting Model

In [None]:
# Create a Gradient Boosting Classifier model
model_gradient = GradientBoostingClassifier(random_state=1)

# Train the model on training data
model_gradient.fit(X_train, y_train)

# Use model to predict on test data
y_pred_gradient = model_gradient.predict(X_test)

# Evaluate model accuracy
accuracy_gradient = accuracy_score(y_test, y_pred_gradient)
print(f'Accuracy: {accuracy_gradient}')

# Generate a classification report
print(classification_report(y_test, y_pred_gradient))

## Hyperparameter Tuning

In [None]:
## Set hyperparameters
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [350, 500, 700],
    'max_depth': [30, 40, 50, 60],
    'min_samples_split': [2, 5, 10],
}

# Create GridSearchCV instance with the parameter grid
grid_search = GridSearchCV(estimator=model_forest, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

In [None]:
# View hyperparameters
best_params

In [None]:
# Create a Random Forest Classifier model with the best hyperparameters
model_tuned = RandomForestClassifier(random_state=1, n_estimators=350, max_depth=50, min_samples_split=5)

# Train the classifier on the entire training data
model_tuned.fit(X_train, y_train)

# Make predictions on the test data once again
y_pred_tuned = model_tuned.predict(X_test)

accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
print(f'Accuracy: {accuracy_tuned}')

report = classification_report(y_test, y_pred_tuned)
print(report)

In this case of the new model, the accuracy is nearly identical to the accuracy of the initial model, indicating that the hyperparameters tuned did not have a significant impact on the model's performance for this specific dataset. 
The next possible steps to take this project even further would be to try different model algorithms, such as Neural Networks, revisit hyperparameter tuning, gather 
more data, or revisit your feature selection. This is why domain knowledge in data analysis is so important, since you can guide your feature selection based on your industry knowledge. 