# Predicting Customer Bank Term Deposit 

We are using Bank Marketing data from UC Irvine's Machine Learning Repository. This data is related to the marketing campaigns of a Portugese banking institution. These marketing campaigns were based on calls to clients to determine if they would be subscribing to a bank term deposit. 

The structure of this project is as follows:
- First import and view the data
- Focus on cleaning the dataset from it's missing values
- Preprocessing the data for our machine learning model
- Finally build the ML model to predict if a client would subscribe to a term deposit or not.

## Import Data From UC Irvine Machine Learning Repository

In [None]:
from ucimlrepo import fetch_ucirepo
import pandas as pd
  
# fetch dataset 
bank_marketing = fetch_ucirepo(id=222) 
  
# data (as pandas dataframes) 
X = bank_marketing.data.features 
y = bank_marketing.data.targets 
  
# metadata 
print(bank_marketing.metadata) 
  
# variable information 
print(bank_marketing.variables) 

In [None]:
X.head()

In [None]:
y.head()

## Inspecting Data

In [None]:
# View summary statistics
print(X.describe())
print(y.describe())

In [None]:
# View df info
print(X.info)
print(y.info)

In [None]:
# View dataframe data types
print(X.dtypes)
print(y.dtypes)

In [None]:
# Convert the categorical features to numeric ones using get_dummies() method from pandas
X_dummies = pd.get_dummies(X)

pd.set_option('display.max_column', None)

X_dummies.head()

In [None]:
# Convert boolean values (True/False) to binary (0/1)
X_dummies = X_dummies.astype(int)

In [None]:
X_dummies.dtypes

In [None]:
# Changing y df from "yes/no" to True/False
mapping = {'yes' : True, 'no': False}

# Apply mapping to column
y['y'] = y['y'].map(mapping)

In [None]:
# Convert boolean values in output variable to binary 0/1 values
y_dummies = y.astype(int)

## Measuring Statistical Correlation between Predictor Variables (X) and Target Variable (y)

In [None]:
# Looking at correlation with heatmap

import seaborn as sns

df = pd.concat([X_dummies, y_dummies], axis = 1)

sns.heatmap(df.corr(), cmap="YlGnBu")

In [None]:
df

In [None]:
corr_matrix = df.corr()

corr_target = corr_matrix.iloc[:-1, -1]

sorted_corr = corr_target.abs().sort_values(ascending=False)

print(sorted_corr)

# Data Modeling

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

## Split Data

In [None]:
# Split data into training and testing sets (80% for training, 20% for testing)
X_train, X_test, y_train, y_test = train_test_split(X_dummies, y_dummies, test_size=0.2, random_state=1) 

# 'test_size' determines the proportion of the dataset used for the training and testing data
# 'random_state' sets the seed for reproducability 

## Choose a Model

In [None]:
# Create a Random Forest Classifier instance
model = RandomForestClassifier(random_state=1)

# Train the classifier on training data
model.fit(X_train, y_train)

In [None]:
# Use model to predict on test data
y_pred = model.predict(X_test)

# Evaluate accuracey
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# Generate a classification report
report = classification_report(y_test, y_pred)
print(report)

From what we see in the accuracy score, our model is approximately 90% accurate in it's predictions. However, the classification report shows our model is generally better at predicting if a customer will not subscribe to a bank term deposit then if they will

## Tune Model

In [None]:
## Set hyperparameters

param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
}

from sklearn.model_selection import GridSearchCV

# Create GridSearchCV instance with the parameter grid
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

In [None]:
# Check best parameters for the model

best_params

In [None]:
# Create a Random Forest Classifier model with the best hyperparameters
model2 = RandomForestClassifier(random_state=1, n_estimators=150, max_depth=30, min_samples_split=5)

# Train the classifier on the entire training data
model2.fit(X_train, y_train)

In [None]:
# Make predictions on the test data once again
y_pred2 = model2.predict(X_test)

accuracy2 = accuracy_score(y_test, y_pred2)
print(f'Accuracy: {accuracy2}')

report2 = classification_report(y_test, y_pred2)
print(report2)


In this case of the new model, the accuracy is identical to the accuracy of the initial model, indicating that the hyperparameters tuned did not have a significant impact on the mode;'s performance for this specific dataset. 
Next possible steps to take this project even further would be to try different model algorithms, such as Gradient Boossitng, Support Vector Machines, or Neural Networks. You could also revisit hyperparameter tuning, gather 
more data, or revisit your feature selection. This is why domain knowledge in data analysis is so important, since you can guide your feature selection based on your industry knowledge. 

## Revisiting Feature Selection

In [None]:
X

In [None]:
# Checking the missing values in the initial dataset
X.isna().sum()

In this new selection of features, I decide to now fill all missing values with the most common values that are in those columns. However, I want to note the 'poutcome' column having almost all null values. I believe having previous data in knowing what the client answered in previous phone calls is important, but since there is practically barely any data in this case, I decide to just drop the column alltogether.

In [None]:
X2 = X.drop('poutcome', axis=1)

In [None]:
X2

In [None]:
# Now going to replace null values with most frequent value

columns_replace = ['job', 'education', 'contact']

for col in columns_replace:
    most_frequent_value = X2[col].mode()[0]
    X2[col].fillna(most_frequent_value, inplace=True)


In [None]:
X2.isna().sum()

In [None]:
X_dummies2 = pd.get_dummies(X2)

pd.set_option('display.max_column', None)

X_dummies2.head()

# Convert boolean values (True/False) to binary (0/1)
X_dummies2 = X_dummies2.astype(int)

## Build Model with Data Preprocessing

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_dummies2, y_dummies, test_size=0.2, random_state=1)

In [None]:
# Create a Random Forest Classifier instance
model3 = RandomForestClassifier(random_state=1)

# Train the classifier on training data
model3.fit(X_train, y_train)

In [None]:
# Make predictions on the test data once again
y_pred3 = model3.predict(X_test)

accuracy3 = accuracy_score(y_test, y_pred3)
print(f'Accuracy: {accuracy3}')

report3 = classification_report(y_test, y_pred3)
print(report3)

In [None]:
## Set hyperparameters

param_grid = {
    'n_estimators': [150, 300, 400],
    'max_depth': [30, 40, 50, 60],
    'min_samples_split': [2, 5, 10],
}

# Create GridSearchCV instance with the parameter grid
grid_search2 = GridSearchCV(estimator=model3, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')

# Fit the grid search to the data
grid_search2.fit(X_train, y_train)

# Get the best hyperparameters
best_params2 = grid_search2.best_params_

In [None]:
best_params2

In [None]:
# Create a Random Forest Classifier model with the best hyperparameters
model3 = RandomForestClassifier(random_state=1, n_estimators=400, max_depth=30, min_samples_split=5)

# Train the classifier on the entire training data
model3.fit(X_train, y_train)

In [None]:
# Make predictions on the test data once again
y_pred4 = model3.predict(X_test)

accuracy4 = accuracy_score(y_test, y_pred4)
print(f'Accuracy: {accuracy4}')

report4 = classification_report(y_test, y_pred4)
print(report4)