# Week 10 Deliverables

**Team Member's Details**

Group Name: Data Science Bank Marketers
Members:

Amr Hacoglu – amr.hacoglu@gmail.com - Turkey - University of Karabuk - Data Science
Ha My Pham – mpham25@wooster.edu - US – College of Wooster – Data Science

**Problem Description**

ABC Bank aims to develop a machine learning model to predict whether a customer will subscribe to a term deposit product. This model will help the bank focus its marketing efforts on customers with a higher likelihood of purchasing the product, thereby optimizing resource allocation and reducing marketing costs.

**Exploratory Data Analysis (EDA)**

* Analyzed the distribution of features and their relationships with the target variable.
* Handled missing values using mean/median/mode imputation and model-based approaches.
* Identified and addressed outliers using the IQR method and capping.
* Performed feature engineering to create new features and transform existing ones.
* Addressed the class imbalance issue using SMOTE, class weighting, and undersampling techniques.



**Final Recommendation**

Based on the EDA and feature engineering, we recommend using a combination of Logistic Regression, ensemble methods (e.g., Random Forest), and boosting algorithms (e.g., XGBoost, LightGBM) to build the predictive model. To evaluate the model's performance, we suggest using appropriate metrics such as AUC-ROC, precision, recall, and F1-score. Additionally, we can translate the model's performance into business metrics like potential cost savings and increased conversion rates.

**GitHub Repository**

https://github.com/Amr-Hacoglu/Data-Glacier-Internship

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from statistics import mean
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelEncoder

In [None]:
# Load and inspect the dataset
df = pd.read_csv('/kaggle/input/bankdataset/bank-additional-full.csv', sep=';')
df.head()

In [None]:
# Handle missing values
numeric_columns = df.select_dtypes(include=[np.number]).columns
df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].mean())

categorical_columns = df.select_dtypes(include=['object']).columns
df[categorical_columns] = df[categorical_columns].fillna(df[categorical_columns].mode().iloc[0])

In [None]:
# Identify and handle outliers
def handle_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df[column] = df[column].clip(lower_bound, upper_bound)

for column in numeric_columns:
    handle_outliers(df, column)

In [None]:
# Feature engineering
le = LabelEncoder()
df['month'] = le.fit_transform(df['month'])
df['day_of_week'] = le.fit_transform(df['day_of_week'])
df['poutcome'] = le.fit_transform(df['poutcome'])
df['job'] = le.fit_transform(df['job'])
df['marital'] = le.fit_transform(df['marital'])
df['education'] = le.fit_transform(df['education'])
df['default'] = le.fit_transform(df['default'])
df['housing'] = le.fit_transform(df['housing'])
df['loan'] = le.fit_transform(df['loan'])
df['contact'] = le.fit_transform(df['contact'])
df['y'] = le.fit_transform(df['y'])

X = df.iloc[:, :20]
y = df['y']

In [None]:
# Handle class imbalance using SMOTE
oversample = SMOTE()
over_X, over_y = oversample.fit_resample(X, y)
over_X_train, over_X_test, over_y_train, over_y_test = train_test_split(over_X, over_y, test_size=0.1, stratify=over_y)

In [None]:
# Build SMOTE SRF model
SMOTE_SRF = RandomForestClassifier(n_estimators=100, random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scoring = ('f1', 'recall', 'precision')
scores = cross_validate(SMOTE_SRF, over_X, over_y, scoring=scoring, cv=cv)

print('Mean f1: %.3f' % mean(scores['test_f1']))
print('Mean recall: %.3f' % mean(scores['test_recall']))
print('Mean precision: %.3f' % mean(scores['test_precision']))

In [None]:
# Evaluate the model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y)
SMOTE_SRF.fit(over_X_train, over_y_train)
y_pred = SMOTE_SRF.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Will Not Buy', 'Will Buy'])
disp.plot(cmap='Greens')
plt.title('SMOTE + Standard Random Forest Confusion Matrix')
plt.show()

In [None]:
# Extract feature importances
feature_importances = SMOTE_SRF.feature_importances_
features = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importances
})
features = features.sort_values(by='Importance', ascending=False)

plt.figure(figsize=(12, 8))
plt.barh(features['Feature'], features['Importance'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importances from RandomForestClassifier')
plt.gca().invert_yaxis()
plt.show()

features

In [None]:
features=df[['duration','euribor3m','nr.employed','cons.conf.idx','cons.price.idx']]
print(features)
Model = RandomForestClassifier(n_estimators=100, random_state=0, max_depth=3)
Model.fit(features,y)

In [None]:
from sklearn.tree import plot_tree
# Extract a single tree from the forest (e.g., the first tree)
tree = Model.estimators_[0]

# Plot the tree
plt.figure(figsize=(20,10))
plot_tree(tree, filled=True, feature_names=features.columns, class_names=['Will Not Buy', 'Will Buy'], rounded=True)
plt.title('Decision Tree from RandomForestClassifier')
plt.show()