## Predictive Modeling/Classification Notebook
This notebook contains the Predictive Modeling and Classification tasks performed on the bank marketing dataset.

### Plans:
1. Data Preprocessing
2. Initial Modeling - validating with 10 fold cross validations
3. Retraining with Outlier Removed Features - validating with 10 fold cross validations
4. Retraining with Top Features - validating with 10 fold cross validations
5. Retraining with Outlier Removed + Top Features - validating with 10 fold cross validations
6. Ensemble Modeling - validating with 10 fold cross validations
7. Hyper-parameters tunining with Optuna Framework - validating with 10 fold cross validations
7. Best Perfomed Model Checkpoint Saving for Deployment

#### Data Preprocessing - Lable Encoding Categorical Features

In [64]:
# Import the necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Reload the data
df = pd.read_csv('/content/ML Assessment Dataset (Bank Data) - Sheet1.csv')

# Identify the categorical columns for Label Encoding
label_columns = df.select_dtypes(include=['object']).columns

# Initialize the Label Encoder
label_encoder = LabelEncoder()

# Convert categorical variables to numerical form using Label Encoding
for column in label_columns:
    df[column] = label_encoder.fit_transform(df[column])


# Separate features (X) and target variable (y)
X = df.drop('y', axis=1)
y = df['y']


# Standardize the feature variables (important for models like Naive Bayes)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


#### Initial Model Training

Model Training DecisionTreeClassifier - Validating with 10 Folds


In [65]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# Initialize the Decision Tree Classifier
dt_model = DecisionTreeClassifier(random_state=42)

# Perform 10-Fold Cross-Validation
cv_scores_dt = cross_val_score(dt_model, X_scaled, y, cv=10)

# Calculate the mean and standard deviation of the cross-validation scores
cv_mean_dt = np.mean(cv_scores_dt)
cv_std_dt = np.std(cv_scores_dt)

cv_mean_dt, cv_std_dt


(0.8672869171110982, 0.010317171288062858)

*   Mean Accuracy: Approximately 86.73%
*   Standard Deviation: Approximately 1.03%




Model Training Naive Bayes Classifier - Validating with 10 Folds

In [66]:
from sklearn.naive_bayes import GaussianNB

# Initialize the Naive Bayes Classifier
nb_model = GaussianNB()

# Perform 10-Fold Cross-Validation
cv_scores_nb = cross_val_score(nb_model, X_scaled, y, cv=10)

# Calculate the mean and standard deviation of the cross-validation scores
cv_mean_nb = np.mean(cv_scores_nb)
cv_std_nb = np.std(cv_scores_nb)

cv_mean_nb, cv_std_nb

(0.8345528336165973, 0.017938256605871426)



*   Mean Accuracy: Approximately 83.46%
*   Standard Deviation: Approximately 1.79%



Summary:
* Decision Tree Classifier: Mean Accuracy ~ 86.73%, Standard Deviation ~ 1.03%
* Naive Bayes Classifier: Mean Accuracy ~ 83.46%, Standard Deviation ~ 1.79%

#### Outlier Removal using Z-Score & Train Models

Outlier Remove using Z-Score

In [67]:
# Re-importing the necessary libraries and data due to a reset in the code environment
import pandas as pd
import numpy as np
from scipy import stats


# Calculate z-scores to identify outliers
z_scores = np.abs(stats.zscore(df.select_dtypes(include=[np.number])))

# Get boolean array indicating the presence of outliers
outliers = (z_scores > 3)

# Count outliers per feature
outliers_count = pd.DataFrame(outliers, columns=df.select_dtypes(include=[np.number]).columns).sum()

outliers_count

age           44
job            0
marital        0
education      0
default       76
balance       88
housing        0
loan           0
contact        0
day            0
month          0
duration      88
campaign      87
pdays        171
previous      99
poutcome       0
y              0
dtype: int64

In [68]:
# Remove outliers from the data
df_no_outliers = df[(z_scores < 3).all(axis=1)]

# Check the shape of the new DataFrame to see how many rows were removed
df.shape, df_no_outliers.shape


((4521, 17), (3908, 17))

No outlier Data Preparations for Model Traning

In [69]:
# Re-import the necessary libraries for data preparation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Re-identify the categorical columns for Label Encoding
label_columns = df_no_outliers.select_dtypes(include=['object']).columns

# Re-initialize the Label Encoder and Scaler
label_encoder = LabelEncoder()
scaler = StandardScaler()

# Re-convert categorical variables to numerical form using Label Encoding
for column in label_columns:
    df_no_outliers[column] = label_encoder.fit_transform(df_no_outliers[column])

# Re-separate features (X) and target variable (y)
X_no_outliers = df_no_outliers.drop('y', axis=1)
y_no_outliers = df_no_outliers['y']

# Re-standardize the feature variables
X_scaled_no_outliers = scaler.fit_transform(X_no_outliers)


Train Decision Tree Classifier & Naive Bayes Classifier without Outlier

In [70]:
# Re-import the necessary library for 10-Fold Cross-Validation
from sklearn.model_selection import cross_val_score

# Initialize the Decision Tree Classifier
dt_model = DecisionTreeClassifier(random_state=42)

# Initialize the Naive Bayes Classifier
nb_model = GaussianNB()

# Perform 10-Fold Cross-Validation on the new Decision Tree model
cv_scores_dt_no_outliers = cross_val_score(dt_model, X_scaled_no_outliers, y_no_outliers, cv=10)

# Calculate the mean and standard deviation of the cross-validation scores for the new Decision Tree model
cv_mean_dt_no_outliers = np.mean(cv_scores_dt_no_outliers)
cv_std_dt_no_outliers = np.std(cv_scores_dt_no_outliers)

# Perform 10-Fold Cross-Validation on the new Naive Bayes model
cv_scores_nb_no_outliers = cross_val_score(nb_model, X_scaled_no_outliers, y_no_outliers, cv=10)

# Calculate the mean and standard deviation of the cross-validation scores for the new Naive Bayes model
cv_mean_nb_no_outliers = np.mean(cv_scores_nb_no_outliers)
cv_std_nb_no_outliers = np.std(cv_scores_nb_no_outliers)

cv_mean_dt_no_outliers, cv_std_dt_no_outliers, cv_mean_nb_no_outliers, cv_std_nb_no_outliers


(0.8710308872712966,
 0.01428552223865972,
 0.8469748835989245,
 0.023024169206089455)

**New Models with No Outliers:**

**Decision Tree Classifier:**
* Mean Accuracy: ~87.09%
* Standard Deviation: ~0.14%

**Naive Bayes Classifier:**
* Mean Accuracy: ~84.70%
* Standard Deviation: ~0.23%



---


**Intial Models:**

**Decision Tree Classifier:**
* Mean Accuracy: ~86.73%
* Standard Deviation: ~1.03%

**Naive Bayes Classifier:**
* Mean Accuracy: ~83.46%
* Standard Deviation: ~1.79%


---

**Observations:** Score have been improved after removed outliers

#### Retraining Models with Top Features (Before Outlier Removed Data)

Find out the important features before outlier removed (initial dataset)

In [71]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

# Prepare the data for training
# Convert categorical variables to numerical using Label Encoding
label_columns = df.select_dtypes(include=['object']).columns
df_encoded = df.copy()
label_encoder = LabelEncoder()
for column in label_columns:
    df_encoded[column] = label_encoder.fit_transform(df[column])

# Separate features and target variable
X = df_encoded.drop('y', axis=1)
y = df_encoded['y']

# Create a RandomForestClassifier model
rf_model = RandomForestClassifier(random_state=42)

# Fit the model
rf_model.fit(X, y)

# Get feature importances
feature_importances = rf_model.feature_importances_

# Create a DataFrame for feature importances
feature_importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importances})

# Sort the features by importance
feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)


feature_importance_df


Unnamed: 0,Feature,Importance
11,duration,0.290557
5,balance,0.1058
0,age,0.105436
9,day,0.092174
10,month,0.083508
13,pdays,0.052059
1,job,0.050703
15,poutcome,0.045556
12,campaign,0.038802
3,education,0.030945


**Most Important Features:**
* Duration: This is the most important feature with an importance of approximately 0.29. This aligns with our earlier correlation analysis.
* Balance and Age: These are also important features, each with an importance around 0.10.
* Day and Month: These features have importance levels around 0.09 and 0.08, respectively.

**Least Important Features:**
* Default: This has the least importance, close to 0.003.
* Loan: This feature also has a low importance, around 0.01.

Train Models with Top 10 Features

In [72]:
# Select the top 10 important features based on the feature scores
top_features = feature_importance_df['Feature'][:10].tolist()

# Create new training and testing sets with only the top important features
X_train_top_features = pd.DataFrame(X_scaled, columns=X.columns)[top_features]

# Initialize the Decision Tree Classifier
dt_model = DecisionTreeClassifier(random_state=42)

# Initialize the Naive Bayes Classifier
nb_model = GaussianNB()

# Perform 10-Fold Cross-Validation on the new Decision Tree model
cv_scores_dt_top_features = cross_val_score(dt_model, X_train_top_features, y, cv=10)

# Calculate the mean and standard deviation of the cross-validation scores for the new Decision Tree model
cv_mean_dt_top_features = np.mean(cv_scores_dt_top_features)
cv_std_dt_top_features = np.std(cv_scores_dt_top_features)

# Perform 10-Fold Cross-Validation on the new Naive Bayes model
cv_scores_nb_top_features = cross_val_score(nb_model, X_train_top_features, y, cv=10)

# Calculate the mean and standard deviation of the cross-validation scores for the new Naive Bayes model
cv_mean_nb_top_features = np.mean(cv_scores_nb_top_features)
cv_std_nb_top_features = np.std(cv_scores_nb_top_features)

cv_mean_dt_top_features, cv_std_dt_top_features, cv_mean_nb_top_features, cv_std_nb_top_features


(0.862862626736213,
 0.0144016057526754,
 0.8522475531852546,
 0.01958895278676839)

#### Retraining with Top Features (After outlier removed)

Find out the important features before outlier removed (After Outlier Removed)

In [73]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

# Prepare the data for training
# Convert categorical variables to numerical using Label Encoding
label_columns = df_no_outliers.select_dtypes(include=['object']).columns

df_encoded = df_no_outliers.copy()
label_encoder = LabelEncoder()
for column in label_columns:
    df_encoded[column] = label_encoder.fit_transform(df_no_outliers[column])

# Separate features and target variable
X = df_encoded.drop('y', axis=1)
y = df_encoded['y']

# Create a RandomForestClassifier model
rf_model = RandomForestClassifier(random_state=42)

# Fit the model
rf_model.fit(X, y)

# Get feature importances
feature_importances = rf_model.feature_importances_

# Create a DataFrame for feature importances
feature_importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importances})

# Sort the features by importance
feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)

feature_importance_df


Unnamed: 0,Feature,Importance
11,duration,0.269735
5,balance,0.118612
0,age,0.102628
9,day,0.101755
10,month,0.087414
1,job,0.053063
15,poutcome,0.047885
13,pdays,0.043582
12,campaign,0.043517
3,education,0.031151


**Observations:**

**Before Handling Outliers:**
Duration had the highest importance score (0.290557).
Default had the lowest importance score (0.003211).

**After Handling Outliers:**
Duration still has the highest score, but it's a bit lower (0.274582).
Default still has the lowest score but it's a bit higher (0.004136).

**What Changed?**
Balance, Day, and Job became more important after removing outliers.
Pdays, Poutcome, and Age became a bit less important.

Train Models with Top 10 Features & Outlier Removed Dataset

In [74]:
# Select the top 10 important features based on the feature scores
top_features = feature_importance_df['Feature'][:10].tolist()

# Create new training and testing sets with only the top important features
X_train_top_features_outlier_rm = pd.DataFrame(X_scaled_no_outliers, columns=X.columns)[top_features]

# Initialize the Decision Tree Classifier
dt_model = DecisionTreeClassifier(random_state=42)

# Initialize the Naive Bayes Classifier
nb_model = GaussianNB()

# Perform 10-Fold Cross-Validation on the new Decision Tree model
cv_scores_dt_top_features = cross_val_score(dt_model, X_train_top_features_outlier_rm, y, cv=10)

# Calculate the mean and standard deviation of the cross-validation scores for the new Decision Tree model
cv_mean_dt_top_features = np.mean(cv_scores_dt_top_features)
cv_std_dt_top_features = np.std(cv_scores_dt_top_features)

# Perform 10-Fold Cross-Validation on the new Naive Bayes model
cv_scores_nb_top_features = cross_val_score(nb_model, X_train_top_features_outlier_rm, y, cv=10)

# Calculate the mean and standard deviation of the cross-validation scores for the new Naive Bayes model
cv_mean_nb_top_features = np.mean(cv_scores_nb_top_features)
cv_std_nb_top_features = np.std(cv_scores_nb_top_features)

cv_mean_dt_top_features, cv_std_dt_top_features, cv_mean_nb_top_features, cv_std_nb_top_features


(0.8743556954554398,
 0.013705997897794798,
 0.8600268870089843,
 0.02511212961936939)

**Models with Top 10 features with Outlier Removed:**

**Decision Tree Classifier:**
* Mean Accuracy: ~87.44%
* Standard Deviation: ~0.13%

**Naive Bayes Classifier:**
* Mean Accuracy: ~86.00%
* Standard Deviation: ~0.25%



---


**Models with Top 10 features without Outlier Removed:**


**Decision Tree Classifier:**
* Mean Accuracy: ~86.37%
* Standard Deviation: ~0.13%

**Naive Bayes Classifier:**
* Mean Accuracy: ~85.22%
* Standard Deviation: ~0.19%


---

**Observations:** Score have been improved after removed outliers

#### Ensemble Two Model on Top 10 features of outlier removed dataset

In [75]:
from sklearn.ensemble import VotingClassifier

# Initialize the Decision Tree Classifier
dt_model = DecisionTreeClassifier(random_state=42)

# Initialize the Naive Bayes Classifier
nb_model = GaussianNB()

# Initialize the ensemble model with Decision Tree and Naive Bayes
ensemble_model = VotingClassifier(estimators=[('dt', dt_model), ('nb', nb_model)],
                                  voting='soft',
                                  weights=[2, 1])

# Train the ensemble model on the training data with top features
# ensemble_model.fit(X_train_top_features, y_train_no_outliers)

# Perform 10-Fold Cross-Validation on the ensemble model
cv_scores_ensemble = cross_val_score(ensemble_model, X_train_top_features_outlier_rm, y, cv=10)

# Calculate the mean and standard deviation of the cross-validation scores for the ensemble model
cv_mean_ensemble = np.mean(cv_scores_ensemble)
cv_std_ensemble = np.std(cv_scores_ensemble)

cv_mean_ensemble, cv_std_ensemble


(0.8743556954554398, 0.013705997897794798)

#### Models Hyper-parameters tunining with Optuna Framework

In [76]:
pip install optuna



In [77]:
import optuna
from sklearn.model_selection import StratifiedKFold, cross_val_score

def objective(trial):
    # Specify the hyperparameter space
    classifier_name = trial.suggest_categorical('classifier', ['DecisionTree', 'NaiveBayes'])

    # Initialize k-Fold cross-validation
    kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

    if classifier_name == 'DecisionTree':
        criterion = trial.suggest_categorical('criterion', ['gini', 'entropy'])
        max_depth = trial.suggest_int('max_depth', 10, 50)
        min_samples_split = trial.suggest_int('min_samples_split', 2, 15)
        min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 10)

        classifier_obj = DecisionTreeClassifier(
            criterion=criterion, max_depth=max_depth,
            min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf
        )
    else:
        var_smoothing = trial.suggest_float('var_smoothing', 1e-10, 1e-2, log=True)

        classifier_obj = GaussianNB(var_smoothing=var_smoothing)

    # Perform k-Fold cross-validation
    scores = cross_val_score(classifier_obj, X_train_top_features_outlier_rm, y, cv=kfold)

    # Return the mean of cross-validation scores as a metric to be optimized
    return scores.mean()

# Create a study object and specify the direction is maximize.
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)


[I 2023-10-31 12:51:11,592] A new study created in memory with name: no-name-f6222b0c-ab14-4e7c-8b5c-caf1677b221e
[I 2023-10-31 12:51:11,759] Trial 0 finished with value: 0.8889441930618401 and parameters: {'classifier': 'DecisionTree', 'criterion': 'gini', 'max_depth': 46, 'min_samples_split': 3, 'min_samples_leaf': 5}. Best is trial 0 with value: 0.8889441930618401.
[I 2023-10-31 12:51:11,798] Trial 1 finished with value: 0.8590071480097056 and parameters: {'classifier': 'NaiveBayes', 'var_smoothing': 0.00034902452981797635}. Best is trial 0 with value: 0.8889441930618401.
[I 2023-10-31 12:51:11,836] Trial 2 finished with value: 0.8590071480097056 and parameters: {'classifier': 'NaiveBayes', 'var_smoothing': 3.832400703329542e-08}. Best is trial 0 with value: 0.8889441930618401.
[I 2023-10-31 12:51:11,872] Trial 3 finished with value: 0.8590071480097056 and parameters: {'classifier': 'NaiveBayes', 'var_smoothing': 0.000124178719135687}. Best is trial 0 with value: 0.8889441930618401.

In [78]:
# Get the best trial
best_trial = study.best_trial

# Get the best trial number
best_trial_number = best_trial.number

# Get the best parameters and the best score
best_params = best_trial.params
best_score = best_trial.value

print(f"Best Trial Number: {best_trial_number}")
print(f"Best Parameters: {best_params}")
print(f"Best Score: {best_score}")


Best Trial Number: 42
Best Parameters: {'classifier': 'DecisionTree', 'criterion': 'gini', 'max_depth': 47, 'min_samples_split': 5, 'min_samples_leaf': 8}
Best Score: 0.9019994753754345


In [79]:
# Initialize Decision Tree Classifier with the best hyperparameters
best_dt_model = DecisionTreeClassifier(criterion='gini',
                                       max_depth=36,
                                       min_samples_split=8,
                                       min_samples_leaf=8,
                                       random_state=42)

# Train the model on the full dataset
best_dt_model.fit(X_train_top_features_outlier_rm, y)


#### Checkpoint, Label Encoder & Scaler Saving for Deployement in Streamlit [Considered Top 10 outlier removed features]

Saving trained model on no outliers 10 important features dataset.

In [80]:
import joblib
joblib.dump(best_dt_model, 'best_dt_model.pkl')

['best_dt_model.pkl']

Saving label encoder categorical features for streamlit

In [81]:
import joblib

label_columns = ['job','education','month','poutcome']

# Save the LabelEncoder
label_encoders = {}
for column in label_columns:
    le = LabelEncoder()
    df[column] = le.fit_transform(df[column])
    label_encoders[column] = le

joblib.dump(label_encoders, 'label_encoders.pkl')


['label_encoders.pkl']

Saving scaler for features scaling in streamlit

In [82]:
fetures = [ "duration", "balance", "age", "day", "month", "pdays", "job", "poutcome", "campaign", "education" ]

X_no_outliers_tmp = X_no_outliers[fetures]

X_no_outliers_tmp = scaler.fit_transform(X_no_outliers_tmp)

joblib.dump(scaler, 'scaler.pkl')


['scaler.pkl']