# Model Selection and Training

## Introduction to Model Selection:


### Overview of Algorithms Considered

In this project, I explored several machine learning algorithms to build a predictive model on a processed dataset with a combination of interaction features, standardized numerical features, and one-hot encoded categorical features. The target variable is an imbalanced binary feature, with the majority class being 0.

1. **Logistic Regression:**
   - **Performance:** Logistic Regression was one of the top-performing models, achieving a training ROC AUC score of 0.823. Its ability to handle imbalanced data, coupled with its simplicity and interpretability, made it a strong candidate.
   - **Advantages:** Works well with standardized numerical features and can handle multicollinearity. The coefficients provide insights into feature importance.
   - **Hyperparameter Tuning:** I used GridSearch with Stratified K-Fold (number of splits = 5) to tune hyperparameters for Logistic Regression, optimizing its performance on the imbalanced data.

2. **Random Forest:**
   - **Performance:** Random Forest slightly outperformed Logistic Regression with a training ROC AUC score of 0.825. It effectively handled the imbalanced dataset and the mix of feature types.
   - **Advantages:** Robust to overfitting, particularly useful for capturing complex interactions between features, and provides feature importance metrics.
   - **Hyperparameter Tuning:** Similar to Logistic Regression, GridSearch with Stratified K-Fold was employed to fine-tune the hyperparameters of Random Forest, ensuring the model was well-calibrated for the task.

3. **Decision Tree:**
   - **Performance:** Decision Tree performed poorly compared to Logistic Regression and Random Forest. The model struggled with the imbalanced data and did not generalize well.
   - **Challenges:** Tendency to overfit, especially in the presence of noise and complex feature interactions, and less effective with imbalanced datasets.

4. **K-Nearest Neighbors (KNN):**
   - **Performance:** KNN also performed poorly. The model was likely hindered by the high dimensionality introduced by the one-hot encoded features and struggled with the imbalanced target variable.
   - **Challenges:** Computationally expensive for large datasets and high-dimensional spaces, sensitive to the choice of distance metric, and not well-suited for imbalanced data.

5. **Support Vector Machine (SVM):**
   - **Performance:** Due to the computational cost associated with SVM on this dataset, I was unable to successfully train and evaluate the model.
   - **Challenges:** While SVM is powerful for binary classification, especially with imbalanced data, it can be computationally prohibitive on large or high-dimensional datasets.

**Summary:**
- **Best Performing Models:** Logistic Regression and Random Forest emerged as the top contenders, with very close performance in terms of ROC AUC score. Both models were fine-tuned using GridSearch with Stratified K-Fold to optimize their performance.
- **Poor Performing Models:** Decision Tree and KNN were less effective due to overfitting and difficulty handling the dataset's complexity and imbalance.
- **Computational Limitations:** SVM was not feasible due to the high computational cost.

This thorough exploration helped identify Logistic Regression and Random Forest as the most suitable models for our task, balancing performance, interpretability, and computational efficiency.

## Data Loading

In [5]:
import pandas as pd
import numpy as np

# Load the dataset
data = pd.read_csv('../data/bank-additional/bank_processed_data.csv')

data.shape

(41176, 77)

In [6]:
X = data.drop(['y'], axis=1)
y = data['y']

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

print(X.shape)
print(y.shape)

(41176, 76)
(41176,)


In [7]:
# Split the data into train and test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42)

print("X Train:", X_train.shape)
print("X Test:", X_test.shape)
print("Y Train:", y_train.shape)
print("Y Test:", y_test.shape)

X Train: (32940, 76)
X Test: (8236, 76)
Y Train: (32940,)
Y Test: (8236,)


## Baseline Model:
Train a baseline model

In [8]:
from sklearn.dummy import DummyClassifier
baseline_model = DummyClassifier(strategy='most_frequent')
baseline_model.fit(X_train, y_train)

y_pred = baseline_model.predict_proba(X_test)

# Print AUC Scorel
from sklearn.metrics import roc_auc_score
print("AUC score: %.3f" % roc_auc_score(y_test, y_pred[:, 1]))

AUC score: 0.500



## Training Models:
Train models with default parameters:

In [4]:
# Logistic Regression
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(class_weight='balanced',
                        max_iter=10000,
                        random_state=42)

# Define the Stratified K-Fold cross-validator
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(lr,
                         X_train, y_train, 
                         cv=stratified_kfold, 
                         scoring='roc_auc')

print(f"CV ROC AUC Score: {np.mean(scores):.3f} +/- {np.std(scores):.3f}")

CV ROC AUC Score: 0.817 +/- 0.008


In [5]:
# Decision Tree
from sklearn.tree import DecisionTreeClassifier

tree_model = DecisionTreeClassifier(random_state=42, 
                                    class_weight='balanced')

# Define the Stratified K-Fold cross-validator
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(tree_model,
                         X_train, y_train, 
                         cv=stratified_kfold, 
                         scoring='roc_auc')

print(f"CV ROC AUC Score: {np.mean(scores):.3f} +/- {np.std(scores):.3f}")

CV ROC AUC Score: 0.686 +/- 0.011


In [6]:
# Random forest
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(random_state=42, 
                                class_weight='balanced')

# Define the Stratified K-Fold cross-validator
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(forest,
                         X_train, y_train, 
                         cv=stratified_kfold, 
                         scoring='roc_auc')

print(f"CV ROC AUC Score: {np.mean(scores):.3f} +/- {np.std(scores):.3f}")

CV ROC AUC Score: 0.791 +/- 0.010


In [7]:
# K Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

# Define the Stratified K-Fold cross-validator
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(knn,
                         X_train, y_train, 
                         cv=stratified_kfold, 
                         scoring='roc_auc')

print(f"CV ROC AUC Score: {np.mean(scores):.3f} +/- {np.std(scores):.3f}")

CV ROC AUC Score: 0.740 +/- 0.009


In [10]:
from sklearn.svm import SVC

svm = SVC(class_weight='balanced', random_state=1)

scores = cross_val_score(svm,
                         X_train, y_train, 
                         cv=stratified_kfold, 
                         scoring='roc_auc')

print(f"CV ROC AUC Score: {np.mean(scores):.3f} +/- {np.std(scores):.3f}")

CV ROC AUC Score: 0.794 +/- 0.008


## Hyperparameter Tuning:
Grid search for hyperparameter optimization of best performing models.

In [9]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
lr = LogisticRegression(max_iter=10000, 
                        class_weight='balanced', 
                        random_state=42)

# Define the Stratified K-Fold cross-validator
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Define the hyperparameters and their values to search
param_grid = {'C': np.logspace(-4, 4, 20),
              'solver': ['liblinear'],
              'penalty': ['l2', 'l1']}

# Set up GridSearchCV
gs = GridSearchCV(estimator=lr,
                  param_grid=param_grid,
                  scoring='roc_auc',
                  cv=stratified_kfold,
                  n_jobs=-1)

gs_lr = gs.fit(X_train, y_train)

# best_model = gs.best_estimator_

string = f"""
Best C parameter: {gs_lr.best_params_['C']}
Best solver parameter: {gs_lr.best_params_['solver']}
Best penalty parameter: {gs_lr.best_params_['penalty']}
Training ROC AUC: {gs_lr.best_score_}
"""
print(string)


Best C parameter: 1.623776739188721
Best solver parameter: liblinear
Best penalty parameter: l1
Training ROC AUC: 0.8174873112286329



In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

# Initialize Random Forest Model
forest = RandomForestClassifier(criterion='gini', class_weight='balanced', random_state=42)

# Define the hyperparameters to search over
param_grid = {
    'n_estimators': [280, 300, 320],
    'max_depth': [12, 15, 17],
    'min_samples_split': [11, 13, 15]}

# Define the Stratified K-Fold cross-validator
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Set up GridSearchCV with Stratified K-Fold and ROC AUC as the scoring metric
gs = GridSearchCV(estimator=forest, 
                  param_grid=param_grid, 
                  cv=stratified_kfold, 
                  scoring='roc_auc',
                  n_jobs=-1)

gs_forest = gs.fit(X_train, y_train)

string = f"""
Best n_estimators parameter: {gs_forest.best_params_['n_estimators']}
Best max_depth parameter: {gs_forest.best_params_['max_depth']}
Best min_sample_split parameter: {gs_forest.best_params_['min_samples_split']}
Training ROC AUC: {gs_forest.best_score_}
"""

print(string)


Best n_estimators parameter: 300
Best max_depth parameter: 15
Best min_sample_split parameter: 13
Training ROC AUC: 0.8173020725332496



In [11]:
import joblib

# Save the best models
joblib.dump(gs_lr.best_estimator_, 'best_logistic_model.joblib')
joblib.dump(gs_forest.best_estimator_, 'best_forest_model.joblib')

['best_forest_model.joblib']