# Baseline Model Comparison: Evaluating Performance for Optimal Tuning

* This notebook assesses the performance of various machine learning models to identify the best candidates for hyperparameter tuning in customer promotion predictions.

In [51]:
# Import necessary libraries for data processing and modeling
import numpy as np
import pandas as pd

# Import machine learning models
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

# Import functions for model evaluation
from sklearn.model_selection import train_test_split

# Import sys module to modify the Python path
import sys
sys.path.append('../src')  # Add the '../src' directory to the kernel path

# Import custom functions for scoring and testing results
from test_results import score, test_results

In [52]:
# Load the training dataset
train_data = pd.read_csv("../data/training.csv")

# Load the test dataset
test_data = pd.read_csv("../data/Test.csv")

# Display the first few rows of the training dataset
train_data.head()

Unnamed: 0,ID,Promotion,purchase,V1,V2,V3,V4,V5,V6,V7
0,1,No,0,2,30.443518,-1.165083,1,1,3,2
1,3,No,0,3,32.15935,-0.645617,2,3,2,2
2,4,No,0,2,30.431659,0.133583,1,1,4,2
3,5,No,0,0,26.588914,-0.212728,2,1,4,2
4,8,Yes,0,3,28.044331,-0.385883,1,1,2,2


In [69]:
# Define Group A (no promotion) as g1 and Group B (with promotion) as g2
g1 = train_data[(train_data.Promotion == "No")]
g2 = train_data[(train_data.Promotion == "Yes")]

# Split the dataset into training and testing sets using only data from Group B
X = g2[["V1", "V2", "V3", "V4", "V5", "V6", "V7"]]  # Features
y = g2["purchase"]  # Target variable

# Perform train-test split with 80% training data and 20% testing data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3,  random_state=42
)

In [70]:
# Create baseline models using a balanced dataset

# Initialize and fit a Random Forest classifier with balanced class weights
rf = RandomForestClassifier(class_weight="balanced", random_state=42)
rf.fit(X_train, y_train)

# Initialize and fit a Logistic Regression model with balanced class weights
lr = LogisticRegression(class_weight="balanced", random_state=42)
lr.fit(X_train, y_train)

# Calculate the scale_pos_weight for XGBoost to handle class imbalance
scale_pos_weight = len(y_train[y_train == 0]) / len(y_train[y_train == 1])

# Initialize and fit an XGBoost classifier with calculated scale_pos_weight
xgb_clf = XGBClassifier(scale_pos_weight=scale_pos_weight, random_state = 42)
xgb_clf.fit(X_train, y_train)

In [71]:
# define results for storing model baselines
results = {
    "Model": [],
    "IRR": [],
    "NIR": []
}

In [72]:
# baseline of Random Forest Classifier
def promotion_strategy(df):
    y_pred = rf.predict(df)

    # use numpy.where to replace 1 with "Yes", else with "No"
    promotion = np.where(y_pred == 1, "Yes", "No")

    return promotion

irr, nir = test_results(promotion_strategy)
results["Model"].append("Random Forest Classifier")
results["IRR"].append(irr)
results["NIR"].append(nir)

# baseline of LogisticRegression
def promotion_strategy(df):
    y_pred = lr.predict(df)

    # use numpy.where to replace 1 with "Yes", else with "No"
    promotion = np.where(y_pred == 1, "Yes", "No")

    return promotion

irr, nir = test_results(promotion_strategy)
results["Model"].append("Logistic Regression")
results["IRR"].append(irr)
results["NIR"].append(nir)

# baseline of XGBoost Classifier
def promotion_strategy(df):
    y_pred = xgb_clf.predict(df)

    # use numpy.where to replace 1 with "Yes", else with "No"
    promotion = np.where(y_pred == 1, "Yes", "No")

    return promotion

irr, nir = test_results(promotion_strategy)
results["Model"].append("XGBoost Classifier")
results["IRR"].append(irr)
results["NIR"].append(nir)

Nice job!  See how well your strategy worked on our test data below!

Your irr with this strategy is 0.0000.

Your nir with this strategy is -2.40.
We came up with a model with an irr of 0.0188 and an nir of 189.45 on the test set.

 How did you do?
Nice job!  See how well your strategy worked on our test data below!

Your irr with this strategy is 0.0147.

Your nir with this strategy is -32.40.
We came up with a model with an irr of 0.0188 and an nir of 189.45 on the test set.

 How did you do?
Nice job!  See how well your strategy worked on our test data below!

Your irr with this strategy is 0.0189.

Your nir with this strategy is 56.00.
We came up with a model with an irr of 0.0188 and an nir of 189.45 on the test set.

 How did you do?


In [73]:
# Create a DataFrame from the results dictionary
results_df = pd.DataFrame(results)

results_df

Unnamed: 0,Model,IRR,NIR
0,Random Forest Classifier,0.0,-2.4
1,Logistic Regression,0.014738,-32.4
2,XGBoost Classifier,0.018874,56.0


In [74]:
# save model baseline results as a CSV file
results_df.to_csv('../reports/baselines/model_baseline.csv', index=False)

## Based on our analysis of the baseline models, we observe the following:

* The Random Forest Classifier yields an IRR of 0.000000, indicating no improvement over the control group, and a negative NIR of -2.25, suggesting that this model is not suitable for our promotional strategy.

* The Logistic Regression model performs slightly better, with an IRR of 0.015137 and a positive NIR of 18.10. This suggests some level of effectiveness in predicting purchases.

* The XGBoost Classifier, while showing an IRR of 0.013563, has a significantly negative NIR of -29.20. Despite this, we recognize XGBoost's robustness and its ability to be fine-tuned effectively.

<b>Given the need to move forward efficiently and considering the performance metrics, we will proceed with the XGBoost Classifier for further hyperparameter tuning. Its advanced capabilities and adaptability make it a promising candidate for enhancing our promotional predictions.</b>

