# Final Course Group Project – Insurance Purchase Prediction

**Course:** BZAN 6357 – Business Analytics with Python  
**Project Type:** Supervised ML (Classification)  
**Template generated:** 2025-10-30

## Team
- Aditya Boghara 
- Meghana

## Deliverables
Submit a single zip with:  
1) This notebook (fully executed).  
2) `my_prediction.csv` with **exactly** 3 columns: `id_new`, `probability`, `classification`.

## 1) Introduction & Objective
- **Background:** Cross-sell *car insurance* to existing medical policyholders.
- **Objective:** Predict purchase probability (1=purchased, 0=not purchased) and classify Score data.
- **Evaluation:** AUC-ROC and F1 score on held-out test; clarity and rigor of this notebook.
- **Approach (summary):** Data prep → EDA → Modeling (baseline → tuned) → Evaluation → Score file export.

## 2) Setup
Fill in project constants and file paths if needed.

In [55]:
# === Project constants ===
RANDOM_STATE = 42
TEST_SIZE = 0.2  # 20% test split
N_FOLDS = 5  # 5- or 10-fold CV recommended

# File names expected by the project
TRAIN_FILE = "bzan6357_insurance_3_TRAINING.csv"
SCORE_FILE = "bzan6357_insurance_3_SCORE.csv"
SUBMIT_FILE = "my_prediction.csv"  # must contain: id_new, probability, classification


## 3) Imports
Only add libraries you actually use.

In [56]:
import pandas as pd, numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import roc_auc_score, f1_score, roc_curve, confusion_matrix
import tensorflow as tf
from tensorflow.keras import layers, Sequential
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    ConfusionMatrixDisplay,
    precision_recall_curve,
)
from scikeras.wrappers import KerasClassifier
from sklearn.utils.class_weight import compute_class_weight

## 4) Data Load & Quick Audit
If files are missing, you'll see a helpful message instead of a crash.

In [57]:
# Load data (paths are already set above)
df_train = pd.read_csv(TRAIN_FILE)
df_score = pd.read_csv(SCORE_FILE)

df_train.head()

y = df_train["buy"]

print(y.value_counts())
print(y.value_counts(normalize=True) * 100)

print("Shape of df_train", df_train.shape)
print("Shape of df_score", df_score.shape)


buy
0    16705
1     3755
Name: count, dtype: int64
buy
0    81.647116
1    18.352884
Name: proportion, dtype: float64
Shape of df_train (20460, 12)
Shape of df_score (2000, 11)


## 5) Basic EDA (brief)
Keep this concise and focused on modeling decisions.

**Suggested checks:**
- Target balance (`buy`).  
- Distributions of numeric features (e.g., `age`, `tenure`, `v_prem_quote`).  
- Cardinality of `region`, `cs_rep`.  
- Categorical value ranges (`gender`, `v_age`, `v_accident`).

In [58]:
# Target and features
y = df_train["buy"].astype(int)
X = df_train.drop(columns=["buy"])


X = X.drop(columns=["id_new"])


# Identify feature types
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_features = X.select_dtypes(exclude=["int64", "float64"]).columns.tolist()


print(numeric_features)
print(categorical_features)


score_ids = df_score["id_new"].copy()
X_score = df_score.drop(columns=["id_new"])

numeric_features = [c for c in X.select_dtypes(include=["int64", "float64"]).columns]
categorical_features = [
    c for c in X.select_dtypes(exclude=["int64", "float64"]).columns
]

preprocessor = ColumnTransformer(
    transformers=[
        ("OneHotEncoder", OneHotEncoder(drop="first"), categorical_features),
        ("StandardScaler", StandardScaler(with_mean=False), numeric_features),
    ]
)

print("Shape of df_train", X.shape)
print("Shape of df_score", X_score.shape)


['age', 'tenure', 'region', 'dl', 'has_v_insurance', 'v_prem_quote', 'cs_rep']
['gender', 'v_age', 'v_accident']
Shape of df_train (20460, 10)
Shape of df_score (2000, 10)


## 6) Preprocessing (Pipelines)
Use a **ColumnTransformer** so the *same* steps can be reused for TEST and SCORE.

**Notes:**
- Treat high-cardinality IDs (e.g., `region`, `cs_rep`) with One-Hot (can be large) or try frequency encoding.
- One-Hot encode: `gender`, `v_age`, `v_accident`, `region`, `cs_rep`.
- Scale numeric features as needed for certain models.

In [59]:
X = preprocessor.fit_transform(X)
X_score = preprocessor.transform(X_score)

X_score


array([[1.        , 0.        , 0.        , ..., 0.        , 1.39816747,
        9.7316272 ],
       [0.        , 1.        , 0.        , ..., 0.        , 2.3758078 ,
        9.19429809],
       [1.        , 0.        , 0.        , ..., 2.23570222, 1.64180453,
        7.28379459],
       ...,
       [1.        , 0.        , 0.        , ..., 2.23570222, 3.62885928,
        7.28379459],
       [1.        , 0.        , 0.        , ..., 0.        , 1.95069579,
        7.28379459],
       [1.        , 0.        , 1.        , ..., 0.        , 2.10455499,
        7.28379459]], shape=(2000, 11))

## 7) Train/Test Split
Stratify on `buy` to preserve class balance.

In [60]:
X_tr, X_va, y_tr, y_va = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

In [61]:
from imblearn.over_sampling import SMOTE

# 2. Apply SMOTE only on the training set
sm = SMOTE(random_state=42)
X_tr_res, y_tr_res = sm.fit_resample(X_tr, y_tr)


print("Before SMOTE:", X_tr.shape, y_tr.shape)
print("After SMOTE:", X_tr_res.shape, y_tr_res.shape)

print("value count before SMOTE:", y_tr.value_counts())
print("value count After SMOTE:", y_tr_res.value_counts())



Before SMOTE: (16368, 11) (16368,)
After SMOTE: (26786, 11) (26786,)
value count before SMOTE: buy
0    13393
1     2975
Name: count, dtype: int64
value count After SMOTE: buy
0    13393
1    13393
Name: count, dtype: int64


## 8) Baseline Models
Start with a few solid baselines and compare AUC/F1.

In [None]:
def build_sequential( input_dim: int, lr: float = 1e-3, dropout: float = 0.2) -> tf.keras.Model:
    """Binary-classification MLP in pure tf.keras Sequential."""
    model = Sequential(
        [
            layers.Input(shape=(input_dim,)),
            layers.Dense(256, activation="relu"),
            layers.BatchNormalization(),
            layers.Dropout(dropout),
            layers.Dense(128, activation="relu"),
            layers.BatchNormalization(),
            layers.Dropout(dropout),
            layers.Dense(64, activation="relu"),
            layers.BatchNormalization(),
            layers.Dropout(dropout / 2),
            layers.Dense(1, activation="sigmoid"),
        ]
    )
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=lr),
        loss="binary_crossentropy",
        metrics=[
            tf.keras.metrics.BinaryAccuracy(name="accuracy"),
            tf.keras.metrics.AUC(name="auc"),
        ],
    )
    return model


from tensorflow.keras.callbacks import EarlyStopping

early_stopping_callback = EarlyStopping(
    monitor="val_loss", patience=10, restore_best_weights=True
)


Using SMOTE

In [63]:
model = build_sequential(input_dim=X_tr_res.shape[1], lr=1e-3, dropout=0.2)
hist = model.fit(
    X_tr_res,
    y_tr_res,
    validation_data=(X_va, y_va),
    epochs=50,
    batch_size=512,
    callbacks=[early_stopping_callback],
    verbose=1
)


proba_va = model.predict(X_va).ravel()
prec, rec, thr = precision_recall_curve(y_va, proba_va)
f1_vals = (2 * prec * rec) / (prec + rec + 1e-12)
best_idx = int(np.nanargmax(f1_vals))
best_thr = float(thr[max(best_idx - 1, 0)]) if best_idx < len(thr) else 0.5


pred_va = (proba_va >= best_thr).astype(int)
print(f"\nValidation AUC: {roc_auc_score(y_va, proba_va):.4f}")
print(f"Best threshold (F1): {best_thr:.4f}")
print(f"Validation F1: {f1_score(y_va, pred_va):.4f}")
print("Confusion Matrix (val):\n", confusion_matrix(y_va, pred_va))
print(
    "\nClassification Report (val):\n", classification_report(y_va, pred_va, digits=4)
)


Epoch 1/50
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step - accuracy: 0.6855 - auc: 0.7288 - loss: 0.5882 - val_accuracy: 0.8096 - val_auc: 0.7395 - val_loss: 0.5644
Epoch 2/50
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.7059 - auc: 0.7564 - loss: 0.5444 - val_accuracy: 0.8079 - val_auc: 0.7345 - val_loss: 0.5154
Epoch 3/50
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.7137 - auc: 0.7634 - loss: 0.5361 - val_accuracy: 0.5565 - val_auc: 0.7359 - val_loss: 0.6033
Epoch 4/50
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.7164 - auc: 0.7687 - loss: 0.5298 - val_accuracy: 0.5061 - val_auc: 0.7372 - val_loss: 0.6608
Epoch 5/50
[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.7192 - auc: 0.7740 - loss: 0.5253 - val_accuracy: 0.5227 - val_auc: 0.7498 - val_loss: 0.6534
Epoch 6/50
[1m53/53[0m [32m━━━━━━━━━━

Using Class Weights!!

In [70]:
from sklearn.utils import class_weight


class_weights = class_weight.compute_class_weight(
    'balanced',
    classes=np.unique(y_tr),
    y=y_tr
)

print(class_weights)



[0.61106548 2.75092437]


In [None]:

from sklearn.utils.class_weight import compute_class_weight

classes = np.unique(y_tr)
w = compute_class_weight('balanced', classes=classes, y=y_tr)
scale_pos = 0.7 
cw_soft = {int(c): (float(w[i]) * (scale_pos if c==1 else 1.0))
           for i, c in enumerate(classes)}



print(cw_soft)

early_stopping_callback = EarlyStopping(monitor='val_auc', mode='min',
                                        patience=5, restore_best_weights=True)

model1 = build_sequential(input_dim=X_tr.shape[1], lr=1e-3, dropout=0.2)
hist = model1.fit(
    X_tr,
    y_tr,
    validation_data=(X_va, y_va),
    epochs=50,
    batch_size=512,
    callbacks=[early_stopping_callback],
    verbose=1, 
    class_weight = cw_soft
)


proba_va = model1.predict(X_va).ravel()
prec, rec, thr = precision_recall_curve(y_va, proba_va)
f1_vals = (2 * prec * rec) / (prec + rec + 1e-12)


pred_va = (proba_va >= 0.7).astype(int)
print(f"\nValidation AUC: {roc_auc_score(y_va, proba_va):.4f}")
print(f"Validation F1: {f1_score(y_va, pred_va):.4f}")
print("Confusion Matrix (val):\n", confusion_matrix(y_va, pred_va))
print(
    "\nClassification Report (val):\n", classification_report(y_va, pred_va, digits=4)
)


{0: 0.6110654819681923, 1: 1.9256470588235295}
Epoch 1/50
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 12ms/step - accuracy: 0.5672 - auc: 0.6917 - loss: 0.5668 - val_accuracy: 0.5281 - val_auc: 0.7138 - val_loss: 0.6701
Epoch 2/50
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.6114 - auc: 0.7298 - loss: 0.5046 - val_accuracy: 0.6029 - val_auc: 0.6902 - val_loss: 0.6130
Epoch 3/50
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.6332 - auc: 0.7418 - loss: 0.4826 - val_accuracy: 0.7520 - val_auc: 0.7148 - val_loss: 0.5490
Epoch 4/50
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.6529 - auc: 0.7508 - loss: 0.4719 - val_accuracy: 0.7571 - val_auc: 0.7491 - val_loss: 0.5343
Epoch 5/50
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.6568 - auc: 0.7484 - loss: 0.4701 - val_accuracy: 0.7036 - val_auc: 0.7476 - val_loss:

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [69]:
proba_va = model1.predict(X).ravel()
prec, rec, thr = precision_recall_curve(y, proba_va)
f1_vals = (2 * prec * rec) / (prec + rec + 1e-12)


pred_va = (proba_va >= 0.7).astype(int)
print(f"\nValidation AUC: {roc_auc_score(y, proba_va):.4f}")
print(f"Validation F1: {f1_score(y, pred_va):.4f}")
print("Confusion Matrix (val):\n", confusion_matrix(y, pred_va))
print(
    "\nClassification Report (val):\n", classification_report(y, pred_va, digits=4)
)

[1m640/640[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 394us/step

Validation AUC: 0.6920
Validation F1: 0.0000
Confusion Matrix (val):
 [[16705     0]
 [ 3755     0]]

Classification Report (val):
               precision    recall  f1-score   support

           0     0.8165    1.0000    0.8990     16705
           1     0.0000    0.0000    0.0000      3755

    accuracy                         0.8165     20460
   macro avg     0.4082    0.5000    0.4495     20460
weighted avg     0.6666    0.8165    0.7340     20460



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## 9) Model Selection & (Optional) Hyperparameter Tuning
Pick the best baseline by AUC/F1, then optionally run a small grid search.


## 10) Fit Final Model on Full Training Set
Use the chosen/tuned pipeline and refit on the entire TRAIN set (`X`, `y`).

## 11) Score Dataset → Create `my_prediction.csv`
Follow the required format: `id_new`, `probability` (for class 1 only), `classification` (argmax).

## 12) Results, Interpretation, and Recommendations
**Summarize:**
- Best model and *why* it was chosen.
- AUC/F1 on the test set and what that implies.
- Any key drivers of purchase you identified.
- Business recommendations (who to target, how to use scores, next steps).

## Appendix
- Python/Sklearn versions
- Reproducibility notes
- Any references

In [67]:
import sys, sklearn

print("Python:", sys.version)
print("pandas:", pd.__version__)
print("numpy:", np.__version__)
print("sklearn:", sklearn.__version__)


Python: 3.12.12 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 20:07:49) [Clang 20.1.8 ]
pandas: 2.3.3
numpy: 2.3.4
sklearn: 1.7.2
