# Final Course Group Project – Insurance Purchase Prediction

**Course:** BZAN 6357 – Business Analytics with Python  
**Project Type:** Supervised ML (Classification)  
**Template generated:** 2025-10-30

## Team
- Aditya Boghara 
- Meghana

## Deliverables
Submit a single zip with:  
1) This notebook (fully executed).  
2) `my_prediction.csv` with **exactly** 3 columns: `id_new`, `probability`, `classification`.

## 1) Introduction & Objective
- **Background:** Cross-sell *car insurance* to existing medical policyholders.
- **Objective:** Predict purchase probability (1=purchased, 0=not purchased) and classify Score data.
- **Evaluation:** AUC-ROC and F1 score on held-out test; clarity and rigor of this notebook.
- **Approach (summary):** Data prep → EDA → Modeling (baseline → tuned) → Evaluation → Score file export.

## 2) Setup
Fill in project constants and file paths if needed.

In [4]:
# === Project constants ===
RANDOM_STATE = 42
TEST_SIZE = 0.2  # 20% test split
N_FOLDS = 5  # 5- or 10-fold CV recommended

# File names expected by the project
TRAIN_FILE = "bzan6357_insurance_3_TRAINING.csv"
SCORE_FILE = "bzan6357_insurance_3_SCORE.csv"
SUBMIT_FILE = "my_prediction.csv"  # must contain: id_new, probability, classification


## 3) Imports
Only add libraries you actually use.

In [5]:
import pandas as pd, numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import roc_auc_score, f1_score, roc_curve, confusion_matrix
import tensorflow as tf
from tensorflow.keras import layers, Sequential
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    ConfusionMatrixDisplay,
    precision_recall_curve,
)
from scikeras.wrappers import KerasClassifier
from sklearn.utils.class_weight import compute_class_weight

## 4) Data Load & Quick Audit
If files are missing, you'll see a helpful message instead of a crash.

In [6]:
# Load data (paths are already set above)
df_train = pd.read_csv(TRAIN_FILE)
df_score = pd.read_csv(SCORE_FILE)

df_train.head()

y = df_train["buy"]

print(y.value_counts())
print(y.value_counts(normalize=True) * 100)

print("Shape of df_train", df_train.shape)
print("Shape of df_score", df_score.shape)


buy
0    16705
1     3755
Name: count, dtype: int64
buy
0    81.647116
1    18.352884
Name: proportion, dtype: float64
Shape of df_train (20460, 12)
Shape of df_score (2000, 11)


## 5) Basic EDA (brief)
Keep this concise and focused on modeling decisions.

**Suggested checks:**
- Target balance (`buy`).  
- Distributions of numeric features (e.g., `age`, `tenure`, `v_prem_quote`).  
- Cardinality of `region`, `cs_rep`.  
- Categorical value ranges (`gender`, `v_age`, `v_accident`).

In [7]:
# Target and features
y = df_train["buy"].astype(int)
X = df_train.drop(columns=["buy"])


X = X.drop(columns=["id_new"])


# Identify feature types
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_features = X.select_dtypes(exclude=["int64", "float64"]).columns.tolist()


print(numeric_features)
print(categorical_features)


score_ids = df_score["id_new"].copy()
X_score = df_score.drop(columns=["id_new"])

numeric_features = [c for c in X.select_dtypes(include=["int64", "float64"]).columns]
categorical_features = [
    c for c in X.select_dtypes(exclude=["int64", "float64"]).columns
]

preprocessor = ColumnTransformer(
    transformers=[
        ("OneHotEncoder", OneHotEncoder(drop="first"), categorical_features),
        ("StandardScaler", StandardScaler(with_mean=False), numeric_features),
    ]
)

print("Shape of df_train", X.shape)
print("Shape of df_score", X_score.shape)


['age', 'tenure', 'region', 'dl', 'has_v_insurance', 'v_prem_quote', 'cs_rep']
['gender', 'v_age', 'v_accident']
Shape of df_train (20460, 10)
Shape of df_score (2000, 10)


## 6) Preprocessing (Pipelines)
Use a **ColumnTransformer** so the *same* steps can be reused for TEST and SCORE.

**Notes:**
- Treat high-cardinality IDs (e.g., `region`, `cs_rep`) with One-Hot (can be large) or try frequency encoding.
- One-Hot encode: `gender`, `v_age`, `v_accident`, `region`, `cs_rep`.
- Scale numeric features as needed for certain models.

In [8]:
X = preprocessor.fit_transform(X)
X_score = preprocessor.transform(X_score)

X_score


array([[1.        , 0.        , 0.        , ..., 0.        , 1.39816747,
        9.7316272 ],
       [0.        , 1.        , 0.        , ..., 0.        , 2.3758078 ,
        9.19429809],
       [1.        , 0.        , 0.        , ..., 2.23570222, 1.64180453,
        7.28379459],
       ...,
       [1.        , 0.        , 0.        , ..., 2.23570222, 3.62885928,
        7.28379459],
       [1.        , 0.        , 0.        , ..., 0.        , 1.95069579,
        7.28379459],
       [1.        , 0.        , 1.        , ..., 0.        , 2.10455499,
        7.28379459]], shape=(2000, 11))

## 7) Train/Test Split
Stratify on `buy` to preserve class balance.

In [9]:
X_tr, X_va, y_tr, y_va = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

In [10]:
from imblearn.over_sampling import SMOTE

# 2. Apply SMOTE only on the training set
sm = SMOTE(random_state=42)
X_tr_res, y_tr_res = sm.fit_resample(X_tr, y_tr)


print("Before SMOTE:", X_tr.shape, y_tr.shape)
print("After SMOTE:", X_tr_res.shape, y_tr_res.shape)

print("value count before SMOTE:", y_tr.value_counts())
print("value count After SMOTE:", y_tr_res.value_counts())



Before SMOTE: (16368, 11) (16368,)
After SMOTE: (26786, 11) (26786,)
value count before SMOTE: buy
0    13393
1     2975
Name: count, dtype: int64
value count After SMOTE: buy
0    13393
1    13393
Name: count, dtype: int64


## 8) Baseline Models
Start with a few solid baselines and compare AUC/F1.

In [11]:
from xgboost import XGBClassifier

## Beginning Model Training
models = {
    "Xgboost Regressor":XGBClassifier()
   
}

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_tr_res, y_tr_res) # Train model

    # Make prediction

    # y_train_pred = model.predict(X_tr_res)
    y_proba_tr = model.predict_proba(X_tr_res)[:, 1]
    y_train_pred = (y_proba_tr >= 0.5).astype(int)
    y_test_pred = model.predict(X)
    

    
    print(list(models.keys())[i])
    
    print("Confusion Matrix (val):\n", confusion_matrix(y_tr_res, y_train_pred))
    print(
    "\nClassification Report fot train (val):\n", classification_report(y_tr_res, y_train_pred,  digits=4)
    )
    
    print("accuracy: ", accuracy_score(y_tr_res, y_train_pred))

    print('----------------------------------')
    
    print("Confusion Matrix (val):\n", confusion_matrix(y, y_test_pred))
    print(
    "\nClassification Repor for test (val):\n", classification_report(y, y_test_pred, digits=4)
    )
    print("accuracy: ", accuracy_score(y, y_test_pred))
    
    print('='*35)
    print('\n')

Xgboost Regressor
Confusion Matrix (val):
 [[11545  1848]
 [ 1158 12235]]

Classification Report fot train (val):
               precision    recall  f1-score   support

           0     0.9088    0.8620    0.8848     13393
           1     0.8688    0.9135    0.8906     13393

    accuracy                         0.8878     26786
   macro avg     0.8888    0.8878    0.8877     26786
weighted avg     0.8888    0.8878    0.8877     26786

accuracy:  0.8877771970432315
----------------------------------
Confusion Matrix (val):
 [[14114  2591]
 [ 1267  2488]]

Classification Repor for test (val):
               precision    recall  f1-score   support

           0     0.9176    0.8449    0.8798     16705
           1     0.4899    0.6626    0.5633      3755

    accuracy                         0.8114     20460
   macro avg     0.7037    0.7537    0.7215     20460
weighted avg     0.8391    0.8114    0.8217     20460

accuracy:  0.8114369501466275




## 9) Model Selection & (Optional) Hyperparameter Tuning
Pick the best baseline by AUC/F1, then optionally run a small grid search.


## 10) Fit Final Model on Full Training Set
Use the chosen/tuned pipeline and refit on the entire TRAIN set (`X`, `y`).

## 11) Score Dataset → Create `my_prediction.csv`
Follow the required format: `id_new`, `probability` (for class 1 only), `classification` (argmax).

## 12) Results, Interpretation, and Recommendations
**Summarize:**
- Best model and *why* it was chosen.
- AUC/F1 on the test set and what that implies.
- Any key drivers of purchase you identified.
- Business recommendations (who to target, how to use scores, next steps).

## Appendix
- Python/Sklearn versions
- Reproducibility notes
- Any references

In [14]:
import sys, sklearn, xgboost

print("Python:", sys.version)
print("pandas:", pd.__version__)
print("numpy:", np.__version__)
print("sklearn:", sklearn.__version__)
print("XGBoost:", xgboost.__version__)


Python: 3.12.12 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 20:07:49) [Clang 20.1.8 ]
pandas: 2.3.3
numpy: 2.3.4
sklearn: 1.7.2
XGBoost: 3.1.1
