# Final Course Group Project – Insurance Purchase Prediction

**Course:** BZAN 6357 – Business Analytics with Python  
**Project Type:** Supervised ML (Classification)  
**Template generated:** 2025-10-30

## Team
- Aditya Boghara 
- Meghana

## Deliverables
Submit a single zip with:  
1) This notebook (fully executed).  
2) `my_prediction.csv` with **exactly** 3 columns: `id_new`, `probability`, `classification`.

## 1) Introduction & Objective
- **Background:** Cross-sell *car insurance* to existing medical policyholders.
- **Objective:** Predict purchase probability (1=purchased, 0=not purchased) and classify Score data.
- **Evaluation:** AUC-ROC and F1 score on held-out test; clarity and rigor of this notebook.
- **Approach (summary):** Data prep → EDA → Modeling (baseline → tuned) → Evaluation → Score file export.

## 2) Setup
Fill in project constants and file paths if needed.

In [2]:
# === Project constants ===
RANDOM_STATE = 42
TEST_SIZE = 0.2  # 20% test split
N_FOLDS = 5      # 5- or 10-fold CV recommended

# File names expected by the project
TRAIN_FILE = 'bzan6357_insurance_3_TRAINING.csv'
SCORE_FILE = 'bzan6357_insurance_3_SCORE.csv'
SUBMIT_FILE = 'my_prediction.csv'  # must contain: id_new, probability, classification


## 3) Imports
Only add libraries you actually use.

In [3]:
import pandas as pd, numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import roc_auc_score, f1_score, classification_report, roc_curve

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, ConfusionMatrixDisplay, \
                            precision_score, recall_score, f1_score, roc_auc_score,roc_curve 

## 4) Data Load & Quick Audit
If files are missing, you'll see a helpful message instead of a crash.

In [4]:
# Load data (paths are already set above)
df_train = pd.read_csv("bzan6357_insurance_3_TRAINING.csv")
df_score = pd.read_csv("bzan6357_insurance_3_SCORE.csv")

df_train.head()

y = df_train['buy']

print(y.value_counts())

y.value_counts(normalize=True) * 100



buy
0    16705
1     3755
Name: count, dtype: int64


buy
0    81.647116
1    18.352884
Name: proportion, dtype: float64

## 5) Basic EDA (brief)
Keep this concise and focused on modeling decisions.

**Suggested checks:**
- Target balance (`buy`).  
- Distributions of numeric features (e.g., `age`, `tenure`, `v_prem_quote`).  
- Cardinality of `region`, `cs_rep`.  
- Categorical value ranges (`gender`, `v_age`, `v_accident`).

In [5]:
# Target and features
y = df_train['buy'].astype(int)
X = df_train.drop(columns=['buy'])


# Identify feature types
numeric_features = X.select_dtypes(include=['int64','float64']).columns.tolist()
categorical_features = X.select_dtypes(exclude=['int64','float64']).columns.tolist()


X = X.drop(columns=['id_new'])
score_ids = df_score['id_new'].copy()
X_score = df_score.drop(columns=['id_new'])

numeric_features = [c for c in X.select_dtypes(include=['int64','float64']).columns]
categorical_features = [c for c in X.select_dtypes(exclude=['int64','float64']).columns]

preprocessor = ColumnTransformer(
    transformers=[
        ('OneHotEncoder', OneHotEncoder(drop='first'), categorical_features),
        ('StandardScaler', StandardScaler(with_mean=False), numeric_features)
    ]
)


## 6) Train/Test Split
Stratify on `buy` to preserve class balance.

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)



## 7) Preprocessing (Pipelines)
Use a **ColumnTransformer** so the *same* steps can be reused for TEST and SCORE.

**Notes:**
- Treat high-cardinality IDs (e.g., `region`, `cs_rep`) with One-Hot (can be large) or try frequency encoding.
- One-Hot encode: `gender`, `v_age`, `v_accident`, `region`, `cs_rep`.
- Scale numeric features as needed for certain models.

In [18]:
X_train=preprocessor.fit_transform(X_train)
X_test=preprocessor.transform(X_test)


## 8) Baseline Models
Start with a few solid baselines and compare AUC/F1.

## 9) Model Selection & (Optional) Hyperparameter Tuning
Pick the best baseline by AUC/F1, then optionally run a small grid search.


## 10) Fit Final Model on Full Training Set
Use the chosen/tuned pipeline and refit on the entire TRAIN set (`X`, `y`).

## 11) Score Dataset → Create `my_prediction.csv`
Follow the required format: `id_new`, `probability` (for class 1 only), `classification` (argmax).

## 12) Results, Interpretation, and Recommendations
**Summarize:**
- Best model and *why* it was chosen.
- AUC/F1 on the test set and what that implies.
- Any key drivers of purchase you identified.
- Business recommendations (who to target, how to use scores, next steps).

## Appendix
- Python/Sklearn versions
- Reproducibility notes
- Any references

In [None]:
# import sys, sklearn
# print('Python:', sys.version)
# print('pandas:', pd.__version__)
# print('numpy:', np.__version__)
# print('sklearn:', sklearn.__version__)
