# Step 5: Feature Engineering

Feature engineering is performed selectively based on insights from exploratory data analysis and statistical validation. 
The goal is to strengthen behavioral fraud signals while maintaining interpretability and avoiding data leakage.

Only domain-relevant and EDA-supported features are engineered. No automated or high-dimensional transformations are applied.


In [1]:
# load 
import pandas as pd
import numpy as np

# Load preprocessed dataset
df = pd.read_csv("../data/processed/cleaned_data.csv")

df.shape 


(15420, 74)

In [2]:

X = df.drop('FraudFound_P', axis=1)
y = df['FraudFound_P']

X.shape, y.shape


((15420, 73), (15420,))

### Deductible Binning

EDA revealed that deductible values are policy-defined slabs rather than continuous values. 
To reflect this structure, deductible is converted into ordinal bins.


In [3]:
X['Deductible_Bin'] = pd.cut(
    X['Deductible'],
    bins=[0, 350, 400, 450, 1000],
    labels=['Low', 'Medium', 'High', 'Very_High']
)

X = X.drop(columns=['Deductible'])


### Age Grouping

Age showed a small but statistically significant relationship with fraud.
Grouping age into life-stage buckets improves interpretability and robustness.


In [5]:
X['Age_Group'] = pd.cut(
    X['Age'],
    bins=[0, 25, 35, 50, 65, 100],
    labels=['Young', 'Early_Adult', 'Mid_Age', 'Senior', 'Elder']
)

X = X.drop(columns=['Age'])


### Early Claim Flag

Claims occurring soon after policy initiation may indicate opportunistic fraud.
An early-claim binary flag is created based on policy accident timing.


In [6]:
X['Early_Claim_Flag'] = X['Days_Policy_Accident'].isin(
    ['none', '1 to 7', '8 to 15']
).astype(int)


KeyError: 'Days_Policy_Accident'