# Step 5: Feature Engineering

Feature engineering is performed selectively based on insights from exploratory data analysis and statistical validation. 
The goal is to strengthen behavioral fraud signals while maintaining interpretability and avoiding data leakage.


Feature engineering is performed on raw categorical features before encoding.
All engineered features are EDA-driven, interpretable, and leakage-safe.


In [7]:
# load 
import pandas as pd
import numpy as np

# Load preprocessed dataset
df = pd.read_csv("../data/processed/cleaned_raw_features.csv")

X = df.drop('FraudFound_P', axis=1)
y = df['FraudFound_P']

X.shape



(15420, 22)

### Deductible Binning

EDA revealed that deductible values are policy-defined slabs rather than continuous values. 
To reflect this structure, deductible is converted into ordinal bins.


In [8]:
X['Deductible_Bin'] = pd.cut(
    X['Deductible'],
    bins=[0, 350, 400, 450, 1000],
    labels=['Low', 'Medium', 'High', 'Very_High']
)

X = X.drop(columns=['Deductible'])


### Age Grouping

Age showed a small but statistically significant relationship with fraud.
Grouping age into life-stage buckets improves interpretability and robustness.


In [9]:
X['Age_Group'] = pd.cut(
    X['Age'],
    bins=[0, 25, 35, 50, 65, 100],
    labels=['Young', 'Early_Adult', 'Mid_Age', 'Senior', 'Elder']
)

X = X.drop(columns=['Age'])



### Early Claim Flag

Claims occurring soon after policy initiation may indicate opportunistic fraud.
An early-claim binary flag is created based on policy accident timing.


In [10]:
X['Early_Claim_Flag'] = X['Days_Policy_Accident'].isin(
    ['none', '1 to 7', '8 to 15']
).astype(int)



### Repeated claim Flag

In [11]:
X['Repeat_Claimant'] = X['PastNumberOfClaims'].isin(
    ['2 to 4', 'more than 4']
).astype(int)


### Address Change risk

In [12]:
X['Address_Change_Flag'] = X['AddressChange_Claim'].isin(
    ['1 year', '2 years']
).astype(int)


### Policy holder at Fault

In [13]:
X['Policyholder_At_Fault'] = (X['Fault'] == 'Policy Holder').astype(int)


### Dropping source column

In [14]:
X = X.drop(columns=[
    'Days_Policy_Accident',
    'PastNumberOfClaims',
    'AddressChange_Claim',
    'Fault'
])


### Saving feature engineered dataset

In [15]:
feature_eng_df = pd.concat([X, y], axis=1)

feature_eng_df.to_csv(
    "../data/processed/feature_engineered_raw.csv",
    index=False
)

feature_eng_df.shape


(15420, 23)