# Feature Engineering for Insurance Fraud Detection

This notebook builds on the cleaned dataset and prepares it for modeling by:

- Extracting meaningful **time-based features** from `incident_date`
- Creating a custom **fraud risk signal** (`fraud_weight`)
- Applying **One-Hot Encoding** to relevant categorical variables
- Saving the transformed dataset as a **model-ready CSV** for training and evaluation

This step is critical for improving model performance and interpretability in fraud detection.

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import os

## 1. Create Time-Based Features

If the `incident_date` column exists, extract the following features:

- **`incident_year`** – Year of the incident (e.g., 2014)
- **`incident_month`** – Month of the incident (1–12)
- **`incident_dayofweek`** – Day of the week (0 = Monday, 6 = Sunday)
- **`incident_day`** – Day of the month (1–31)
- **`incident_is_weekend`** – Binary flag: 1 if the incident occurred on a weekend (Saturday/Sunday), 0 otherwise


In [None]:
# If 'incident_date' exists, convert and extract date-related features
if 'incident_date' in df.columns:
    df['incident_date'] = pd.to_datetime(df['incident_date'], errors='coerce')

    # Extract parts of the date into new columns
    df['incident_year'] = df['incident_date'].dt.year
    df['incident_month'] = df['incident_date'].dt.month
    df['incident_dayofweek'] = df['incident_date'].dt.dayofweek
    df['incident_day'] = df['incident_date'].dt.day

    # Binary feature: 1 if incident was on weekend (Saturday or Sunday), else 0
    df['incident_is_weekend'] = df['incident_dayofweek'].isin([5, 6]).astype(int)

## 2. Add Interaction Feature

Multiply `total_claim_amount` by `risk_score` to estimate `fraud_weight`.

In [None]:
# Multiply claim amount by risk score to create a weighted fraud risk metric
df['fraud_weight'] = df['total_claim_amount'] * df['risk_score']

# Preview key fraud-related features
df[['total_claim_amount', 'risk_score', 'fraud_weight']].head()

## 3. Encode Categorical Features

Apply **One-Hot Encoding** to convert selected categorical columns into binary indicator (dummy) variables:

- Encoded columns: `incident_type`, `collision_type`, `police_report_available`
- Dropped the first category from each to avoid multicollinearity (`drop_first=True`)

In [None]:
# Define categorical columns to encode (from EDA insight)
categorical_cols = ['incident_type', 'collision_type', 'police_report_available']

# Apply one-hot encoding, dropping the first level to avoid multicollinearity
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
print("One-hot encoding complete. New shape:", df_encoded.shape)

## 4. Save Final Feature Dataset

Save the fully transformed dataset including engineered features and encoded variables as a CSV file.
This dataset will serve as the input for the upcoming model training and evaluation steps.

In [None]:
# Define folder path to save feature set
engineered_dir = os.path.join(project_dir, "data", "features")
os.makedirs(engineered_dir, exist_ok=True)  # Create if it doesn't exist

# Save final engineered feature set to CSV
feature_file = os.path.join(engineered_dir, "engineered_insurance_claims.csv")
df_encoded.to_csv(feature_file, index=False)
print(f"Feature dataset saved to:\n{feature_file}")