# Feature Engineering for Insurance Fraud Detection

This notebook builds on the cleaned dataset and prepares it for machine learning by:

- Extracting meaningful **time based features** from `incident_date`
- Creating a custom **fraud signal** (`fraud_weight`)
- Applying **One-Hot Encoding** to key categorical variables
- Saving the final dataset as a model ready CSV for training

These steps are crucial for improving model performance and interpretability.

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import os

from IPython.display import Markdown

# Define project directory path
project_dir = os.path.abspath(os.path.join(os.getcwd(), '..'))

# Load cleaned dataset from ETL output
clean_data_path = os.path.join(project_dir, 'data', 'processed', 'cleaned_insurance_claims.csv')
df = pd.read_csv(clean_data_path)

# Confirm successful load
print("Loaded cleaned dataset with shape:", df.shape)
df.head()

## 1. Create Time Based Features

If the `incident_date` column exists, extract the following features:

- **`incident_year`** – Year of the incident (e.g., 2014)
- **`incident_month`** – Month of the incident (1–12)
- **`incident_dayofweek`** – Day of the week (0 = Monday, 6 = Sunday)
- **`incident_day`** – Day of the month (1–31)
- **`incident_is_weekend`** – Binary flag: 1 if the incident occurred on a weekend (Saturday/Sunday), 0 otherwise

> These derived features capture seasonal, weekly, and behavioral patterns in claim timings, which may be linked to fraud likelihood.

In [None]:
# Check and convert 'incident_date' column
if 'incident_date' in df.columns:
    df['incident_date'] = pd.to_datetime(df['incident_date'], errors='coerce')

    # Display missing/invalid date conversions
    num_missing_dates = df['incident_date'].isna().sum()
    display(Markdown(f"**Missing or invalid incident_date rows:** {num_missing_dates}"))

    # Drop rows with invalid dates (optional - already cleaned in ETL but I like to double check)
    df = df.dropna(subset=['incident_date'])

    # Extract parts of date
    df['incident_year'] = df['incident_date'].dt.year
    df['incident_month'] = df['incident_date'].dt.month
    df['incident_dayofweek'] = df['incident_date'].dt.dayofweek
    df['incident_day'] = df['incident_date'].dt.day

    # Create a weekend flag
    df['incident_is_weekend'] = df['incident_dayofweek'].isin([5, 6]).astype(int)
    print("Time-based features created:", ['incident_year', 'incident_month', 'incident_dayofweek', 'incident_day', 'incident_is_weekend'])

## 3. Create a Fraud Signal Feature

This feature captures interaction between the monetary value of a claim and its risk score.
Higher `fraud_weight` values may correlate with suspicious high risk, high value claims.

In [None]:
# Create a new feature that combines claim amount with risk
df['fraud_weight'] = df['total_claim_amount'] * df['risk_score']

# Preview to verify
df[['total_claim_amount', 'risk_score', 'fraud_weight']].head()

## 4. One-Hot Encode Key Categorical Features

To prepare for machine learning, I converted selected categorical features into binary variables.
I used `drop_first=True` to avoid multicollinearity in linear models.

In [None]:
# Define categorical columns to encode (from EDA insights)
categorical_cols = ['incident_type', 'collision_type', 'police_report_available']

# Perform one-hot encoding
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Confirm new shape
print("One-hot encoding complete. New dataset shape:", df_encoded.shape)
df_encoded.head()

## 5. Save Engineered Dataset

Save the final dataset to the `data/features/` folder. This will be used in the next step: `model_training.ipynb`.

In [None]:
# Define output path
engineered_dir = os.path.join(project_dir, "data", "features")
os.makedirs(engineered_dir, exist_ok=True)  # This will create the directory if it doesn't exist

# Save to CSV
feature_file = os.path.join(engineered_dir, "engineered_insurance_claims.csv")
df_encoded.to_csv(feature_file, index=False)
print(f"Feature dataset saved to: {feature_file}")