# Feature Engineering for Insurance Fraud Detection

This notebook builds on the cleaned dataset and prepares it for machine learning by:

- Extracting meaningful **time based features** from `incident_date`
- Creating a custom **fraud signal** (`fraud_weight`)
- Applying **One-Hot Encoding** to key categorical variables
- Saving the final dataset as a model ready CSV for training

> These steps are crucial for improving model performance and interpretability.

In [1]:
# Import essential libraries
import pandas as pd
import numpy as np
import os

from IPython.display import Markdown

# Define project directory path
project_dir = os.path.abspath(os.path.join(os.getcwd(), '..'))

# Load cleaned dataset from ETL output
clean_data_path = os.path.join(project_dir, 'data', 'cleaned', 'cleaned_insurance_claims.csv')
df = pd.read_csv(clean_data_path)

# Confirm successful load
print("Loaded cleaned dataset with shape:", df.shape)
df.head()

Loaded cleaned dataset with shape: (1000, 46)


Unnamed: 0,months_as_customer,age,policy_number,policy_bind_date,policy_state,policy_csl,policy_deductable,policy_annual_premium,umbrella_limit,insured_zip,...,auto_model,auto_year,fraud_reported,collision_type_missing_flag,police_report_available_missing_flag,property_damage_missing_flag,authorities_contacted_missing_flag,policy_csl_min,policy_csl_max,risk_score
0,328,48,521585,2014-10-17,OH,250/500,1000,1406.91,0,466132,...,92X,2004,1,0,0,0,0,250,500,2
1,228,42,342868,2006-06-27,IN,250/500,2000,1197.22,5000000,468176,...,E400,2007,1,0,0,0,0,250,500,0
2,134,29,687698,2000-09-06,OH,100/300,2000,1413.14,5000000,430632,...,RAM,2007,0,0,0,0,0,100,300,2
3,256,41,227811,1990-05-25,IL,250/500,2000,1415.74,6000000,608117,...,TAHOE,2014,1,0,0,0,0,250,500,3
4,228,44,367455,2014-06-06,IL,500/1000,1000,1583.91,6000000,610706,...,RSX,2009,0,0,0,0,0,500,1000,1


### 1. Create Time Based Features

The `incident_date` column contains the date when each insurance claim incident occurred. Temporal patterns can be highly informative for fraud detection, as fraud may vary by season, day of the week, or weekend behavior.

This section extracts multiple numeric features from the raw date:

- **`incident_year`** — The calendar year of the incident (e.g., 2014)
- **`incident_month`** — The calendar month (1–12)
- **`incident_dayofweek`** — Day of the week as an integer (0 = Monday, 6 = Sunday)
- **`incident_day`** — Day of the month (1–31)
- **`incident_is_weekend`** — Binary flag indicating whether the incident occurred on a weekend (Saturday or Sunday)

These features will allow the model to capture potential seasonal, weekly, and weekend-related fraud trends.

Additionally, the code will validate and clean the date column by coercing invalid entries to `NaT` and removing any rows with missing or invalid dates, ensuring robust downstream feature extraction.

In [2]:
# Check if 'incident_date' exists and convert to datetime
if 'incident_date' in df.columns:
    df['incident_date'] = pd.to_datetime(df['incident_date'], errors='coerce')

    # Count and display number of missing/invalid dates after conversion
    num_missing_dates = df['incident_date'].isna().sum()
    display(Markdown(f"**Missing or invalid incident_date rows:** {num_missing_dates}"))

    # Drop rows where date conversion failed (optional, but recommended)
    df = df.dropna(subset=['incident_date'])

    # Extract date components as numeric features
    df['incident_year'] = df['incident_date'].dt.year
    df['incident_month'] = df['incident_date'].dt.month
    df['incident_dayofweek'] = df['incident_date'].dt.dayofweek
    df['incident_day'] = df['incident_date'].dt.day

    # Create binary weekend flag (Saturday=5, Sunday=6)
    df['incident_is_weekend'] = df['incident_dayofweek'].isin([5, 6]).astype(int)

    print("Time-based features created:", [
        'incident_year', 'incident_month', 'incident_dayofweek', 'incident_day', 'incident_is_weekend'
    ])
else:
    print("Column 'incident_date' not found in dataset.")

**Missing or invalid incident_date rows:** 0

Time-based features created: ['incident_year', 'incident_month', 'incident_dayofweek', 'incident_day', 'incident_is_weekend']


### 2. Create a Fraud Signal Feature

This feature captures interaction between the monetary value of a claim and its risk score.
Higher `fraud_weight` values may correlate with suspicious high risk, high value claims.

In [3]:
# Create a new feature that combines claim amount with risk
df['fraud_weight'] = df['total_claim_amount'] * df['risk_score']

# Preview to verify
df[['total_claim_amount', 'risk_score', 'fraud_weight']].head()

Unnamed: 0,total_claim_amount,risk_score,fraud_weight
0,71610,2,143220
1,5070,0,0
2,34650,2,69300
3,63400,3,190200
4,6500,1,6500


### 3. One-Hot Encode Key Categorical Features

To prepare for machine learning, I converted selected categorical features into binary variables.
I used `drop_first=True` to avoid multicollinearity in linear models.

In [4]:
# Define categorical columns to encode (from EDA insights)
categorical_cols = ['incident_type', 'collision_type', 'police_report_available']

# Perform one-hot encoding
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Confirm new shape
print("One-hot encoding complete. New dataset shape:", df_encoded.shape)
df_encoded.head()

One-hot encoding complete. New dataset shape: (1000, 55)


Unnamed: 0,months_as_customer,age,policy_number,policy_bind_date,policy_state,policy_csl,policy_deductable,policy_annual_premium,umbrella_limit,insured_zip,...,incident_dayofweek,incident_day,incident_is_weekend,fraud_weight,incident_type_PARKED CAR,incident_type_SINGLE VEHICLE COLLISION,incident_type_VEHICLE THEFT,collision_type_REAR COLLISION,collision_type_SIDE COLLISION,police_report_available_YES
0,328,48,521585,2014-10-17,OH,250/500,1000,1406.91,0,466132,...,6,25,1,143220,False,True,False,False,True,True
1,228,42,342868,2006-06-27,IN,250/500,2000,1197.22,5000000,468176,...,2,21,0,0,False,False,True,False,False,False
2,134,29,687698,2000-09-06,OH,100/300,2000,1413.14,5000000,430632,...,6,22,1,69300,False,False,False,True,False,False
3,256,41,227811,1990-05-25,IL,250/500,2000,1415.74,6000000,608117,...,5,10,1,190200,False,True,False,False,False,False
4,228,44,367455,2014-06-06,IL,500/1000,1000,1583.91,6000000,610706,...,1,17,0,6500,False,False,True,False,False,False


### 4. Final Cleanup: Drop Raw Date Column

After extracting all relevant time based features from the `incident_date` column, we drop the original date column. This helps keep the dataset clean and focused only on the engineered features that are useful for modeling.

In [5]:
# Drop original 'incident_date' column since all date features have been extracted
df = df.drop(columns=['incident_date'], errors='ignore')
print("'incident_date' column dropped before saving feature-engineered dataset.")

'incident_date' column dropped before saving feature-engineered dataset.


### 5. Save Engineered Dataset

Save the final dataset to the `data/features/` folder. This will be used in the next step: `model_training.ipynb`.

In [6]:
# Define output path
engineered_dir = os.path.join(project_dir, "data", "features")
os.makedirs(engineered_dir, exist_ok=True)  # This will create the directory if it doesn't exist

# Save to CSV
feature_file = os.path.join(engineered_dir, "engineered_insurance_claims.csv")
df_encoded.to_csv(feature_file, index=False)
print(f"Feature dataset saved to: {feature_file}")

Feature dataset saved to: c:\Users\Cloud\OneDrive\Desktop\Fraud_Analytics_Project\data\features\engineered_insurance_claims.csv


# Conclusion and Next Steps

In this notebook, I successfully engineered new features to enhance the model’s ability to detect fraudulent insurance claims. Key activities included:

- Extracting time based features from the `incident_date` column
- Creating a custom fraud risk signal (`fraud_weight`) based on domain logic
- Applying one-hot encoding to critical categorical variables such as `incident_type`, `collision_type`, and `police_report_available`
- Saving the fully transformed dataset in a clean, model-ready format

These engineered features were designed to improve the model’s ability to capture underlying fraud patterns particularly in the presence of temporal trends, missing data flags, and categorical context.

The resulting dataset will now be used in the next stage: `model_training.ipynb`, where I will build, evaluate, and save predictive models for fraud detection. This ensures a seamless transition from data understanding to actionable machine learning.