# Step 4: Data Preprocessing & Encoding

This step transforms the raw dataset into a model-ready format based on insights from exploratory data analysis.
Preprocessing decisions are driven by statistical validation, fraud-rate analysis, and feature stability considerations.

Key objectives:
- Remove low-value and leakage-prone features
- Handle rare categories
- Encode categorical variables appropriately
- Prepare a clean dataset for model training


In [1]:
#Load data
import pandas as pd
import numpy as np

# Load raw data
df = pd.read_csv("../data/raw/vehicle_claim_fraud.csv")

df.shape


(15420, 33)

### Feature Removal

The following features were removed due to low predictive value, high sparsity, or risk of data leakage:
- Identifier columns
- Rare or weak categorical attributes
- Low-signal temporal features

This reduces noise and improves model generalization.


In [None]:
# Drop Confirmed Low-Value / Leakage Features
drop_features = [
    'PolicyNumber', 'RepNumber',
    'WitnessPresent', 'PoliceReportFiled', 'NumberOfCars',
    'DayOfWeek', 'Month', 'WeekOfMonth', 'WeekOfMonthClaimed',
    'Year'
]

df = df.drop(columns=drop_features)

df.shape


(15420, 23)

In [None]:
# Separate Target Variable
X = df.drop('FraudFound_P', axis=1)
y = df['FraudFound_P']

X.shape, y.shape


((15420, 22), (15420,))

### Rare Category Handling

Categories with very low frequency were grouped into an "Other" category to improve feature stability and reduce overfitting risk.


In [4]:
categorical_cols = X.select_dtypes(include='object').columns

def handle_rare_categories(df, col, threshold=0.01):
    freq = df[col].value_counts(normalize=True)
    rare_labels = freq[freq < threshold].index
    df[col] = df[col].replace(rare_labels, 'Other')
    return df

for col in categorical_cols:
    X = handle_rare_categories(X, col)


In [6]:
#Encode Categorical Features
# Encoding Strategy
# Nominal features → One-Hot Encoding

# Ordinal / ordered features → Label Encoding

from sklearn.preprocessing import OneHotEncoder

X_encoded = pd.get_dummies(X, drop_first=True)

X_encoded.shape


(15420, 73)

### Numerical Features

Numerical features were retained in their original scale, as tree-based models do not require feature scaling.


In [7]:
# Final sanity checks
X_encoded.isnull().sum().sum(), X_encoded.shape


(np.int64(0), (15420, 73))

In [8]:
# Save processed Data Set

processed_df = pd.concat([X_encoded, y], axis=1)

processed_df.to_csv("../data/processed/cleaned_data.csv", index=False)

processed_df.shape


(15420, 74)

## Step 4 Summary

- Removed low-value, sparse, and leakage-prone features
- Handled rare categorical levels to improve stability
- Applied appropriate encoding strategies
- Created a clean, model-ready dataset

The processed dataset is now ready for model training and evaluation.
