# Step 4: Data Preprocessing

This step performs basic data cleaning and feature pruning based on EDA insights.
No encoding or feature engineering is applied at this stage.
The output retains raw, human-readable categorical features.


In [1]:
#Load data
import pandas as pd

df = pd.read_csv("../data/raw/vehicle_claim_fraud.csv")
df.shape


(15420, 33)

### Feature Removal

The following features were removed due to low predictive value, high sparsity, or risk of data leakage:
- Identifier columns
- Rare or weak categorical attributes
- Low-signal temporal features

This reduces noise and improves model generalization.


In [2]:
# Drop Confirmed Low-Value / Leakage Features
drop_features = [
    # Identifiers
    'PolicyNumber', 'RepNumber',

    # Removed after EDA
    'WitnessPresent', 'PoliceReportFiled', 'NumberOfCars',

    # Weak temporal features
    'DayOfWeek', 'Month', 'WeekOfMonth', 'WeekOfMonthClaimed',
    'Year'
]

df = df.drop(columns=drop_features)
df.shape

(15420, 23)

----------------------

### Rare Category Handling

Categories with very low frequency were grouped into an "Other" category to improve feature stability and reduce overfitting risk.


In [None]:
# categorical_cols = X.select_dtypes(include='object').columns

# def handle_rare_categories(df, col, threshold=0.01):
#     freq = df[col].value_counts(normalize=True)
#     rare_labels = freq[freq < threshold].index
#     df[col] = df[col].replace(rare_labels, 'Other')
#     return df

# for col in categorical_cols:
#     X = handle_rare_categories(X, col)


In [None]:
#Encode Categorical Features
# Encoding Strategy
# Nominal features → One-Hot Encoding

# Ordinal / ordered features → Label Encoding

# from sklearn.preprocessing import OneHotEncoder

# X_encoded = pd.get_dummies(X, drop_first=True)

# X_encoded.shape


(15420, 73)

### Numerical Features

Numerical features were retained in their original scale, as tree-based models do not require feature scaling.


In [None]:
# # Final sanity checks
# X_encoded.isnull().sum().sum(), X_encoded.shape


(np.int64(0), (15420, 73))

In [None]:
# # Save processed Data Set

# processed_df = pd.concat([X_encoded, y], axis=1)

# processed_df.to_csv("../data/processed/cleaned_data.csv", index=False)

# processed_df.shape


(15420, 74)

---------------------------------

### Save processed Data Set

In [3]:
df.to_csv("../data/processed/cleaned_raw_features.csv", index=False)


## Step 4 Summary

- Removed low-value, sparse, and leakage-prone features
- 
