# 🧩 Car Insurance Claim Prediction — Data Preprocessing & Feature Engineering

This notebook preprocesses and encodes features for model training.


In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import joblib

# Load data
base_path = "car_insurance_data"
train_df = pd.read_csv(os.path.join(base_path, "train.csv"))
test_df = pd.read_csv(os.path.join(base_path, "test.csv"))

print("✅ Data Loaded Successfully")
print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)


✅ Data Loaded Successfully
Train shape: (58592, 44)
Test shape: (39063, 43)


## 🧠 Define Feature Categories
We’ll identify numerical and categorical features and separate the target variable `is_claim`.


In [2]:
target_col = "is_claim"

# Separate features and target
X = train_df.drop(columns=[target_col, "policy_id"])
y = train_df[target_col]

# Identify categorical and numerical columns
cat_cols = X.select_dtypes(include=["object"]).columns.tolist()
num_cols = X.select_dtypes(exclude=["object"]).columns.tolist()

print("Categorical Columns:", cat_cols)
print("Numerical Columns:", num_cols)


Categorical Columns: ['area_cluster', 'segment', 'model', 'fuel_type', 'max_torque', 'max_power', 'engine_type', 'is_esc', 'is_adjustable_steering', 'is_tpms', 'is_parking_sensors', 'is_parking_camera', 'rear_brakes_type', 'transmission_type', 'steering_type', 'is_front_fog_lights', 'is_rear_window_wiper', 'is_rear_window_washer', 'is_rear_window_defogger', 'is_brake_assist', 'is_power_door_locks', 'is_central_locking', 'is_power_steering', 'is_driver_seat_height_adjustable', 'is_day_night_rear_view_mirror', 'is_ecw', 'is_speed_alert']
Numerical Columns: ['policy_tenure', 'age_of_car', 'age_of_policyholder', 'population_density', 'make', 'airbags', 'displacement', 'cylinder', 'gear_box', 'turning_radius', 'length', 'width', 'height', 'gross_weight', 'ncap_rating']


## 🧰 Create Preprocessing Pipelines
We’ll use:
- OneHotEncoder for categorical features
- StandardScaler for numerical features


In [3]:
# Define transformers
categorical_transformer = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
numerical_transformer = StandardScaler()

# Combine into a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, num_cols),
        ('cat', categorical_transformer, cat_cols)
    ]
)

# Build preprocessing pipeline
preprocess_pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

print("✅ Preprocessing Pipeline Created")


✅ Preprocessing Pipeline Created


## 🧪 Train–Validation Split
We’ll hold out 20% of the data for validation.


In [4]:
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training set:", X_train.shape)
print("Validation set:", X_val.shape)


Training set: (46873, 42)
Validation set: (11719, 42)


## 🔄 Fit the Preprocessor on Training Data and Transform


In [5]:
# Fit and transform
X_train_transformed = preprocess_pipeline.fit_transform(X_train)
X_val_transformed = preprocess_pipeline.transform(X_val)

print("✅ Data transformed successfully!")
print("Transformed training shape:", X_train_transformed.shape)


✅ Data transformed successfully!
Transformed training shape: (46873, 127)


## 💾 Save the Preprocessor for Later Use
This allows us to use the exact same transformations during model prediction (Streamlit app).


In [6]:
# Save the pipeline
joblib.dump(preprocess_pipeline, "preprocessor.pkl")
print("✅ Preprocessor pipeline saved as preprocessor.pkl")


✅ Preprocessor pipeline saved as preprocessor.pkl


### 🔍 Notes
- The preprocessing pipeline ensures consistent encoding/scaling during model training and prediction.
- The `preprocessor.pkl` will be loaded during deployment to preprocess user inputs.
- Next step ➡️ **Model Training and Evaluation**
