1. Policyholder Information: This includes demographic details such as age,
gender, occupation, marital status, and geographical location.
2. Claim History: Information regarding past insurance claims, including claim
amounts, types of claims (e.g., medical, automobile), frequency of claims, and
claim durations.
3. Policy Details: Details about the insurance policies held by the policyholders,
such as coverage type, policy duration, premium amount, and deductibles.
4. Risk Factors: Variables indicating potential risk factors associated with
policyholders, such as credit score, driving record (for automobile insurance),
health status (for medical insurance), and property characteristics (for home
insurance).
5. External Factors: Factors external to the policyholders that may influence claim
likelihood, such as economic indicators, weather conditions, and regulatory
changes

## Data Preprocessing 

In [11]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer

# Load the dataset
df = pd.read_csv('data.csv')

# Check for missing values
print("\nMissing values in each column:")
print(df.isnull().sum())

continuous_features = ['vehicle_age', 'customer_age', 
                      'displacement', 'turning_radius', 'length', 'width', 'gross_weight']
scaler = StandardScaler()
df[continuous_features] = scaler.fit_transform(df[continuous_features])

# Display summary statistics of standardized features
print("\nSummary statistics of standardized features:")
print(df[continuous_features].describe())

# Define categorical features - these are columns with text or categorical values
categorical_features = ['region_code', 'segment', 'model', 'fuel_type', 'engine_type', 
                       'airbags', 'rear_brakes_type', 'cylinder', 'transmission_type',
                       'steering_type']

# Encode categorical variables
for column in categorical_features:
    le = LabelEncoder()
    df[column] = le.fit_transform(df[column])

# Display first few rows of processed dataset
print("\nFirst few rows of processed dataset:")
display(df.head())

# Basic statistics of the processed dataset
print("\nProcessed dataset statistics:")
print(df.describe())

# Save the cleaned dataset
df.to_csv('cleaned_insurance_claims.csv', index=False)
print("\nCleaned dataset saved successfully!")


Missing values in each column:
policy_id                           0
subscription_length                 0
vehicle_age                         0
customer_age                        0
region_code                         0
region_density                      0
segment                             0
model                               0
fuel_type                           0
max_torque                          0
max_power                           0
engine_type                         0
airbags                             0
is_esc                              0
is_adjustable_steering              0
is_tpms                             0
is_parking_sensors                  0
is_parking_camera                   0
rear_brakes_type                    0
displacement                        0
cylinder                            0
transmission_type                   0
steering_type                       0
turning_radius                      0
length                              0
width             

Unnamed: 0,policy_id,subscription_length,vehicle_age,customer_age,region_code,region_density,segment,model,fuel_type,max_torque,...,is_brake_assist,is_power_door_locks,is_central_locking,is_power_steering,is_driver_seat_height_adjustable,is_day_night_rear_view_mirror,is_ecw,is_speed_alert,ncap_rating,claim_status
0,POL045360,9.3,-0.166143,-0.551353,20,8794,4,5,1,250Nm@2750rpm,...,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,3,0
1,POL016745,8.2,0.36277,-1.416462,11,27003,3,10,1,200Nm@1750rpm,...,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,4,0
2,POL007194,9.5,-1.047663,-0.118799,20,8794,4,5,1,250Nm@2750rpm,...,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,3,0
3,POL018146,5.2,-0.871359,-0.118799,1,73430,0,0,0,60Nm@3500rpm,...,No,No,No,Yes,No,No,No,Yes,0,0
4,POL049011,10.1,-0.342447,1.611418,4,5410,2,6,1,200Nm@3000rpm,...,No,Yes,Yes,Yes,No,No,Yes,Yes,5,0



Processed dataset statistics:
       subscription_length   vehicle_age  customer_age   region_code  \
count         58592.000000  5.859200e+04  5.859200e+04  58592.000000   
mean              6.111688 -2.053700e-16  1.250896e-16     13.035653   
std               4.142790  1.000009e+00  1.000009e+00      6.803915   
min               0.000000 -1.223968e+00 -1.416462e+00      0.000000   
25%               2.100000 -8.713593e-01 -8.397228e-01      6.000000   
50%               5.700000 -1.661427e-01 -1.187989e-01     15.000000   
75%              10.400000  7.153780e-01  6.021250e-01     20.000000   
max              14.000000  1.640645e+01  4.350929e+00     21.000000   

       region_density       segment         model     fuel_type   engine_type  \
count    58592.000000  58592.000000  58592.000000  58592.000000  58592.000000   
mean     18826.858667      1.938644      4.659237      1.003448      5.502748   
std      17660.174792      1.566329      3.197355      0.835104      2.684796

# Task 1: Data Cleaning and Initial Processing

# Task 2: Exploratory Data Analysis (EDA)

## Risk Segmentation

# Task 1: Customer Segmentation

# Task 2: Anomaly Detection

## Predictive Modeling

# Task 1: Classification Model

# Task 2: Model Evaluation

## Association (?)

# Task 1: Association Rule Mining

# Task 2: Sequential Pattern Analysis   