# INSURANCE CLAIM PROBABILITY BASED ON  POLICY FEATURES AND SAFETY RATINGS

## Problem Statement
Insurance companies face the challenge of accurately predicting the likelihood of claims on car insurance policies. With a variety of factors influencing claim frequency and severity, it is difficult to assess risk and determine appropriate premiums for policyholders. The problem arises from the complexity and variety of data involved, including demographic details, vehicle specifications, and past claim histories. Inaccurate predictions can lead to suboptimal premium pricing, financial loss, and inefficient claim management.

The goal of this project is to develop a predictive model that analyzes this complex dataset to predict whether a car insurance claim will occur within a six-month period. The expected output is a model capable of accurately classifying policies into two categories: those likely to file a claim and those unlikely to do so. The model will provide actionable insights into the factors that contribute to the likelihood of claims.

In [1]:
# Importing required packages 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Data Overview

In [3]:
# importing data using pandas
data = pd.read_csv(r'data/Data.csv')
data.shape

(58592, 44)

### a. Column names and Datatypes

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58592 entries, 0 to 58591
Data columns (total 44 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   policy_id                         58592 non-null  object 
 1   policy_tenure                     58592 non-null  float64
 2   age_of_car                        58592 non-null  float64
 3   age_of_policyholder               58592 non-null  float64
 4   area_cluster                      58592 non-null  object 
 5   population_density                58592 non-null  int64  
 6   make                              58592 non-null  int64  
 7   segment                           58592 non-null  object 
 8   model                             58592 non-null  object 
 9   fuel_type                         58592 non-null  object 
 10  max_torque                        58592 non-null  object 
 11  max_power                         58592 non-null  object 
 12  engi

In [8]:
data.head(10)

Unnamed: 0,policy_id,policy_tenure,age_of_car,age_of_policyholder,area_cluster,population_density,make,segment,model,fuel_type,...,is_brake_assist,is_power_door_locks,is_central_locking,is_power_steering,is_driver_seat_height_adjustable,is_day_night_rear_view_mirror,is_ecw,is_speed_alert,ncap_rating,is_claim
0,ID00001,0.515874,0.05,0.644231,C1,4990,1,A,M1,CNG,...,No,No,No,Yes,No,No,No,Yes,0,0
1,ID00002,0.672619,0.02,0.375,C2,27003,1,A,M1,CNG,...,No,No,No,Yes,No,No,No,Yes,0,0
2,ID00003,0.84111,0.02,0.384615,C3,4076,1,A,M1,CNG,...,No,No,No,Yes,No,No,No,Yes,0,0
3,ID00004,0.900277,0.11,0.432692,C4,21622,1,C1,M2,Petrol,...,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,2,0
4,ID00005,0.596403,0.11,0.634615,C5,34738,2,A,M3,Petrol,...,No,Yes,Yes,Yes,No,Yes,Yes,Yes,2,0
5,ID00006,1.018709,0.07,0.519231,C6,13051,3,C2,M4,Diesel,...,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,3,0
6,ID00007,0.097992,0.16,0.403846,C7,6112,4,B2,M5,Diesel,...,No,Yes,Yes,Yes,No,No,Yes,Yes,5,0
7,ID00008,0.509085,0.14,0.423077,C8,8794,1,B2,M6,Petrol,...,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,2,0
8,ID00009,0.282394,0.07,0.298077,C7,6112,3,C2,M4,Diesel,...,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,3,0
9,ID00010,0.566255,0.04,0.442308,C9,17804,1,B2,M7,Petrol,...,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,0,0


### Data Summary:
* **Total Rows:** 58,592
* **Total Columns:** 44

### Data Types Overview:
* **float64:** 4 columns
* **int64:** 12 columns
* **object:** 28 columns (categorical)

**Note**: 
* The values in  `age_of_car` and `age_of_policyholder` columns appear to be normalized.
*  The `age_of_car` and `age_of_policyholder` columns, since they are normalized and to get the denormalized values we need more infomation about the technique used and the parameters used for normalization

In [17]:
data.describe()

Unnamed: 0,policy_tenure,age_of_car,age_of_policyholder,population_density,make,airbags,displacement,cylinder,gear_box,turning_radius,length,width,height,gross_weight,ncap_rating,is_claim
count,58592.0,58592.0,58592.0,58592.0,58592.0,58592.0,58592.0,58592.0,58592.0,58592.0,58592.0,58592.0,58592.0,58592.0,58592.0,58592.0
mean,0.611246,0.069424,0.46942,18826.858667,1.763722,3.137066,1162.355851,3.626963,5.245443,4.852893,3850.476891,1672.233667,1553.33537,1385.276813,1.75995,0.063968
std,0.414156,0.056721,0.122886,17660.174792,1.136988,1.832641,266.304786,0.483616,0.430353,0.228061,311.457119,112.089135,79.62227,212.423085,1.389576,0.244698
min,0.002735,0.0,0.288462,290.0,1.0,1.0,796.0,3.0,5.0,4.5,3445.0,1475.0,1475.0,1051.0,0.0,0.0
25%,0.21025,0.02,0.365385,6112.0,1.0,2.0,796.0,3.0,5.0,4.6,3445.0,1515.0,1475.0,1185.0,0.0,0.0
50%,0.573792,0.06,0.451923,8794.0,1.0,2.0,1197.0,4.0,5.0,4.8,3845.0,1735.0,1530.0,1335.0,2.0,0.0
75%,1.039104,0.11,0.548077,27003.0,3.0,6.0,1493.0,4.0,5.0,5.0,3995.0,1755.0,1635.0,1510.0,3.0,0.0
max,1.396641,1.0,1.0,73430.0,5.0,6.0,1498.0,4.0,6.0,5.2,4300.0,1811.0,1825.0,1720.0,5.0,1.0


**Observations:**

1. **`policy_tenure`**: The values range between 0.0027 and 1.3966, with a mean of 0.611, indicating some policies are quite new, while others are close to the maximum tenure.
   
2. **`age_of_car`**: Most values are closer to 0, which suggests a larger portion of cars are relatively new

3. **`age_of_policyholder`**: The minimum value is 0.2885, and the 75th percentile is 0.54 with maximum as 1, showing that policyholders are mostly younger adults or middle-aged individuals.

4. **`population_density`**: The data appears to be signifcantly skewed to the right for this column indicating most of the population density is less than 27,000.

5. **`ncap_rating`**: Most cars have a rating of 1 to 3, with a few reaching the maximum value of 5, indicating a skewed distribution towards lower ratings.

6. **`gross_weight`**: The range is between 1,051 kg and 1,720 kg, suggesting most cars fall within this weight range, with larger vehicles closer to the maximum.

7. **`is_claim`**: A significant number of policies do not have claims, as indicated by a mean of 0.064, which reflects a highly **imbalanced class distribution**.
 

In [19]:
data['population_density'].skew()

1.6741777983981572

### Check for null values

In [20]:
data.isnull().sum()

policy_id                           0
policy_tenure                       0
age_of_car                          0
age_of_policyholder                 0
area_cluster                        0
population_density                  0
make                                0
segment                             0
model                               0
fuel_type                           0
max_torque                          0
max_power                           0
engine_type                         0
airbags                             0
is_esc                              0
is_adjustable_steering              0
is_tpms                             0
is_parking_sensors                  0
is_parking_camera                   0
rear_brakes_type                    0
displacement                        0
cylinder                            0
transmission_type                   0
gear_box                            0
steering_type                       0
turning_radius                      0
length      

**Observation:**
  * There are no null values present in any of the data columns

### Check for duplicated records

In [34]:
data.duplicated().sum()

0

**There are no duplicated records in the data set**

### Observing categorcial columns

In [28]:
categorical_columns = data.select_dtypes(include='object')

for i in categorical_columns.iloc[:,1:].columns:
    print('*'*10 +' '+ i +' '+'*'*10 )
    print(set(categorical_columns[i]))
    print()
    

********** area_cluster **********
{'C11', 'C13', 'C22', 'C15', 'C1', 'C12', 'C9', 'C17', 'C19', 'C21', 'C20', 'C2', 'C7', 'C6', 'C8', 'C5', 'C18', 'C10', 'C3', 'C16', 'C4', 'C14'}

********** segment **********
{'C2', 'C1', 'A', 'B1', 'Utility', 'B2'}

********** model **********
{'M9', 'M4', 'M6', 'M5', 'M10', 'M11', 'M1', 'M2', 'M7', 'M8', 'M3'}

********** fuel_type **********
{'CNG', 'Diesel', 'Petrol'}

********** max_torque **********
{'200Nm@1750rpm', '85Nm@3000rpm', '170Nm@4000rpm', '113Nm@4400rpm', '91Nm@4250rpm', '250Nm@2750rpm', '60Nm@3500rpm', '200Nm@3000rpm', '82.1Nm@3400rpm'}

********** max_power **********
{'118.36bhp@5500rpm', '67.06bhp@5500rpm', '88.77bhp@4000rpm', '55.92bhp@5300rpm', '97.89bhp@3600rpm', '61.68bhp@6000rpm', '40.36bhp@6000rpm', '113.45bhp@4000rpm', '88.50bhp@6000rpm'}

********** engine_type **********
{'1.5 Turbocharged Revotorq', '1.0 SCe', '1.2 L K Series Engine', 'K10C', 'i-DTEC', 'G12B', '1.2 L K12N Dualjet', 'F8D Petrol Engine', '1.5 Turbocharge

**Observation**:
* There are no invalid values in any of the categorcial columns

In [29]:
numerical_columns = data.select_dtypes(exclude='object')
numerical_columns

Unnamed: 0,policy_tenure,age_of_car,age_of_policyholder,population_density,make,airbags,displacement,cylinder,gear_box,turning_radius,length,width,height,gross_weight,ncap_rating,is_claim
0,0.515874,0.05,0.644231,4990,1,2,796,3,5,4.6,3445,1515,1475,1185,0,0
1,0.672619,0.02,0.375000,27003,1,2,796,3,5,4.6,3445,1515,1475,1185,0,0
2,0.841110,0.02,0.384615,4076,1,2,796,3,5,4.6,3445,1515,1475,1185,0,0
3,0.900277,0.11,0.432692,21622,1,2,1197,4,5,4.8,3995,1735,1515,1335,2,0
4,0.596403,0.11,0.634615,34738,2,2,999,3,5,5.0,3731,1579,1490,1155,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58587,0.355089,0.13,0.644231,8794,2,2,999,3,5,5.0,3731,1579,1490,1155,2,0
58588,1.199642,0.02,0.519231,7788,1,2,796,3,5,4.6,3445,1515,1475,1185,0,0
58589,1.162273,0.05,0.451923,34738,1,2,796,3,5,4.6,3445,1515,1475,1185,0,0
58590,1.236307,0.14,0.557692,8794,1,2,1197,4,5,4.8,3845,1735,1530,1335,2,0


In [32]:
for i in numerical_columns.iloc[:,4:].columns:
    print('*'*10 +' '+ i +' '+'*'*10 )
    print(set(numerical_columns[i]))
    print()   

********** make **********
{1, 2, 3, 4, 5}

********** airbags **********
{1, 2, 6}

********** displacement **********
{998, 999, 1196, 1197, 1199, 1493, 1497, 1498, 796}

********** cylinder **********
{3, 4}

********** gear_box **********
{5, 6}

********** turning_radius **********
{4.85, 5.0, 4.6, 4.8, 5.2, 4.7, 4.9, 4.5, 5.1}

********** length **********
{3845, 3655, 4300, 3731, 3445, 3990, 3993, 3675, 3995}

********** width **********
{1475, 1735, 1579, 1515, 1745, 1811, 1620, 1755, 1790, 1695}

********** height **********
{1825, 1635, 1475, 1606, 1515, 1675, 1490, 1523, 1530, 1500, 1501}

********** gross_weight **********
{1660, 1185, 1410, 1155, 1510, 1490, 1335, 1720, 1051, 1340}

********** ncap_rating **********
{0, 2, 3, 4, 5}

********** is_claim **********
{0, 1}



**Observation:**

**Categorical Nature of Some Numerical Columns**: 

* Several numerical columns, such as `airbags`, `displacement`, `cylinder`, `gear_box`, `turning_radius`, `length`, `width`, `height`, `gross_weight`, `ncap_rating`, and `is_claim`, contain discrete, limited values.
* They categorize the data into specific groups rather than representing continuous ranges. 
* These columns should be treated as categorical variables for further analysis and modeling, even though they appear numerically.
    