# Insurance Claim Prediction for Buildings

## Problem Statement
The objective of this project is to build a predictive model that estimates the probability that a building will have at least one insurance claim during the insured period based on its characteristics.

The target variable `Claim` is defined as:
- 1: Building has at least one claim
- 0: Building has no claim


In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

sns.set()


In [22]:
train_data = pd.read_csv("../dataset/Train_data.csv")
var_desc = pd.read_csv("../dataset/Variable Description.csv")


## Data Overview
This section provides a high-level overview of the dataset structure and content.

In [34]:
train_data.head()


Unnamed: 0,Customer Id,YearOfObservation,Insured_Period,Residential,Building_Painted,Building_Fenced,Garden,Settlement,Building Dimension,Building_Type,Date_of_Occupancy,NumberOfWindows,Geo_Code,Claim
0,H14663,2013,1.0,0,N,V,V,U,290.0,1,1960.0,.,1053,0
1,H2037,2015,1.0,0,V,N,O,R,490.0,1,1850.0,4,1053,0
2,H3802,2014,1.0,0,N,V,V,U,595.0,1,1960.0,.,1053,0
3,H3834,2013,1.0,0,V,V,V,U,2840.0,1,1960.0,.,1053,0
4,H5053,2014,1.0,0,V,N,O,R,680.0,1,1800.0,3,1053,0


In [35]:
train_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7160 entries, 0 to 7159
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Customer Id         7160 non-null   object 
 1   YearOfObservation   7160 non-null   int64  
 2   Insured_Period      7160 non-null   float64
 3   Residential         7160 non-null   int64  
 4   Building_Painted    7160 non-null   object 
 5   Building_Fenced     7160 non-null   object 
 6   Garden              7160 non-null   object 
 7   Settlement          7160 non-null   object 
 8   Building Dimension  7160 non-null   float64
 9   Building_Type       7160 non-null   int64  
 10  Date_of_Occupancy   6652 non-null   float64
 11  NumberOfWindows     7160 non-null   object 
 12  Geo_Code            7058 non-null   object 
 13  Claim               7160 non-null   int64  
dtypes: float64(3), int64(4), object(7)
memory usage: 783.3+ KB


In [36]:

train_data.describe()

Unnamed: 0,YearOfObservation,Insured_Period,Residential,Building Dimension,Building_Type,Date_of_Occupancy,Claim
count,7160.0,7160.0,7160.0,7160.0,7160.0,6652.0,7160.0
mean,2013.669553,0.909758,0.305447,1871.873184,2.186034,1964.456404,0.228212
std,1.383769,0.239756,0.460629,2263.296186,0.940632,36.002014,0.419709
min,2012.0,0.0,0.0,1.0,1.0,1545.0,0.0
25%,2012.0,0.997268,0.0,531.5,2.0,1960.0,0.0
50%,2013.0,1.0,0.0,1083.0,2.0,1970.0,0.0
75%,2015.0,1.0,1.0,2250.0,3.0,1980.0,0.0
max,2016.0,1.0,1.0,20940.0,4.0,2016.0,1.0


In [37]:
from sklearn.impute import SimpleImputer

cat_imputer = SimpleImputer(strategy='most_frequent')
train_data[['Garden']] = cat_imputer.fit_transform(train_data[['Garden']])


In [38]:
from sklearn.impute import SimpleImputer

num_imputer = SimpleImputer(strategy='median')
train_data[['Building Dimension']] = num_imputer.fit_transform(
    train_data[['Building Dimension']]
)


In [39]:
train_data.isnull().sum()


Customer Id             0
YearOfObservation       0
Insured_Period          0
Residential             0
Building_Painted        0
Building_Fenced         0
Garden                  0
Settlement              0
Building Dimension      0
Building_Type           0
Date_of_Occupancy     508
NumberOfWindows         0
Geo_Code              102
Claim                   0
dtype: int64

In [None]:
from sklearn.impute import SimpleImputer

cat_imputer = SimpleImputer(strategy='most_frequent')
train_data[['Date_of_occupancy']] = cat_imputer.fit_transform(train_data[['Date_of_occupancy']])
