## Adventure Insured: Predicting Claims



> The dataset consists of 9,267 rows and 11 columns. It includes various features related to travel insurance policies and claim outcomes, such as policy information, customer demographics, and claim details. The potential of this project lies in its ability to predict claim outcomes using advanced machine learning models. By identifying key patterns and predictors of claims, insurance companies can enhance risk assessment, improve customer service, and optimize their decision-making processes.

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv("insurance_part2_data-2.csv")
df.head()

Unnamed: 0,Age,Agency_Code,Type,Claimed,Commision,Channel,Duration,Sales,Product Name,Destination
0,48,C2B,Airlines,No,0.7,Online,7,2.51,Customised Plan,ASIA
1,36,EPX,Travel Agency,No,0.0,Online,34,20.0,Customised Plan,ASIA
2,39,CWT,Travel Agency,No,5.94,Online,3,9.9,Customised Plan,Americas
3,36,EPX,Travel Agency,No,0.0,Online,4,26.0,Cancellation Plan,ASIA
4,33,JZI,Airlines,No,6.3,Online,53,18.0,Bronze Plan,ASIA


In [3]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Age           3000 non-null   int64  
 1   Agency_Code   3000 non-null   object 
 2   Type          3000 non-null   object 
 3   Claimed       3000 non-null   object 
 4   Commision     3000 non-null   float64
 5   Channel       3000 non-null   object 
 6   Duration      3000 non-null   int64  
 7   Sales         3000 non-null   float64
 8   Product Name  3000 non-null   object 
 9   Destination   3000 non-null   object 
dtypes: float64(2), int64(2), object(6)
memory usage: 234.5+ KB


In [4]:
print("Number of rows:", df.shape[0], "\nNumber of Columns",df.shape[1])


Number of rows: 3000 
Number of Columns 10


In [5]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Age           3000 non-null   int64  
 1   Agency_Code   3000 non-null   object 
 2   Type          3000 non-null   object 
 3   Claimed       3000 non-null   object 
 4   Commision     3000 non-null   float64
 5   Channel       3000 non-null   object 
 6   Duration      3000 non-null   int64  
 7   Sales         3000 non-null   float64
 8   Product Name  3000 non-null   object 
 9   Destination   3000 non-null   object 
dtypes: float64(2), int64(2), object(6)
memory usage: 234.5+ KB


In [6]:
df.describe()


Unnamed: 0,Age,Commision,Duration,Sales
count,3000.0,3000.0,3000.0,3000.0
mean,38.091,14.529203,70.001333,60.249913
std,10.463518,25.481455,134.053313,70.733954
min,8.0,0.0,-1.0,0.0
25%,32.0,0.0,11.0,20.0
50%,36.0,4.63,26.5,33.0
75%,42.0,17.235,63.0,69.0
max,84.0,210.21,4580.0,539.0


- The age group is relatively young, with an average age of 38.09 years old.
- There is a wide range of commission rates, with the lowest commission rate being 0% and the highest commission rate being 210.21%.
- The average duration is 70.01 minutes, but there is a wide range of durations, with the shortest lasting 1 minute and the longest lasting 4580 minutes.
- The average sales amount is 60.25, but there is a fair amount of variability in the sales amounts, with the lowest sales amount being 0 and the highest sales amount being 539.

In [7]:
df.duplicated().sum()


139

In [8]:
df[df.duplicated()]


Unnamed: 0,Age,Agency_Code,Type,Claimed,Commision,Channel,Duration,Sales,Product Name,Destination
63,30,C2B,Airlines,Yes,15.0,Online,27,60.0,Bronze Plan,ASIA
329,36,EPX,Travel Agency,No,0.0,Online,5,20.0,Customised Plan,ASIA
407,36,EPX,Travel Agency,No,0.0,Online,11,19.0,Cancellation Plan,ASIA
411,35,EPX,Travel Agency,No,0.0,Online,2,20.0,Customised Plan,ASIA
422,36,EPX,Travel Agency,No,0.0,Online,5,20.0,Customised Plan,ASIA
...,...,...,...,...,...,...,...,...,...,...
2940,36,EPX,Travel Agency,No,0.0,Online,8,10.0,Cancellation Plan,ASIA
2947,36,EPX,Travel Agency,No,0.0,Online,10,28.0,Customised Plan,ASIA
2952,36,EPX,Travel Agency,No,0.0,Online,2,10.0,Cancellation Plan,ASIA
2962,36,EPX,Travel Agency,No,0.0,Online,4,20.0,Customised Plan,ASIA


In [9]:
df.drop(columns=['Agency_Code','Age'],axis=1,inplace=True)


In [10]:
df1=df[['Commision','Sales','Duration']]
df1.head()


Unnamed: 0,Commision,Sales,Duration
0,0.7,2.51,7
1,0.0,20.0,34
2,5.94,9.9,3
3,0.0,26.0,4
4,6.3,18.0,53


In [11]:
from scipy.stats import zscore
df1=df1.apply(zscore)
df1.head()

Unnamed: 0,Commision,Sales,Duration
0,-0.542807,-0.816433,-0.470051
1,-0.570282,-0.569127,-0.268605
2,-0.337133,-0.71194,-0.499894
3,-0.570282,-0.484288,-0.492433
4,-0.323003,-0.597407,-0.126846


In [12]:
for col in df1.columns:
    df1[col]=np.where(df1[col] > 3,3,df1[col] )
    df1[col]=np.where(df1[col] < -3,-3,df1[col] )

In [13]:
df1[['Type','Claimed','Channel','Product Name','Destination']]=df[['Type','Claimed','Channel','Product Name','Destination']]

In [14]:
df1.head()


Unnamed: 0,Commision,Sales,Duration,Type,Claimed,Channel,Product Name,Destination
0,-0.542807,-0.816433,-0.470051,Airlines,No,Online,Customised Plan,ASIA
1,-0.570282,-0.569127,-0.268605,Travel Agency,No,Online,Customised Plan,ASIA
2,-0.337133,-0.71194,-0.499894,Travel Agency,No,Online,Customised Plan,Americas
3,-0.570282,-0.484288,-0.492433,Travel Agency,No,Online,Cancellation Plan,ASIA
4,-0.323003,-0.597407,-0.126846,Airlines,No,Online,Bronze Plan,ASIA


In [15]:
df.columns


Index(['Type', 'Claimed', 'Commision', 'Channel', 'Duration', 'Sales',
       'Product Name', 'Destination'],
      dtype='object')

In [16]:
for i in df1.columns:
    if df1[i].dtype == 'object':
        df1[i]=pd.Categorical(df1[i]).codes

In [17]:
df1.head()


Unnamed: 0,Commision,Sales,Duration,Type,Claimed,Channel,Product Name,Destination
0,-0.542807,-0.816433,-0.470051,0,0,1,2,0
1,-0.570282,-0.569127,-0.268605,1,0,1,2,0
2,-0.337133,-0.71194,-0.499894,1,0,1,2,1
3,-0.570282,-0.484288,-0.492433,1,0,1,1,0
4,-0.323003,-0.597407,-0.126846,0,0,1,0,0


In [18]:
df1.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Commision     3000 non-null   float64
 1   Sales         3000 non-null   float64
 2   Duration      3000 non-null   float64
 3   Type          3000 non-null   int8   
 4   Claimed       3000 non-null   int8   
 5   Channel       3000 non-null   int8   
 6   Product Name  3000 non-null   int8   
 7   Destination   3000 non-null   int8   
dtypes: float64(3), int8(5)
memory usage: 85.1 KB
