### The Story

This is about an Insurance company that has provided Health Insurance to its customers, now the company needs help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company again.

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

Therefore we are interested to predicted the response of the customer to take up the insurance (Probability of response 'yes').


In [22]:
# importing the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


In [23]:
# lets import and explore the dataset:
insuranc = pd.read_csv('dataset.csv')
insuranc.head()


Unnamed: 0,id,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
0,1,Male,44,1,28.0,0,> 2 Years,Yes,40454.0,26.0,217,1
1,2,Male,76,1,3.0,0,1-2 Year,No,33536.0,26.0,183,0
2,3,Male,47,1,28.0,0,> 2 Years,Yes,38294.0,26.0,27,1
3,4,Male,21,1,11.0,1,< 1 Year,No,28619.0,152.0,203,0
4,5,Female,29,1,41.0,1,< 1 Year,No,27496.0,152.0,39,0


In [24]:
# we select 
insurance=insuranc[:38000]

In [25]:
# to check on the missing values
insurance.isnull().sum()

id                      0
Gender                  0
Age                     0
Driving_License         0
Region_Code             0
Previously_Insured      0
Vehicle_Age             0
Vehicle_Damage          0
Annual_Premium          0
Policy_Sales_Channel    0
Vintage                 0
Response                0
dtype: int64

In [26]:
# to see the data types and more information
insurance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38000 entries, 0 to 37999
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    38000 non-null  int64  
 1   Gender                38000 non-null  object 
 2   Age                   38000 non-null  int64  
 3   Driving_License       38000 non-null  int64  
 4   Region_Code           38000 non-null  float64
 5   Previously_Insured    38000 non-null  int64  
 6   Vehicle_Age           38000 non-null  object 
 7   Vehicle_Damage        38000 non-null  object 
 8   Annual_Premium        38000 non-null  float64
 9   Policy_Sales_Channel  38000 non-null  float64
 10  Vintage               38000 non-null  int64  
 11  Response              38000 non-null  int64  
dtypes: float64(3), int64(6), object(3)
memory usage: 3.5+ MB


In [27]:
# to see the object columns
insurence_object=insurance.select_dtypes(include=['object'])
insurence_object

Unnamed: 0,Gender,Vehicle_Age,Vehicle_Damage
0,Male,> 2 Years,Yes
1,Male,1-2 Year,No
2,Male,> 2 Years,Yes
3,Male,< 1 Year,No
4,Female,< 1 Year,No
...,...,...,...
37995,Male,1-2 Year,Yes
37996,Female,1-2 Year,Yes
37997,Male,1-2 Year,Yes
37998,Female,< 1 Year,No


In [28]:
# to apply label encoder to the selected columns:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
insurance.Gender= le.fit_transform(insurance.Gender.values)
insurance.Vehicle_Damage= le.fit_transform(insurance.Vehicle_Damage.values)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [29]:
insurance.head()

Unnamed: 0,id,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
0,1,1,44,1,28.0,0,> 2 Years,1,40454.0,26.0,217,1
1,2,1,76,1,3.0,0,1-2 Year,0,33536.0,26.0,183,0
2,3,1,47,1,28.0,0,> 2 Years,1,38294.0,26.0,27,1
3,4,1,21,1,11.0,1,< 1 Year,0,28619.0,152.0,203,0
4,5,0,29,1,41.0,1,< 1 Year,0,27496.0,152.0,39,0


In [63]:
# apply oneHotEncoder
from sklearn.preprocessing import OneHotEncoder

ohe= OneHotEncoder(drop='first')
ohe1 = ohe.fit(insurance[['Vehicle_Age']])
ohe_transform= ohe1.transform(insurance[['Vehicle_Age']].toarray()

ohe_df = pd.DataFrame(ohe_transform)

insurance_ = pd.concat([insurance, ohe_df], axis=1).drop(['Vehicle_Age'], axis=1)
insurance_

TypeError: 'numpy.ndarray' object is not callable

In [60]:
# to

dummies= pd.get_dummies(insurance['Vehicle_Age'], drop_first=True)
insurance_2 = pd.concat([insurance, dummies], axis=1).drop(['Vehicle_Age'], axis=1)
insurance_2

Unnamed: 0,id,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response,< 1 Year,> 2 Years
0,1,1,44,1,28.0,0,1,40454.0,26.0,217,1,0,1
1,2,1,76,1,3.0,0,0,33536.0,26.0,183,0,0,0
2,3,1,47,1,28.0,0,1,38294.0,26.0,27,1,0,1
3,4,1,21,1,11.0,1,0,28619.0,152.0,203,0,1,0
4,5,0,29,1,41.0,1,0,27496.0,152.0,39,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
37995,37996,1,44,1,28.0,0,1,30681.0,26.0,170,1,0,0
37996,37997,0,72,1,28.0,0,1,30798.0,13.0,273,0,0,0
37997,37998,1,59,1,21.0,0,1,24565.0,124.0,96,0,0,0
37998,37999,0,28,1,8.0,0,0,35588.0,152.0,55,1,1,0
