# Problem Statement

Our client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

![Health Insurance](https://image.freepik.com/free-vector/health-insurance-vector-illustration_159144-57.jpg)

# Task

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for plotting
import seaborn as sns
%matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Import Dataset

In [None]:
train_data = pd.read_csv("../input/health-insurance-cross-sell-prediction/train.csv")
test_data = pd.read_csv("../input/health-insurance-cross-sell-prediction/test.csv")

In [None]:
print(train_data.shape)
print(test_data.shape)

print('Features : ', train_data.columns.values)
print('Features : ', test_data.columns.values)
train_data.head()

In [None]:
train_data.info()

In [None]:
train_data.isnull().sum()

In [None]:
test_data.isnull().sum()

In [None]:
train_data.nunique()

In [None]:
train_data["Response"].value_counts(normalize= True)

In [None]:
train_data.describe().transpose()

# Exploratory Data Analysis

In [None]:
sns.countplot(train_data["Response"])

* There is class inbalnce problem here. Less records for targert variable "1"

In [None]:
sns.distplot(train_data.Age)

* Most of the customer are in 20-27 age group

In [None]:
sns.boxplot(y = 'Age', data = train_data)

In [None]:
plt.figure(figsize=(12,8))
sns.scatterplot(x=train_data['Age'],y=train_data['Annual_Premium'])

* Majority of 20 - 26 Young generation tends to buy a vehicle with a lesser amount
* There are some cusotmers 40 - 50 Tends to get higher value vehicles

In [None]:
sns.countplot(train_data.Gender)

In [None]:
sns.countplot(train_data.Previously_Insured)

In [None]:
sns.countplot(train_data.Vehicle_Age)

In [None]:
sns.countplot(train_data.Vehicle_Damage)

In [None]:
sns.distplot(train_data.Annual_Premium)

In [None]:
sns.boxplot(y = 'Annual_Premium', data = train_data)

In [None]:
sns.distplot(train_data.Vintage)

# Data Preprocessing

### Encoding Variables for model

In [None]:
train_data['Gender'] = train_data['Gender'].map( {'Female': 0, 'Male': 1} ).astype(int)

In [None]:
train_data=pd.get_dummies(train_data,drop_first=True)

In [None]:
train_data.head()

In [None]:
train_data=train_data.rename(columns={"Vehicle_Age_< 1 Year": "Vehicle_Age_lt_1_Year", "Vehicle_Age_> 2 Years": "Vehicle_Age_gt_2_Years"})
train_data['Vehicle_Age_lt_1_Year']=train_data['Vehicle_Age_lt_1_Year'].astype('int')
train_data['Vehicle_Age_gt_2_Years']=train_data['Vehicle_Age_gt_2_Years'].astype('int')
train_data['Vehicle_Damage_Yes']=train_data['Vehicle_Damage_Yes'].astype('int')

In [None]:
train_data.head()

In [None]:
numeric_features = ['Age','Vintage']

from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler, RobustScaler
ss = StandardScaler()
train_data[numeric_features] = ss.fit_transform(train_data[numeric_features])


mm = MinMaxScaler()
train_data[['Annual_Premium']] = mm.fit_transform(train_data[['Annual_Premium']])

In [None]:
train_data=train_data.drop('id',axis=1)

In [None]:
test_data['Gender'] = test_data['Gender'].map( {'Female': 0, 'Male': 1} ).astype(int)
test_data=pd.get_dummies(test_data,drop_first=True)
test_data=test_data.rename(columns={"Vehicle_Age_< 1 Year": "Vehicle_Age_lt_1_Year", 
                                    "Vehicle_Age_> 2 Years": "Vehicle_Age_gt_2_Years"})
test_data['Vehicle_Age_lt_1_Year']=test_data['Vehicle_Age_lt_1_Year'].astype('int')
test_data['Vehicle_Age_gt_2_Years']=test_data['Vehicle_Age_gt_2_Years'].astype('int')
test_data['Vehicle_Damage_Yes']=test_data['Vehicle_Damage_Yes'].astype('int')

In [None]:
test_data[numeric_features] = ss.transform(test_data[numeric_features])


mm = MinMaxScaler()
test_data[['Annual_Premium']] = mm.fit_transform(test_data[['Annual_Premium']])

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(train_data.corr(), annot=True, cmap='viridis')

* The most relevant is the relation between "Response" and the rest of the variables. 
* We can see some good correlation between Response and "Vehicle_Damage"
* There is a negative relation with binary "Previously_Insured" variable.

In [None]:
plt.figure(figsize=(12,6))
train_data.corr()['Response'].drop('Response').sort_values(ascending=False).plot(kind='bar')

* From above graph it displays there is a negative relationship with Previously Insured Feature and Customers tends to intrested in a Health Insurance when they met with an accident

## Train Test Split

In [None]:
id=test_data.id
test_data=test_data.drop('id',axis=1)

In [None]:
from sklearn.model_selection import train_test_split

train_target=train_data['Response']
train_data=train_data.drop(['Response'], axis = 1)
x_train,x_test,y_train,y_test = train_test_split(train_data,train_target, random_state = 0)

Since the target variable is inbalance as a Oversample method used SMOTE to oversample minor class

In [None]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=2)
x_train, y_train = sm.fit_sample(x_train, y_train)

# Create Nural Network

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import r2_score
from keras import backend as K

In [None]:
K.clear_session()

In [None]:
model = Sequential()
model.add(Dense(11, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(6, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(3, activation='relu'))
model.add(Dropout(0.2))

In [None]:
model.add(Dense(units=1, activation='sigmoid'))

In [None]:
opt = Adam(learning_rate=0.01)
model.compile(loss='binary_crossentropy', optimizer=opt)

In [None]:
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=15)
model.fit(x=x_train, y=y_train, epochs=100, batch_size=72, validation_data=(x_test, y_test), callbacks=[early_stop])

In [None]:
pred = model.predict_classes(x_test)

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, f1_score, \
    recall_score, classification_report, precision_score

print(classification_report(y_test, pred))
print('-----------------------------------------------------------------')
accuracy = accuracy_score(y_test, pred)
precision = precision_score(y_test, pred)
confusion = confusion_matrix(y_test, pred)
recall = recall_score(y_test, pred)
f1 = f1_score(y_test, pred)
print('-----------------------------------------------------------------')
print('accuracy: ', accuracy)
print('-----------------------------------------------------------------')
print('recall: ', recall)
print('-----------------------------------------------------------------')
print('precision: ', precision)
print('-----------------------------------------------------------------')
print('f1_score: ', f1)
print('-----------------------------------------------------------------')
print('ROC AUC Score:', roc_auc_score(y_test, pred, average = 'weighted'))
print('-----------------------------------------------------------------')
print('confusion matrix:')
print(confusion)
print('-----------------------------------------------------------------')
print('-----------------------------------------------------------------')

# Submition Result Data 

In [None]:
pred = model.predict_classes(test_data)

In [None]:
submission = None
submission = pd.concat([id, pd.DataFrame(columns = ['Prediction'], data = pred)], axis=1)
submission.to_csv('vehicle_insurance_predicted.csv', index = False)
submission.head()

<div style="text-align:center;color:white;font-size:150%;border-radius:5px;display:fill;background-color:#5642C5;font-family:Verdana;letter-spacing:0.5px;padding: 10px;" > <br> If you find this is useful make sure to appriciate me with an <b>UPVOTE</b> !!! 👍 
    <br>
</div>

<br>


![Suranga Nanayakkara](https://cdn1.bbcode0.com/uploads/2020/12/26/553f26c97fcdb5d167020f64ea95fa51-full.png) 

[Lets Connect on LinkedIn!](https://www.linkedin.com/in/surangan/)