# Health Insurance Cross Sell Prediction 🏠 🏥

Predict Health Insurance Owner's who wil be interested in Vehicle Insurance

# Workflow stages

The competition solution workflow goes through following stages:
1. Acquire training and testing data
2. Wrangle, prepare, and cleanse the data


In [None]:
import pandas as pd
import numpy as np
import sys
import random as rnd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Acquire Data
We will use Python Pandas package to load the data. We will store the data in train_df and test_df. We will combine the datato run certain operations on both datasets together.

In [None]:
train_df = pd.read_csv("D:\Learning\ML_projects\Health_Insurance_cross_sell_prediction/train.csv")
test_df = pd.read_csv("D:\Learning\ML_projects\Health_Insurance_cross_sell_prediction/test.csv")
combine = [train_df, test_df]

# Getting quick insights of data

We will try to get a quick insights of data. We will print the top five and bottom five rows of both the datasets. Also, we will print the names of all columns to check which features are available for us.

In [None]:
print(train_df.columns.values)

In [None]:
train_df.head()

In [None]:
train_df.tail()

## Which features are categorical (Nominal or Ordinal) and numerical (Continuous or Discrete) ?

* Nominal : **Gender, Driving_license, Previously_Insured, Vehicle_Damage, Response**

* Ordinal : **Vehicle_Age**

* Continuous : **Age, Annual_Premium.** 


In [None]:
test_df.head()

In [None]:
test_df.tail()

## Are there any blank, null or empty values?

So, there are no null values in dataset. We will move to next stage.

In [None]:
train_df.info()
print("="*100)
test_df.info()

In [None]:
train_df.describe()

## Corelate the data.

We plot a heatmap and check the corelation. As Region code show' no relation we wil drop it. Also, we don't need id column so, we will drop it.

After that we will change the gender, vehicle age, and vehicle damage into categorical data.

In [None]:
def health_in(data):
    correlation = data.corr()
    sns.heatmap(correlation, annot =True, cbar = True, cmap="RdYlGn")
    
health_in(train_df)

In [None]:
train_df = train_df.drop(['id'], axis=1)
test_df = test_df.drop(['id'], axis=1)

In [None]:
train_df = train_df.drop(['Region_Code'], axis=1)
test_df = test_df.drop(['Region_Code'], axis=1)

In [None]:
train_df.head()

In [None]:
train_df.loc[train_df['Gender'] == 'Male', 'Gender'] = 0
train_df.loc[train_df['Gender'] == 'Female', 'Gender'] = 1
test_df.loc[test_df['Gender'] == 'Male', 'Gender'] = 0
test_df.loc[test_df['Gender'] == 'Female', 'Gender'] = 1

train_df.loc[train_df['Vehicle_Age'] == '< 1 Year', 'Vehicle_Age'] = 0
train_df.loc[train_df['Vehicle_Age'] == '1-2 Year', 'Vehicle_Age'] = 1
train_df.loc[train_df['Vehicle_Age'] == '> 2 Years', 'Vehicle_Age'] = 2
test_df.loc[test_df['Vehicle_Age'] == '< 1 Year', 'Vehicle_Age'] = 0
test_df.loc[test_df['Vehicle_Age'] == '1-2 Year', 'Vehicle_Age'] = 1
test_df.loc[test_df['Vehicle_Age'] == '> 2 Years', 'Vehicle_Age'] = 2


train_df.head()

In [None]:
train_df.loc[train_df['Vehicle_Damage'] == 'Yes', 'Vehicle_Damage'] = 1
train_df.loc[train_df['Vehicle_Damage'] == 'No', 'Vehicle_Damage'] = 0
test_df.loc[test_df['Vehicle_Damage'] == 'Yes', 'Vehicle_Damage'] = 1
test_df.loc[test_df['Vehicle_Damage'] == 'No', 'Vehicle_Damage'] = 0

test_df.head()


In [None]:
train_df.head()

In [None]:
X_train = train_df.drop(['Response'], axis =1)
Y_train = train_df['Response']

X_test = test_df
X_train.shape, Y_train.shape, X_test.shape

## Fitting and predictions

We will now fit the data using different classifiers. Firstly, we will import it and then we will predict and then store them.
If we see the result then we get the accuracy of 99.7 percent for both random forest and decision tree classifiers.


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log

In [None]:
gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian

In [None]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn 


In [None]:
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc

In [None]:
sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd

In [None]:
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree

In [None]:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest

In [None]:
models = pd.DataFrame({
    'Model': ['KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [ acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)

In [None]:
test_sf = pd.read_csv("D:\Learning\ML_projects\Health_Insurance_cross_sell_prediction/test.csv")


In [None]:
submission = pd.DataFrame({
        "Id": test_sf["id"],
        "Response": Y_pred
    })
submission.to_csv('D:\Learning\ML_projects\Health_Insurance_cross_sell_prediction/submission.csv', index=False)