# Insurance Churn Prediction

Insurance companies around the world operate in a very competitive environment. With various aspects of data collected from millions of customers, it is painstakingly hard to analyze and understand the reason for a customer’s decision to switch to a different insurance provider.

For an industry where customer acquisition and retention are equally important, and the former being a more expensive process, insurance companies rely on data to understand customer behavior to prevent retention. Thus knowing whether a customer is possibly going to switch beforehand gives Insurance companies an opportunity to come up with strategies to prevent it from actually happening.

# Task

Given are 16 distinguishing factors that can help in understanding the customer churn, your objective as a data scientist is to build a Machine Learning model that can predict whether the insurance company will lose a customer or not using these factors.

You are provided with 16 anonymized factors (feature_0 to feature 15) that influence the churn of customers in the insurance industry.

*Build a Machine Learning model that can predict whether the insurance company will lose a customer or not using these factors.*

# 1. Read and import all data files

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import seaborn as sns
figure = plt.figure()

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#read data
train = pd.read_csv("/kaggle/input/insurance-churn-prediction-weekend-hackathon/Insurance_Churn_ParticipantsData/Train.csv")
test = pd.read_csv("/kaggle/input/insurance-churn-prediction-weekend-hackathon/Insurance_Churn_ParticipantsData/Test.csv")

In [None]:
train.head()

In [None]:
test.head()

# 2. Features Analysis:

In [None]:
train.describe()

### 2.1 **Check Categorical features**

In [None]:
train.dtypes

As it is clearly seen there is no categorical feature so there is no need of categorical mapping

### 2.2 **Check missing values**

In [None]:
train.isnull().sum()

There is no missing value.

### 2.3 Univirate Analysis

This analysis helps us in removing the outliers present in the dataset that may lead to overfit the model with noise.

In [None]:
for col in train.columns:
    plot = plt.boxplot(train[col])
    print(f'plot of feature {col} is {plot}')
    plt.show()

With the help of plots, lets check the index of main outliers so we can delete them.

In [None]:
train[train['feature_1']>24].index

In [None]:
train[train['feature_3']>15].index

In [None]:
train[train['feature_4']>16].index

In [None]:
train[train['feature_6']>20].index

Now let us drop these outliers from dataset.

In [None]:
train1 = train.drop([5445, 5606, 29608, 20042, 17893, 20894, 32159, 7705])

### 2.4 Bi-variate Analysis
Check which feature is relevant or not.

In [None]:
plt.figure(figsize=(16,8))
corr = train.corr()
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);

# 3. Split the features and lables from data

In [None]:
X = train[[col for col in train.columns if not col == 'labels']]
X = X.set_index('feature_0')
X.shape

In [None]:
y = train['labels']
y.shape

# 4. Splitting the dataset into the Training set and Test set

We divide the data here into 80% train set and 20% test set.

In [None]:
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y,train_size=0.8, test_size=0.2,random_state = 0)

# 5. Create Model

Here I'm using XGBOOST for classification.

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.

### 4.1 Import model

Here the tuning of parameters is alraedy done by me.

In [None]:
from xgboost import XGBClassifier
model_xgb  = XGBClassifier(n_estimators = 178,
                       eta = 0.17,
                       booster_pram = 'dart',
                       tree_method = 'hist',
                       scale_pos_weight= 5,
                       max_bin=215,
                       random_state = 0)

### 4.2 Fit the model

In [None]:
model_xgb.fit(train_X,
          train_y)

### 4.3 Make the Prediction on test data

In [None]:
predict = model_xgb.predict(val_X)

### 4.4 Evaluate these predictions using F1 score metric

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

F1 = 2 * (precision * recall) / (precision + recall)

In [None]:
from sklearn.metrics import f1_score
f1_score(val_y,predict)

0.63 is pretty good score on this dataset. It is improved by doing some feature engineering, etc. 
You can try different algorithms too.

### 4.5  Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.


In [None]:
from sklearn.metrics import confusion_matrix
print("Confusion matrix \n",confusion_matrix(val_y,predict))

Keep supporting!


## Any advice would be appreciated.