![image.png](attachment:image.png)

Prudential, one of the largest issuers of life insurance in the USA.

In a one-click shopping world with on-demand everything, the life insurance application process is antiquated. Customers provide extensive information to identify risk classification and eligibility, including scheduling medical exams, a process that takes an average of 30 days.

The result? People are turned off. That’s why only 40% of U.S. households own individual life insurance. Prudential wants to make it quicker and less labor intensive for new and existing customers to get a quote while maintaining privacy boundaries.

By developing a predictive model that accurately classifies risk using a more automated approach, you can greatly impact public perception of the industry


# Goal

In this dataset, you are provided over a hundred variables describing attributes of life insurance applicants. The task is to predict the "Response" variable for each Id in the test set. "Response" is an ordinal measure of risk that has 8 levels.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

Importing necessary packages and data

In [None]:
import pandas as pd 
import pandas_profiling as pdp
import matplotlib.pyplot as plt
import seaborn as sns

pd.options.display.float_format = '{:.3f}'.format
%matplotlib inline
plt.style.use('fivethirtyeight')

In [None]:
train=pd.read_csv('../input/prudential-life-insurance-assessment/train.csv.zip')
test=pd.read_csv('../input/prudential-life-insurance-assessment/test.csv.zip')

## Data Description

* train.csv - the training set, contains the Response values
* test.csv - the test set, you must predict the Response variable for all rows in this file


* Id :	A unique identifier associated with an application.
* Product_Info_1-7 :	A set of normalized variables relating to the product applied for
* Ins_Age :	Normalized age of applicant
* Ht :	Normalized height of applicant
* Wt :	Normalized weight of applicant
* BMI :	Normalized BMI of applicant
* Employment_Info_1-6 :	A set of normalized variables relating to the employment history of the applicant.
* InsuredInfo_1-6 :	A set of normalized variables providing information about the applicant.
* Insurance_History_1-9 :	A set of normalized variables relating to the insurance history of the applicant.
* Family_Hist_1-5 :	A set of normalized variables relating to the family history of the applicant.
* Medical_History_1-41 :	A set of normalized variables relating to the medical history of the applicant.
* Medical_Keyword_1-48 :	A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application.
* Response :	This is the target variable, an ordinal variable relating to the final decision associated with an application

In [None]:
train.head()

In [None]:
train.info()

#  Missing Value Analysis

In [None]:
def missing_data(data):
    total = data.isnull().sum()
    percent = (data.isnull().sum()/data.isnull().count()*100)
    tt = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    types = []
    for col in data.columns:
        dtype = str(data[col].dtype)
        types.append(dtype)
    tt['Types'] = types
    return tt
    #return(np.transpose(tt))

In [None]:
#checking missing value percentage in train data
missing_data(train)['Percent'].sort_values(ascending=False)

In [None]:
#checking missing value percentage in train data
missing_data(test)['Percent'].sort_values(ascending=False)

Dropping columns which has more than 75% missing value

In [None]:
train=train[train.columns[train.isnull().mean() <= 0.75]]

In [None]:
test=test[test.columns[test.isnull().mean() <= 0.75]]

In [None]:
train.isnull().sum().sort_values(ascending=False)

In [None]:
test.isnull().sum().sort_values(ascending=False)

**Taking null value column names**

In [None]:
 list_train=train.columns[train.isna().any()].tolist()

In [None]:
list_test=test.columns[test.isna().any()].tolist()

**Printing column names and data types  which has null values**

In [None]:
for i in range(0,len(list_train)):
    print('column name: ',list_train[i],' Dtype:',train[list_train[i]].dtypes)

In [None]:
for i in range(0,len(list_test)):
    print('column name: ',list_test[i],' Dtype:',train[list_test[i]].dtypes)

**Filling Null Values With Mean**

In [None]:
for column in list_train:
    train[column].fillna(train[column].mean(), inplace=True)

In [None]:
for column in list_test:
    test[column].fillna(test[column].mean(), inplace=True)

In [None]:
train.info()
test.info()

## Label Encoding

In [None]:
obj_train=list(train.select_dtypes(include=['object']).columns)
obj_test=list(test.select_dtypes(include=['object']).columns)

In [None]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
train[obj_train]=le.fit_transform(train[obj_train])
test[obj_test]=le.transform(test[obj_test])

#  **Analysing features**

* **Weight**

In [None]:
f, axes = plt.subplots(1, 2, figsize=(15,7))
sns.boxplot(x = 'Wt', data=train,  orient='v' , ax=axes[0])
sns.distplot(train['Wt'],  ax=axes[1])

* **Height**

In [None]:
f, axes = plt.subplots(1, 2, figsize=(15,7))
sns.boxplot(x = 'Ht', data=train,  orient='v' , ax=axes[0])
sns.distplot(train['Ht'],  ax=axes[1])

* **BMI**

In [None]:
f, axes = plt.subplots(1, 2, figsize=(15,7))
sns.boxplot(x = 'BMI', data=train,  orient='v' , ax=axes[0])
sns.distplot(train['BMI'],  ax=axes[1])

* **Age**

In [None]:
f,axes=plt.subplots(1,2,figsize=(15,7))
sns.boxplot(x='Ins_Age',data=train,orient='v',ax=axes[0])
sns.distplot(train['Ins_Age'],ax=axes[1])

**The image above is a comparison of a boxplot of a nearly normal distribution and the probability density function (pdf) for a normal distribution. The reason why I am showing you this image is that looking at a statistical distribution is more commonplace than looking at a box plot. In other words, it might help you understand a boxplot.**

#  **Target Variable Analysis**

In [None]:
f,ax=plt.subplots(1,2,figsize=(18,8))
train['Response'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Response')
ax[0].set_ylabel('')
sns.countplot('Response',data=train,ax=ax[1])
ax[1].set_title('Response')
plt.show()

**We can see that Class 8 has the highest distribution.**

 ****Converting target variable****

**We will do a binary classification by altering the target variable. The new problem statement would be - Based on the attributes of customers, will the life insurance policy be approved or not i.e.yes(1) or no(0).we will turn this Multiclass classification challenge into Binary classification challenge.**

**We are making 0 to 7 as one class and 8 as another class**

In [None]:
#create a funtion to create a  new target variable based on conditions 

def new_target(row):
    if (row['Response']<=7) & (row['Response']>=0):
        val=0
    elif (row['Response']==8):
        val=1
    else:
        val=-1
    return val

In [None]:
#create a copy of original dataset
new_data=train.copy()

In [None]:
#create a new column
new_data['Final_Response']=new_data.apply(new_target,axis=1)

In [None]:
new_data['Final_Response'].value_counts()

In [None]:
#distribution plot for target classes
sns.countplot(x=new_data.Final_Response).set_title('Distribution of rows by response categories')

In [None]:
#dropping already existing column
new_data.drop(['Response'],axis=1,inplace=True)
train=new_data
del new_data

In [None]:
train.rename(columns={'Final_Response':'Response'},inplace=True)

Categorizing BMI,AGE,HEIGHT and WEIGHT based on their values

In [None]:
# BMI Categorization
conditions = [
    (train['BMI'] <= train['BMI'].quantile(0.25)),
    (train['BMI'] > train['BMI'].quantile(0.25)) & (train['BMI'] <= train['BMI'].quantile(0.75)),
    (train['BMI'] > train['BMI'].quantile(0.75))]

choices = ['under_weight', 'average', 'overweight']

train['BMI_Wt'] = np.select(conditions, choices)

# Age Categorization
conditions = [
    (train['Ins_Age'] <= train['Ins_Age'].quantile(0.25)),
    (train['Ins_Age'] > train['Ins_Age'].quantile(0.25)) & (train['Ins_Age'] <= train['Ins_Age'].quantile(0.75)),
    (train['Ins_Age'] > train['Ins_Age'].quantile(0.75))]

choices = ['young', 'average', 'old']
train['Old_Young'] = np.select(conditions, choices)

# Height Categorization
conditions = [
    (train['Ht'] <= train['Ht'].quantile(0.25)),
    (train['Ht'] > train['Ht'].quantile(0.25)) & (train['Ht'] <= train['Ht'].quantile(0.75)),
    (train['Ht'] > train['Ht'].quantile(0.75))]

choices = ['short', 'average', 'tall']

train['Short_Tall'] = np.select(conditions, choices)

# Weight Categorization
conditions = [
    (train['Wt'] <= train['Wt'].quantile(0.25)),
    (train['Wt'] > train['Wt'].quantile(0.25)) & (train['Wt'] <= train['Wt'].quantile(0.75)),
    (train['Wt'] > train['Wt'].quantile(0.75))]

choices = ['thin', 'average', 'fat']

train['Thin_Fat'] = np.select(conditions, choices)

In [None]:
plt.figure(figsize=(10,7))
sns.countplot(x = 'BMI_Wt', hue = 'Response', data = train)

Overweight policyholders are not offered standard terms.



In [None]:
plt.figure(figsize=(10,7))
sns.countplot(x = 'Old_Young', hue = 'Response', data = train)

Compared to young lives and average lives, more often, old lives were not offered standard terms

In [None]:
plt.figure(figsize=(10,7))
sns.countplot(x = 'Short_Tall', hue = 'Response', data = train)

This does not indicate any behaviour



In [None]:
plt.figure(figsize=(10,7))
sns.countplot(x = 'Thin_Fat', hue = 'Response', data = train)


Fat people are not offered standard terms



**Let's get deeper into it**

In [None]:
def new_target(row):
    if (row['BMI_Wt']=='overweight') or (row['Old_Young']=='old')  or (row['Thin_Fat']=='fat'):
        val='extremely_risky'
    else:
        val='not_extremely_risky'
    return val

train['extreme_risk'] = train.apply(new_target,axis=1)

In [None]:
plt.figure(figsize=(10,7))
sns.countplot(x = 'extreme_risk', hue = 'Response', data = train)

Under "extreme risk" category, lots of policies are getting either rejected or issued on substandard terms


In [None]:
def new_target(row):
    if (row['BMI_Wt']=='average') or (row['Old_Young']=='average')  or (row['Thin_Fat']=='average'):
        val='average'
    else:
        val='non_average'
    return val

train['average_risk'] = train.apply(new_target,axis=1)

In [None]:
plt.figure(figsize=(10,7))
sns.countplot(x = 'average_risk', hue = 'Response', data = train)

This does not indicate any behaviour

In [None]:
def new_target(row):
    if (row['BMI_Wt']=='under_weight') or (row['Old_Young']=='young')  or (row['Thin_Fat']=='thin'):
        val='low_end'
    else:
        val='non_low_end'
    return val

train['low_end_risk'] = train.apply(new_target,axis=1)

In [None]:
plt.figure(figsize=(10,7))
sns.countplot(x = 'low_end_risk', hue = 'Response', data = train)

Under non-low-end risk category, lots of policies are either getting rejected or issued at substandard terms.

In [None]:
plt.hist(train['Employment_Info_1']);
plt.title('Distribution of Employment_Info_1 variable');

**Exploring product features**

In [None]:
train['Product_Info_1'].value_counts()

In [None]:
#product1 vs response
sns.distplot(train[train['Response']==0]['Product_Info_1'],hist=False,label='Rejected')
sns.distplot(train[train['Response']==1]['Product_Info_1'],hist=False,label='Accepted')

In [None]:
#product2 vs response
sns.distplot(train[train['Response']==0]['Product_Info_2'],hist=False,label='Rejected')
sns.distplot(train[train['Response']==1]['Product_Info_2'],hist=False,label='Accepted')

In [None]:
#product3 vs response
sns.distplot(train[train['Response']==0]['Product_Info_3'],hist=False,label='Rejected')
sns.distplot(train[train['Response']==1]['Product_Info_3'],hist=False,label='Accepted')

In [None]:
#product5 vs response
sns.distplot(train[train['Response']==0]['Product_Info_5'],hist=False,label='Rejected')
sns.distplot(train[train['Response']==1]['Product_Info_5'],hist=False,label='Accepted')

In [None]:
#product6 vs response
sns.distplot(train[train['Response']==0]['Product_Info_6'],hist=False,label='Rejected')
sns.distplot(train[train['Response']==1]['Product_Info_6'],hist=False,label='Accepted')

In [None]:
#product7 vs response
sns.distplot(train[train['Response']==0]['Product_Info_7'],hist=False,label='Rejected')
sns.distplot(train[train['Response']==1]['Product_Info_7'],hist=False,label='Accepted')

In [None]:
corr = train.corr()
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(100, 370, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);

We are not getting clear Visual on correlation Graph.