# Prudential Life Insurance Assessment 
### Can you make buying life insurance easier?

Prudential, one of the largest issuers of life insurance in the USA, is hiring passionate data scientists to join a newly-formed Data Science group solving complex challenges and identifying opportunities. The results have been impressive so far but we want more. 

The Challenge
In a one-click shopping world with on-demand everything, the life insurance application process is antiquated. Customers provide extensive information to identify risk classification and eligibility, including scheduling medical exams, a process that takes an average of 30 days.

The result? People are turned off. That’s why only 40% of U.S. households own individual life insurance. Prudential wants to make it quicker and less labor intensive for new and existing customers to get a quote while maintaining privacy boundaries.

By developing a predictive model that accurately classifies risk using a more automated approach, you can greatly impact public perception of the industry.

The results will help Prudential better understand the predictive power of the data points in the existing assessment, enabling us to significantly streamline the process.

### Import the neccessay libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Load the train & test data

In [2]:
train_data = pd.read_csv(r'C:\Users\sadiq\Documents\Gatom_test\TATA AIA\train.csv')
test_data = pd.read_csv(r'C:\Users\sadiq\Documents\Gatom_test\TATA AIA\test.csv')

In [3]:
pd.set_option('display.max_columns', None)

### Understand the description of data 

In [4]:
train_data.head()

Unnamed: 0,Id,Product_Info_1,Product_Info_2,Product_Info_3,Product_Info_4,Product_Info_5,Product_Info_6,Product_Info_7,Ins_Age,Ht,Wt,BMI,Employment_Info_1,Employment_Info_2,Employment_Info_3,Employment_Info_4,Employment_Info_5,Employment_Info_6,InsuredInfo_1,InsuredInfo_2,InsuredInfo_3,InsuredInfo_4,InsuredInfo_5,InsuredInfo_6,InsuredInfo_7,Insurance_History_1,Insurance_History_2,Insurance_History_3,Insurance_History_4,Insurance_History_5,Insurance_History_7,Insurance_History_8,Insurance_History_9,Family_Hist_1,Family_Hist_2,Family_Hist_3,Family_Hist_4,Family_Hist_5,Medical_History_1,Medical_History_2,Medical_History_3,Medical_History_4,Medical_History_5,Medical_History_6,Medical_History_7,Medical_History_8,Medical_History_9,Medical_History_10,Medical_History_11,Medical_History_12,Medical_History_13,Medical_History_14,Medical_History_15,Medical_History_16,Medical_History_17,Medical_History_18,Medical_History_19,Medical_History_20,Medical_History_21,Medical_History_22,Medical_History_23,Medical_History_24,Medical_History_25,Medical_History_26,Medical_History_27,Medical_History_28,Medical_History_29,Medical_History_30,Medical_History_31,Medical_History_32,Medical_History_33,Medical_History_34,Medical_History_35,Medical_History_36,Medical_History_37,Medical_History_38,Medical_History_39,Medical_History_40,Medical_History_41,Medical_Keyword_1,Medical_Keyword_2,Medical_Keyword_3,Medical_Keyword_4,Medical_Keyword_5,Medical_Keyword_6,Medical_Keyword_7,Medical_Keyword_8,Medical_Keyword_9,Medical_Keyword_10,Medical_Keyword_11,Medical_Keyword_12,Medical_Keyword_13,Medical_Keyword_14,Medical_Keyword_15,Medical_Keyword_16,Medical_Keyword_17,Medical_Keyword_18,Medical_Keyword_19,Medical_Keyword_20,Medical_Keyword_21,Medical_Keyword_22,Medical_Keyword_23,Medical_Keyword_24,Medical_Keyword_25,Medical_Keyword_26,Medical_Keyword_27,Medical_Keyword_28,Medical_Keyword_29,Medical_Keyword_30,Medical_Keyword_31,Medical_Keyword_32,Medical_Keyword_33,Medical_Keyword_34,Medical_Keyword_35,Medical_Keyword_36,Medical_Keyword_37,Medical_Keyword_38,Medical_Keyword_39,Medical_Keyword_40,Medical_Keyword_41,Medical_Keyword_42,Medical_Keyword_43,Medical_Keyword_44,Medical_Keyword_45,Medical_Keyword_46,Medical_Keyword_47,Medical_Keyword_48,Response
0,2,1,D3,10,0.076923,2,1,1,0.641791,0.581818,0.148536,0.323008,0.028,12,1,0.0,3,,1,2,6,3,1,2,1,1,1,3,1,0.000667,1,1,2,2,,0.598039,,0.526786,4.0,112,2,1,1,3,2,2,1,,3,2,3,3,240.0,3,3,1,1,2,1,2,3,,1,3,3,1,3,2,3,,1,3,1,2,2,1,3,3,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8
1,5,1,A1,26,0.076923,2,3,1,0.059701,0.6,0.131799,0.272288,0.0,1,3,0.0,2,0.0018,1,2,6,3,1,2,1,2,1,3,1,0.000133,1,3,2,2,0.188406,,0.084507,,5.0,412,2,1,1,3,2,2,1,,3,2,3,3,0.0,1,3,1,1,2,1,2,3,,1,3,3,1,3,2,3,,3,1,1,2,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4
2,6,1,E1,26,0.076923,2,3,1,0.029851,0.745455,0.288703,0.42878,0.03,9,1,0.0,2,0.03,1,2,8,3,1,1,1,2,1,1,3,,3,2,3,3,0.304348,,0.225352,,10.0,3,2,2,1,3,2,2,2,,3,2,3,3,,1,3,1,1,2,1,2,3,,2,2,3,1,3,2,3,,3,3,1,3,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8
3,7,1,D4,10,0.487179,2,3,1,0.164179,0.672727,0.205021,0.352438,0.042,9,1,0.0,3,0.2,2,2,8,3,1,2,1,2,1,1,3,,3,2,3,3,0.42029,,0.352113,,0.0,350,2,2,1,3,2,2,2,,3,2,3,3,,1,3,1,1,2,2,2,3,,1,3,3,1,3,2,3,,3,3,1,2,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8
4,8,1,D2,26,0.230769,2,3,1,0.41791,0.654545,0.23431,0.424046,0.027,9,1,0.0,2,0.05,1,2,6,3,1,2,1,2,1,1,3,,3,2,3,2,0.463768,,0.408451,,,162,2,2,1,3,2,2,2,,3,2,3,3,,1,3,1,1,2,1,2,3,,2,2,3,1,3,2,3,,3,3,1,3,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8


In [5]:
train_data.describe()

Unnamed: 0,Id,Product_Info_1,Product_Info_3,Product_Info_4,Product_Info_5,Product_Info_6,Product_Info_7,Ins_Age,Ht,Wt,BMI,Employment_Info_1,Employment_Info_2,Employment_Info_3,Employment_Info_4,Employment_Info_5,Employment_Info_6,InsuredInfo_1,InsuredInfo_2,InsuredInfo_3,InsuredInfo_4,InsuredInfo_5,InsuredInfo_6,InsuredInfo_7,Insurance_History_1,Insurance_History_2,Insurance_History_3,Insurance_History_4,Insurance_History_5,Insurance_History_7,Insurance_History_8,Insurance_History_9,Family_Hist_1,Family_Hist_2,Family_Hist_3,Family_Hist_4,Family_Hist_5,Medical_History_1,Medical_History_2,Medical_History_3,Medical_History_4,Medical_History_5,Medical_History_6,Medical_History_7,Medical_History_8,Medical_History_9,Medical_History_10,Medical_History_11,Medical_History_12,Medical_History_13,Medical_History_14,Medical_History_15,Medical_History_16,Medical_History_17,Medical_History_18,Medical_History_19,Medical_History_20,Medical_History_21,Medical_History_22,Medical_History_23,Medical_History_24,Medical_History_25,Medical_History_26,Medical_History_27,Medical_History_28,Medical_History_29,Medical_History_30,Medical_History_31,Medical_History_32,Medical_History_33,Medical_History_34,Medical_History_35,Medical_History_36,Medical_History_37,Medical_History_38,Medical_History_39,Medical_History_40,Medical_History_41,Medical_Keyword_1,Medical_Keyword_2,Medical_Keyword_3,Medical_Keyword_4,Medical_Keyword_5,Medical_Keyword_6,Medical_Keyword_7,Medical_Keyword_8,Medical_Keyword_9,Medical_Keyword_10,Medical_Keyword_11,Medical_Keyword_12,Medical_Keyword_13,Medical_Keyword_14,Medical_Keyword_15,Medical_Keyword_16,Medical_Keyword_17,Medical_Keyword_18,Medical_Keyword_19,Medical_Keyword_20,Medical_Keyword_21,Medical_Keyword_22,Medical_Keyword_23,Medical_Keyword_24,Medical_Keyword_25,Medical_Keyword_26,Medical_Keyword_27,Medical_Keyword_28,Medical_Keyword_29,Medical_Keyword_30,Medical_Keyword_31,Medical_Keyword_32,Medical_Keyword_33,Medical_Keyword_34,Medical_Keyword_35,Medical_Keyword_36,Medical_Keyword_37,Medical_Keyword_38,Medical_Keyword_39,Medical_Keyword_40,Medical_Keyword_41,Medical_Keyword_42,Medical_Keyword_43,Medical_Keyword_44,Medical_Keyword_45,Medical_Keyword_46,Medical_Keyword_47,Medical_Keyword_48,Response
count,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59362.0,59381.0,59381.0,52602.0,59381.0,48527.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,33985.0,59381.0,59381.0,59381.0,59381.0,30725.0,25140.0,40197.0,17570.0,50492.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,557.0,59381.0,59381.0,59381.0,59381.0,14785.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,3801.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,1107.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0,59381.0
mean,39507.211515,1.026355,24.415655,0.328952,2.006955,2.673599,1.043583,0.405567,0.707283,0.292587,0.469462,0.077582,8.641821,1.300904,0.006283,2.142958,0.361469,1.209326,2.007427,5.83584,2.883666,1.02718,1.409188,1.038531,1.727606,1.055792,2.146983,1.958707,0.001733,1.901989,2.048484,2.41936,2.68623,0.47455,0.497737,0.44489,0.484635,7.962172,253.9871,2.102171,1.654873,1.007359,2.889897,2.012277,2.044088,1.769943,141.118492,2.993836,2.056601,2.768141,2.968542,123.760974,1.327529,2.978006,1.053536,1.034455,1.985079,1.108991,1.981644,2.528115,50.635622,1.194961,2.808979,2.980213,1.06721,2.542699,2.040771,2.985265,11.965673,2.804618,2.689076,1.002055,2.179468,1.938398,1.00485,2.83072,2.967599,1.641064,0.042,0.008942,0.049275,0.01455,0.008622,0.012597,0.01391,0.010407,0.006652,0.036459,0.058015,0.010003,0.005962,0.007848,0.190465,0.012715,0.009161,0.007494,0.009296,0.008134,0.014601,0.037167,0.097775,0.018895,0.089456,0.013439,0.011856,0.014937,0.011755,0.025042,0.010896,0.021168,0.022836,0.020646,0.006938,0.010407,0.066587,0.006837,0.013658,0.056954,0.010054,0.045536,0.01071,0.007528,0.013691,0.008488,0.019905,0.054496,5.636837
std,22815.883089,0.160191,5.072885,0.282562,0.083107,0.739103,0.291949,0.19719,0.074239,0.089037,0.122213,0.082347,4.227082,0.715034,0.032816,0.350033,0.349551,0.417939,0.085858,2.674536,0.320627,0.231566,0.491688,0.274915,0.445195,0.329328,0.989139,0.945739,0.007338,0.971223,0.755149,0.509577,0.483159,0.154959,0.140187,0.163012,0.1292,13.027697,178.621154,0.303098,0.475414,0.085864,0.456128,0.17236,0.291353,0.421032,107.759559,0.09534,0.231153,0.640259,0.197715,98.516206,0.740118,0.146778,0.225848,0.182859,0.121375,0.311847,0.134236,0.84917,78.149069,0.406082,0.393237,0.197652,0.250589,0.839904,0.1981,0.170989,38.718774,0.593798,0.724661,0.063806,0.412633,0.240574,0.069474,0.556665,0.252427,0.933361,0.200591,0.094141,0.216443,0.119744,0.092456,0.111526,0.117119,0.101485,0.081289,0.187432,0.233774,0.099515,0.076981,0.088239,0.392671,0.11204,0.095275,0.086244,0.095967,0.089821,0.119949,0.189172,0.297013,0.136155,0.285404,0.115145,0.108237,0.121304,0.10778,0.156253,0.103813,0.143947,0.14938,0.142198,0.083007,0.101485,0.249307,0.082405,0.116066,0.231757,0.099764,0.208479,0.102937,0.086436,0.116207,0.091737,0.139676,0.226995,2.456833
min,2.0,1.0,1.0,0.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,2.0,0.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,19780.0,1.0,26.0,0.076923,2.0,3.0,1.0,0.238806,0.654545,0.225941,0.385517,0.035,9.0,1.0,0.0,2.0,0.06,1.0,2.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0004,1.0,1.0,2.0,2.0,0.362319,0.401961,0.323944,0.401786,2.0,112.0,2.0,1.0,1.0,3.0,2.0,2.0,2.0,8.0,3.0,2.0,3.0,3.0,17.0,1.0,3.0,1.0,1.0,2.0,1.0,2.0,3.0,1.0,1.0,3.0,3.0,1.0,3.0,2.0,3.0,0.0,3.0,3.0,1.0,2.0,2.0,1.0,3.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
50%,39487.0,1.0,26.0,0.230769,2.0,3.0,1.0,0.402985,0.709091,0.288703,0.451349,0.06,9.0,1.0,0.0,2.0,0.25,1.0,2.0,6.0,3.0,1.0,1.0,1.0,2.0,1.0,3.0,2.0,0.000973,1.0,2.0,2.0,3.0,0.463768,0.519608,0.422535,0.508929,4.0,162.0,2.0,2.0,1.0,3.0,2.0,2.0,2.0,229.0,3.0,2.0,3.0,3.0,117.0,1.0,3.0,1.0,1.0,2.0,1.0,2.0,3.0,8.0,1.0,3.0,3.0,1.0,3.0,2.0,3.0,0.0,3.0,3.0,1.0,2.0,2.0,1.0,3.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0
75%,59211.0,1.0,26.0,0.487179,2.0,3.0,1.0,0.567164,0.763636,0.345188,0.532858,0.1,9.0,1.0,0.0,2.0,0.55,1.0,2.0,8.0,3.0,1.0,2.0,1.0,2.0,1.0,3.0,3.0,0.002,3.0,3.0,3.0,3.0,0.57971,0.598039,0.56338,0.580357,9.0,418.0,2.0,2.0,1.0,3.0,2.0,2.0,2.0,240.0,3.0,2.0,3.0,3.0,240.0,1.0,3.0,1.0,1.0,2.0,1.0,2.0,3.0,64.0,1.0,3.0,3.0,1.0,3.0,2.0,3.0,2.0,3.0,3.0,1.0,2.0,2.0,1.0,3.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0
max,79146.0,2.0,38.0,1.0,3.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,38.0,3.0,1.0,3.0,1.0,3.0,3.0,11.0,3.0,3.0,2.0,3.0,2.0,3.0,3.0,3.0,1.0,3.0,3.0,3.0,3.0,1.0,1.0,0.943662,1.0,240.0,648.0,3.0,2.0,3.0,3.0,3.0,3.0,3.0,240.0,3.0,3.0,3.0,3.0,240.0,3.0,3.0,3.0,3.0,3.0,3.0,2.0,3.0,240.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,240.0,3.0,3.0,3.0,3.0,3.0,2.0,3.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,8.0


### Finding the missing values if any

In [6]:
total = train_data.isnull().sum().sort_values(ascending=False)
percent = (train_data.isnull().sum()/train_data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

Unnamed: 0,Total,Percent
Medical_History_10,58824,0.99062
Medical_History_32,58274,0.981358
Medical_History_24,55580,0.93599
Medical_History_15,44596,0.751015
Family_Hist_5,41811,0.704114
Family_Hist_3,34241,0.576632
Family_Hist_2,28656,0.482579
Insurance_History_5,25396,0.427679
Family_Hist_4,19184,0.323066
Employment_Info_6,10854,0.182786


#### Features like 'Medical_History_10','Medical_History_32','Medical_History_24' having more than 90% missing data thus it doesn't give any inside and no meaningfull to model hence drop these features

In [7]:
train_df = train_data.drop(['Medical_History_10','Medical_History_32','Medical_History_24','Id'],1)

test_df = test_data.drop(['Medical_History_10','Medical_History_32','Medical_History_24','Id'],1)

## Feature Engineering

#### Creating a new feature "BMI_Age" which is a product of features "BMI" & "Ins_Age" since I think it is a useful feature for model to learn

In [8]:
train_df['BMI_Age'] = train_df['BMI'] * train_df['Ins_Age']

test_df['BMI_Age'] = test_df['BMI'] * test_df['Ins_Age']

#### There is a categorical variable called Product_Info_2 which contains character and number. I  factorize the column and split the character and number, then create additional two columns with the extract character and number after factorization.Drop the original column and rename it

In [9]:
train_df['Product_Info_2_char'] = train_df.Product_Info_2.str[0]

train_df['Product_Info_2_num'] = train_df.Product_Info_2.str[1]

train_df['Product_Info_2'] = pd.factorize(train_df['Product_Info_2'])[0]

train_df['Product_Info_2_char'] = pd.factorize(train_df['Product_Info_2_char'])[0]

train_df['Product_Info_2_num'] = pd.factorize(train_df['Product_Info_2_num'])[0]

In [10]:
train_df.head()

Unnamed: 0,Product_Info_1,Product_Info_2,Product_Info_3,Product_Info_4,Product_Info_5,Product_Info_6,Product_Info_7,Ins_Age,Ht,Wt,BMI,Employment_Info_1,Employment_Info_2,Employment_Info_3,Employment_Info_4,Employment_Info_5,Employment_Info_6,InsuredInfo_1,InsuredInfo_2,InsuredInfo_3,InsuredInfo_4,InsuredInfo_5,InsuredInfo_6,InsuredInfo_7,Insurance_History_1,Insurance_History_2,Insurance_History_3,Insurance_History_4,Insurance_History_5,Insurance_History_7,Insurance_History_8,Insurance_History_9,Family_Hist_1,Family_Hist_2,Family_Hist_3,Family_Hist_4,Family_Hist_5,Medical_History_1,Medical_History_2,Medical_History_3,Medical_History_4,Medical_History_5,Medical_History_6,Medical_History_7,Medical_History_8,Medical_History_9,Medical_History_11,Medical_History_12,Medical_History_13,Medical_History_14,Medical_History_15,Medical_History_16,Medical_History_17,Medical_History_18,Medical_History_19,Medical_History_20,Medical_History_21,Medical_History_22,Medical_History_23,Medical_History_25,Medical_History_26,Medical_History_27,Medical_History_28,Medical_History_29,Medical_History_30,Medical_History_31,Medical_History_33,Medical_History_34,Medical_History_35,Medical_History_36,Medical_History_37,Medical_History_38,Medical_History_39,Medical_History_40,Medical_History_41,Medical_Keyword_1,Medical_Keyword_2,Medical_Keyword_3,Medical_Keyword_4,Medical_Keyword_5,Medical_Keyword_6,Medical_Keyword_7,Medical_Keyword_8,Medical_Keyword_9,Medical_Keyword_10,Medical_Keyword_11,Medical_Keyword_12,Medical_Keyword_13,Medical_Keyword_14,Medical_Keyword_15,Medical_Keyword_16,Medical_Keyword_17,Medical_Keyword_18,Medical_Keyword_19,Medical_Keyword_20,Medical_Keyword_21,Medical_Keyword_22,Medical_Keyword_23,Medical_Keyword_24,Medical_Keyword_25,Medical_Keyword_26,Medical_Keyword_27,Medical_Keyword_28,Medical_Keyword_29,Medical_Keyword_30,Medical_Keyword_31,Medical_Keyword_32,Medical_Keyword_33,Medical_Keyword_34,Medical_Keyword_35,Medical_Keyword_36,Medical_Keyword_37,Medical_Keyword_38,Medical_Keyword_39,Medical_Keyword_40,Medical_Keyword_41,Medical_Keyword_42,Medical_Keyword_43,Medical_Keyword_44,Medical_Keyword_45,Medical_Keyword_46,Medical_Keyword_47,Medical_Keyword_48,Response,BMI_Age,Product_Info_2_char,Product_Info_2_num
0,1,0,10,0.076923,2,1,1,0.641791,0.581818,0.148536,0.323008,0.028,12,1,0.0,3,,1,2,6,3,1,2,1,1,1,3,1,0.000667,1,1,2,2,,0.598039,,0.526786,4.0,112,2,1,1,3,2,2,1,3,2,3,3,240.0,3,3,1,1,2,1,2,3,1,3,3,1,3,2,3,1,3,1,2,2,1,3,3,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0.207304,0,0
1,1,1,26,0.076923,2,3,1,0.059701,0.6,0.131799,0.272288,0.0,1,3,0.0,2,0.0018,1,2,6,3,1,2,1,2,1,3,1,0.000133,1,3,2,2,0.188406,,0.084507,,5.0,412,2,1,1,3,2,2,1,3,2,3,3,0.0,1,3,1,1,2,1,2,3,1,3,3,1,3,2,3,3,1,1,2,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0.016256,1,1
2,1,2,26,0.076923,2,3,1,0.029851,0.745455,0.288703,0.42878,0.03,9,1,0.0,2,0.03,1,2,8,3,1,1,1,2,1,1,3,,3,2,3,3,0.304348,,0.225352,,10.0,3,2,2,1,3,2,2,2,3,2,3,3,,1,3,1,1,2,1,2,3,2,2,3,1,3,2,3,3,3,1,3,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0.012799,2,1
3,1,3,10,0.487179,2,3,1,0.164179,0.672727,0.205021,0.352438,0.042,9,1,0.0,3,0.2,2,2,8,3,1,2,1,2,1,1,3,,3,2,3,3,0.42029,,0.352113,,0.0,350,2,2,1,3,2,2,2,3,2,3,3,,1,3,1,1,2,2,2,3,1,3,3,1,3,2,3,3,3,1,2,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0.057863,0,2
4,1,4,26,0.230769,2,3,1,0.41791,0.654545,0.23431,0.424046,0.027,9,1,0.0,2,0.05,1,2,6,3,1,2,1,2,1,1,3,,3,2,3,2,0.463768,,0.408451,,,162,2,2,1,3,2,2,2,3,2,3,3,,1,3,1,1,2,1,2,3,2,2,3,1,3,2,3,3,3,1,3,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0.177213,0,3


In [11]:
train_df.drop(['BMI','Ins_Age'],1,inplace=True)

test_df.drop(['BMI','Ins_Age'],1,inplace=True)

In [12]:
test_df['Product_Info_2_char'] = test_df.Product_Info_2.str[0]

test_df['Product_Info_2_num'] = test_df.Product_Info_2.str[1]

test_df['Product_Info_2'] = pd.factorize(test_df['Product_Info_2'])[0]

test_df['Product_Info_2_char'] = pd.factorize(test_df['Product_Info_2_char'])[0]

test_df['Product_Info_2_num'] = pd.factorize(test_df['Product_Info_2_num'])[0]

#### For the Medical_Keyword columns, it has 48 in totals and it is a set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application. I added a column which sum all the counts of those dummy variables.

In [13]:
med_keyword_columns = train_df.columns[train_df.columns.str.startswith('Medical_Keyword_')]

med_keyword_columns_test = test_df.columns[test_df.columns.str.startswith('Medical_Keyword_')]

In [14]:
train_df['Med_Keywords_Count'] = train_df[med_keyword_columns].sum(axis=1)

test_df['Med_Keywords_Count'] = test_df[med_keyword_columns_test].sum(axis=1)

In [15]:
train_df.drop(['Medical_Keyword_1','Medical_Keyword_2','Medical_Keyword_3','Medical_Keyword_4','Medical_Keyword_5','Medical_Keyword_6','Medical_Keyword_7','Medical_Keyword_8','Medical_Keyword_9','Medical_Keyword_10','Medical_Keyword_11','Medical_Keyword_12','Medical_Keyword_13','Medical_Keyword_14','Medical_Keyword_15','Medical_Keyword_16','Medical_Keyword_17','Medical_Keyword_18','Medical_Keyword_19','Medical_Keyword_20','Medical_Keyword_21','Medical_Keyword_22','Medical_Keyword_23','Medical_Keyword_24','Medical_Keyword_25','Medical_Keyword_26','Medical_Keyword_27','Medical_Keyword_28','Medical_Keyword_29','Medical_Keyword_30','Medical_Keyword_31','Medical_Keyword_32','Medical_Keyword_33','Medical_Keyword_34','Medical_Keyword_35','Medical_Keyword_36','Medical_Keyword_37','Medical_Keyword_38','Medical_Keyword_39','Medical_Keyword_40','Medical_Keyword_41','Medical_Keyword_42','Medical_Keyword_43','Medical_Keyword_44','Medical_Keyword_45','Medical_Keyword_46','Medical_Keyword_47','Medical_Keyword_48'], inplace = True, axis = 1)

test_df.drop(['Medical_Keyword_1','Medical_Keyword_2','Medical_Keyword_3','Medical_Keyword_4','Medical_Keyword_5','Medical_Keyword_6','Medical_Keyword_7','Medical_Keyword_8','Medical_Keyword_9','Medical_Keyword_10','Medical_Keyword_11','Medical_Keyword_12','Medical_Keyword_13','Medical_Keyword_14','Medical_Keyword_15','Medical_Keyword_16','Medical_Keyword_17','Medical_Keyword_18','Medical_Keyword_19','Medical_Keyword_20','Medical_Keyword_21','Medical_Keyword_22','Medical_Keyword_23','Medical_Keyword_24','Medical_Keyword_25','Medical_Keyword_26','Medical_Keyword_27','Medical_Keyword_28','Medical_Keyword_29','Medical_Keyword_30','Medical_Keyword_31','Medical_Keyword_32','Medical_Keyword_33','Medical_Keyword_34','Medical_Keyword_35','Medical_Keyword_36','Medical_Keyword_37','Medical_Keyword_38','Medical_Keyword_39','Medical_Keyword_40','Medical_Keyword_41','Medical_Keyword_42','Medical_Keyword_43','Medical_Keyword_44','Medical_Keyword_45','Medical_Keyword_46','Medical_Keyword_47','Medical_Keyword_48'], inplace = True, axis = 1)

In [16]:
train_df['Response'] = train_df['Response'].astype(int)

#### Replacing the missing value with median value with their respective column to avoid the skewness in data

In [17]:
train_df['Employment_Info_1'].fillna(value=train_df['Employment_Info_1'].median(), inplace=True)
train_df['Employment_Info_4'].fillna(value=train_df['Employment_Info_4'].median(), inplace=True)
train_df['Employment_Info_6'].fillna(value=train_df['Employment_Info_6'].median(), inplace=True)
train_df['Insurance_History_5'].fillna(value=train_df['Insurance_History_5'].median(), inplace=True)
train_df['Family_Hist_2'].fillna(value=train_df['Family_Hist_2'].median(), inplace=True)
train_df['Family_Hist_4'].fillna(value=train_df['Family_Hist_4'].median(), inplace=True)
train_df['Medical_History_1'].fillna(value=train_df['Medical_History_1'].median(), inplace=True)

In [18]:
test_df['Employment_Info_1'].fillna(value=test_df['Employment_Info_1'].median(), inplace=True)
test_df['Employment_Info_4'].fillna(value=test_df['Employment_Info_4'].median(), inplace=True)
test_df['Employment_Info_6'].fillna(value=test_df['Employment_Info_6'].median(), inplace=True)
test_df['Insurance_History_5'].fillna(value=test_df['Insurance_History_5'].median(), inplace=True)
test_df['Family_Hist_2'].fillna(value=test_df['Family_Hist_2'].median(), inplace=True)
test_df['Family_Hist_4'].fillna(value=test_df['Family_Hist_4'].median(), inplace=True)
test_df['Medical_History_1'].fillna(value=test_df['Medical_History_1'].median(), inplace=True)
test_df.columns[test_df.isnull().any()]

Index(['Family_Hist_3', 'Family_Hist_5', 'Medical_History_15'], dtype='object')

In [19]:
train_df['Family_Hist_3'].fillna(value=train_df['Family_Hist_3'].median(), inplace=True)
test_df['Family_Hist_3'].fillna(value=test_df['Family_Hist_3'].median(), inplace=True)

train_df['Family_Hist_5'].fillna(value=train_df['Family_Hist_5'].median(), inplace=True)
test_df['Family_Hist_5'].fillna(value=test_df['Family_Hist_5'].median(), inplace=True)

train_df['Medical_History_15'].fillna(value=train_df['Medical_History_15'].median(), inplace=True)
test_df['Medical_History_15'].fillna(value=test_df['Medical_History_15'].median(), inplace=True)

## Modeling
#### Label class is highly imbalance so I used SMOTE sampling

In [21]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split 
X = train_df.drop('Response',1)
y = train_df['Response']
oversample = SMOTE()
X, y = oversample.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state=42)

In [22]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,classification_report,f1_score,confusion_matrix
from ml_metrics import quadratic_weighted_kappa
lr = LogisticRegression(random_state=11)
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
Quad_kappa = quadratic_weighted_kappa(y_test,y_pred)
Quad_kappa

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.3092495287179172

In [23]:
from xgboost import XGBClassifier
xgb_clf = XGBClassifier(base_estimator = lr, random_state = 42)

xgb_clf.fit(X_train, y_train)

y_pred_sample = xgb_clf.predict(X_test)
Quad_kappa = quadratic_weighted_kappa(y_test,y_pred_sample)
Quad_kappa

  data = yaml.load(f.read()) or {}
  defaults = yaml.load(f)


0.6239351474702363

In [24]:
xgb_clf.fit(X, y)
prediction = xgb_clf.predict(X_test)
Quad_kappa = quadratic_weighted_kappa(y_test,prediction)
Quad_kappa

0.6994203257776355

In [25]:
#sc.fit_transform(test_df)
#test_x = test_df.values
xgboost_pred = xgb_clf.predict(test_df)
test_result = pd.DataFrame(test_data['Id'])
submission = pd.DataFrame({"Id": test_data["Id"],"Response": xgboost_pred})
submission.to_csv((r'C:\Users\sadiq\Documents\Gatom_test\TATA AIA\Submission_file\XGbClf_submission.csv'),index=False)

In [36]:
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
#Parameter list
parameters={'learning_rate':[0.1,0.15,0.2,0.25,0.3],'max_depth':range(1,3)}
# Code starts here
xgb_model = XGBClassifier(random_state=0)
xgb_model.fit(X_train,y_train)
y_pred = xgb_model.predict(X_test)

Quad_kappa_1 = quadratic_weighted_kappa(y_test,y_pred)
print(Quad_kappa_1)

clf_model = GridSearchCV(estimator=xgb_model,param_grid=parameters)
clf_model.fit(X_train,y_train)
y_pred_1 = clf_model.predict(X_test)
Quad_kappa_2 = quadratic_weighted_kappa(y_test,y_pred_1)
print(Quad_kappa_2)

0.6239351474702363
0.5268539492670157


In [27]:
from sklearn.ensemble import RandomForestClassifier
rf_sample = RandomForestClassifier(random_state=42)

#rf_sample.fit(X_sample1, y_sample1)
rf_sample.fit(X_train, y_train)
# #predicting on test data
y_pred = rf_sample.predict(X_test)
#y_pred_sample = rf_sample.predict(X_test_vc)
Quad_kappa_2 = quadratic_weighted_kappa(y_test,y_pred)
print(Quad_kappa_2)

0.6995697751308181


In [31]:
from mlxtend.classifier import StackingClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
classifier1 = RandomForestClassifier(random_state=42)
classifier2= LogisticRegression(random_state=42)
classifier3 = LinearSVC(random_state=42)
classifier4= MultinomialNB()
classifier_list=[classifier1,classifier2,classifier3, classifier4]

m_classifier=LogisticRegression(random_state=42)

# # Code starts here
sclf = StackingClassifier(classifiers = classifier_list, meta_classifier = m_classifier)

sclf.fit(X_train, y_train)

y_pred_sample = sclf.predict(X_test)

Quad_kappa = quadratic_weighted_kappa(y_test,y_pred_sample)
print(Quad_kappa)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.699582351588218


In [32]:
from sklearn.preprocessing import StandardScaler , MinMaxScaler
sc = MinMaxScaler()
X_train_min_sc = sc.fit_transform(X_train)
X_test_min_sc = sc.transform(X_test)

lr = LogisticRegression(random_state=11)
lr.fit(X_train_min_sc,y_train)
y_pred = lr.predict(X_test_min_sc)
from ml_metrics import quadratic_weighted_kappa
Quad_kappa = quadratic_weighted_kappa(y_test,y_pred)
Quad_kappa

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.495967441066431

In [33]:
xgb_clf.fit(X_train_min_sc,y_train)
prediction = xgb_clf.predict(X_test_min_sc)
Quad_kappa = quadratic_weighted_kappa(y_test,prediction)
Quad_kappa

0.6249429343112662

In [34]:
from mlxtend.classifier import StackingClassifier
from sklearn.naive_bayes import MultinomialNB
classifier1 = RandomForestClassifier(random_state=42)
classifier2= LogisticRegression(random_state=42)
classifier3 = LinearSVC(random_state=42)
classifier4= MultinomialNB()
classifier_list=[classifier1,classifier2,classifier3, classifier4]

m_classifier=LogisticRegression(random_state=42)

# # Code starts here
sclf = StackingClassifier(classifiers = classifier_list, meta_classifier = m_classifier)

sclf.fit(X_train_min_sc,y_train)

y_pred_sample = sclf.predict(X_test_min_sc)

Quad_kappa = quadratic_weighted_kappa(y_test,y_pred_sample)
print(Quad_kappa)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.6989319577690692
