## Predictive Modeling on an Insurance Dataset

Here I have a dataset which contains the details of different customers of an Insurance company. Here the target variable has binary output results which is 1 or 0 where 1 means the customer is paing the premium for the next month on time and 0 means he/she is failing to do so.

Here my task is to build a model which given the proper parameters will predict whether a customer of the particular insurance company will pay the premium next month on time or not. 

### Importing the required libraries

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import r2_score

### Reading the data


I have a csv file here which contains the required dataset for building and testing the model. I will be providing the csv file later. using the **read_csv()** function of the pandas I have created a pandas dataframe and named it **train**.

In [12]:
train = pd.read_csv(r'C:\Users\hp world\Downloads\train_qnU1GcL.csv')
train.head()

Unnamed: 0,id,perc_premium_paid_by_cash_credit,age_in_days,Income,Count_3-6_months_late,Count_6-12_months_late,Count_more_than_12_months_late,application_underwriting_score,no_of_premiums_paid,sourcing_channel,residence_area_type,target
0,110936,0.429,12058,355060,0.0,0.0,0.0,99.02,13,C,Urban,1
1,41492,0.01,21546,315150,0.0,0.0,0.0,99.89,21,A,Urban,1
2,31300,0.917,17531,84140,2.0,3.0,1.0,98.69,7,C,Rural,0
3,19415,0.049,15341,250510,0.0,0.0,0.0,99.57,9,A,Urban,1
4,99379,0.052,31400,198680,0.0,0.0,0.0,99.87,12,B,Urban,1


### Variable Identification

In [13]:
train.columns

Index(['id', 'perc_premium_paid_by_cash_credit', 'age_in_days', 'Income',
       'Count_3-6_months_late', 'Count_6-12_months_late',
       'Count_more_than_12_months_late', 'application_underwriting_score',
       'no_of_premiums_paid', 'sourcing_channel', 'residence_area_type',
       'target'],
      dtype='object')

In [14]:
train.dtypes

id                                    int64
perc_premium_paid_by_cash_credit    float64
age_in_days                           int64
Income                                int64
Count_3-6_months_late               float64
Count_6-12_months_late              float64
Count_more_than_12_months_late      float64
application_underwriting_score      float64
no_of_premiums_paid                   int64
sourcing_channel                     object
residence_area_type                  object
target                                int64
dtype: object

In [15]:
train.shape

(79853, 12)

We can see that the datarame has 79853 rows and 12 columns here and upon inspection we will see that the dataframe has many variable columns which have **object** datatypes which are **categorical** in nature. If we feed the string values of a categorical variable to a predictive model, it will return an error. So we need to convert them using the following step:

In [16]:
train = pd.get_dummies(train)

In [17]:
train.shape

(79853, 17)

In [18]:
train.columns

Index(['id', 'perc_premium_paid_by_cash_credit', 'age_in_days', 'Income',
       'Count_3-6_months_late', 'Count_6-12_months_late',
       'Count_more_than_12_months_late', 'application_underwriting_score',
       'no_of_premiums_paid', 'target', 'sourcing_channel_A',
       'sourcing_channel_B', 'sourcing_channel_C', 'sourcing_channel_D',
       'sourcing_channel_E', 'residence_area_type_Rural',
       'residence_area_type_Urban'],
      dtype='object')

Also we can understand that the **id** variable contains little value when it comes to training the model so let us drop it in the first place.

In [19]:
train = train.drop(['id'],axis=1)
train.head()

Unnamed: 0,perc_premium_paid_by_cash_credit,age_in_days,Income,Count_3-6_months_late,Count_6-12_months_late,Count_more_than_12_months_late,application_underwriting_score,no_of_premiums_paid,target,sourcing_channel_A,sourcing_channel_B,sourcing_channel_C,sourcing_channel_D,sourcing_channel_E,residence_area_type_Rural,residence_area_type_Urban
0,0.429,12058,355060,0.0,0.0,0.0,99.02,13,1,0,0,1,0,0,0,1
1,0.01,21546,315150,0.0,0.0,0.0,99.89,21,1,1,0,0,0,0,0,1
2,0.917,17531,84140,2.0,3.0,1.0,98.69,7,0,0,0,1,0,0,1,0
3,0.049,15341,250510,0.0,0.0,0.0,99.57,9,1,1,0,0,0,0,0,1
4,0.052,31400,198680,0.0,0.0,0.0,99.87,12,1,0,1,0,0,0,0,1


Here we can see that the nuber of rows remain the same as before but the number of columns increase by 5 and upon inspecting the columns we can see which are the new dummy variables created in order to compensate for the categorical variables.

### Univariate Analysis

Here we explore the variables one at a time and summarize them. Let us use the  **describe()** function to  get a good idea of the data set

In [21]:
train.describe()

Unnamed: 0,perc_premium_paid_by_cash_credit,age_in_days,Income,Count_3-6_months_late,Count_6-12_months_late,Count_more_than_12_months_late,application_underwriting_score,no_of_premiums_paid,target,sourcing_channel_A,sourcing_channel_B,sourcing_channel_C,sourcing_channel_D,sourcing_channel_E,residence_area_type_Rural,residence_area_type_Urban
count,79853.0,79853.0,79853.0,79756.0,79756.0,79756.0,76879.0,79853.0,79853.0,79853.0,79853.0,79853.0,79853.0,79853.0,79853.0,79853.0
mean,0.314288,18846.696906,208847.2,0.248671,0.078188,0.060008,99.067291,10.863887,0.93741,0.540168,0.20678,0.150765,0.094661,0.007627,0.396604,0.603396
std,0.334915,5208.719136,496582.6,0.691468,0.436507,0.312023,0.739799,5.170687,0.242226,0.498387,0.404999,0.357821,0.292749,0.086997,0.489195,0.489195
min,0.0,7670.0,24030.0,0.0,0.0,0.0,91.9,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.034,14974.0,108010.0,0.0,0.0,0.0,98.81,7.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.167,18625.0,166560.0,0.0,0.0,0.0,99.21,10.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,0.538,22636.0,252090.0,0.0,0.0,0.0,99.54,14.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
max,1.0,37602.0,90262600.0,13.0,17.0,11.0,99.89,60.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


From the **count** column we get an idea that the dataset has a few missing values. Let us inspect it further as we will need to treat the missing values as keeping them will raise errors while building the model. 

In [22]:
pd.isnull(train).sum()

perc_premium_paid_by_cash_credit       0
age_in_days                            0
Income                                 0
Count_3-6_months_late                 97
Count_6-12_months_late                97
Count_more_than_12_months_late        97
application_underwriting_score      2974
no_of_premiums_paid                    0
target                                 0
sourcing_channel_A                     0
sourcing_channel_B                     0
sourcing_channel_C                     0
sourcing_channel_D                     0
sourcing_channel_E                     0
residence_area_type_Rural              0
residence_area_type_Urban              0
dtype: int64

Using the above function we see that that the columns **Count_3-6_months_late,Count_6-12_months_late,Count_more_than_12_months_late** has 97 missing values each and the column **application_underwriting_score** missing values which we will immpute in the following cells.

### Missing Value Treatment

In [23]:
missing_columns = ["Count_3-6_months_late", "Count_6-12_months_late", "Count_more_than_12_months_late", "application_underwriting_score"]

It is important to note that there can be several immputation techniques one can take to immpute missing values such as replacing them with any random number or their respective **mean,median or mode**. While developing a Predictive model one should take up the technique which will give us the maximum accuracy.

Here I have developed a function which first randomly imputates the missing values of the respective columns which contains them and then I have created a **Regression Model** which will iteratively replace the  values of the columns which contained the missing values with the values predicted by it.

In [24]:
def random_imputation(df, feature):

    number_missing = df[feature].isnull().sum()
    observed_values = df.loc[df[feature].notnull(), feature]
    df.loc[df[feature].isnull(), feature + '_imp'] = np.random.choice(observed_values, number_missing, replace = True)
    
    return df

In [26]:
for feature in missing_columns:
    train[feature + '_imp'] = train[feature]
    train = random_imputation(train, feature)

In [27]:
deter_data = pd.DataFrame(columns = ["Det" + name for name in missing_columns])

for feature in missing_columns:
        
    deter_data["Det" + feature] = train[feature + "_imp"]
    parameters = list(set(train.columns) - set(missing_columns) - {feature + '_imp'})
    
    #Create a Linear Regression model to estimate the missing data
    model = LinearRegression()
    model.fit(X = train[parameters], y = train[feature + '_imp'])
    
    #observe that I preserve the index of the missing data from the original dataframe
    deter_data.loc[train[feature].isnull(), "Det" + feature] = model.predict(train[parameters])[train[feature].isnull()]

In [28]:
pd.isnull(deter_data).sum()

DetCount_3-6_months_late             0
DetCount_6-12_months_late            0
DetCount_more_than_12_months_late    0
Detapplication_underwriting_score    0
dtype: int64

In [29]:
train['Count_3-6_months_late'] = deter_data['DetCount_3-6_months_late']
train['Count_6-12_months_late'] = deter_data['DetCount_6-12_months_late']
train['Count_more_than_12_months_late'] = deter_data['DetCount_more_than_12_months_late']
train['application_underwriting_score'] = deter_data['Detapplication_underwriting_score']

In [30]:
pd.isnull(train).sum()

perc_premium_paid_by_cash_credit      0
age_in_days                           0
Income                                0
Count_3-6_months_late                 0
Count_6-12_months_late                0
Count_more_than_12_months_late        0
application_underwriting_score        0
no_of_premiums_paid                   0
target                                0
sourcing_channel_A                    0
sourcing_channel_B                    0
sourcing_channel_C                    0
sourcing_channel_D                    0
sourcing_channel_E                    0
residence_area_type_Rural             0
residence_area_type_Urban             0
Count_3-6_months_late_imp             0
Count_6-12_months_late_imp            0
Count_more_than_12_months_late_imp    0
application_underwriting_score_imp    0
dtype: int64

Finally we can see that the dataframe is free from missing values. SO now we can proceed further.

### Data Preprocessing

In [31]:
train['target'].value_counts()

1    74855
0     4998
Name: target, dtype: int64

Here we can see that there are total **74855** rows in the **target** variable accounting to 1 and **4998** rows accounting to 0. So it is an **imbalanced dataset**

The problem with an imbalanced dataset is that whenever we train a model using such kind of dataset, the  model becomes biased in its prediction and gives all of its predictions on behalf of the **majority** class, in this case 1

So now our aim must be to tackle this problem and for doing so we will need make the count of majority classes and minority classes equal. This is basicall done in 2 ways:
1. By randomly generating and increasing the minority classes and making them equal to majority classes.
2. By randomly deleting the ajority classes and making them equal to the minnority classes.

Here I have used the **utlis** library from sklearn and using the **resample** I resampled and generated minority class to make them equal to the majority class. First I have used the  **train_test_split** function and then resampled the classes and stated above

In [32]:
from sklearn.utils import resample
from sklearn.model_selection import train_test_split

# Separate input features and target
y = train.target
X = train.drop('target', axis=1)

# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=27)

# concatenate our training data back together
X = pd.concat([X_train, y_train], axis=1)

# separate minority and majority classes
not_pay = X[X.target==0]
pay = X[X.target==1]

# upsample minority
premium_default = resample(not_pay,
                          replace=True, # sample with replacement
                          n_samples=len(pay), # match number in majority class
                          random_state=27) # reproducible results

# combine majority and upsampled minority
upsampled = pd.concat([pay, premium_default])

### Modeling

#### Using Decision Tree

In [33]:
from sklearn.tree import DecisionTreeClassifier
upsampled = DecisionTreeClassifier(criterion = 'entropy',max_depth = 5).fit(X_train, y_train)
upsampled_pred = upsampled.predict(X_test)

#### Using K Nearest Neighbors(KNN)

In [34]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=5)
neigh.fit(X_train,y_train)
neigh_predict =neigh.predict(X_test)

#### Using Logistic Regression

In [35]:
from sklearn.linear_model import LogisticRegression
logis = LogisticRegression()
logis.fit(X_train,y_train)
logis_predict = logis.predict(X_test)



### Accuracy Calculation For Different Models

#### For Decision Tree

In [36]:
accuracy_score(y_test, upsampled_pred)

0.9372599766238103

#### For KNN

In [37]:
accuracy_score(y_test, neigh_predict)

0.932042077141426

#### For Logistic Regression

In [38]:
accuracy_score(y_test, logis_predict)

0.9352980464184338

### F1 Score

#### For Decision Tree

In [30]:
f1_score(y_test, upsampled_pred)

0.9674773817014167

#### For KNN

In [39]:
f1_score(y_test, neigh_predict)

0.9648091306039515

#### For Logistic Regression

In [40]:
f1_score(y_test, logis_predict)

0.9665674474785384

It is important to note that we can keep trying different models and see if the accuracy and f1 score increases, we can also tweek the different parameters of a particular model and check for the best fit to get the best predictions

## Testing the model on an Unknown Dataset

Here I have an unknown dataset on which I can test my model and predict the values. The steps followed here for Data Preprocessing are same as the ones followed while preparing the data set for training the model

In [41]:
test= pd.read_csv(r'C:\Users\hp world\Downloads\test_u8jxaCM.csv')
test.head()

Unnamed: 0,id,perc_premium_paid_by_cash_credit,age_in_days,Income,Count_3-6_months_late,Count_6-12_months_late,Count_more_than_12_months_late,application_underwriting_score,no_of_premiums_paid,sourcing_channel,residence_area_type
0,649,0.001,27384,51150,0.0,0.0,0.0,99.89,7,A,Rural
1,81136,0.124,23735,285140,0.0,0.0,0.0,98.93,19,A,Urban
2,70762,1.0,17170,186030,0.0,0.0,0.0,,2,B,Urban
3,53935,0.198,16068,123540,0.0,0.0,0.0,99.0,11,B,Rural
4,15476,0.041,10591,200020,1.0,0.0,0.0,99.17,14,A,Rural


In [43]:
test.shape

(34224, 11)

In [44]:
test.columns

Index(['id', 'perc_premium_paid_by_cash_credit', 'age_in_days', 'Income',
       'Count_3-6_months_late', 'Count_6-12_months_late',
       'Count_more_than_12_months_late', 'application_underwriting_score',
       'no_of_premiums_paid', 'sourcing_channel', 'residence_area_type'],
      dtype='object')

In [42]:
test.dtypes

id                                    int64
perc_premium_paid_by_cash_credit    float64
age_in_days                           int64
Income                                int64
Count_3-6_months_late               float64
Count_6-12_months_late              float64
Count_more_than_12_months_late      float64
application_underwriting_score      float64
no_of_premiums_paid                   int64
sourcing_channel                     object
residence_area_type                  object
dtype: object

#### Creating Dummy Variables

In [45]:
test = pd.get_dummies(test)

In [46]:
test1 = test.drop(['id'],axis=1)
test1.head()

Unnamed: 0,perc_premium_paid_by_cash_credit,age_in_days,Income,Count_3-6_months_late,Count_6-12_months_late,Count_more_than_12_months_late,application_underwriting_score,no_of_premiums_paid,sourcing_channel_A,sourcing_channel_B,sourcing_channel_C,sourcing_channel_D,sourcing_channel_E,residence_area_type_Rural,residence_area_type_Urban
0,0.001,27384,51150,0.0,0.0,0.0,99.89,7,1,0,0,0,0,1,0
1,0.124,23735,285140,0.0,0.0,0.0,98.93,19,1,0,0,0,0,0,1
2,1.0,17170,186030,0.0,0.0,0.0,,2,0,1,0,0,0,0,1
3,0.198,16068,123540,0.0,0.0,0.0,99.0,11,0,1,0,0,0,1,0
4,0.041,10591,200020,1.0,0.0,0.0,99.17,14,1,0,0,0,0,1,0


In [47]:
test1.shape

(34224, 15)

### Univariate Analysis

In [48]:
test1.describe()

Unnamed: 0,perc_premium_paid_by_cash_credit,age_in_days,Income,Count_3-6_months_late,Count_6-12_months_late,Count_more_than_12_months_late,application_underwriting_score,no_of_premiums_paid,sourcing_channel_A,sourcing_channel_B,sourcing_channel_C,sourcing_channel_D,sourcing_channel_E,residence_area_type_Rural,residence_area_type_Urban
count,34224.0,34224.0,34224.0,34193.0,34193.0,34193.0,32901.0,34224.0,34224.0,34224.0,34224.0,34224.0,34224.0,34224.0,34224.0
mean,0.314457,18824.215346,202820.1,0.238733,0.080718,0.058111,99.061898,10.890428,0.545582,0.202285,0.150362,0.094144,0.007626,0.397849,0.602151
std,0.334059,5246.525604,270253.6,0.686162,0.454634,0.307046,0.742942,5.216867,0.497925,0.401709,0.357431,0.292034,0.086996,0.489461,0.489461
min,0.0,7671.0,24030.0,0.0,0.0,0.0,91.9,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.034,14972.0,106397.5,0.0,0.0,0.0,98.8,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.169,18623.0,165070.0,0.0,0.0,0.0,99.21,10.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,0.54,22636.0,250020.0,0.0,0.0,0.0,99.53,14.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
max,1.0,35785.0,21914550.0,12.0,10.0,7.0,99.89,59.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [49]:
pd.isnull(test1).sum()

perc_premium_paid_by_cash_credit       0
age_in_days                            0
Income                                 0
Count_3-6_months_late                 31
Count_6-12_months_late                31
Count_more_than_12_months_late        31
application_underwriting_score      1323
no_of_premiums_paid                    0
sourcing_channel_A                     0
sourcing_channel_B                     0
sourcing_channel_C                     0
sourcing_channel_D                     0
sourcing_channel_E                     0
residence_area_type_Rural              0
residence_area_type_Urban              0
dtype: int64

### Treating Missing Values

In [50]:
missing_columns_test = ["Count_3-6_months_late", "Count_6-12_months_late", "Count_more_than_12_months_late", "application_underwriting_score"]

In [51]:
for feature in missing_columns_test:
    test1[feature + '_imp'] = test1[feature]
    test1 = random_imputation(test1, feature)

In [52]:
deter_data_test = pd.DataFrame(columns = ["Det" + name for name in missing_columns_test])

for feature in missing_columns_test:
        
    deter_data_test["Det" + feature] = test1[feature + "_imp"]
    parameters_test = list(set(test1.columns) - set(missing_columns_test) - {feature + '_imp'})
    
    #Create a Linear Regression model to estimate the missing data
    model1 = LinearRegression()
    model1.fit(X = test1[parameters_test], y = test1[feature + '_imp'])
    
    #observe that I preserve the index of the missing data from the original dataframe
    deter_data_test.loc[test1[feature].isnull(), "Det" + feature] = model1.predict(test1[parameters_test])[test1[feature].isnull()]

In [53]:
pd.isnull(deter_data_test).sum()

DetCount_3-6_months_late             0
DetCount_6-12_months_late            0
DetCount_more_than_12_months_late    0
Detapplication_underwriting_score    0
dtype: int64

In [54]:
test1['Count_3-6_months_late'] = deter_data_test['DetCount_3-6_months_late']
test1['Count_6-12_months_late'] = deter_data_test['DetCount_6-12_months_late']
test1['Count_more_than_12_months_late'] = deter_data_test['DetCount_more_than_12_months_late']
test1['application_underwriting_score'] = deter_data_test['Detapplication_underwriting_score']

In [55]:
pd.isnull(test1).sum()

perc_premium_paid_by_cash_credit      0
age_in_days                           0
Income                                0
Count_3-6_months_late                 0
Count_6-12_months_late                0
Count_more_than_12_months_late        0
application_underwriting_score        0
no_of_premiums_paid                   0
sourcing_channel_A                    0
sourcing_channel_B                    0
sourcing_channel_C                    0
sourcing_channel_D                    0
sourcing_channel_E                    0
residence_area_type_Rural             0
residence_area_type_Urban             0
Count_3-6_months_late_imp             0
Count_6-12_months_late_imp            0
Count_more_than_12_months_late_imp    0
application_underwriting_score_imp    0
dtype: int64

### Predictions

#### For Decision Tree

In [56]:
submission_pred = upsampled.predict(test1)

#### For KNN

In [57]:
submission_pred1 = neigh.predict(test1)

#### For Logistic Regression

In [58]:
submission_pred2 = logis.predict(test1)

### Creating Dataframes and  Storing the Results

In [59]:
submission10 = pd.DataFrame()

In [60]:
submission11 = pd.DataFrame()

In [61]:
submission12 = pd.DataFrame()

#### For decision Tree

In [63]:
submission10['id'] = test['id']
submission10['target']=submission_pred

#### For KNN

In [64]:
submission11['id'] = test['id']
submission11['target']=submission_pred1

#### For Logistic Regression

In [65]:
submission12['id'] = test['id']
submission12['target']=submission_pred2

In [66]:
submission10['target'].value_counts()

1    33855
0      369
Name: target, dtype: int64

In [67]:
submission11['target'].value_counts()

1    34076
0      148
Name: target, dtype: int64

In [68]:
submission12['target'].value_counts()

1    34224
Name: target, dtype: int64

Using the value_counts() function we can see that the Logistic Regression still givea a **biased** model.

### Saving the Dataframes to a CSV file

In [69]:
submission10.to_csv('submission10.csv', header=True, index=False)

In [70]:
submission11.to_csv('submission11.csv', header=True, index=False)

In [71]:
submission12.to_csv('submission12.csv', header=True, index=False)