## Problem Statement

Non-life insurance company that wants to evaluate customer life time value based on each customer's demographics and policy information including claim details. The CLV(customer life time value) is a profitablity metric the organization has set in terms of a value placed by the company on each customer and can be conceived in two dimensions: the customer`s present Value and potential future Value.

It is important for the company to take an optimal decision and implement appropriate action plans. They will be able to do this by accurately predicting the CLV of its customers.

An analytical and modelling framework to predict the life time value of each customer is designed.Various statistical and machine learning models were applied to predict the CLV. This is based on the quantitative and qualitative features provided in the dataset.

**Goal**

To predict the Customer life time value for an auto insurance company based on different quantitative and qualitative features provided.

In [403]:
## Import necessary libraries.

import numpy as np ## Numpy Library ( will use to convert data frame to array or creating array etc...).
import pandas as pd ## Pandas Library (will use to load data,create data frame...etc).
import os ## For connecting to machine to get path for reading/writing files.
from sklearn.preprocessing import LabelEncoder ## For label encoding(converting categorical values to label).
from sklearn.model_selection import train_test_split ## For splitting data into train and validation.
from sklearn.preprocessing import LabelEncoder ## For label encoding(converting categorical values to label).
from statsmodels.stats.outliers_influence import variance_inflation_factor ## For VIF.
from sklearn.linear_model import LinearRegression ## For regression model.
from sklearn.metrics import mean_squared_error ## For MSE.
from math import sqrt ## For square root.
from sklearn.tree import DecisionTreeRegressor ## For Decision tree model.
from sklearn.ensemble import RandomForestRegressor ## For Random Forest model.
from sklearn.neighbors import KNeighborsRegressor ## For KNN mmodel.
from sklearn.svm import SVR ## For SVR mmodel.
from sklearn.ensemble import AdaBoostRegressor ## For Adaboost model.
from sklearn.ensemble import GradientBoostingRegressor ## For GBR model.
from xgboost.sklearn import XGBRegressor ## For  XGB model.
from keras.models import Sequential ## For sequential model
from keras.layers import Dense ## For fully connnected layer.
from sklearn.model_selection import GridSearchCV ## For Grid search.
from sklearn.linear_model import Ridge ## For Ridge model.
from sklearn.linear_model import Lasso ## For Lasso model.

In [2]:
## Get current working directory.
os.getcwd()

'D:\\Python\\Pratice\\Life Insurance'

In [3]:
## Set working directory.
os.chdir("D:\DataScience\Pratice\Life Insurance")
os.getcwd()

'D:\\DataScience\\Pratice\\Life Insurance'

In [4]:
## Read train and test data.
train = pd.read_csv('train.csv',sep=',',header='infer')
test = pd.read_csv('test.csv',sep=',',header='infer')

In [5]:
## Get first 5 records of train data.
train.head()

Unnamed: 0,CustomerID,Customer.Lifetime.Value,Coverage,Education,EmploymentStatus,Gender,Income,Location.Geo,Location.Code,Marital.Status,...,Months.Since.Policy.Inception,Number.of.Open.Complaints,Number.of.Policies,Policy.Type,Policy,Renew.Offer.Type,Sales.Channel,Total.Claim.Amount,Vehicle.Class,Vehicle.Size
0,5917,7824.372789,Basic,Bachelor,Unemployed,F,0,"17.7,77.7",Urban,Married,...,33,,2.0,Personal Auto,Personal L2,Offer2,Branch,267.214383,Four-Door Car,2.0
1,2057,8005.964669,Basic,College,Employed,M,63357,"28.8,76.6",Suburban,Married,...,42,0.0,5.0,Personal Auto,Personal L2,Offer2,Agent,565.508572,SUV,2.0
2,4119,8646.504109,Basic,High School or Below,Employed,F,64125,"21.6,88.4",Urban,Married,...,44,0.0,3.0,Personal Auto,Personal L1,Offer2,Branch,369.818708,SUV,1.0
3,1801,9294.088719,Basic,College,Employed,M,67544,1972.5,Suburban,Married,...,15,,3.0,Corporate Auto,Corporate L3,Offer1,Branch,556.8,SUV,3.0
4,9618,5595.971365,Basic,Bachelor,Retired,F,19651,"19.1,74.7",Suburban,Married,...,68,0.0,5.0,Personal Auto,Personal L1,Offer2,Web,345.6,Two-Door Car,3.0


In [6]:
## Get last 5 records of train data.
train.tail()

Unnamed: 0,CustomerID,Customer.Lifetime.Value,Coverage,Education,EmploymentStatus,Gender,Income,Location.Geo,Location.Code,Marital.Status,...,Months.Since.Policy.Inception,Number.of.Open.Complaints,Number.of.Policies,Policy.Type,Policy,Renew.Offer.Type,Sales.Channel,Total.Claim.Amount,Vehicle.Class,Vehicle.Size
9801,3735,20496.69426,Basic,High School or Below,Unemployed,F,0,"12.7,79.4",Suburban,Single,...,72,0.0,2.0,Personal Auto,Personal L2,Offer1,Branch,307.2,Four-Door Car,2.0
9802,5988,2592.437797,Basic,High School or Below,Employed,M,72421,"18.6,72.3",Suburban,Married,...,23,0.0,1.0,Corporate Auto,Corporate L3,Offer2,Call Center,312.0,Four-Door Car,3.0
9803,8767,3103.923041,Extended,College,Employed,F,74665,"19.2,74.7",Urban,Married,...,90,2.0,1.0,Corporate Auto,Corporate L2,Offer2,Call Center,236.902001,Four-Door Car,2.0
9804,9900,9161.655119,Basic,High School or Below,Employed,F,91763,"19.5,73.9",Urban,Married,...,64,0.0,3.0,Special Auto,Special L3,Offer1,Call Center,441.992043,SUV,3.0
9805,11323,8583.272854,Premium,High School or Below,Disabled,F,18017,"17.2,78.2",Suburban,Divorced,...,54,0.0,9.0,Personal Auto,Personal L3,Offer2,Call Center,547.2,Four-Door Car,2.0


In [7]:
## Get first 5 records of test data.
test.head()

Unnamed: 0,CustomerID,Coverage,Education,EmploymentStatus,Gender,Income,Location.Geo,Location.Code,Marital.Status,Monthly.Premium.Auto,...,Months.Since.Policy.Inception,Number.of.Open.Complaints,Number.of.Policies,Policy.Type,Policy,Renew.Offer.Type,Sales.Channel,Total.Claim.Amount,Vehicle.Class,Vehicle.Size
0,17,Basic,Bachelor,Employed,M,43836.0,"12.6,79.4",Rural,Single,73.0,...,44,0,1,Personal Auto,Personal L1,Offer1,Agent,138.130879,Four-Door Car,Medsize
1,19,Extended,College,Employed,F,28812.0,"17.3,78.4",Urban,Married,93.0,...,7,0,8,Special Auto,Special L2,Offer2,Branch,425.527834,Four-Door Car,Medsize
2,29,Premium,Master,Employed,M,77026.0,"18.4,73.5",Urban,Married,110.0,...,82,2,3,Corporate Auto,Corporate L1,Offer2,Agent,472.029737,Four-Door Car,Medsize
3,34,Basic,Bachelor,Employed,F,24599.0,"17.1,78.2",Rural,Married,64.0,...,50,1,2,Corporate Auto,Corporate L2,Offer2,Branch,42.920271,Four-Door Car,Medsize
4,37,Extended,Bachelor,Disabled,F,13789.0,1380.1,Suburban,Divorced,79.0,...,49,0,1,Personal Auto,Personal L3,Offer4,Call Center,379.2,Four-Door Car,Medsize


In [8]:
## Get last 5 records of test data.
test.tail()

Unnamed: 0,CustomerID,Coverage,Education,EmploymentStatus,Gender,Income,Location.Geo,Location.Code,Marital.Status,Monthly.Premium.Auto,...,Months.Since.Policy.Inception,Number.of.Open.Complaints,Number.of.Policies,Policy.Type,Policy,Renew.Offer.Type,Sales.Channel,Total.Claim.Amount,Vehicle.Class,Vehicle.Size
1762,11553,Basic,Bachelor,Employed,F,61896.0,"19.4,72.8",Urban,Married,104.0,...,97,0,1,Personal Auto,Personal L2,Offer4,Call Center,461.306722,SUV,Medsize
1763,11557,Basic,Doctor,Employed,F,39317.0,"19.4,71.9",Rural,Single,64.0,...,46,0,1,Personal Auto,Personal L2,Offer2,Branch,77.695607,Four-Door Car,Medsize
1764,11559,Basic,Bachelor,Employed,F,30205.0,"19.2,74.3",Suburban,Single,195.0,...,1,0,4,Personal Auto,Personal L3,Offer1,Agent,1329.957905,Luxury SUV,Large
1765,11570,Extended,College,Employed,M,36918.0,"18.9,72.7",Suburban,Divorced,76.0,...,77,3,3,Personal Auto,Personal L1,Offer1,Branch,364.8,Two-Door Car,Medsize
1766,11572,Extended,Bachelor,Employed,F,59367.0,"18.8,73.2",Rural,Married,84.0,...,48,0,1,Personal Auto,Personal L3,Offer2,Agent,6.880385,Four-Door Car,Medsize


In [9]:
## Check dimensions of train data.
train.shape

(9806, 22)

In [10]:
## Check dimensions of test data.
test.shape

(1767, 21)

In [11]:
## Get column names of train data.
train.columns

Index(['CustomerID', 'Customer.Lifetime.Value', 'Coverage', 'Education',
       'EmploymentStatus', 'Gender', 'Income', 'Location.Geo', 'Location.Code',
       'Marital.Status', 'Monthly.Premium.Auto', 'Months.Since.Last.Claim',
       'Months.Since.Policy.Inception', 'Number.of.Open.Complaints',
       'Number.of.Policies', 'Policy.Type', 'Policy', 'Renew.Offer.Type',
       'Sales.Channel', 'Total.Claim.Amount', 'Vehicle.Class', 'Vehicle.Size'],
      dtype='object')

In [12]:
## Get column names of test data.
test.columns

Index(['CustomerID', 'Coverage', 'Education', 'EmploymentStatus', 'Gender',
       'Income', 'Location.Geo', 'Location.Code', 'Marital.Status',
       'Monthly.Premium.Auto', 'Months.Since.Last.Claim',
       'Months.Since.Policy.Inception', 'Number.of.Open.Complaints',
       'Number.of.Policies', 'Policy.Type', 'Policy', 'Renew.Offer.Type',
       'Sales.Channel', 'Total.Claim.Amount', 'Vehicle.Class', 'Vehicle.Size'],
      dtype='object')

In [13]:
## Get index range of train data.
train.index

RangeIndex(start=0, stop=9806, step=1)

In [14]:
## Get index range of test data.
test.index

RangeIndex(start=0, stop=1767, step=1)

In [15]:
## Get summary statistics of train data.
train.describe(include='all')

Unnamed: 0,CustomerID,Customer.Lifetime.Value,Coverage,Education,EmploymentStatus,Gender,Income,Location.Geo,Location.Code,Marital.Status,...,Months.Since.Policy.Inception,Number.of.Open.Complaints,Number.of.Policies,Policy.Type,Policy,Renew.Offer.Type,Sales.Channel,Total.Claim.Amount,Vehicle.Class,Vehicle.Size
count,9806.0,9806.0,8881,9677,9688,9677,9806.0,9806,9687,9677,...,9806.0,8988.0,9685.0,8915,9685,9678,9678,9806.0,9680,9680.0
unique,,,3,5,5,2,4622.0,2840,3,3,...,,,,3,9,4,4,,6,
top,,,Basic,Bachelor,Employed,F,0.0,"NA,NA",Suburban,Married,...,,,,Personal Auto,Personal L3,Offer1,Agent,,Four-Door Car,
freq,,,5361,2934,6020,4985,2461.0,119,6204,5643,...,,,,6620,3637,3975,3670,,4869,
mean,5778.381807,7998.047015,,,,,,,,,...,48.165001,0.379172,2.960351,,,,,438.266734,,2.089773
std,3343.286093,6848.055899,,,,,,,,,...,27.96363,0.896427,2.389801,,,,,293.502301,,0.538524
min,1.0,1898.007675,,,,,,,,,...,0.0,0.0,1.0,,,,,0.099007,,1.0
25%,2879.25,4013.949039,,,,,,,,,...,24.0,0.0,1.0,,,,,280.352767,,2.0
50%,5783.0,5780.182197,,,,,,,,,...,48.0,0.0,2.0,,,,,384.007015,,2.0
75%,8678.75,8960.280213,,,,,,,,,...,71.75,0.0,4.0,,,,,553.540973,,2.0


In [16]:
## Get summary statistics of test data.
test.describe(include='all')

Unnamed: 0,CustomerID,Coverage,Education,EmploymentStatus,Gender,Income,Location.Geo,Location.Code,Marital.Status,Monthly.Premium.Auto,...,Months.Since.Policy.Inception,Number.of.Open.Complaints,Number.of.Policies,Policy.Type,Policy,Renew.Offer.Type,Sales.Channel,Total.Claim.Amount,Vehicle.Class,Vehicle.Size
count,1767.0,1767,1767,1767,1767,1528.0,1767,1767,1767,1695.0,...,1767.0,1767.0,1767.0,1725,1767,1767,1767,1767.0,1767,1767
unique,,3,5,4,2,,1192,3,3,,...,,,,3,9,4,4,,6,3
top,,Basic,Bachelor,Employed,F,,"17.1,78.2",Suburban,Married,,...,,,,Personal Auto,Personal L3,Offer1,Agent,,Four-Door Car,Medsize
freq,,1091,529,1153,886,,8,1063,1027,,...,,,,1301,669,721,666,,870,1252
mean,5834.826825,,,,,44606.390707,,,,93.622419,...,47.486701,0.413696,3.002264,,,,,423.389681,,
std,3328.701974,,,,,29046.821652,,,,34.752238,...,27.95486,0.955579,2.388154,,,,,289.518186,,
min,17.0,,,,,0.0,,,,61.0,...,0.0,0.0,1.0,,,,,1.332349,,
25%,2977.0,,,,,23491.75,,,,69.0,...,24.0,0.0,1.0,,,,,238.197494,,
50%,5813.0,,,,,42821.0,,,,84.0,...,47.0,0.0,2.0,,,,,381.118731,,
75%,8702.5,,,,,67968.5,,,,110.0,...,71.0,0.0,4.0,,,,,542.4,,


In [17]:
## Check NA values for train data.
train.isna().sum()

CustomerID                         0
Customer.Lifetime.Value            0
Coverage                         925
Education                        129
EmploymentStatus                 118
Gender                           129
Income                             0
Location.Geo                       0
Location.Code                    119
Marital.Status                   129
Monthly.Premium.Auto             794
Months.Since.Last.Claim            0
Months.Since.Policy.Inception      0
Number.of.Open.Complaints        818
Number.of.Policies               121
Policy.Type                      891
Policy                           121
Renew.Offer.Type                 128
Sales.Channel                    128
Total.Claim.Amount                 0
Vehicle.Class                    126
Vehicle.Size                     126
dtype: int64

In [18]:
## Check NA values for test data.
test.isna().sum()

CustomerID                         0
Coverage                           0
Education                          0
EmploymentStatus                   0
Gender                             0
Income                           239
Location.Geo                       0
Location.Code                      0
Marital.Status                     0
Monthly.Premium.Auto              72
Months.Since.Last.Claim            0
Months.Since.Policy.Inception      0
Number.of.Open.Complaints          0
Number.of.Policies                 0
Policy.Type                       42
Policy                             0
Renew.Offer.Type                   0
Sales.Channel                      0
Total.Claim.Amount                 0
Vehicle.Class                      0
Vehicle.Size                       0
dtype: int64

In [19]:
## Check column data types for train data.
train.dtypes

CustomerID                         int64
Customer.Lifetime.Value          float64
Coverage                          object
Education                         object
EmploymentStatus                  object
Gender                            object
Income                            object
Location.Geo                      object
Location.Code                     object
Marital.Status                    object
Monthly.Premium.Auto             float64
Months.Since.Last.Claim            int64
Months.Since.Policy.Inception      int64
Number.of.Open.Complaints        float64
Number.of.Policies               float64
Policy.Type                       object
Policy                            object
Renew.Offer.Type                  object
Sales.Channel                     object
Total.Claim.Amount               float64
Vehicle.Class                     object
Vehicle.Size                     float64
dtype: object

In [20]:
## Check column data types for test data.
test.dtypes

CustomerID                         int64
Coverage                          object
Education                         object
EmploymentStatus                  object
Gender                            object
Income                           float64
Location.Geo                      object
Location.Code                     object
Marital.Status                    object
Monthly.Premium.Auto             float64
Months.Since.Last.Claim            int64
Months.Since.Policy.Inception      int64
Number.of.Open.Complaints          int64
Number.of.Policies                 int64
Policy.Type                       object
Policy                            object
Renew.Offer.Type                  object
Sales.Channel                     object
Total.Claim.Amount               float64
Vehicle.Class                     object
Vehicle.Size                      object
dtype: object

In [21]:
### This method will return number of levels,null values,unique values,data types.

def statistics(df):
    return(pd.DataFrame({'dtypes' : df.dtypes,
                         'levels' : [df[x].unique() for x in df.columns],
                         'null_values' : df.isna().sum(),
                         'Unique Values': df.nunique()
                        }))

In [22]:
## Undertsand train data.
statistics(train)

Unnamed: 0,dtypes,levels,null_values,Unique Values
CustomerID,int64,"[5917, 2057, 4119, 1801, 9618, 2747, 3633, 385...",0,9806
Customer.Lifetime.Value,float64,"[7824.372789, 8005.964669, 8646.504109, 9294.0...",0,6477
Coverage,object,"[Basic, Extended, nan, Premium]",925,3
Education,object,"[Bachelor, College, High School or Below, Doct...",129,5
EmploymentStatus,object,"[Unemployed, Employed, Retired, Medical Leave,...",118,5
Gender,object,"[F, M, nan]",129,2
Income,object,"[0, 63357, 64125, 67544, 19651, 23589, 74126, ...",0,4622
Location.Geo,object,"[17.7,77.7, 28.8,76.6, 21.6,88.4, 19,72.5, 19....",0,2840
Location.Code,object,"[Urban, Suburban, Rural, nan]",119,3
Marital.Status,object,"[Married, Divorced, Single, nan]",129,3


In [23]:
## Understand test data.
statistics(test)

Unnamed: 0,dtypes,levels,null_values,Unique Values
CustomerID,int64,"[17, 19, 29, 34, 37, 44, 48, 49, 54, 65, 77, 8...",0,1767
Coverage,object,"[Basic, Extended, Premium]",0,3
Education,object,"[Bachelor, College, Master, High School or Bel...",0,5
EmploymentStatus,object,"[Employed, Disabled, Medical Leave, Unemployed]",0,4
Gender,object,"[M, F]",0,2
Income,float64,"[43836.0, 28812.0, 77026.0, 24599.0, 13789.0, ...",239,1219
Location.Geo,object,"[12.6,79.4, 17.3,78.4, 18.4,73.5, 17.1,78.2, 1...",0,1192
Location.Code,object,"[Rural, Urban, Suburban]",0,3
Marital.Status,object,"[Single, Married, Divorced]",0,3
Monthly.Premium.Auto,float64,"[73.0, 93.0, 110.0, 64.0, 79.0, 71.0, 72.0, 11...",72,142


In [142]:
## Below logic is used for checking for special charcter for numeric columns.

def checkSpecialCharcters(df):
    for col in df.select_dtypes(['int64','float64']).columns: 
        print('\n',col,'----->')
        for index in range(1,len(df)):
            try:
                skip=float(df.loc[index,col])
                skip=int(df.loc[index,col])
            except ValueError :
                if str(df.loc[index,col])!= 'nan':
                    print(index,df.loc[index,col])
            

In [144]:
## Check special charcter for numeric column of train data.
checkSpecialCharcters(train)


 CustomerID ----->

 Customer.Lifetime.Value ----->

 Monthly.Premium.Auto ----->

 Months.Since.Last.Claim ----->

 Months.Since.Policy.Inception ----->

 Number.of.Open.Complaints ----->

 Number.of.Policies ----->

 Total.Claim.Amount ----->

 Vehicle.Size ----->


In [145]:
## Check special charcter for numeric column of test data.
checkSpecialCharcters(test)


 CustomerID ----->

 Income ----->

 Monthly.Premium.Auto ----->

 Months.Since.Last.Claim ----->

 Months.Since.Policy.Inception ----->

 Number.of.Open.Complaints ----->

 Number.of.Policies ----->

 Total.Claim.Amount ----->


In [112]:
## Check special characters for categorical columns.
def checkSpclCharcters(df):
    for col in df.select_dtypes(['object']).columns:
        print('\n',col,'----->')
        for index in range(1,len(df)):
            if  str(df.loc[index,col]).isdigit() or df.loc[index,col]==' ' or \
                str(df.loc[index,col]).isalpha() or re.sub('[\s+]', '',df.loc[index,col]).isalpha() or \
                re.sub('[\s+]', '',df.loc[index,col]).replace('-','').isalnum() or str(df.loc[index,col]).isalnum():
                skip = True
            else:
                print("Index ",index,"\tSpecial Character ",df.loc[index,col])       

In [114]:
## Check special charcters for category columns of train data.
checkSpclCharcters(train.drop('Location.Geo',axis=1))


 Coverage ----->

 Education ----->

 EmploymentStatus ----->

 Gender ----->

 Income ----->
Index  87 	Special Character  ?
Index  160 	Special Character  ?
Index  283 	Special Character  ?
Index  320 	Special Character  ?
Index  383 	Special Character  ?
Index  407 	Special Character  ?
Index  436 	Special Character  ?
Index  553 	Special Character  ?
Index  672 	Special Character  ?
Index  1032 	Special Character  ?
Index  1035 	Special Character  ?
Index  1052 	Special Character  ?
Index  1139 	Special Character  ?
Index  1270 	Special Character  ?
Index  1314 	Special Character  ?
Index  1487 	Special Character  ?
Index  1498 	Special Character  ?
Index  1508 	Special Character  ?
Index  1611 	Special Character  ?
Index  1686 	Special Character  ?
Index  1777 	Special Character  ?
Index  1888 	Special Character  ?
Index  2057 	Special Character  ?
Index  2098 	Special Character  ?
Index  2549 	Special Character  ?
Index  2608 	Special Character  ?
Index  2940 	Special Character 

In [115]:
## Check special charcters for category columns of test data.
checkSpclCharcters(test.drop('Location.Geo',axis=1))


 Coverage ----->

 Education ----->

 EmploymentStatus ----->

 Gender ----->

 Location.Code ----->

 Marital.Status ----->

 Policy.Type ----->

 Policy ----->

 Renew.Offer.Type ----->

 Sales.Channel ----->

 Vehicle.Class ----->

 Vehicle.Size ----->


In [146]:
## Calculate variance for numeric columns.
def variance(x):
        return(pd.DataFrame({'Datatype' : x.dtypes,
                            'Variance': [round(x[i].var()) for i in x] }))

In [148]:
## Get varience for numeric columns of train data.
variance(train.select_dtypes(['int64','float64']))

Unnamed: 0,Datatype,Variance
CustomerID,int64,11177562
Customer.Lifetime.Value,float64,46895870
Monthly.Premium.Auto,float64,1185
Months.Since.Last.Claim,int64,100
Months.Since.Policy.Inception,int64,782
Number.of.Open.Complaints,float64,1
Number.of.Policies,float64,6
Total.Claim.Amount,float64,86144
Vehicle.Size,float64,0


In [149]:
## Get varience for numeric columns of test data.
variance(test.select_dtypes(['int64','float64']))

Unnamed: 0,Datatype,Variance
CustomerID,int64,11080257
Income,float64,843717848
Monthly.Premium.Auto,float64,1208
Months.Since.Last.Claim,int64,104
Months.Since.Policy.Inception,int64,781
Number.of.Open.Complaints,int64,1
Number.of.Policies,int64,6
Total.Claim.Amount,float64,83821


In [152]:
## Drop duplicate records.
train.drop_duplicates(keep = False, inplace = True) ## Return DataFrame with duplicate rows removed.

In [154]:
## Replace ? with NA for train and test data.
train['Income'] = train['Income'].replace('?','NA')
test['Income'] = test['Income'].replace('?','NA')

In [155]:
## Check special charcters for category columns of train data after replacing ? with NA.
checkSpclCharcters(train.drop('Location.Geo',axis=1))


 Coverage ----->

 Education ----->

 EmploymentStatus ----->

 Gender ----->

 Income ----->

 Location.Code ----->

 Marital.Status ----->

 Policy.Type ----->

 Policy ----->

 Renew.Offer.Type ----->

 Sales.Channel ----->

 Vehicle.Class ----->


In [156]:
## For test data Vehicle.Size is having different level so converting them in numers.
test['Vehicle.Size']=test['Vehicle.Size'].replace('Large','1.0').replace('Medsize','2.0').replace('Small','3.0')

In [157]:
## Data Type conversions.

In [158]:
## Convert category to numeric.
train['Income'] = pd.to_numeric(train['Income'], errors='coerce')
print(train['Income'].dtypes)
test['Income'] = pd.to_numeric(test['Income'], errors='coerce')
print(test['Income'].dtypes)

float64
float64


In [159]:
## Convert numeric to category.
train['Vehicle.Size'] = train['Vehicle.Size'].astype('str').astype('category')
print(train['Vehicle.Size'].dtypes)
test['Vehicle.Size'] = test['Vehicle.Size'].astype('str').astype('category')
print(test['Vehicle.Size'].dtypes)

category
category


In [164]:
## Convert to object type to category datatype.
def dtypeConversion(df):  
    for i in df.select_dtypes('object'):
        df[i]=df[i].astype('category')

In [165]:
## Convert object data type to category data type for train data.
dtypeConversion(train)

In [166]:
## Convert object data type to category data type for test data.
dtypeConversion(test)

In [167]:
## Check column data types for train after conversion.
train.dtypes

CustomerID                          int64
Customer.Lifetime.Value           float64
Coverage                         category
Education                        category
EmploymentStatus                 category
Gender                           category
Income                            float64
Location.Geo                     category
Location.Code                    category
Marital.Status                   category
Monthly.Premium.Auto              float64
Months.Since.Last.Claim             int64
Months.Since.Policy.Inception       int64
Number.of.Open.Complaints         float64
Number.of.Policies                float64
Policy.Type                      category
Policy                           category
Renew.Offer.Type                 category
Sales.Channel                    category
Total.Claim.Amount                float64
Vehicle.Class                    category
Vehicle.Size                     category
dtype: object

In [168]:
## Check column data types for test after conversion.
test.dtypes

CustomerID                          int64
Coverage                         category
Education                        category
EmploymentStatus                 category
Gender                           category
Income                            float64
Location.Geo                     category
Location.Code                    category
Marital.Status                   category
Monthly.Premium.Auto              float64
Months.Since.Last.Claim             int64
Months.Since.Policy.Inception       int64
Number.of.Open.Complaints           int64
Number.of.Policies                  int64
Policy.Type                      category
Policy                           category
Renew.Offer.Type                 category
Sales.Channel                    category
Total.Claim.Amount                float64
Vehicle.Class                    category
Vehicle.Size                     category
dtype: object

In [169]:
## Drop unsignificant column for train and test.
train.drop('Location.Geo', axis = 1,inplace=True)
test.drop('Location.Geo', axis = 1,inplace=True)

In [170]:
## Set index value to train and test.
train.set_index('CustomerID',inplace=True)
test.set_index('CustomerID',inplace=True)

In [174]:
## Split data into train and validation(70:30 ratio).
X_train,X_test,y_train,y_test = train_test_split(train.drop('Customer.Lifetime.Value',axis=1),train['Customer.Lifetime.Value'],test_size=0.3,random_state=123)

In [175]:
## Check first record of train data.
X_train.head(1)

Unnamed: 0_level_0,Coverage,Education,EmploymentStatus,Gender,Income,Location.Code,Marital.Status,Monthly.Premium.Auto,Months.Since.Last.Claim,Months.Since.Policy.Inception,Number.of.Open.Complaints,Number.of.Policies,Policy.Type,Policy,Renew.Offer.Type,Sales.Channel,Total.Claim.Amount,Vehicle.Class,Vehicle.Size
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
9502,Basic,Doctor,Employed,M,34614.0,Suburban,Married,,12,97,,2.0,,Personal L1,Offer3,Branch,413.25233,Four-Door Car,2.0


In [176]:
## Check first record of train data.
X_test.head(1)

Unnamed: 0_level_0,Coverage,Education,EmploymentStatus,Gender,Income,Location.Code,Marital.Status,Monthly.Premium.Auto,Months.Since.Last.Claim,Months.Since.Policy.Inception,Number.of.Open.Complaints,Number.of.Policies,Policy.Type,Policy,Renew.Offer.Type,Sales.Channel,Total.Claim.Amount,Vehicle.Class,Vehicle.Size
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
4141,Basic,Master,Disabled,M,22239.0,Suburban,Married,62.0,34,85,0.0,1.0,Personal Auto,Personal L3,Offer2,Branch,297.6,Four-Door Car,3.0


In [177]:
## Check first record of train target data.
y_train.head(1)

CustomerID
9502    8730.421977
Name: Customer.Lifetime.Value, dtype: float64

In [178]:
## Check first record of test target data.
y_test.head(1)

CustomerID
4141    2404.633766
Name: Customer.Lifetime.Value, dtype: float64

In [180]:
X_train.isna().sum()

Coverage                         663
Education                         88
EmploymentStatus                  82
Gender                            88
Income                            82
Location.Code                     87
Marital.Status                    88
Monthly.Premium.Auto             556
Months.Since.Last.Claim            0
Months.Since.Policy.Inception      0
Number.of.Open.Complaints        574
Number.of.Policies                86
Policy.Type                      614
Policy                            86
Renew.Offer.Type                  83
Sales.Channel                     83
Total.Claim.Amount                 0
Vehicle.Class                     84
Vehicle.Size                       0
dtype: int64

In [181]:
X_test.isna().sum()

Coverage                         262
Education                         41
EmploymentStatus                  36
Gender                            41
Income                            36
Location.Code                     32
Marital.Status                    41
Monthly.Premium.Auto             238
Months.Since.Last.Claim            0
Months.Since.Policy.Inception      0
Number.of.Open.Complaints        244
Number.of.Policies                35
Policy.Type                      277
Policy                            35
Renew.Offer.Type                  45
Sales.Channel                     45
Total.Claim.Amount                 0
Vehicle.Class                     42
Vehicle.Size                       0
dtype: int64

In [179]:
######################################################## Impute NA values #####################################################

In [185]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_train['Coverage'] = X_train['Coverage'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Coverage column of train data.
X_train['Coverage'].fillna('Unknown',inplace=True)

In [218]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_test['Coverage'] = X_test['Coverage'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Coverage column of validation data.
X_test['Coverage'].fillna('Unknown',inplace=True)

In [219]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
test['Coverage'] = test['Coverage'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Coverage column of test data.
test['Coverage'].fillna('Unknown',inplace=True)

In [189]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_train['Education'] = X_train['Education'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Education column of train data.
X_train['Education'].fillna('Unknown',inplace=True)

In [220]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_test['Education'] = X_test['Education'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Education column of validation data.
X_test['Education'].fillna('Unknown',inplace=True)

In [221]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
test['Education'] = test['Education'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Education column of test data.
test['Education'].fillna('Unknown',inplace=True)

In [191]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_train['EmploymentStatus'] = X_train['EmploymentStatus'].cat.add_categories('Unknown')

## Fill NA values with Unknown for EmploymentStatus column of train data.
X_train['EmploymentStatus'].fillna('Unknown',inplace=True)

In [222]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_test['EmploymentStatus'] = X_test['EmploymentStatus'].cat.add_categories('Unknown')

## Fill NA values with Unknown for EmploymentStatus column of validation data.
X_test['EmploymentStatus'].fillna('Unknown',inplace=True)

In [223]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
test['EmploymentStatus'] = test['EmploymentStatus'].cat.add_categories('Unknown')

## Fill NA values with Unknown for EmploymentStatus column of test data.
test['EmploymentStatus'].fillna('Unknown',inplace=True)

In [193]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_train['Gender'] = X_train['Gender'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Gender column of train data.
X_train['Gender'].fillna('Unknown',inplace=True)

In [224]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_test['Gender'] = X_test['Gender'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Gender column of validation data.
X_test['Gender'].fillna('Unknown',inplace=True)

In [225]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
test['Gender'] = test['Gender'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Gender column of test data.
test['Gender'].fillna('Unknown',inplace=True)

In [195]:
## Fill NA values with 0 for Income column of train data.
X_train['Income'].fillna(0,inplace=True)

In [226]:
## Fill NA values with 0 for Income column of validation data.
X_test['Income'].fillna(0,inplace=True)

In [227]:
## Fill NA values with 0 for Income column of test data.
test['Income'].fillna(0,inplace=True)

In [198]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_train['Location.Code'] = X_train['Location.Code'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Location.Code column of train data.
X_train['Location.Code'].fillna('Unknown',inplace=True)

In [None]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_test['Location.Code'] = X_test['Location.Code'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Location.Code column of validation data.
X_test['Location.Code'].fillna('Unknown',inplace=True)

In [230]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
test['Location.Code'] = test['Location.Code'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Location.Code column of test data.
test['Location.Code'].fillna('Unknown',inplace=True)

In [200]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_train['Marital.Status'] = X_train['Marital.Status'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Marital.Status column of train data.
X_train['Marital.Status'].fillna('Unknown',inplace=True)

In [231]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_test['Marital.Status'] = X_test['Marital.Status'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Marital.Status column of validation data.
X_test['Marital.Status'].fillna('Unknown',inplace=True)

In [232]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
test['Marital.Status'] = test['Marital.Status'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Marital.Status column of test data.
test['Marital.Status'].fillna('Unknown',inplace=True)

In [202]:
## Fill NA values with 0 for Monthly.Premium.Auto column of train data.
X_train['Monthly.Premium.Auto'].fillna(0,inplace=True)

In [233]:
## Fill NA values with 0 for Monthly.Premium.Auto column of train data.
X_test['Monthly.Premium.Auto'].fillna(0,inplace=True)

In [234]:
## Fill NA values with 0 for Monthly.Premium.Auto column of test data.
test['Monthly.Premium.Auto'].fillna(0,inplace=True)

In [204]:
## Fill NA values with 0 for Number.of.Open.Complaints column of train data.
X_train['Number.of.Open.Complaints'].fillna(0,inplace=True)

In [236]:
## Fill NA values with 0 for Number.of.Open.Complaints column of validation data.
X_test['Number.of.Open.Complaints'].fillna(0,inplace=True)

In [237]:
## Fill NA values with 0 for Number.of.Open.Complaints column of test data.
test['Number.of.Open.Complaints'].fillna(0,inplace=True)

In [206]:
## Fill NA values with 0 for Number.of.Policies column of train data.
X_train['Number.of.Policies'].fillna(0,inplace=True)

In [238]:
## Fill NA values with 0 for Number.of.Policies column of validation data.
X_test['Number.of.Policies'].fillna(0,inplace=True)

In [239]:
## Fill NA values with 0 for Number.of.Policies column of test data.
test['Number.of.Policies'].fillna(0,inplace=True)

In [208]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_train['Policy.Type'] = X_train['Policy.Type'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Policy.Type column of train data.
X_train['Policy.Type'].fillna('Unknown',inplace=True)

In [240]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_test['Policy.Type'] = X_test['Policy.Type'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Policy.Type column of validation data.
X_test['Policy.Type'].fillna('Unknown',inplace=True)

In [241]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
test['Policy.Type'] = test['Policy.Type'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Policy.Type column of test data.
test['Policy.Type'].fillna('Unknown',inplace=True)

In [210]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_train['Policy'] = X_train['Policy'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Policy column of train data.
X_train['Policy'].fillna('Unknown',inplace=True)

In [243]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_test['Policy'] = X_test['Policy'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Policy column of validation data.
X_test['Policy'].fillna('Unknown',inplace=True)

In [242]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
test['Policy'] = test['Policy'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Policy column of test data.
test['Policy'].fillna('Unknown',inplace=True)

In [212]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_train['Renew.Offer.Type'] = X_train['Renew.Offer.Type'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Renew.Offer.Type column of train data.
X_train['Renew.Offer.Type'].fillna('Unknown',inplace=True)

In [244]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_test['Renew.Offer.Type'] = X_test['Renew.Offer.Type'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Renew.Offer.Type column of validation data.
X_test['Renew.Offer.Type'].fillna('Unknown',inplace=True)

In [245]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
test['Renew.Offer.Type'] = test['Renew.Offer.Type'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Renew.Offer.Type column of test data.
test['Renew.Offer.Type'].fillna('Unknown',inplace=True)

In [214]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_train['Sales.Channel'] = X_train['Sales.Channel'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Sales.Channel column of train data.
X_train['Sales.Channel'].fillna('Unknown',inplace=True)

In [246]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_test['Sales.Channel'] = X_test['Sales.Channel'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Sales.Channel column of validation data.
X_test['Sales.Channel'].fillna('Unknown',inplace=True)

In [247]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
test['Sales.Channel'] = test['Sales.Channel'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Sales.Channel column of test data.
test['Sales.Channel'].fillna('Unknown',inplace=True)

In [216]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_train['Vehicle.Class'] = X_train['Vehicle.Class'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Vehicle.Class column of train data.
X_train['Vehicle.Class'].fillna('Unknown',inplace=True)

In [248]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
X_test['Vehicle.Class'] = X_test['Vehicle.Class'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Vehicle.Class column of validation data.
X_test['Vehicle.Class'].fillna('Unknown',inplace=True)

In [249]:
##  fillna requires a value that already exists as a category for categorical columns so that's why adding Unkown level.
test['Vehicle.Class'] = test['Vehicle.Class'].cat.add_categories('Unknown')

## Fill NA values with Unknown for Vehicle.Class column of test data.
test['Vehicle.Class'].fillna('Unknown',inplace=True)

In [250]:
## Check NA values for train data after imputing.
X_train.isna().sum()

Coverage                         0
Education                        0
EmploymentStatus                 0
Gender                           0
Income                           0
Location.Code                    0
Marital.Status                   0
Monthly.Premium.Auto             0
Months.Since.Last.Claim          0
Months.Since.Policy.Inception    0
Number.of.Open.Complaints        0
Number.of.Policies               0
Policy.Type                      0
Policy                           0
Renew.Offer.Type                 0
Sales.Channel                    0
Total.Claim.Amount               0
Vehicle.Class                    0
Vehicle.Size                     0
dtype: int64

In [251]:
## Check NA values for validation data after imputing.
X_test.isna().sum()

Coverage                         0
Education                        0
EmploymentStatus                 0
Gender                           0
Income                           0
Location.Code                    0
Marital.Status                   0
Monthly.Premium.Auto             0
Months.Since.Last.Claim          0
Months.Since.Policy.Inception    0
Number.of.Open.Complaints        0
Number.of.Policies               0
Policy.Type                      0
Policy                           0
Renew.Offer.Type                 0
Sales.Channel                    0
Total.Claim.Amount               0
Vehicle.Class                    0
Vehicle.Size                     0
dtype: int64

In [252]:
## Check NA values for test data after imputing.
test.isna().sum()

Coverage                         0
Education                        0
EmploymentStatus                 0
Gender                           0
Income                           0
Location.Code                    0
Marital.Status                   0
Monthly.Premium.Auto             0
Months.Since.Last.Claim          0
Months.Since.Policy.Inception    0
Number.of.Open.Complaints        0
Number.of.Policies               0
Policy.Type                      0
Policy                           0
Renew.Offer.Type                 0
Sales.Channel                    0
Total.Claim.Amount               0
Vehicle.Class                    0
Vehicle.Size                     0
dtype: int64

In [253]:
#################################################### Label Encoding ###########################################################

In [256]:
le_coverage = LabelEncoder()
le_eduction = LabelEncoder()
le_employementStatus = LabelEncoder()
le_gender = LabelEncoder()
le_location_code = LabelEncoder()
le_marital_status = LabelEncoder()
le_policy_type = LabelEncoder()
le_policy = LabelEncoder()
le_renew_offer_type = LabelEncoder()
le_sales_channel = LabelEncoder()
le_vehicle_class = LabelEncoder()
le_vehicle_size = LabelEncoder()

In [257]:
## Do label encoding on train data.
X_train['Coverage'] = le_coverage.fit_transform(X_train['Coverage'])
X_train['Education'] = le_eduction.fit_transform(X_train['Education'])
X_train['EmploymentStatus'] = le_employementStatus.fit_transform(X_train['EmploymentStatus'])
X_train['Gender'] = le_gender.fit_transform(X_train['Gender'])
X_train['Location.Code'] = le_location_code.fit_transform(X_train['Location.Code'])
X_train['Marital.Status'] = le_marital_status.fit_transform(X_train['Marital.Status'])
X_train['Policy.Type'] = le_policy_type.fit_transform(X_train['Policy.Type'])
X_train['Policy'] = le_policy.fit_transform(X_train['Policy'])
X_train['Renew.Offer.Type'] = le_renew_offer_type.fit_transform(X_train['Renew.Offer.Type'])
X_train['Sales.Channel'] = le_sales_channel.fit_transform(X_train['Sales.Channel'])
X_train['Vehicle.Class'] = le_vehicle_class.fit_transform(X_train['Vehicle.Class'])
X_train['Vehicle.Size'] = le_vehicle_size.fit_transform(X_train['Vehicle.Size'])

In [258]:
## Do label encoding on validation data.
X_test['Coverage'] = le_coverage.transform(X_test['Coverage'])
X_test['Education'] = le_eduction.transform(X_test['Education'])
X_test['EmploymentStatus'] = le_employementStatus.transform(X_test['EmploymentStatus'])
X_test['Gender'] = le_gender.transform(X_test['Gender'])
X_test['Location.Code'] = le_location_code.transform(X_test['Location.Code'])
X_test['Marital.Status'] = le_marital_status.transform(X_test['Marital.Status'])
X_test['Policy.Type'] = le_policy_type.transform(X_test['Policy.Type'])
X_test['Policy'] = le_policy.transform(X_test['Policy'])
X_test['Renew.Offer.Type'] = le_renew_offer_type.transform(X_test['Renew.Offer.Type'])
X_test['Sales.Channel'] = le_sales_channel.transform(X_test['Sales.Channel'])
X_test['Vehicle.Class'] = le_vehicle_class.transform(X_test['Vehicle.Class'])
X_test['Vehicle.Size'] = le_vehicle_size.transform(X_test['Vehicle.Size'])

In [259]:
## Do label encoding on test data.
test['Coverage'] = le_coverage.transform(test['Coverage'])
test['Education'] = le_eduction.transform(test['Education'])
test['EmploymentStatus'] = le_employementStatus.transform(test['EmploymentStatus'])
test['Gender'] = le_gender.transform(test['Gender'])
test['Location.Code'] = le_location_code.transform(test['Location.Code'])
test['Marital.Status'] = le_marital_status.transform(test['Marital.Status'])
test['Policy.Type'] = le_policy_type.transform(test['Policy.Type'])
test['Policy'] = le_policy.transform(test['Policy'])
test['Renew.Offer.Type'] = le_renew_offer_type.transform(test['Renew.Offer.Type'])
test['Sales.Channel'] = le_sales_channel.transform(test['Sales.Channel'])
test['Vehicle.Class'] = le_vehicle_class.transform(test['Vehicle.Class'])
test['Vehicle.Size'] = le_vehicle_size.transform(test['Vehicle.Size'])

In [260]:
## Check train data after doing label encoding.
X_train.head(1)

Unnamed: 0_level_0,Coverage,Education,EmploymentStatus,Gender,Income,Location.Code,Marital.Status,Monthly.Premium.Auto,Months.Since.Last.Claim,Months.Since.Policy.Inception,Number.of.Open.Complaints,Number.of.Policies,Policy.Type,Policy,Renew.Offer.Type,Sales.Channel,Total.Claim.Amount,Vehicle.Class,Vehicle.Size
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
9502,0,2,1,1,34614.0,1,1,0.0,12,97,0.0,2.0,3,3,2,1,413.25233,0,1


In [261]:
## Check validation data after doing label encoding.
X_test.head(1)

Unnamed: 0_level_0,Coverage,Education,EmploymentStatus,Gender,Income,Location.Code,Marital.Status,Monthly.Premium.Auto,Months.Since.Last.Claim,Months.Since.Policy.Inception,Number.of.Open.Complaints,Number.of.Policies,Policy.Type,Policy,Renew.Offer.Type,Sales.Channel,Total.Claim.Amount,Vehicle.Class,Vehicle.Size
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
4141,0,4,0,1,22239.0,1,1,62.0,34,85,0.0,1.0,1,5,1,1,297.6,0,2


In [265]:
## Check test data after doing label encoding.
test.head(1)

Unnamed: 0_level_0,Coverage,Education,EmploymentStatus,Gender,Income,Location.Code,Marital.Status,Monthly.Premium.Auto,Months.Since.Last.Claim,Months.Since.Policy.Inception,Number.of.Open.Complaints,Number.of.Policies,Policy.Type,Policy,Renew.Offer.Type,Sales.Channel,Total.Claim.Amount,Vehicle.Class,Vehicle.Size
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
17,0,0,1,1,43836.0,0,2,73.0,12,44,0,1,1,3,0,0,138.130879,0,1


In [267]:
## Check corrlation between numeric columns of train data.
X_train.select_dtypes(['int64','float64']).corr()

Unnamed: 0,Income,Monthly.Premium.Auto,Months.Since.Last.Claim,Months.Since.Policy.Inception,Number.of.Open.Complaints,Number.of.Policies,Total.Claim.Amount
Income,1.0,-0.016566,-0.026203,0.006779,-0.000328,-0.001959,-0.34483
Monthly.Premium.Auto,-0.016566,1.0,-0.008001,0.016686,0.001556,-0.02097,0.490546
Months.Since.Last.Claim,-0.026203,-0.008001,1.0,-0.043324,0.009599,0.010029,0.003929
Months.Since.Policy.Inception,0.006779,0.016686,-0.043324,1.0,-0.027242,0.002154,0.009129
Number.of.Open.Complaints,-0.000328,0.001556,0.009599,-0.027242,1.0,0.009524,-0.019939
Number.of.Policies,-0.001959,-0.02097,0.010029,0.002154,0.009524,1.0,-0.020202
Total.Claim.Amount,-0.34483,0.490546,0.003929,0.009129,-0.019939,-0.020202,1.0


In [268]:
## Check corrlation between numeric columns of validation data.
X_test.select_dtypes(['int64','float64']).corr()

Unnamed: 0,Income,Monthly.Premium.Auto,Months.Since.Last.Claim,Months.Since.Policy.Inception,Number.of.Open.Complaints,Number.of.Policies,Total.Claim.Amount
Income,1.0,-0.012955,-0.012346,0.018783,-0.014833,0.009596,-0.361669
Monthly.Premium.Auto,-0.012955,1.0,0.034616,0.007882,-0.006084,0.035743,0.448634
Months.Since.Last.Claim,-0.012346,0.034616,1.0,-0.005449,0.017473,-0.005598,0.021679
Months.Since.Policy.Inception,0.018783,0.007882,-0.005449,1.0,0.010549,-0.002297,-0.006707
Number.of.Open.Complaints,-0.014833,-0.006084,0.017473,0.010549,1.0,-0.040767,0.006852
Number.of.Policies,0.009596,0.035743,-0.005598,-0.002297,-0.040767,1.0,0.017134
Total.Claim.Amount,-0.361669,0.448634,0.021679,-0.006707,0.006852,0.017134,1.0


In [269]:
## Check corrlation between numeric columns of test data.
test.select_dtypes(['int64','float64']).corr()

Unnamed: 0,Income,Monthly.Premium.Auto,Months.Since.Last.Claim,Months.Since.Policy.Inception,Number.of.Open.Complaints,Number.of.Policies,Total.Claim.Amount
Income,1.0,0.032953,-0.031149,-0.000737,0.011579,-0.028624,-0.365328
Monthly.Premium.Auto,0.032953,1.0,-0.007911,0.012925,0.031133,-0.005882,0.526968
Months.Since.Last.Claim,-0.031149,-0.007911,1.0,-0.059982,-0.012031,-0.016154,-0.007014
Months.Since.Policy.Inception,-0.000737,0.012925,-0.059982,1.0,0.008251,-0.028303,-0.015138
Number.of.Open.Complaints,0.011579,0.031133,-0.012031,0.008251,1.0,0.020432,-0.006349
Number.of.Policies,-0.028624,-0.005882,-0.016154,-0.028303,0.020432,1.0,0.017394
Total.Claim.Amount,-0.365328,0.526968,-0.007014,-0.015138,-0.006349,0.017394,1.0


In [271]:
## Create a empty dataframe and calculate VIF for train data.
vif=pd.DataFrame()
vif['Vif']=[variance_inflation_factor(X_train.values,i) for i in range(X_train.shape[1])]
vif['Variables']=X_train.columns.values
vif

Unnamed: 0,Vif,Variables
0,1.664872,Coverage
1,2.298623,Education
2,5.146637,EmploymentStatus
3,1.990233,Gender
4,4.270327,Income
5,2.619114,Location.Code
6,4.407563,Marital.Status
7,7.058076,Monthly.Premium.Auto
8,3.081701,Months.Since.Last.Claim
9,3.679393,Months.Since.Policy.Inception


In [272]:
## Create a empty dataframe and calculate VIF for validation data.
vif=pd.DataFrame()
vif['Vif']=[variance_inflation_factor(X_test.values,i) for i in range(X_test.shape[1])]
vif['Variables']=X_test.columns.values
vif

Unnamed: 0,Vif,Variables
0,1.645248,Coverage
1,2.264995,Education
2,4.798009,EmploymentStatus
3,2.013261,Gender
4,4.269196,Income
5,2.609732,Location.Code
6,4.380495,Marital.Status
7,6.641944,Monthly.Premium.Auto
8,3.046397,Months.Since.Last.Claim
9,3.728155,Months.Since.Policy.Inception


In [273]:
## Create a empty dataframe and calculate VIF for test data.
vif=pd.DataFrame()
vif['Vif']=[variance_inflation_factor(test.values,i) for i in range(test.shape[1])]
vif['Variables']=test.columns.values
vif

Unnamed: 0,Vif,Variables
0,1.859115,Coverage
1,2.277473,Education
2,5.353837,EmploymentStatus
3,2.026935,Gender
4,4.861503,Income
5,2.521012,Location.Code
6,4.655047,Marital.Status
7,10.072863,Monthly.Premium.Auto
8,2.992889,Months.Since.Last.Claim
9,3.660844,Months.Since.Policy.Inception


In [274]:
## VIF value is high for 2 columns,we can drop those two columns and build a model and we can include those two columns and
## build another model,we can see the difference.

In [266]:
########################################## Build Different Models #############################################################

In [275]:
################################################# Linear Regression ###########################################################

In [449]:
## Instantiate regression model and fit  a model.
linreg=LinearRegression()
linear_model=linreg.fit(X_train,y_train)

In [450]:
## Get the predictions on train and validation data.
pred_train = linear_model.predict(X_train)
pred_test = linear_model.predict(X_test)

In [451]:
## Get predictions on test data.
test_pred = linear_model.predict(test)

In [452]:
## Display RMSE value for train and validation data.
print("Train Error:",sqrt(mean_squared_error(y_train, pred_train)))
print("Test Error:",sqrt(mean_squared_error(y_test, pred_test)))

Train Error: 6482.602859866607
Test Error: 6260.030152766292


In [453]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'CustomerID' : test.index,
                          'Customer.Lifetime.Value' : test_pred})

In [454]:
## Check dimesnions of test data.
test.shape

(1767, 19)

In [455]:
## Check dimensons of dataframe.
dataframe.shape

(1767, 2)

In [456]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('LinearRegression.csv',index=False)

In [282]:
############################################### Decision Tree ##################################################################

In [457]:
## Instantiate and fit a regression model.
dtr = DecisionTreeRegressor(max_depth=5,min_samples_leaf=10,min_samples_split=5,random_state=123)
dtr.fit(X_train,y_train)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=5,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=10, min_samples_split=5,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=123, splitter='best')

In [458]:
## Get the predictions on train and validation data.
pred_train = dtr.predict(X_train)
pred_test = dtr.predict(X_test)

In [459]:
## Get predictions for test data.
test_pred = dtr.predict(test)

In [460]:
## Display train and validation RMSE.
print("Train Error:",sqrt(mean_squared_error(y_train, pred_train)))
print("Test Error:",sqrt(mean_squared_error(y_test, pred_test)))

Train Error: 4018.551584252528
Test Error: 3943.9726818873096


In [461]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'CustomerID' : test.index,
                          'Customer.Lifetime.Value' : test_pred})

In [462]:
## Check dimesnions of test data.
test.shape

(1767, 19)

In [463]:
## Check dimensons of dataframe.
dataframe.shape

(1767, 2)

In [464]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('DecisionTree.csv',index=False)

In [465]:
############################################## Random Forest ##################################################################

In [466]:
## Instantiate a regressor model.
rc = RandomForestRegressor(n_estimators= 200, max_depth= 10 ,min_samples_leaf = 4 ,max_features='sqrt')

In [467]:
## Fit a model.
rc.fit(X_train,y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=10, max_features='sqrt', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=4,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=200, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

In [468]:
## Get the predictions on train and validation data.
pred_train = rc.predict(X_train)
pred_test = rc.predict(X_test)

In [469]:
## Get predictions on test data.
test_pred = rc.predict(test)

In [470]:
## Display RMSE values for train and validation data.
print("Train Error:",sqrt(mean_squared_error(y_train, pred_train)))
print("Test Error:",sqrt(mean_squared_error(y_test, pred_test)))

Train Error: 3585.24914088437
Test Error: 3920.0170055197127


In [471]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'CustomerID' : test.index,
                          'Customer.Lifetime.Value' : test_pred})

In [472]:
## Check dimesnions of test data.
test.shape

(1767, 19)

In [473]:
## Check dimensons of dataframe.
dataframe.shape

(1767, 2)

In [474]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('RandomForest.csv',index=False)

In [475]:
################################################### KNN #######################################################################

In [476]:
## Instantiate KNN model and fit it.
knn = KNeighborsRegressor(algorithm = 'brute', n_neighbors = 10,
                           metric = "euclidean")
knn.fit(X_train, y_train)

KNeighborsRegressor(algorithm='brute', leaf_size=30, metric='euclidean',
                    metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                    weights='uniform')

In [477]:
## Get the predictions on train and validation.
pred_train = knn.predict(X_train)
pred_test = knn.predict(X_test)

In [478]:
## Get predictions on test data.
test_pred = knn.predict(test)

In [479]:
## Display RMSE values fo train and validation.
print("Train Error:",sqrt(mean_squared_error(y_train, pred_train)))
print("Test Error:",sqrt(mean_squared_error(y_test, pred_test)))

Train Error: 6057.612456279684
Test Error: 6451.937354796598


In [481]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'CustomerID' : test.index,
                          'Customer.Lifetime.Value' : test_pred})

In [482]:
## Check dimesnions of test data.
test.shape

(1767, 19)

In [483]:
## Check dimensons of dataframe.
dataframe.shape

(1767, 2)

In [484]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('KNN.csv',index=False)

In [485]:
#################################################### SVM #######################################################################

In [486]:
## Instantiate SVR model.
svr_model = SVR()
svr_model

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [487]:
## Fit a model.
svr_model.fit(X = X_train, y = y_train)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [488]:
## Get the predictions on train and validation.
pred_train = svr_model.predict(X_train)
pred_test = svr_model.predict(X_test)

In [489]:
## Get predictions on test data.
test_pred = svr_model.predict(test)

In [490]:
## Get predictions on test data.
print("Train Error:",sqrt(mean_squared_error(y_train, pred_train)))
print("Test Error:",sqrt(mean_squared_error(y_test, pred_test)))

Train Error: 7295.793994630088
Test Error: 6966.066714076594


In [491]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'CustomerID' : test.index,
                          'Customer.Lifetime.Value' : test_pred})

In [492]:
## Check dimesnions of test data.
test.shape

(1767, 19)

In [493]:
## Check dimensons of dataframe.
dataframe.shape

(1767, 2)

In [494]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('SVR.csv',index=False)

In [495]:
################################################### AdaBoost ##################################################################

In [496]:
## Instantiate regressor model and fit it.
Adaboost_model = AdaBoostRegressor(n_estimators=200,learning_rate=0.001)
%time Adaboost_model.fit(X_train, y_train)

Wall time: 6.21 s


AdaBoostRegressor(base_estimator=None, learning_rate=0.001, loss='linear',
                  n_estimators=200, random_state=None)

In [497]:
## Get the predictions on train and validation data.
pred_train = Adaboost_model.predict(X_train)
pred_test = Adaboost_model.predict(X_test)

In [498]:
## Get predictions on test data.
test_pred = Adaboost_model.predict(test)

In [499]:
## Display RMSE value for train and validation data.
print("Train Error:",sqrt(mean_squared_error(y_train, pred_train)))
print("Test Error:",sqrt(mean_squared_error(y_test, pred_test)))

Train Error: 4396.806805269717
Test Error: 4170.575760309947


In [500]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'CustomerID' : test.index,
                          'Customer.Lifetime.Value' : test_pred})

In [501]:
## Check dimesnions of test data.
test.shape

(1767, 19)

In [502]:
## Check dimensons of dataframe.
dataframe.shape

(1767, 2)

In [503]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('AdaBoost.csv',index=False)

In [365]:
##################################################### GradientBoosting #########################################################

In [504]:
## Instantiate GBR and fit it.
gbm = GradientBoostingRegressor(n_estimators=200,learning_rate=0.001,random_state=474)
%time gbm.fit(X=X_train, y=y_train)

Wall time: 4.49 s


GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.001, loss='ls',
                          max_depth=3, max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=200,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=474, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

In [505]:
## Get the predictions on train and validation.
pred_train = gbm.predict(X_train)
pred_test = gbm.predict(X_test)

In [506]:
## Get predictions on test data.
test_pred = gbm.predict(test)

In [507]:
## Dispay RMSE value for train and validation.
print("Train Error:",sqrt(mean_squared_error(y_train, pred_train)))
print("Test Error:",sqrt(mean_squared_error(y_test, pred_test)))

Train Error: 6221.131971580862
Test Error: 5930.60023635255


In [508]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'CustomerID' : test.index,
                          'Customer.Lifetime.Value' : test_pred})

In [509]:
## Check dimesnions of test data.
test.shape

(1767, 19)

In [510]:
## Check dimensons of dataframe.
dataframe.shape

(1767, 2)

In [511]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('GradientBoost.csv',index=False)

In [376]:
################################################## XGradient Boosting ##########################################################

In [422]:
## Model Building with Grid Search.
xgb = XGBRegressor() ## Instantiate XGB model.

optimization_dict = {'max_depth': [2,3,4,5,6,7,10,15], ## trying with different max_depth,n_estimators to find best model.
                      'n_estimators': [50,60,70,80,90,100,150,200]} 

## Build best model with Grid Search params.
model = GridSearchCV(xgb, ## XGB model.
                     optimization_dict, ## dictory with different max_depth,n_estimators.
                     verbose=1, ## for messaging purpose.
                     n_jobs=-1) ## Number of jobs to run in parallel. ''-1' means use all processors.

%time model.fit(X_train, y_train) ## Fit a model.
print(model.best_score_) ## Display best score calues.
print(model.best_params_) ## Display best parameters.

Fitting 5 folds for each of 64 candidates, totalling 320 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   10.8s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:   29.0s
[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed:  1.3min finished
  if getattr(data, 'base', None) is not None and \


Wall time: 1min 22s
0.6598671886474938
{'max_depth': 3, 'n_estimators': 200}


In [512]:
## Instantiate XGBR and fit it.
xgb_model=XGBRegressor(n_estimators=200,learning_rate=0.001,max_depth=7)
%time xgb_model.fit(X_train,y_train,verbose=True)

  if getattr(data, 'base', None) is not None and \


Wall time: 5.74 s


XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.001, max_delta_step=0,
             max_depth=7, min_child_weight=1, missing=None, n_estimators=200,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

In [513]:
## Get the predictions on train and validation.
pred_train = xgb_model.predict(X_train)
pred_test = xgb_model.predict(X_test)

In [514]:
## Get predictions on test data.
test_pred = xgb_model.predict(test)

In [515]:
## Get RMSE value for train and validation data.
print("Train Error:",sqrt(mean_squared_error(y_train, pred_train)))
print("Test Error:",sqrt(mean_squared_error(y_test, pred_test)))

Train Error: 8980.544157300707
Test Error: 8750.180710896557


In [516]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'CustomerID' : test.index,
                          'Customer.Lifetime.Value' : test_pred})

In [517]:
## Check dimesnions of test data.
test.shape

(1767, 19)

In [518]:
## Check dimensons of dataframe.
dataframe.shape

(1767, 2)

In [519]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('XGB.csv',index=False)

In [520]:
############################################# Neural Network Linear Algoritham #################################################

In [521]:
## Instantiate squential model.
model = Sequential()

## Add dense model.
model.add(Dense(1, input_dim=X_train.shape[1]))

## Add compiler to model.
model.compile(loss='mse', optimizer='rmsprop')

## Fit a model.
model.fit(X_train, y_train, epochs=50, batch_size=32)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.callbacks.History at 0x203a80c3f98>

In [522]:
## Get the predictions on train and validation.
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)

In [523]:
## Get predictions on test data.
test_pred = model.predict(test)

In [524]:
## Display RMSE value for train and validation.
print("Train Error:",sqrt(mean_squared_error(y_train, pred_train)))
print("Test Error:",sqrt(mean_squared_error(y_test, pred_test)))

Train Error: 6822.252685993944
Test Error: 6536.224518869452


In [546]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'CustomerID' : test.index,
                          'Customer.Lifetime.Value' : test_pred.tolist()})

In [547]:
## Convert list into float.
dataframe['Customer.Lifetime.Value'] = dataframe['Customer.Lifetime.Value'].apply(lambda x: x[0])

In [548]:
## Check dimesnions of test data.
test.shape

(1767, 19)

In [549]:
## Check dimensons of dataframe.
dataframe.shape

(1767, 2)

In [550]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('NeuralNetwork.csv',index=False)

In [401]:
############################### Perform Grid Search,Ridge,Lasso ###############################################################

In [404]:
##################################################### Ridge ###################################################################

In [406]:
## Ridge regression is parametric and takes a parameter alpha. The value of alpha determines the reduction in magnitude of coefficients.
## But we also need to check which value of alpha gives best predictions on test data. For this we experiment with several values of alpha and pick the best
## We do this by performing grid search over several values of alpha. 
alphas = np.array([1,0.1,0.01,0.001,0.0001,0,1.5,2]) ## Pick the best of these values.
## Create and fit a ridge regression model, testing each alpha.
model_ridge = Ridge()
grid = GridSearchCV(estimator=model_ridge, param_grid=dict(alpha=alphas),cv=10) ## Here the argument cv=10 implies compute error on 10 chucks of data and report average value.
grid.fit(X_train,y_train)
print(grid)

GridSearchCV(cv=10, error_score=nan,
             estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                             max_iter=None, normalize=False, random_state=None,
                             solver='auto', tol=0.001),
             iid='deprecated', n_jobs=None,
             param_grid={'alpha': array([1.0e+00, 1.0e-01, 1.0e-02, 1.0e-03, 1.0e-04, 0.0e+00, 1.5e+00,
       2.0e+00])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)


In [407]:
## Display best params.
print(grid.best_score_)
print(grid.best_estimator_.alpha)

0.11813011698075745
2.0


In [552]:
## Instantiate Ridge and fit it.
Ridge_model= Ridge(alpha=2,normalize=False)
Ridge_model.fit(X_train,y_train) ## Applying it on the train data, to obtain the coefficients.

Ridge(alpha=2, copy_X=True, fit_intercept=True, max_iter=None, normalize=False,
      random_state=None, solver='auto', tol=0.001)

In [553]:
## Get the predictions on train and validation data.
pred_train = Ridge_model.predict(X_train)
pred_test = Ridge_model.predict(X_test)

In [554]:
## Get predictions on test data.
test_pred = Ridge_model.predict(test)

In [555]:
## Display RMSE value for train and validation data.
print("Train Error:",sqrt(mean_squared_error(y_train, pred_train)))
print("Test Error:",sqrt(mean_squared_error(y_test, pred_test)))

Train Error: 6482.602867431113
Test Error: 6260.014516392784


In [556]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'CustomerID' : test.index,
                          'Customer.Lifetime.Value' : test_pred})

In [557]:
## Check dimesnions of test data.
test.shape

(1767, 19)

In [558]:
## Check dimensons of dataframe.
dataframe.shape

(1767, 2)

In [559]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('Ridge.csv',index=False)

In [412]:
####################################################### Lasso #################################################################

In [414]:
## Get best parameter vlaues by doing grid search.
model_lasso = Lasso()
grid = GridSearchCV(estimator=model_lasso, param_grid=dict(alpha=alphas),cv=10) #Here the argument cv=10 implies compute error on 10 chucks of data and report average value
grid.fit(X_train,y_train)
print(grid)

  estimator.fit(X_train, y_train, **fit_params)
  positive)
  positive)
  estimator.fit(X_train, y_train, **fit_params)
  positive)
  positive)
  estimator.fit(X_train, y_train, **fit_params)
  positive)
  positive)
  estimator.fit(X_train, y_train, **fit_params)
  positive)
  positive)
  estimator.fit(X_train, y_train, **fit_params)
  positive)
  positive)
  estimator.fit(X_train, y_train, **fit_params)
  positive)
  positive)
  estimator.fit(X_train, y_train, **fit_params)
  positive)
  positive)
  estimator.fit(X_train, y_train, **fit_params)
  positive)
  positive)
  estimator.fit(X_train, y_train, **fit_params)
  positive)
  positive)
  estimator.fit(X_train, y_train, **fit_params)
  positive)
  positive)


GridSearchCV(cv=10, error_score=nan,
             estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True,
                             max_iter=1000, normalize=False, positive=False,
                             precompute=False, random_state=None,
                             selection='cyclic', tol=0.0001, warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'alpha': array([1.0e+00, 1.0e-01, 1.0e-02, 1.0e-03, 1.0e-04, 0.0e+00, 1.5e+00,
       2.0e+00])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)


In [415]:
## Display best parameters.
print(grid.best_score_)
print(grid.best_estimator_.alpha)

0.1181308780672993
2.0


In [560]:
## Instantiate Lasso and fit it.
Lasso_model= Lasso(alpha=2.0,normalize=False)
Lasso_model.fit(X_train,y_train) ## Applying it on the train data, to obtain the coefficients.

Lasso(alpha=2.0, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

In [561]:
## Get the predictions on train and validation data.
pred_train = Lasso_model.predict(X_train)
pred_test = Lasso_model.predict(X_test)

In [562]:
## Get predictions on test data.
test_pred = Lasso_model.predict(test)

In [563]:
## Display RMSE value for train and validation data.
print("Train Error:",sqrt(mean_squared_error(y_train, pred_train)))
print("Test Error:",sqrt(mean_squared_error(y_test, pred_test)))

Train Error: 6482.608555101062
Test Error: 6259.619971942054


In [564]:
## Prepare a dataframe with test data index,prediction values.
dataframe = pd.DataFrame({'CustomerID' : test.index,
                          'Customer.Lifetime.Value' : test_pred})

In [565]:
## Check dimesnions of test data.
test.shape

(1767, 19)

In [566]:
## Check dimensons of dataframe.
dataframe.shape

(1767, 2)

In [567]:
## Copy dataframe data into a CSV file.
dataframe.to_csv('Lasso.csv',index=False)