Problem Statement
VahanBima is one of the leading insurance companies in India. It provides motor vehicle insurance at the best prices with 24/7 claim settlement. It offers different types of policies for both personal and commercial vehicles. It has established its brand across different regions in India.

Around 90% of businesses today use personalized services. The company wants to launch different personalized experience programs for customers of VahanBima. The personalized experience can be dedicated resources for claim settlement, different kinds of services at the doorstep, etc. To do so, they would like to segment the customers into different tiers based on their customer lifetime value (CLTV). To do it, they would like to predict the customer lifetime value based on the activity and interaction of the customer with the platform.

The challenge objective is to build a high-performance and interpretable machine learning model to predict the customer lifetime value (CLTV) based on the user and policy data.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df_train = pd.read_csv('train_BRCpofr.csv')
df_test = pd.read_csv('test_koRSKBP.csv')

In [3]:
df_train.head()

Unnamed: 0,id,gender,area,qualification,income,marital_status,vintage,claim_amount,num_policies,policy,type_of_policy,cltv
0,1,Male,Urban,Bachelor,5L-10L,1,5,5790,More than 1,A,Platinum,64308
1,2,Male,Rural,High School,5L-10L,0,8,5080,More than 1,A,Platinum,515400
2,3,Male,Urban,Bachelor,5L-10L,1,8,2599,More than 1,A,Platinum,64212
3,4,Female,Rural,High School,5L-10L,0,7,0,More than 1,A,Platinum,97920
4,5,Male,Urban,High School,More than 10L,1,6,3508,More than 1,A,Gold,59736


In [4]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89392 entries, 0 to 89391
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              89392 non-null  int64 
 1   gender          89392 non-null  object
 2   area            89392 non-null  object
 3   qualification   89392 non-null  object
 4   income          89392 non-null  object
 5   marital_status  89392 non-null  int64 
 6   vintage         89392 non-null  int64 
 7   claim_amount    89392 non-null  int64 
 8   num_policies    89392 non-null  object
 9   policy          89392 non-null  object
 10  type_of_policy  89392 non-null  object
 11  cltv            89392 non-null  int64 
dtypes: int64(5), object(7)
memory usage: 8.2+ MB


In [5]:
df_test.head()

Unnamed: 0,id,gender,area,qualification,income,marital_status,vintage,claim_amount,num_policies,policy,type_of_policy
0,89393,Female,Rural,High School,5L-10L,0,6,2134,More than 1,B,Silver
1,89394,Female,Urban,High School,2L-5L,0,4,4102,More than 1,A,Platinum
2,89395,Male,Rural,High School,5L-10L,1,7,2925,More than 1,B,Gold
3,89396,Female,Rural,Bachelor,More than 10L,1,2,0,More than 1,B,Silver
4,89397,Female,Urban,High School,2L-5L,0,5,14059,More than 1,B,Silver


In [6]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59595 entries, 0 to 59594
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              59595 non-null  int64 
 1   gender          59595 non-null  object
 2   area            59595 non-null  object
 3   qualification   59595 non-null  object
 4   income          59595 non-null  object
 5   marital_status  59595 non-null  int64 
 6   vintage         59595 non-null  int64 
 7   claim_amount    59595 non-null  int64 
 8   num_policies    59595 non-null  object
 9   policy          59595 non-null  object
 10  type_of_policy  59595 non-null  object
dtypes: int64(4), object(7)
memory usage: 5.0+ MB


In [7]:
df = df_train.append(df_test)
df.head()

  df = df_train.append(df_test)


Unnamed: 0,id,gender,area,qualification,income,marital_status,vintage,claim_amount,num_policies,policy,type_of_policy,cltv
0,1,Male,Urban,Bachelor,5L-10L,1,5,5790,More than 1,A,Platinum,64308.0
1,2,Male,Rural,High School,5L-10L,0,8,5080,More than 1,A,Platinum,515400.0
2,3,Male,Urban,Bachelor,5L-10L,1,8,2599,More than 1,A,Platinum,64212.0
3,4,Female,Rural,High School,5L-10L,0,7,0,More than 1,A,Platinum,97920.0
4,5,Male,Urban,High School,More than 10L,1,6,3508,More than 1,A,Gold,59736.0


In [8]:
df.shape

(148987, 12)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 148987 entries, 0 to 59594
Data columns (total 12 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   id              148987 non-null  int64  
 1   gender          148987 non-null  object 
 2   area            148987 non-null  object 
 3   qualification   148987 non-null  object 
 4   income          148987 non-null  object 
 5   marital_status  148987 non-null  int64  
 6   vintage         148987 non-null  int64  
 7   claim_amount    148987 non-null  int64  
 8   num_policies    148987 non-null  object 
 9   policy          148987 non-null  object 
 10  type_of_policy  148987 non-null  object 
 11  cltv            89392 non-null   float64
dtypes: float64(1), int64(4), object(7)
memory usage: 14.8+ MB


In [12]:
for i in df.columns:
    print("Number of unique values in column {} : ".format(i), df[i].nunique())

Number of unique values in column id :  148987
Number of unique values in column gender :  2
Number of unique values in column area :  2
Number of unique values in column qualification :  3
Number of unique values in column income :  4
Number of unique values in column marital_status :  2
Number of unique values in column vintage :  9
Number of unique values in column claim_amount :  12356
Number of unique values in column num_policies :  2
Number of unique values in column policy :  3
Number of unique values in column type_of_policy :  3
Number of unique values in column cltv :  18796


In [15]:
df['gender'].dtype

dtype('O')

In [16]:
df['marital_status'].dtype

dtype('int64')

In [26]:
for i in df.columns:
    if df[i].dtype == 'O':
        print('Column Name : {}'.format(i),',Values :',df[i].unique())
    else:
        continue

Column Name : gender ,Values : ['Male' 'Female']
Column Name : area ,Values : ['Urban' 'Rural']
Column Name : qualification ,Values : ['Bachelor' 'High School' 'Others']
Column Name : income ,Values : ['5L-10L' 'More than 10L' '2L-5L' '<=2L']
Column Name : num_policies ,Values : ['More than 1' '1']
Column Name : policy ,Values : ['A' 'C' 'B']
Column Name : type_of_policy ,Values : ['Platinum' 'Gold' 'Silver']
