## Problem Statement
VahanBima is one of the leading insurance companies in India. It provides motor vehicle insurances at best prices with 24/7 claim settlement.  It offers different types of policies for  both personal and commercial vehicles. It has established its brand across different regions in India. 

Around 90% of the businesses today use personalized services. The company wants to launch different personalized experience programs for customers of VahanBima. The personalized experience can be dedicated resources for claim settlement, different kinds of services at doorstep, etc. Inorder to do so, they would like to segment the customers into different tiers based on their customer lifetime value (CLTV).

Inorder to do it, they would like to predict the customer lifetime value based on the activity and interaction of the customer with the platform. So, as a part of this challenge, your task at hand is to build a high performance and interpretable machine learning model to predict the CLTV based on the user and policy data.

---


* __Step 1: Importing the Relevant Libraries__
    
* __Step 2: Data Inspection__
    
* __Step 3: Data Cleaning__
    
* __Step 4: Exploratory Data Analysis__
    
* __Step 5: Building Model__
    


## * __Step 1: Importing library


---



In [3]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
# import dtale
# import dtale.app as dtale_app
import seaborn as sns

In [4]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [25]:
!pip install bayesian-optimization

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bayesian-optimization
  Downloading bayesian_optimization-1.4.2-py3-none-any.whl (17 kB)
Collecting colorama>=0.4.6
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Installing collected packages: colorama, bayesian-optimization
Successfully installed bayesian-optimization-1.4.2 colorama-0.4.6


In [26]:
from sklearn.model_selection import KFold
from sklearn.metrics import r2_score
import lightgbm as lgb
from lightgbm import LGBMRegressor 
from bayes_opt import BayesianOptimization

In [5]:
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

Reading csv train and test file.

In [6]:
train = pd.read_csv(r"/content/drive/MyDrive/Jobathon/20 1 23/train_BRCpofr.csv")
test = pd.read_csv(r"/content/drive/MyDrive/Jobathon/20 1 23/test_koRSKBP.csv")
train.head()

Unnamed: 0,id,gender,area,qualification,income,marital_status,vintage,claim_amount,num_policies,policy,type_of_policy,cltv
0,1,Male,Urban,Bachelor,5L-10L,1,5,5790,More than 1,A,Platinum,64308
1,2,Male,Rural,High School,5L-10L,0,8,5080,More than 1,A,Platinum,515400
2,3,Male,Urban,Bachelor,5L-10L,1,8,2599,More than 1,A,Platinum,64212
3,4,Female,Rural,High School,5L-10L,0,7,0,More than 1,A,Platinum,97920
4,5,Male,Urban,High School,More than 10L,1,6,3508,More than 1,A,Gold,59736


## * __Step 2: Data Inspection

In [7]:
train.isnull().sum()

id                0
gender            0
area              0
qualification     0
income            0
marital_status    0
vintage           0
claim_amount      0
num_policies      0
policy            0
type_of_policy    0
cltv              0
dtype: int64

So, it does not have any column having NAN values.

In [8]:
train.describe()

Unnamed: 0,id,marital_status,vintage,claim_amount,cltv
count,89392.0,89392.0,89392.0,89392.0,89392.0
mean,44696.5,0.575488,4.595669,4351.502416,97952.828978
std,25805.391969,0.494272,2.290446,3262.359775,90613.814793
min,1.0,0.0,0.0,0.0,24828.0
25%,22348.75,0.0,3.0,2406.0,52836.0
50%,44696.5,1.0,5.0,4089.0,66396.0
75%,67044.25,1.0,6.0,6094.0,103440.0
max,89392.0,1.0,8.0,31894.0,724068.0


In [9]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89392 entries, 0 to 89391
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              89392 non-null  int64 
 1   gender          89392 non-null  object
 2   area            89392 non-null  object
 3   qualification   89392 non-null  object
 4   income          89392 non-null  object
 5   marital_status  89392 non-null  int64 
 6   vintage         89392 non-null  int64 
 7   claim_amount    89392 non-null  int64 
 8   num_policies    89392 non-null  object
 9   policy          89392 non-null  object
 10  type_of_policy  89392 non-null  object
 11  cltv            89392 non-null  int64 
dtypes: int64(5), object(7)
memory usage: 8.2+ MB


##* __Step 3: Exploratory Data Analysis_

---



In [10]:
train.groupby(['gender','income']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,id,marital_status,vintage,claim_amount,cltv
gender,income,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Female,2L-5L,44516.659328,0.498894,4.660065,5034.707785,108045.856104
Female,5L-10L,44745.591719,0.5536,4.55081,3986.400945,94587.89973
Female,<=2L,43400.453757,0.522158,4.560694,5963.054913,112668.358382
Female,More than 10L,45055.879069,0.530762,4.568034,3172.767904,92902.785156
Male,2L-5L,44433.118131,0.555679,4.706815,5425.541192,110624.269867
Male,5L-10L,44743.637821,0.629197,4.569222,4312.220685,95407.277676
Male,<=2L,44232.786765,0.569853,4.82598,6174.161765,109886.852941
Male,More than 10L,44931.460771,0.594282,4.584176,3603.925665,86621.310638


It shows Male or Female who have less income has high cltv

In [11]:
train.groupby(['gender','qualification']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,id,marital_status,vintage,claim_amount,cltv
gender,qualification,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Female,Bachelor,44935.477857,0.544164,4.655554,4008.128028,100011.744311
Female,High School,44491.240636,0.527876,4.53166,4328.695038,98151.511256
Female,Others,45070.924506,0.553026,4.455362,3688.481126,78470.739365
Male,Bachelor,44693.859789,0.612685,4.675112,4302.103943,97454.165112
Male,High School,44655.483701,0.599945,4.552052,4723.155584,99916.032954
Male,Others,45108.191141,0.607126,4.531536,3825.0,76960.31584


This shows as person got more educated has higher cltv 

In [12]:
train.groupby(['marital_status']).mean()

Unnamed: 0_level_0,id,vintage,claim_amount,cltv
marital_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,44691.374987,4.626831,4788.031411,106155.045536
1,44700.280499,4.572681,4029.493974,91902.410777


A person who is not married and not married has high cltv difference.

In [13]:
train.groupby(['type_of_policy']).cltv.mean()

type_of_policy
Gold        99381.983095
Platinum    99752.960331
Silver      92457.367539
Name: cltv, dtype: float64

Platinum type of policy has high cltv.

In [14]:
train.groupby(['num_policies']).mean()

Unnamed: 0_level_0,id,marital_status,vintage,claim_amount,cltv
num_policies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,44848.879433,0.634042,4.428645,3622.866181,50979.031618
More than 1,44622.845179,0.547185,4.676402,4703.699368,120658.299056


A person which has more than 1 no of policy has high cltv.

##* __Step 4: Feature Engineering

---





Perform Ordinal Encoding on qualification, num_policies, income, vintage, gender, area column.


In [15]:
def qualification_ord(x):
    if x == 'Bachelor':
        return 2
    elif x == 'High School':
        return 1
    elif x == 'Others':
        return 0
    else:
        return 3

def num_policies_ord(x):
    if x == 'More than 1':
        return 1
    else:
        return 0

def income_ord(x):
    if x == '2L-5L':
        return 3
    elif x == '<=2L':
        return 4
    elif x == '5L-10L':
        return 2
    else:
        return 1


def num_vint(x):
    if x == 3 :
        return 9
    else:
        return x
        
train['vintage'] = train['vintage'].apply(lambda x: num_vint(x))
train['qualification'] = train['qualification'].apply(lambda x: qualification_ord(x))
train['num_policies'] = train['num_policies'].apply(lambda x: num_policies_ord(x))
train['income'] = train['income'].apply(lambda x: income_ord(x))

train['gender'] = train['gender'].apply(lambda x: 1 if x == "Male" else 0)
train['area'] = train['area'].apply(lambda x: 1 if x == "Urban" else 0)
train.head()

Unnamed: 0,id,gender,area,qualification,income,marital_status,vintage,claim_amount,num_policies,policy,type_of_policy,cltv
0,1,1,1,2,2,1,5,5790,1,A,Platinum,64308
1,2,1,0,1,2,0,8,5080,1,A,Platinum,515400
2,3,1,1,2,2,1,8,2599,1,A,Platinum,64212
3,4,0,0,1,2,0,7,0,1,A,Platinum,97920
4,5,1,1,1,1,1,6,3508,1,A,Gold,59736


In [16]:
from enum import Enum
class Colors(Enum):
    Platinum = 3
    Gold = 2
    Silver = 1
    def __str__(self):
        return self.name

train["type_of_policy"] = train["type_of_policy"].apply(lambda x: getattr(Colors, x).value)
train.head()

Unnamed: 0,id,gender,area,qualification,income,marital_status,vintage,claim_amount,num_policies,policy,type_of_policy,cltv
0,1,1,1,2,2,1,5,5790,1,A,3,64308
1,2,1,0,1,2,0,8,5080,1,A,3,515400
2,3,1,1,2,2,1,8,2599,1,A,3,64212
3,4,0,0,1,2,0,7,0,1,A,3,97920
4,5,1,1,1,1,1,6,3508,1,A,2,59736


In [17]:
# Performing One Hot Encoding and merging with original dataframe.
def dummy(dataframe,col):
  dummy_col = pd.get_dummies(dataframe[col],drop_first=True)

  dataframe= pd.merge(
      left=dataframe,
      right=dummy_col,
      left_index=True,
      right_index=True,
  )
  dataframe = dataframe.drop(col, axis=1)

  return dataframe

# pd.get_dummies(train['area', 'gender', 'policy'])
train = dummy(train,'policy')
train.head()

Unnamed: 0,id,gender,area,qualification,income,marital_status,vintage,claim_amount,num_policies,type_of_policy,cltv,B,C
0,1,1,1,2,2,1,5,5790,1,3,64308,0,0
1,2,1,0,1,2,0,8,5080,1,3,515400,0,0
2,3,1,1,2,2,1,8,2599,1,3,64212,0,0
3,4,0,0,1,2,0,7,0,1,3,97920,0,0
4,5,1,1,1,1,1,6,3508,1,2,59736,0,0


Logarathmic transpose for claim amount column to scale down it's range.


In [18]:
train['claim_amount'] = np.log2(train['claim_amount']+1)
train.head()

Unnamed: 0,id,gender,area,qualification,income,marital_status,vintage,claim_amount,num_policies,type_of_policy,cltv,B,C
0,1,1,1,2,2,1,5,12.499597,1,3,64308,0,0
1,2,1,0,1,2,0,8,12.310897,1,3,515400,0,0
2,3,1,1,2,2,1,8,11.344296,1,3,64212,0,0
3,4,0,0,1,2,0,7,0.0,1,3,97920,0,0
4,5,1,1,1,1,1,6,11.776844,1,2,59736,0,0


Splitting data into Train and Test.

In [19]:
# Seperate Features and Target
X= train.drop(columns = ['cltv','id'], axis=1)
y= train['cltv']

# 20% data as validation set
X_train,X_valid,y_train,y_valid = train_test_split(X,y,test_size=0.2,random_state=22)

## * __Step 5:Training model

### Method1


---


training on multiple model and getting there rmse amd mse error to evaluate.

In [11]:
from sklearn.linear_model import Ridge, Lasso, LinearRegression, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
import xgboost as xg

In [13]:
algos = [ElasticNet(),KNeighborsRegressor(),RandomForestRegressor(),GradientBoostingRegressor(),SVR()]
names = ['ElasticNet',
         'K Neighbors Regressor','RandomForestRegressor','gb regressor','SVR']


In [14]:
rmse_list = []
mae_list = []
rmse_list_train = []
mae_list_train = []
for name in algos:
    model = name
    model.fit(X_train,y_train)
    y_pred = model.predict(X_valid)

    MSE= metrics.mean_squared_error(y_valid,y_pred)
    MAE= metrics.mean_absolute_error(y_valid,y_pred)   
    rmse = np.sqrt(MSE)
    rmse_list.append(rmse)
    mae_list.append(MAE)

    Y_pred_train = model.predict(X_train)
    MSE_train= metrics.mean_squared_error(y_train,Y_pred_train)
    MAE_train= metrics.mean_absolute_error(y_train,Y_pred_train)
    rmse_train = np.sqrt(MSE_train)
    rmse_list_train.append(rmse_train)
    mae_list_train.append(MAE_train)

In [15]:
evaluation = pd.DataFrame({'Model': names,'RMSE_train': rmse_list_train,'MAE_test_train':mae_list_train,
                           'RMSE_test': rmse_list,'MAE_test':mae_list})
evaluation

Unnamed: 0,Model,RMSE_train,MAE_test_train,RMSE_test,MAE_test
0,ElasticNet,86675.997932,53961.887342,86307.659587,53360.477715
1,K Neighbors Regressor,74712.810851,45864.694363,90536.870792,55697.51655
2,RandomForestRegressor,38545.231712,23231.518835,88883.717533,55173.525933
3,gb regressor,82764.200673,50411.630754,82833.581799,50388.863898
4,SVR,95914.973158,48695.868808,95433.487048,48148.684576


As GradientBoostingRegressor performed well so using it to train model.


---
Preparing test data.


In [16]:
test['vintage'] = test['vintage'].apply(lambda x: num_vint(x))
test['qualification'] = test['qualification'].apply(lambda x: qualification_ord(x))
test['num_policies'] = test['num_policies'].apply(lambda x: num_policies_ord(x))
test['income'] = test['income'].apply(lambda x: income_ord(x))
test["type_of_policy"] = test["type_of_policy"].apply(lambda x: getattr(Colors, x).value)
test['gender'] = test['gender'].apply(lambda x: 1 if x == "Male" else 0)
test['area'] = test['area'].apply(lambda x: 1 if x == "Urban" else 0)
test = dummy(test,'policy')
test.head()

Unnamed: 0,id,gender,area,qualification,income,marital_status,vintage,claim_amount,num_policies,type_of_policy,B,C
0,89393,0,0,1,2,0,6,2134,1,1,1,0
1,89394,0,1,1,3,0,4,4102,1,3,0,0
2,89395,1,0,1,2,1,7,2925,1,2,1,0
3,89396,0,0,2,1,1,2,0,1,1,1,0
4,89397,0,1,1,3,0,5,14059,1,1,1,0


In [17]:
test['claim_amount'] = np.log2(test['claim_amount']+1)
test.head()

Unnamed: 0,id,gender,area,qualification,income,marital_status,vintage,claim_amount,num_policies,type_of_policy,B,C
0,89393,0,0,1,2,0,6,11.06002,1,1,1,0
1,89394,0,1,1,3,0,4,12.002463,1,3,0,0
2,89395,1,0,1,2,1,7,11.514714,1,2,1,0
3,89396,0,0,2,1,1,2,0.0,1,1,1,0
4,89397,0,1,1,3,0,5,13.779309,1,1,1,0


In [18]:
id = test.id
X_t= test.drop(columns = ['id'], axis=1)

In [31]:
model =  GradientBoostingRegressor()
model.fit(X, y)
final_predictions = model.predict(X_t)
model1 =  SVR()
model1.fit(X, y)
final_predictions1 = model1.predict(X_t)
model2 =  xg.XGBRegressor()()
model2.fit(X, y)
final_predictions2 = model2.predict(X_t)
dict1 = {'id':id , 'cltv':(final_predictions +final_predictions1 +final_predictions2 )/3}
final=pd.DataFrame(dict1)
final.head()

TypeError: ignored

Using Cross Validation to understand better.

In [20]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

cv=ShuffleSplit(n_splits=5,test_size=0.2,random_state=0)

print("gb regressor     : ",cross_val_score(GradientBoostingRegressor(),X,y,cv=cv,scoring='r2'))
print("XGB   : ",cross_val_score(xg.XGBRegressor(),X,y,cv=cv,scoring='r2'))

gb regressor     :  [0.16487422 0.15936621 0.16576758 0.15745867 0.15604342]
XGB   :  [0.1647442  0.15907814 0.16606446 0.15744098 0.15613289]


In [None]:
final.to_csv(r'C:\Users\panch\OneDrive\Desktop\AnalyTics Vidhya\cltv\result5_onlygb.csv',index = False)

### Method2


---


Using Standard Scaler and train using SVR model.

In [None]:
from sklearn.preprocessing import StandardScaler
import numpy as np

scaler = StandardScaler()
def scaleColumns(df, cols_to_scale):
    for col in cols_to_scale:
        df[col] = pd.DataFrame(scaler.fit_transform(pd.DataFrame(df[col])),columns=[col])
    return df

scaled_df = scaleColumns(train,['claim_amount','vintage'])
scaled_df

In [None]:
# Seperate Features and Target
X= scaled_df.drop(columns = ['cltv','id'], axis=1)
y= scaled_df['cltv']

# 20% data as validation set
X_train,X_valid,y_train,y_valid = train_test_split(X,y,test_size=0.2,random_state=22)

In [None]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

cv=ShuffleSplit(n_splits=5,test_size=0.2,random_state=0)

print("SVR     : " ,cross_val_score(SVR(kernel = 'linear', C = 2, epsilon = .5),X,y,cv=cv,scoring='r2'))

### Method3


---


Train a small DNN model using Keras.

In [None]:
import tensorflow as tf
from tensorflow import keras

modeld = keras.Sequential([
    keras.layers.Dense(8, input_dim=11, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(5, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(3, activation='relu'),



    keras.layers.Dense(1)
])

modeld.compile(loss='mean_squared_logarithmic_error', optimizer='adam', metrics=['mean_squared_logarithmic_error'])

modeld.fit(X_train, y_train, epochs=2, batch_size=32)

In [None]:
rmse_list = []
y_pred = modeld.predict(X_valid)
MSE= metrics.mean_squared_log_error(y_valid,y_pred)
rmse = np.sqrt(MSE)
rmse_list.append(MSE)
evaluation = pd.DataFrame({'Model': 'model',
                           'RMSE ': rmse_list})
evaluation

In [33]:
# This function will use bayesian optimization for searching the best parameters for our dataset
def search_best_param(X,y,cat_features):
    
    # trainXY = lgb.Dataset(data=X, label=y,categorical_feature = cat_features,free_raw_data=False)
    trainXY = lgb.Dataset(data=X, label=y,free_raw_data=False)
    def lightGBM_CV(max_depth, num_leaves, n_estimators, learning_rate, subsample, colsample_bytree, 
                lambda_l1, lambda_l2, min_child_weight):
    
        params = {'boosting_type': 'gbdt', 'objective': 'regression', 'metric':'rmse', 'verbose': -1,
                  'early_stopping_round':500}
        
        params['max_depth'] = int(round(max_depth))
        params["num_leaves"] = int(round(num_leaves))
        params["n_estimators"] = int(round(n_estimators))
        params['learning_rate'] = learning_rate
        params['subsample'] = subsample
        params['colsample_bytree'] = colsample_bytree
        params['lambda_l1'] = max(lambda_l1, 0)
        params['lambda_l2'] = max(lambda_l2, 0)
        params['min_child_weight'] = min_child_weight
    
        score = lgb.cv(params, trainXY, nfold=5, seed=1, stratified=False, verbose_eval =False, metrics=['rmse'])

        return -np.min(score['rmse-mean']) 

    # using bayesian optimization for the best hyper-parameters
    lightGBM_Bo = BayesianOptimization(lightGBM_CV, 
                                       {
                                          'max_depth': (5, 100),
                                          'num_leaves': (20, 200),
                                          'n_estimators': (50, 10000),
                                          'learning_rate': (0.005, 0.3),
                                          'subsample': (0.1, 1),
                                          'colsample_bytree' :(0.1, 0.99),
                                          'lambda_l1': (0, 5),
                                          'lambda_l2': (0, 3),
                                          'min_child_weight': (2, 100) 
                                      },
                                       random_state = 1,
                                       verbose = 0
                                      )
    np.random.seed(1)
    
    lightGBM_Bo.maximize(init_points=5, n_iter=25) 
    
    params_set = lightGBM_Bo.max['params']
    
    # get the params of the maximum target     
    max_target = -np.inf
    for i in lightGBM_Bo.res: # loop thru all the residuals 
        if i['target'] > max_target:
            params_set = i['params']
            max_target = i['target']
    
    params_set.update({'verbose': -1})
    params_set.update({'metric': 'rmse'})
    params_set.update({'boosting_type': 'gbdt'})
    params_set.update({'objective': 'regression'})
    
    params_set['max_depth'] = int(round(params_set['max_depth']))
    params_set['num_leaves'] = int(round(params_set['num_leaves']))
    params_set['n_estimators'] = int(round(params_set['n_estimators']))
    params_set['seed'] = 1 #set seed
    
    return params_set

# This function will apply 5 fold cross validation with early stopping
def K_Fold_LightGBM(X_train, y_train , cat_features, num_folds = 5):
    num = 0
    models = []
    folds = KFold(n_splits=num_folds, shuffle=True, random_state=0)

    avg_r2 = 0
    for n_fold, (train_idx, valid_idx) in enumerate (folds.split(X_train, y_train)):
        print(f"model : {num}")
        train_X, train_y = X_train.iloc[train_idx], y_train.iloc[train_idx]
        valid_X, valid_y = X_train.iloc[valid_idx], y_train.iloc[valid_idx]
        
        # train_data=lgb.Dataset(train_X,label=train_y, categorical_feature = cat_features,free_raw_data=False)
        # valid_data=lgb.Dataset(valid_X,label=valid_y, categorical_feature = cat_features,free_raw_data=False)
        
        # params_set = search_best_param(train_X,train_y,cat_features)
        train_data=lgb.Dataset(train_X,label=train_y,free_raw_data=False)
        valid_data=lgb.Dataset(valid_X,label=valid_y,free_raw_data=False)
        
        params_set = search_best_param(train_X,train_y,cat_features)
        print("para set :",params_set)
        CV_LGBM = lgb.train(params_set,
                            train_data,
                            num_boost_round = 2500,
                            valid_sets = valid_data,
                            early_stopping_rounds = 200,
                            verbose_eval = 100
                           )
        
    
        # increase early_stopping_rounds can lead to overfitting 
        models.append(CV_LGBM)
        tr_r2 = r2_score(train_y,models[num].predict(train_X))
        test_r2 = r2_score(valid_y,models[num].predict(valid_X))
        print("Train set r2:", tr_r2)
        print("Test set r2:", test_r2)
        print("\n")
        num = num + 1
        avg_r2 = avg_r2 + test_r2
    mean_r2 =   avg_r2/num_folds
    print("mean r2", mean_r2)
    return models

# This function will take model and test data and will predict our test data with that model.
def predict_cv(model,X):
    y_preds = model.predict(X)
    return y_preds


cat_features  = 'aman'
lgbm_models = K_Fold_LightGBM(X,y,cat_features,5)

model : 0
para set : {'colsample_bytree': 0.6810774990342702, 'lambda_l1': 4.816404352502415, 'lambda_l2': 2.522967575291378, 'learning_rate': 0.16327225751720942, 'max_depth': 5, 'min_child_weight': 65.0425109142572, 'n_estimators': 8944, 'num_leaves': 21, 'subsample': 0.362156147474192, 'verbose': -1, 'metric': 'rmse', 'boosting_type': 'gbdt', 'objective': 'regression', 'seed': 1}
Training until validation scores don't improve for 200 rounds.
[100]	valid_0's rmse: 82657.9
[200]	valid_0's rmse: 82779.5
Early stopping, best iteration is:
[33]	valid_0's rmse: 82566
Train set r2: 0.1670730836270773
Test set r2: 0.16579163095921323


model : 1
para set : {'colsample_bytree': 0.6810774990342702, 'lambda_l1': 4.816404352502415, 'lambda_l2': 2.522967575291378, 'learning_rate': 0.16327225751720942, 'max_depth': 5, 'min_child_weight': 65.0425109142572, 'n_estimators': 8944, 'num_leaves': 21, 'subsample': 0.362156147474192, 'verbose': -1, 'metric': 'rmse', 'boosting_type': 'gbdt', 'objective': 

In [34]:
# Importing required librairies
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.metrics import r2_score
import lightgbm as lgb
from lightgbm import LGBMRegressor 
from bayes_opt import BayesianOptimization
import warnings
warnings.filterwarnings("ignore")


# This function will use bayesian optimization for searching the best parameters for our dataset
def search_best_param(X,y,cat_features):
    
    trainXY = lgb.Dataset(data=X, label=y,categorical_feature = cat_features,free_raw_data=False)
    def lightGBM_CV(max_depth, num_leaves, n_estimators, learning_rate, subsample, colsample_bytree, 
                lambda_l1, lambda_l2, min_child_weight):
    
        params = {'boosting_type': 'gbdt', 'objective': 'regression', 'metric':'rmse', 'verbose': -1,
                  'early_stopping_round':500}
        
        params['max_depth'] = int(round(max_depth))
        params["num_leaves"] = int(round(num_leaves))
        params["n_estimators"] = int(round(n_estimators))
        params['learning_rate'] = learning_rate
        params['subsample'] = subsample
        params['colsample_bytree'] = colsample_bytree
        params['lambda_l1'] = max(lambda_l1, 0)
        params['lambda_l2'] = max(lambda_l2, 0)
        params['min_child_weight'] = min_child_weight
    
        score = lgb.cv(params, trainXY, nfold=5, seed=1, stratified=False, verbose_eval =False, metrics=['rmse'])

        return -np.min(score['rmse-mean']) 

    # using bayesian optimization for the best hyper-parameters
    lightGBM_Bo = BayesianOptimization(lightGBM_CV, 
                                       {
                                          'max_depth': (5, 100),
                                          'num_leaves': (20, 200),
                                          'n_estimators': (50, 10000),
                                          'learning_rate': (0.005, 0.3),
                                          'subsample': (0.1, 1),
                                          'colsample_bytree' :(0.1, 0.99),
                                          'lambda_l1': (0, 5),
                                          'lambda_l2': (0, 3),
                                          'min_child_weight': (2, 100) 
                                      },
                                       random_state = 1,
                                       verbose = 0
                                      )
    np.random.seed(1)
    
    lightGBM_Bo.maximize(init_points=5, n_iter=25) 
    
    params_set = lightGBM_Bo.max['params']
    
    # get the params of the maximum target     
    max_target = -np.inf
    for i in lightGBM_Bo.res: # loop thru all the residuals 
        if i['target'] > max_target:
            params_set = i['params']
            max_target = i['target']
    
    params_set.update({'verbose': -1})
    params_set.update({'metric': 'rmse'})
    params_set.update({'boosting_type': 'gbdt'})
    params_set.update({'objective': 'regression'})
    
    params_set['max_depth'] = int(round(params_set['max_depth']))
    params_set['num_leaves'] = int(round(params_set['num_leaves']))
    params_set['n_estimators'] = int(round(params_set['n_estimators']))
    params_set['seed'] = 1 #set seed
    
    return params_set

# This function will apply 5 fold cross validation with early stopping
def K_Fold_LightGBM(X_train, y_train , cat_features, num_folds = 5):
    num = 0
    models = []
    folds = KFold(n_splits=num_folds, shuffle=True, random_state=0)

    avg_r2 = 0
    for n_fold, (train_idx, valid_idx) in enumerate (folds.split(X_train, y_train)):
        print(f"model : {num}")
        train_X, train_y = X_train.iloc[train_idx], y_train.iloc[train_idx]
        valid_X, valid_y = X_train.iloc[valid_idx], y_train.iloc[valid_idx]
        
        train_data=lgb.Dataset(train_X,label=train_y, categorical_feature = cat_features,free_raw_data=False)
        valid_data=lgb.Dataset(valid_X,label=valid_y, categorical_feature = cat_features,free_raw_data=False)
        
        params_set = search_best_param(train_X,train_y,cat_features)
        print("parameter   : ",params_set)
        CV_LGBM = lgb.train(params_set,
                            train_data,
                            num_boost_round = 2500,
                            valid_sets = valid_data,
                            early_stopping_rounds = 200,
                            verbose_eval = 100
                           )
        
    
        # increase early_stopping_rounds can lead to overfitting 
        models.append(CV_LGBM)
        tr_r2 = r2_score(train_y,models[num].predict(train_X))
        test_r2 = r2_score(valid_y,models[num].predict(valid_X))
        print("Train set r2:", tr_r2)
        print("Test set r2:", test_r2)
        print("\n")
        num = num + 1
        avg_r2 = avg_r2 + test_r2
    mean_r2 =   avg_r2/num_folds
    print("mean r2", mean_r2)
    return models

# This function will take model and test data and will predict our test data with that model.
def predict_cv(model,X):
    y_preds = model.predict(X)
    return y_preds

#Reading train, test and submission files
train = pd.read_csv("/content/drive/MyDrive/Jobathon/20 1 23/train_BRCpofr.csv")
test = pd.read_csv("/content/drive/MyDrive/Jobathon/20 1 23/test_koRSKBP.csv")
# submission = pd.read_csv("/kaggle/input/hackathon/sample_submission.csv")

#Data Analysis and Pre-processing

# Separating dependent and independent features from dataframes
train_x = train.iloc[:,1:-1] # Removing 'id' and target column from the train dataset
train_y = train.iloc[:,-1:] # Creating a dataframe with only target column
test_x = test.iloc[:,1:] # Removing 'id' column from the test dataset

# Combining both train and test independent features for feature engineering at a single time
X = pd.concat([train_x, test_x], axis=0)

# One hot encoding all the categorical columns and removing the first one
one_hot_columns = ['gender', 'area','qualification','income','num_policies','policy','type_of_policy']
X_ohe = pd.get_dummies(data=X, columns=one_hot_columns,drop_first=True)
X = X_ohe

# Separating back train and test datasets based on their previous length
train_x = X.iloc[:89392,:]
test_x = X.iloc[89392:,:]
train_y = train_y['cltv']

# Model Building
# In our data, only 'vintage' and 'claim_amount' are not categorical columns. So using them in lightgbm model
column_names = list(train_x.columns)
cat_features = [i for i in column_names if i not in ['vintage','claim_amount']]

# Training our data for 5 folds and getting our 5 models as outputs for prediction
lgbm_models = K_Fold_LightGBM(train_x,train_y,cat_features,5)

# Inference
# ensembling predictions from first and last model
# predicting on test dataset using first and fifth model
y_prediction_1 = predict_cv(lgbm_models[0],test_x)
y_prediction_5 = predict_cv(lgbm_models[-1],test_x)

y_prediction_final = (y_prediction_1 + y_prediction_5)/2

# Generating Submission File

# replacing submission column with predicted values
submission['cltv'] = y_prediction_final

# # Saving the submission dataframe to a csv file
# submission.to_csv("final_prediction_test.csv",index=False)



model : 0
parameter   :  {'colsample_bytree': 0.99, 'lambda_l1': 3.0027303913927157, 'lambda_l2': 0.0, 'learning_rate': 0.005, 'max_depth': 52, 'min_child_weight': 2.0, 'n_estimators': 1750, 'num_leaves': 20, 'subsample': 1.0, 'verbose': -1, 'metric': 'rmse', 'boosting_type': 'gbdt', 'objective': 'regression', 'seed': 1}
Training until validation scores don't improve for 200 rounds.
[100]	valid_0's rmse: 85723
[200]	valid_0's rmse: 83894.4
[300]	valid_0's rmse: 83135
[400]	valid_0's rmse: 82823.6
[500]	valid_0's rmse: 82685.5
[600]	valid_0's rmse: 82614.5
[700]	valid_0's rmse: 82587.9
[800]	valid_0's rmse: 82574.4
[900]	valid_0's rmse: 82563.4
[1000]	valid_0's rmse: 82557
[1100]	valid_0's rmse: 82553.7
[1200]	valid_0's rmse: 82549.7
[1300]	valid_0's rmse: 82547.9
[1400]	valid_0's rmse: 82550.5
Early stopping, best iteration is:
[1291]	valid_0's rmse: 82547.3
Train set r2: 0.1733902112431206
Test set r2: 0.16616948292575495


model : 1
parameter   :  {'colsample_bytree': 0.99, 'lambda_l

NameError: ignored