<center><h1>Credit Card Lead Prediction 💳💳</h1>
    <h2> XGBoost, LBGM, CATBoost, NN Review </h2>
<img class='center' height="600" width="800" src="https://cdn.britannica.com/02/160902-050-B58BAD84/Credit-cards.jpg">
</center>
    


# Credit Card Lead Prediction

Happy Customer Bank is a mid-sized private bank that deals in all kinds of banking products, like Savings accounts, Current accounts, investment products, credit products, among other offerings.



The bank also cross-sells products to its existing customers and to do so they use different kinds of communication like tele-calling, e-mails, recommendations on net banking, mobile banking, etc. 



In this case, the Happy Customer Bank wants to cross sell its credit cards to its existing customers. The bank has identified a set of customers that are eligible for taking these credit cards.



Now, the bank is looking for your help in identifying customers that could show higher intent towards a recommended credit card, given:

* Customer details (gender, age, region etc.)
* Details of his/her relationship with the bank (Channel_Code,Vintage, 'Avg_Asset_Value etc.)

## Data Dictionary 
| Variable      | Definition |
| ----------- | ----------- |
| ID      | Unique Identifier for a row |
| Gender      | Gender of the Customer |
| Age      | Age of the Customer (in Years) |
| Region_Code      | Code of the Region for the customers |
| Occupation      | Occupation Type for the customer |
| Channel_Code      | Acquisition Channel Code for the Customer  (Encoded) |
| Vintage      | Vintage for the Customer (In Months) |
| Credit_Product      | If the Customer has any active credit product (Home loan,Personal loan, Credit Card etc.) |
| Avg_Account_Balance      | Average Account Balance for the Customer in last 12 Months |
| Is_Active      | If the Customer is Active in last 3 Months |
| Is_Lead(Target)      | If the Customer is interested for the Credit Card  {0 : Customer is not interested, 1 : Customer is interested} |

# Data Understanding

In [None]:
# !pip install catboost
# !pip install featuretools

# Supress Warnings

import warnings
warnings.filterwarnings('ignore')

# Importing necessary libraries

import os
import chardet

import numpy as np
import pandas as pd

# visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

# Model analysis and building libraries

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import cross_validate, cross_val_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.impute import KNNImputer
import featuretools as ft

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from sklearn.preprocessing import LabelEncoder
from catboost import CatBoostClassifier
import xgboost as xgb
from xgboost.sklearn import XGBClassifier

from imblearn.over_sampling import SMOTE, SMOTENC
from imblearn.under_sampling import RandomUnderSampler

from sklearn.metrics import roc_auc_score

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, Sequential, optimizers

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Encoding is in standard ascii format

train = pd.read_csv('/kaggle/input/jobathon-may-2021-credit-card-lead-prediction/train.csv')
test = pd.read_csv('/kaggle/input/jobathon-may-2021-credit-card-lead-prediction/test.csv')

# Copy of train set for EDA
df = train.copy()
test_df = test.copy()

In [None]:
# Number of features and corresponding records

print('Training data shape: ',train.shape)
print('Testing data shape: ',test.shape)

* We have `246725` and `105312` rows in Train and Test dataset respectively.
* We have around `10` independent features in the dataset

In [None]:
df.head(5)

In [None]:
# Distribution of features data types

df.dtypes

Categorical Features in dataset: 7

Numerical features in dataset: 4

In [None]:
# Feature Analysis

df.describe(include='all').T

In [None]:
# Checking the Null values

df.info()

In [None]:
# Percentage of Null values in each column

round(df.isnull().sum()/len(df)*100, 2)

We have 12% of null values in `Credit_product` feature

# Data Cleaning


## Credit Product

`Credit_Product` : If the Customer has any active credit product (Home loan, Personal loan, Credit Card etc.)

We have `12%` of Null values in credit product

Ways to handle Null Values

* Drop the null value rows
* Impute the Null values


In [None]:
print('Null values: ', df.Credit_Product.isnull().sum())

In [None]:
ax = sns.countplot(df.Credit_Product, hue=df.Is_Lead)
ax.set_title('Credit Product Distribution')
plt.show()

* There is a trend in output. Most of the Users with Credit Product has a high probability of taking a lead

* Hence rather than imputing with mode we will discard the records with null values as we have huge data

In [None]:
# Dropping Null values

df = df[~df['Credit_Product'].isna()]

# Feature Engineering and Preprocessing

## ID

`ID` : 	Unique Identifier for a row

* This feature will be dropped as there is no trend in the data as every instance is a unique datapoint

In [None]:
print('Total Number of features: ', len(df.ID))
print('Total Number of Unique features: ', len(df.ID.unique()))

In [None]:
# Dropping the ID feature

df = df.drop(['ID'], axis=1)
test_df = test_df.drop(['ID'], axis=1)

## Credit Product

`Credit_Product`: If the Customer has any active credit product (Home loan, Personal loan, Credit Card etc.)

* Need to perform Binary Encoding

In [None]:
test_df.head(3)

In [None]:
df.Credit_Product = df.Credit_Product.map({'Yes':1,'No':0})
test_df.Credit_Product = test_df.Credit_Product.map({'Yes':1,'No':0})

## Gender

`Gender`: Gender of the Customer


In [None]:
df.Gender.value_counts()

In [None]:
ax = sns.countplot(x=df.Gender, hue=df.Is_Lead)
ax.set_title('Gender Distribution in Dataset')
plt.show()

> There is no bias in the Gender dataset. we do have equal weightage of data of both Male and Female

**Dummy Encoding**

Categorical variables are dummy encoded by dropping the first column to avoid collinearlity among them

In [None]:
dummy_encoding = pd.get_dummies(df['Gender'], drop_first=True)

# Concatinating with existing dataframe
df = pd.concat([df, dummy_encoding], axis=1)

# Drop parent category column which are encoded 
df = df.drop(['Gender'] , axis=1)

dummy_encoding = pd.get_dummies(test_df['Gender'], drop_first=True)

# Concatinating with existing dataframe
test_df = pd.concat([test_df, dummy_encoding], axis=1)

# Drop parent category column which are encoded 
test_df = test_df.drop(['Gender'] , axis=1)

In [None]:
df.head()

## Occupation

`Occupation`: Occupation Type for the customer


In [None]:
df.Occupation.value_counts()

In [None]:
f,ax = plt.subplots(nrows=1,ncols=2,figsize=(15,6))
sns.countplot(df.Occupation, hue=df.Is_Lead, ax=ax[0])
ax[0].set_title('Occupation Distribution')
sns.countplot(df[df.Occupation=='Entrepreneur'].Occupation, hue=df.Is_Lead, ax=ax[1])
ax[1].set_title('Entrepreneur Distribution')
plt.tight_layout()

> Entrepreneur Occupation has less weightage in the dataset. The reason is Happy Bank being a mid sized bank

> But Being a Entrepreneur has a high probability in taking a lead

**Dummy Encoding**

Categorical variables are dummy encoded by dropping the first column to avoid collinearlity among them

In [None]:
dummy_encoding = pd.get_dummies(df['Occupation'], drop_first=True)

# Concatinating with existing dataframe
df = pd.concat([df, dummy_encoding], axis=1)

# Drop parent category column which are encoded 
df = df.drop(['Occupation'] , axis=1)

dummy_encoding = pd.get_dummies(test_df['Occupation'], drop_first=True)

# Concatinating with existing dataframe
test_df = pd.concat([test_df, dummy_encoding], axis=1)

# Drop parent category column which are encoded 
test_df = test_df.drop(['Occupation'] , axis=1)

In [None]:
df.head()

## Channel Code

`Channel_Code`: Acquisition Channel Code for the Customer  (Encoded)


In [None]:
df.Channel_Code.value_counts()

In [None]:
ax = sns.countplot(df.Channel_Code)
ax.set_title('Channel Distribution')
plt.show()

> X4 channel has less weightage in the dataset

In [None]:
dummy_encoding = pd.get_dummies(df['Channel_Code'], drop_first=True)

# Concatinating with existing dataframe
df = pd.concat([df, dummy_encoding], axis=1)

# Drop parent category column which are encoded 
df = df.drop(['Channel_Code'] , axis=1)

dummy_encoding = pd.get_dummies(test_df['Channel_Code'], drop_first=True)

# Concatinating with existing dataframe
test_df = pd.concat([test_df, dummy_encoding], axis=1)

# Drop parent category column which are encoded 
test_df = test_df.drop(['Channel_Code'] , axis=1)

## Is Active

`Is_Active` : If the Customer is Active in last 3 Months

In [None]:
ax = sns.countplot(df.Is_Active, hue=df.Is_Lead)
ax.set_title('Distribution of Is_Active')
plt.show()

> Active Customers have high Probabiity of taking a lead

> As this is a binary feature we need to assign 1/0 instead of Yes/No as a Binary Encoding

In [None]:
# Binary Encoding to convert categorical values to numerical values

df.Is_Active = df.Is_Active.map({'Yes': 1, 'No': 0})
test_df.Is_Active = test_df.Is_Active.map({'Yes': 1, 'No': 0})

In [None]:
df.head()

## Region Code

`Region_Code` : Code of the Region for the customers

In [None]:
plt.figure(figsize=(15,6))
ax = sns.countplot(df.Region_Code, hue=df.Is_Lead)
ax.set_title('Distribution of Region Code')
plt.xticks(rotation=45)
plt.show()

1. Each region has differnt trend 

2. Dummy Encoding of Region Code will increase complexity of Model and also preformance will be lowered

3. Will use Lead Probabilty Score of each region instead of categories. 

> $probability\_score = \frac{no\_of\_leads\_in\_region}{ no\_of\_customers\_in\_region}$


In [None]:
rc_encoding = df.groupby('Region_Code')['Is_Lead'].mean().reset_index()

plt.figure(figsize=(15,6))
ax = sns.barplot(x='Region_Code', y='Is_Lead', data=rc_encoding.sort_values(by=['Is_Lead'], ascending=False));
ax.set_title('Lead Probability Distribution of Region Code')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Will convert categorical variables to region lead probability

# Dictionary to map
rc_enc_dict = {rc[0]:rc[1]for rc in rc_encoding.values.tolist()}

df.Region_Code = df.Region_Code.map(rc_enc_dict)
test_df.Region_Code = test_df.Region_Code.map(rc_enc_dict)

In [None]:
df.head()

## Average Account Balance

`Avg_Account_Balance` : Average Account Balance for the Customer in last 12 Months

In [None]:
plt.figure(figsize=(12,6))
ax = sns.distplot(df.Avg_Account_Balance/10000)
ax.set_title('Distribution of Average Account Balance (10k scale)')
plt.show()

In [None]:
plt.figure(figsize=(10,6))
ax = sns.boxplot(df.Avg_Account_Balance)
ax.set_title('Distribution of Average Account Balance ')
plt.show()

* We have outliers in the Average Account Balance feature

## Vintage

`Vintage` : Vintage for the Customer (In Months)

In [None]:
f,ax = plt.subplots(nrows=2,ncols=1,figsize=(12,6))
sns.distplot(df.Vintage, ax=ax[0])
ax[0].set_title('Distribution of Vintage')
sns.boxplot(df.Vintage, ax=ax[1])
ax[1].set_title('Distribution of Vintage')
plt.tight_layout()

## Age

`Age`: Age of the Customer (in Years)

In [None]:
plt.figure(figsize=(12, 6))
ax = sns.distplot(df.Age)
ax.set_title('Distribution of Age')
plt.show()

> Will Select Bins for Age based on Decision Tree

In [None]:
tree_model = DecisionTreeClassifier(max_depth=2)
tree_model.fit(df.Age.to_frame(), df.Is_Lead)
df['Age_tree']=tree_model.predict_proba(df.Age.to_frame())[:,1] 

> Checking if Age Tree is a good predictor

In [None]:
fig = plt.figure()
fig = df.groupby(['Age_tree'])['Is_Lead'].mean().plot()
fig.set_title('Monotonic relationship between discretised Age and Lead')
fig.set_ylabel('Lead')
plt.show()

# Monotonic Relationnship is a good predictor indication

In [None]:
df.groupby(['Age_tree'])['Is_Lead'].count()

In [None]:
# Binning the Age into 4 differnt categories

def bin_age(x):
  if x in range(0,34): return 'Age_23_33'
  if x in range(34,36): return 'Age_34_35'
  if x in range(36,42): return 'Age_36_41'
  if x in range(42,100): return 'Age_42_85'

df.Age = df.Age.apply(lambda x : bin_age(x))
test_df.Age = test_df.Age.apply(lambda x : bin_age(x))

> Dummy Encoding Categorical variables

In [None]:
dummy_encoding = pd.get_dummies(df['Age'], drop_first=True)

# Concatinating with existing dataframe
df = pd.concat([df, dummy_encoding], axis=1)

# Drop parent category column which are encoded 
df = df.drop(['Age','Age_tree'] , axis=1)

dummy_encoding = pd.get_dummies(test_df['Age'], drop_first=True)

# Concatinating with existing dataframe
test_df = pd.concat([test_df, dummy_encoding], axis=1)

# Drop parent category column which are encoded 
test_df = test_df.drop(['Age'] , axis=1)

## Is Lead

`Is_Lead`: This is a target variable

In [None]:
plt.figure(figsize=(6,4))
ax = sns.countplot(df.Is_Lead)
ax.set_title('Distribution of Is Lead')
plt.show()

In [None]:
# Get percentage of Lead

df.Is_Lead.mean()*100

> we have Imbalanced Dataset

> There is a ratio of 6:1 between lead and non lead

In [None]:
y = df.pop('Is_Lead')

# Modelling

### Train Test Split

In [None]:
# SMOTE should be applied for Trainset or else both test and train will overfit
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.20, random_state=42)

> Datasplit of 1:4 ratio for modelling and testing

### SMOTE

**Handling Imbalanced Data**

A technique similar to upsampling is to create synthetic samples.

We will use imblearn’s SMOTE or Synthetic Minority Oversampling Technique.

SMOTE uses a nearest neighbors algorithm to generate new and synthetic data we can use for training our model.

SMOTE will synthesise samples to oversample the minority class by handling overfitting



In [None]:
# For Oversampling
sm = SMOTE(random_state=5)

# For Categorical encoder
#sm = SMOTENC(random_state=42, categorical_features=[2,4,5,6,7,8,9,10,11,12,13,14])

# For Undersampling
#sm = RandomUnderSampler(random_state=42)

SX_train, Sy_train = sm.fit_resample(X_train, y_train)
SX_train = pd.DataFrame(SX_train, columns=X_test.columns)

### Normalizing the Data

In [None]:
dff = SX_train.copy()
dft = test_df.copy()

dff[['Vintage','Avg_Account_Balance']] = MinMaxScaler().fit_transform(dff[['Vintage','Avg_Account_Balance']]) 
dft[['Vintage','Avg_Account_Balance']] = MinMaxScaler().fit_transform(dft[['Vintage','Avg_Account_Balance']]) 

### Imputing Null Values in Test Set

> Using KNN Imputer 

In [None]:
imputer = KNNImputer(n_neighbors=2)
testset = imputer.fit_transform(dft)

# CAT Boost

In [None]:
dff[['Credit_Product','Is_Active','Male','Other','Salaried','Self_Employed','X2','X3','X4','Age_34_35','Age_36_41','Age_42_85']] = dff[['Credit_Product','Is_Active','Male','Other','Salaried','Self_Employed','X2','X3','X4','Age_34_35','Age_36_41','Age_42_85']].astype(int)

In [None]:
cat = CatBoostClassifier(learning_rate=0.05, 
                         l2_leaf_reg=1, 
                         iterations= 500, 
                         depth= 9, 
                         border_count= 20, 
                         eval_metric = 'AUC')

cat= cat.fit(X_train, y_train,cat_features=['X2','X3','X4','Age_34_35','Age_36_41','Age_42_85','Credit_Product', 'Other', 'Salaried', 'Self_Employed', 'Is_Active'],eval_set=(X_test, y_test),early_stopping_rounds=70,verbose=50)


## Hyperparameter Tuning

In [None]:
# cat = CatBoostClassifier(eval_metric = 'AUC')
# param = { 'depth':[3,1,2,6,4,8,9,10,20,30,50],
#          'iterations':[250,100,500,1000],
#          'learning_rate':[0.03,0.001,0.01,0.1,0.13,0.2,0.3],
#          'l2_leaf_reg':[3,1,5,10,100],
#          'border_count':[32,5,10,20,100,200]
#         }

# randm = RandomizedSearchCV(cat, param_distributions = param, cv=5,refit = True, n_iter = 10, n_jobs=-1)
# randm.fit(X_train, y_train, cat_features=['Gender',	'Age', 'Region_Code',	'Occupation', 'Channel_Code',	'Vintage', 'Credit_Product', 'Is_Active'])

# randm.best_params_

In [None]:
cat_y_pred = cat.predict_proba(X_train)[:, 1]
cat_y_pred2 = cat.predict_proba(X_test)[:, 1]

print('Train ROC:',roc_auc_score(y_train,cat_y_pred))
print('Test ROC:',roc_auc_score(y_test,cat_y_pred2))

In [None]:
cft = testset.copy()
cft = pd.DataFrame(cft,columns=dft.columns)
cft[['Credit_Product','Is_Active','Male','Other','Salaried','Self_Employed','X2','X3','X4','Age_34_35','Age_36_41','Age_42_85']] = cft[['Credit_Product','Is_Active','Male','Other','Salaried','Self_Employed','X2','X3','X4','Age_34_35','Age_36_41','Age_42_85']].astype(int)

In [None]:
cat_pred = cat.predict_proba(cft)[:, 1]

# LGBM

In [None]:
lgb = LGBMClassifier(boosting_type='gbdt',n_estimators=500,depth=10,learning_rate=0.04,
                     objective='binary',metric='auc',is_unbalance=True,
                     colsample_bytree=0.5,reg_lambda=2,reg_alpha=2,random_state=42)

lgb= lgb.fit(X_train, y_train,eval_metric='auc',eval_set=(X_test , y_test),verbose=50,categorical_feature=[2,4,5,6,7,8,9,10,11,12,13,14],early_stopping_rounds= 50)


## Hyperparameter Tuning

In [None]:
# lgb = LGBMClassifier(objective='binary',metric='auc',is_unbalance=True)
# param = { 'depth':[3,1,2,6,4,8,9,10,20,30,50],
#          'n_estimators':[250,100,500,1000],
#          'learning_rate':[0.03,0.04,0.1,0.13,0.2,0.3]
#         }

# grid = GridSearchCV(lgb, param_grid = param, cv=5,refit = True, n_jobs=-1)
# grid.fit(X_train, y_train, cat_features=categorical_feature=[2,4,5,6,7,8,9,10,11,12,13,14])

# grid.best_params_

In [None]:
lgb_y_pred = lgb.predict_proba(X_train)[:, 1]
lgb_y_pred2 = lgb.predict_proba(X_test)[:, 1]

print('Train AUC:',roc_auc_score(y_train,lgb_y_pred))
print('Test AUC:',roc_auc_score(y_test,lgb_y_pred2))

In [None]:
lgb_pred = lgb.predict_proba(cft)[:, 1]

# XG

In [None]:
xg = XGBClassifier(n_estimators=200, max_depth=3,gamma=1)
xg.fit(X_train,y_train)

xg_y_pred = xg.predict_proba(X_train)[:, 1]
xg_y_pred2 = xg.predict_proba(X_test)[:, 1]

print('Train AUC:',roc_auc_score(y_train,xg_y_pred))
print('Test AUC:',roc_auc_score(y_test,xg_y_pred2))

## Hyperparameter Tuning

In [None]:
# xg = XGBClassifier(objective='binary',metric='auc',is_unbalance=True)
# param = { 'max_depth':[3,1,2,6,4,8,9,10,20,30,50],
#          'n_estimators':[250,100,500,1000]
#         }

# grid = GridSearchCV(xg, param_grid = param, cv=5,refit = True, n_jobs=-1)
# grid.fit(X_train, y_train)

# grid.best_params_

In [None]:
xg_pred = xg.predict_proba(cft)[:, 1]

# Random Forest

In [None]:
rf = RandomForestClassifier(n_estimators=200, max_depth=10, min_samples_split=5)
rf.fit(X_train,y_train)

rf_y_pred = rf.predict_proba(X_train)[:, 1]
rf_y_pred2 = rf.predict_proba(X_test.values)[:, 1]

print('ROC Train:',roc_auc_score(y_train,rf_y_pred))
print('ROC Test:',roc_auc_score(y_test,rf_y_pred2))

## Hyperparameter Tuning

In [None]:
# rf = RandomForestClassifier(objective='binary',metric='auc',is_unbalance=True)
# param = { 'max_depth':[3,1,2,6,4,8,9,10,20,30,50],
#          'n_estimators':[250,100,500,1000],
#          'min_samples_split': [3,4,5,6,7]
#         }

# grid = GridSearchCV(rf, param_grid = param, cv=5,refit = True, n_jobs=-1)
# grid.fit(X_train, y_train)

# grid.best_params_

In [None]:
rf_pred = rf.predict_proba(testset)[:, 1]

# Ensemble

In [None]:
validation_test = pd.DataFrame([y_test.values, rf_y_pred2, xg_y_pred2, cat_y_pred2, lgb_y_pred2]).T

validation_test = validation_test.rename(columns={0: 'y_test', 1: 'rf_y_pred2', 2:'xg_y_pred2', 3: 'cat_y_pred2', 4: 'lgb_y_pred2'})

validation_test['Mean'] = validation_test.apply(lambda x : pd.Series([x['rf_y_pred2'],x['xg_y_pred2'],x['cat_y_pred2'],x['lgb_y_pred2']]).mean(), axis=1)

In [None]:

print('Test Mean',roc_auc_score(validation_test['y_test'],validation_test['Mean']))

# Feature Engineering

* We will Try Aggregation of Features to see the improvement in the metrics

In [None]:
# We add features to both train and test set
fet = testset.copy()
fef = df.copy()
fe_y = y.copy()

fet = pd.DataFrame(fet, columns=X_train.columns)

enc = pd.concat([fef,fet],axis=0)

es = ft.EntitySet(id = 'Lead')
es.entity_from_dataframe(entity_id='ID', dataframe = enc,index='index')

feature_matrix , feature_defs = ft.dfs(entityset = es, target_entity='ID', trans_primitives = ['add_numeric', 'multiply_numeric'], verbose=True)

In [None]:
fef = feature_matrix[:len(fef)]
fet = feature_matrix[len(fef):]

> Will use stratified split instead of SMOTE

In [None]:
# Stratified Split

X_train, X_test, y_train, y_test = train_test_split(fef, fe_y, stratify=y,test_size=0.20, random_state=42)

# Neural Network


In [None]:
model = Sequential()
model.add(layers.Dense(500, activation = 'relu', input_shape = (225,))) 
model.add(layers.Dense(450, activation = 'relu')) 
model.add(layers.Dense(380, activation = 'relu'))
model.add(layers.Dense(260, activation = 'relu')) 
model.add(layers.Dense(150, activation = 'relu'))
model.add(layers.Dense(80, activation = 'relu'))
model.add(layers.Dense(30, activation = 'relu'))
model.add(layers.Dense(15, activation = 'relu'))
model.add(layers.Dense(4, activation = 'relu'))
model.add(layers.Dense(1, activation = 'sigmoid'))

model.compile(optimizer=optimizers.Adam(lr=0.001),  loss='binary_crossentropy',  metrics=['AUC'])

In [None]:
model.fit(X_train, y_train, epochs=50, validation_data=(X_test, y_test),batch_size=2000, verbose=1)

In [None]:
model.evaluate(X_test,  y_test, verbose=2)

In [None]:
nn_pred = model.predict(fet)

# Conclusion

* Started with Vanila SVM, Logistic and KNN Classifiers which performed on the lower side.

* CATboost are selected and hypertuned to optimize the roc_auc score as data is mostly categorical

* Used XGBoost, Random Forest and LightGBM hypertuned to optimize the score but the performance is similar to CATboost

* All three CATBoost, XGBoost, LightGBM and Random Forest showed similar Train and Validation Scores 

* This Indicated the the model is not verfitting but suffering from bias.

* Tried Oversampling, Undersampling the data to check if imbalance in data is the cause. But Performance didnt have significant affect

* Tried an ensemble method by Averging all four models the model performnce didn't improve much

* Tried to combine features using Automtic Feature generator and trained the model on Neural Nets but there was no significant affect

* XGBoost seems to have high ROC_AUC on both validation and test set

* XGBoost is selected as final model for predicting on Test Set




# Test Dataset

In [None]:
sub = pd.read_csv('sample_submission_eyYijxG.csv')

sub.Is_Lead = xb_pred

sub.to_csv('MySubmission.csv', index=False)