# Problem Statement

### Credit Card Lead Prediction

Happy Customer Bank is a mid-sized private bank that deals in all kinds of banking products, like Savings accounts, Current accounts, investment products, credit products, among other offerings.


The bank also cross-sells products to its existing customers and to do so they use different kinds of communication like tele-calling, e-mails, recommendations on net banking, mobile banking, etc. 


In this case, the Happy Customer Bank wants to cross sell its credit cards to its existing customers. The bank has identified a set of customers that are eligible for taking these credit cards.


Now, the bank is looking for your help in identifying customers that could show higher intent towards a recommended credit card, given:

    Customer details (gender, age, region etc.)
    Details of his/her relationship with the bank (Channel_Code,Vintage, 'Avg_Asset_Value etc.)


In [56]:
# !wget https://datahack-prod.s3.amazonaws.com/train_file/train_s3TEQDk.csv
# !wget https://datahack-prod.s3.amazonaws.com/test_file/test_mSzZ8RL.csv
# pip install -U scikit-learn

In [57]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score

from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

pd.options.display.max_columns = 300

In [58]:
data = pd.read_csv("train_s3TEQDk.csv")
test = pd.read_csv("test_mSzZ8RL.csv")
print(f"Train shape {data.shape}, Test Shape {test.shape}")

Train shape (245725, 11), Test Shape (105312, 10)


In [59]:
train,valid = train_test_split(data,test_size=0.20,random_state=345,stratify=data['Is_Lead'])
train = train.copy()
valid = valid.copy()
print(f"Train shape {train.shape} Validation shape {valid.shape}")

Train shape (196580, 11) Validation shape (49145, 11)


In [60]:
train.head(3)

Unnamed: 0,ID,Gender,Age,Region_Code,Occupation,Channel_Code,Vintage,Credit_Product,Avg_Account_Balance,Is_Active,Is_Lead
53078,N3ZQ84QR,Female,46,RG280,Self_Employed,X2,51,No,863584,Yes,0
213644,JWGAMK7P,Male,67,RG258,Other,X2,43,Yes,706126,No,0
131870,CX9NGNQT,Male,46,RG279,Self_Employed,X2,26,Yes,422207,Yes,0


In [61]:
test.head(3)

Unnamed: 0,ID,Gender,Age,Region_Code,Occupation,Channel_Code,Vintage,Credit_Product,Avg_Account_Balance,Is_Active
0,VBENBARO,Male,29,RG254,Other,X1,25,Yes,742366,No
1,CCMEWNKY,Male,43,RG268,Other,X2,49,,925537,No
2,VK3KGA9M,Male,31,RG270,Salaried,X1,14,No,215949,No


In [62]:
valid.head(3)

Unnamed: 0,ID,Gender,Age,Region_Code,Occupation,Channel_Code,Vintage,Credit_Product,Avg_Account_Balance,Is_Active,Is_Lead
148453,OK9KJGZ2,Female,55,RG268,Self_Employed,X1,37,No,929257,No,0
117997,TTC7CPSI,Male,57,RG283,Self_Employed,X3,87,,909740,No,0
5432,MPUWVRAX,Male,39,RG275,Salaried,X1,8,Yes,961742,Yes,0


In [63]:
train['ID'].nunique()

196580

In [64]:
train.isna().sum()

ID                         0
Gender                     0
Age                        0
Region_Code                0
Occupation                 0
Channel_Code               0
Vintage                    0
Credit_Product         23525
Avg_Account_Balance        0
Is_Active                  0
Is_Lead                    0
dtype: int64

In [65]:
train['Gender'].value_counts(normalize=True)

Male      0.546693
Female    0.453307
Name: Gender, dtype: float64

In [66]:
train['Region_Code'].nunique()

35

In [67]:
train['Occupation'].value_counts(normalize=True)

Self_Employed    0.411532
Salaried         0.292731
Other            0.284866
Entrepreneur     0.010871
Name: Occupation, dtype: float64

In [68]:
train['Channel_Code'].value_counts(normalize=True)

X1    0.421279
X3    0.280359
X2    0.275669
X4    0.022693
Name: Channel_Code, dtype: float64

In [69]:
train['Credit_Product'].value_counts(normalize=True)

No     0.667019
Yes    0.332981
Name: Credit_Product, dtype: float64

In [70]:
train['Avg_Account_Balance'].describe()

count    1.965800e+05
mean     1.129489e+06
std      8.532486e+05
min      2.079000e+04
25%      6.042470e+05
50%      8.954865e+05
75%      1.368733e+06
max      1.035201e+07
Name: Avg_Account_Balance, dtype: float64

In [71]:
train['Is_Active'].value_counts(normalize=True)

No     0.611375
Yes    0.388625
Name: Is_Active, dtype: float64

In [72]:
train['Is_Lead'].value_counts(normalize=True)

0    0.762794
1    0.237206
Name: Is_Lead, dtype: float64

In [73]:
train['Age'].describe()

count    196580.000000
mean         43.864971
std          14.821238
min          23.000000
25%          30.000000
50%          43.000000
75%          54.000000
max          85.000000
Name: Age, dtype: float64

In [74]:
train['Vintage'].describe()

count    196580.000000
mean         46.978121
std          32.346981
min           7.000000
25%          20.000000
50%          32.000000
75%          73.000000
max         135.000000
Name: Vintage, dtype: float64

In [75]:
train.groupby(['Is_Lead'])[['Age','Avg_Account_Balance']].median()

Unnamed: 0_level_0,Age,Avg_Account_Balance
Is_Lead,Unnamed: 1_level_1,Unnamed: 2_level_1
0,38,871158
1,49,980686


In [76]:
train['Credit_Product'] = train['Credit_Product'].fillna('NA')
train['Avg_Account_Balance'] = np.log(1+train['Avg_Account_Balance'])

test['Credit_Product'] = test['Credit_Product'].fillna('NA')
test['Avg_Account_Balance'] = np.log(1+test['Avg_Account_Balance'])

valid['Credit_Product'] = valid['Credit_Product'].fillna('NA')
valid['Avg_Account_Balance'] = np.log(1+valid['Avg_Account_Balance'])


In [77]:
train.sample(3)

Unnamed: 0,ID,Gender,Age,Region_Code,Occupation,Channel_Code,Vintage,Credit_Product,Avg_Account_Balance,Is_Active,Is_Lead
34,FXPTJYP7,Male,67,RG268,Other,X1,87,Yes,14.336791,No,1
196199,GQLACHBW,Male,49,RG283,Other,X3,98,,13.735415,No,1
142727,MDKVS4YD,Female,42,RG275,Self_Employed,X1,32,No,13.775077,No,0


In [78]:
train['Is_Active'] = train['Is_Active'].replace({'No':'N','Yes':'Y'})
test['Is_Active'] = test['Is_Active'].replace({'No':'N','Yes':'Y'})
valid['Is_Active'] = valid['Is_Active'].replace({'No':'N','Yes':'Y'})

In [79]:
def bucket_age(age):
    if age <= 25:
        val = '0-25'
    elif age > 25 and age <=30:
        val = '25-30'
    elif age > 30 and age <= 35:
        val = '30-35'
    elif age > 35 and age <=40:
        val = '35-40'
    elif age > 40 and age <=45:
        val = '40-45'
    elif age > 45 and age <=50:
        val = '45-50'
    elif age > 50 and age <= 55:
        val = '50-55'
    elif age > 55 and age <= 60:
        val = '55-60'
    elif age > 60 and age <=65:
        val = '60-65'
    elif age >65 and age <=70:
        val = '65-70'
    else:
        val = '70+'
    return val

train['age_cat'] = train['Age'].apply(bucket_age)
test['age_cat'] = test['Age'].apply(bucket_age)
valid['age_cat'] = valid['Age'].apply(bucket_age)
train[['Age','age_cat']].sample(3)

Unnamed: 0,Age,age_cat
158063,32,30-35
29994,24,0-25
185735,66,65-70


In [80]:
train[['Age','age_cat']].sample(3)

Unnamed: 0,Age,age_cat
241090,60,55-60
62338,33,30-35
126424,39,35-40


In [81]:
cat_cols = ['age_cat','Gender','Region_Code','Occupation','Channel_Code','Credit_Product','Is_Active']
featured_cols = []
for idx,col in enumerate(cat_cols):
    for sub_col in cat_cols[idx+1:]:
        new_col = f"{col}-{sub_col}"
        featured_cols.append(new_col)
        train[new_col] = train[col] + "-" + train[sub_col]
        test[new_col] = test[col] + "-" + test[sub_col]
        valid[new_col] = valid[col] + "-" + valid[sub_col]

train.sample(3)

Unnamed: 0,ID,Gender,Age,Region_Code,Occupation,Channel_Code,Vintage,Credit_Product,Avg_Account_Balance,Is_Active,Is_Lead,age_cat,age_cat-Gender,age_cat-Region_Code,age_cat-Occupation,age_cat-Channel_Code,age_cat-Credit_Product,age_cat-Is_Active,Gender-Region_Code,Gender-Occupation,Gender-Channel_Code,Gender-Credit_Product,Gender-Is_Active,Region_Code-Occupation,Region_Code-Channel_Code,Region_Code-Credit_Product,Region_Code-Is_Active,Occupation-Channel_Code,Occupation-Credit_Product,Occupation-Is_Active,Channel_Code-Credit_Product,Channel_Code-Is_Active,Credit_Product-Is_Active
78094,DZ9HTUFA,Male,52,RG277,Other,X2,67,Yes,13.832029,N,0,50-55,50-55-Male,50-55-RG277,50-55-Other,50-55-X2,50-55-Yes,50-55-N,Male-RG277,Male-Other,Male-X2,Male-Yes,Male-N,RG277-Other,RG277-X2,RG277-Yes,RG277-N,Other-X2,Other-Yes,Other-N,X2-Yes,X2-N,Yes-N
60648,MNMJGZ2J,Male,50,RG281,Self_Employed,X3,74,Yes,13.201117,N,0,45-50,45-50-Male,45-50-RG281,45-50-Self_Employed,45-50-X3,45-50-Yes,45-50-N,Male-RG281,Male-Self_Employed,Male-X3,Male-Yes,Male-N,RG281-Self_Employed,RG281-X3,RG281-Yes,RG281-N,Self_Employed-X3,Self_Employed-Yes,Self_Employed-N,X3-Yes,X3-N,Yes-N
204946,HF25RAVK,Female,42,RG257,Other,X2,57,Yes,13.162801,Y,0,40-45,40-45-Female,40-45-RG257,40-45-Other,40-45-X2,40-45-Yes,40-45-Y,Female-RG257,Female-Other,Female-X2,Female-Yes,Female-Y,RG257-Other,RG257-X2,RG257-Yes,RG257-Y,Other-X2,Other-Yes,Other-Y,X2-Yes,X2-Y,Yes-Y


In [82]:
all_cat_cols = cat_cols + featured_cols
num_col = ['Vintage','Avg_Account_Balance','Is_Lead']
for idx,col in enumerate (all_cat_cols):
    print(f"\rWorking Cat Col {idx}/{len(all_cat_cols)}",end='')
    for ind, num in enumerate(num_col):

        grp = train.groupby([col])[num].agg(['mean','std'])
        grp['cv'] = grp['std']/(1+grp['mean'])
        grp = grp.add_prefix(f'{col}-{num}-')
        grp = grp.fillna(-1)
        grp = grp.reset_index()
        
        train = train.merge(grp,on=[col],how='left')
        test = test.merge(grp,on=[col],how='left')
        valid = valid.merge(grp,on=[col],how='left')

Working Cat Col 27/28

In [83]:
train.head(3)

Unnamed: 0,ID,Gender,Age,Region_Code,Occupation,Channel_Code,Vintage,Credit_Product,Avg_Account_Balance,Is_Active,Is_Lead,age_cat,age_cat-Gender,age_cat-Region_Code,age_cat-Occupation,age_cat-Channel_Code,age_cat-Credit_Product,age_cat-Is_Active,Gender-Region_Code,Gender-Occupation,Gender-Channel_Code,Gender-Credit_Product,Gender-Is_Active,Region_Code-Occupation,Region_Code-Channel_Code,Region_Code-Credit_Product,Region_Code-Is_Active,Occupation-Channel_Code,Occupation-Credit_Product,Occupation-Is_Active,Channel_Code-Credit_Product,Channel_Code-Is_Active,Credit_Product-Is_Active,age_cat-Vintage-mean,age_cat-Vintage-std,age_cat-Vintage-cv,age_cat-Avg_Account_Balance-mean,age_cat-Avg_Account_Balance-std,age_cat-Avg_Account_Balance-cv,age_cat-Is_Lead-mean,age_cat-Is_Lead-std,age_cat-Is_Lead-cv,Gender-Vintage-mean,Gender-Vintage-std,Gender-Vintage-cv,Gender-Avg_Account_Balance-mean,Gender-Avg_Account_Balance-std,Gender-Avg_Account_Balance-cv,Gender-Is_Lead-mean,Gender-Is_Lead-std,Gender-Is_Lead-cv,Region_Code-Vintage-mean,Region_Code-Vintage-std,Region_Code-Vintage-cv,Region_Code-Avg_Account_Balance-mean,Region_Code-Avg_Account_Balance-std,Region_Code-Avg_Account_Balance-cv,Region_Code-Is_Lead-mean,Region_Code-Is_Lead-std,Region_Code-Is_Lead-cv,Occupation-Vintage-mean,Occupation-Vintage-std,Occupation-Vintage-cv,Occupation-Avg_Account_Balance-mean,Occupation-Avg_Account_Balance-std,Occupation-Avg_Account_Balance-cv,Occupation-Is_Lead-mean,Occupation-Is_Lead-std,Occupation-Is_Lead-cv,Channel_Code-Vintage-mean,Channel_Code-Vintage-std,Channel_Code-Vintage-cv,Channel_Code-Avg_Account_Balance-mean,Channel_Code-Avg_Account_Balance-std,Channel_Code-Avg_Account_Balance-cv,Channel_Code-Is_Lead-mean,Channel_Code-Is_Lead-std,Channel_Code-Is_Lead-cv,Credit_Product-Vintage-mean,Credit_Product-Vintage-std,Credit_Product-Vintage-cv,Credit_Product-Avg_Account_Balance-mean,Credit_Product-Avg_Account_Balance-std,Credit_Product-Avg_Account_Balance-cv,Credit_Product-Is_Lead-mean,Credit_Product-Is_Lead-std,Credit_Product-Is_Lead-cv,Is_Active-Vintage-mean,Is_Active-Vintage-std,Is_Active-Vintage-cv,Is_Active-Avg_Account_Balance-mean,Is_Active-Avg_Account_Balance-std,Is_Active-Avg_Account_Balance-cv,Is_Active-Is_Lead-mean,Is_Active-Is_Lead-std,Is_Active-Is_Lead-cv,age_cat-Gender-Vintage-mean,age_cat-Gender-Vintage-std,age_cat-Gender-Vintage-cv,age_cat-Gender-Avg_Account_Balance-mean,age_cat-Gender-Avg_Account_Balance-std,age_cat-Gender-Avg_Account_Balance-cv,age_cat-Gender-Is_Lead-mean,age_cat-Gender-Is_Lead-std,age_cat-Gender-Is_Lead-cv,age_cat-Region_Code-Vintage-mean,age_cat-Region_Code-Vintage-std,age_cat-Region_Code-Vintage-cv,age_cat-Region_Code-Avg_Account_Balance-mean,age_cat-Region_Code-Avg_Account_Balance-std,age_cat-Region_Code-Avg_Account_Balance-cv,age_cat-Region_Code-Is_Lead-mean,age_cat-Region_Code-Is_Lead-std,age_cat-Region_Code-Is_Lead-cv,age_cat-Occupation-Vintage-mean,age_cat-Occupation-Vintage-std,age_cat-Occupation-Vintage-cv,age_cat-Occupation-Avg_Account_Balance-mean,age_cat-Occupation-Avg_Account_Balance-std,age_cat-Occupation-Avg_Account_Balance-cv,age_cat-Occupation-Is_Lead-mean,age_cat-Occupation-Is_Lead-std,age_cat-Occupation-Is_Lead-cv,age_cat-Channel_Code-Vintage-mean,age_cat-Channel_Code-Vintage-std,age_cat-Channel_Code-Vintage-cv,age_cat-Channel_Code-Avg_Account_Balance-mean,age_cat-Channel_Code-Avg_Account_Balance-std,age_cat-Channel_Code-Avg_Account_Balance-cv,age_cat-Channel_Code-Is_Lead-mean,age_cat-Channel_Code-Is_Lead-std,age_cat-Channel_Code-Is_Lead-cv,age_cat-Credit_Product-Vintage-mean,age_cat-Credit_Product-Vintage-std,age_cat-Credit_Product-Vintage-cv,age_cat-Credit_Product-Avg_Account_Balance-mean,age_cat-Credit_Product-Avg_Account_Balance-std,age_cat-Credit_Product-Avg_Account_Balance-cv,age_cat-Credit_Product-Is_Lead-mean,age_cat-Credit_Product-Is_Lead-std,age_cat-Credit_Product-Is_Lead-cv,age_cat-Is_Active-Vintage-mean,age_cat-Is_Active-Vintage-std,age_cat-Is_Active-Vintage-cv,age_cat-Is_Active-Avg_Account_Balance-mean,age_cat-Is_Active-Avg_Account_Balance-std,age_cat-Is_Active-Avg_Account_Balance-cv,age_cat-Is_Active-Is_Lead-mean,age_cat-Is_Active-Is_Lead-std,age_cat-Is_Active-Is_Lead-cv,Gender-Region_Code-Vintage-mean,Gender-Region_Code-Vintage-std,Gender-Region_Code-Vintage-cv,Gender-Region_Code-Avg_Account_Balance-mean,Gender-Region_Code-Avg_Account_Balance-std,Gender-Region_Code-Avg_Account_Balance-cv,Gender-Region_Code-Is_Lead-mean,Gender-Region_Code-Is_Lead-std,Gender-Region_Code-Is_Lead-cv,Gender-Occupation-Vintage-mean,Gender-Occupation-Vintage-std,Gender-Occupation-Vintage-cv,Gender-Occupation-Avg_Account_Balance-mean,Gender-Occupation-Avg_Account_Balance-std,Gender-Occupation-Avg_Account_Balance-cv,Gender-Occupation-Is_Lead-mean,Gender-Occupation-Is_Lead-std,Gender-Occupation-Is_Lead-cv,Gender-Channel_Code-Vintage-mean,Gender-Channel_Code-Vintage-std,Gender-Channel_Code-Vintage-cv,Gender-Channel_Code-Avg_Account_Balance-mean,Gender-Channel_Code-Avg_Account_Balance-std,Gender-Channel_Code-Avg_Account_Balance-cv,Gender-Channel_Code-Is_Lead-mean,Gender-Channel_Code-Is_Lead-std,Gender-Channel_Code-Is_Lead-cv,Gender-Credit_Product-Vintage-mean,Gender-Credit_Product-Vintage-std,Gender-Credit_Product-Vintage-cv,Gender-Credit_Product-Avg_Account_Balance-mean,Gender-Credit_Product-Avg_Account_Balance-std,Gender-Credit_Product-Avg_Account_Balance-cv,Gender-Credit_Product-Is_Lead-mean,Gender-Credit_Product-Is_Lead-std,Gender-Credit_Product-Is_Lead-cv,Gender-Is_Active-Vintage-mean,Gender-Is_Active-Vintage-std,Gender-Is_Active-Vintage-cv,Gender-Is_Active-Avg_Account_Balance-mean,Gender-Is_Active-Avg_Account_Balance-std,Gender-Is_Active-Avg_Account_Balance-cv,Gender-Is_Active-Is_Lead-mean,Gender-Is_Active-Is_Lead-std,Gender-Is_Active-Is_Lead-cv,Region_Code-Occupation-Vintage-mean,Region_Code-Occupation-Vintage-std,Region_Code-Occupation-Vintage-cv,Region_Code-Occupation-Avg_Account_Balance-mean,Region_Code-Occupation-Avg_Account_Balance-std,Region_Code-Occupation-Avg_Account_Balance-cv,Region_Code-Occupation-Is_Lead-mean,Region_Code-Occupation-Is_Lead-std,Region_Code-Occupation-Is_Lead-cv,Region_Code-Channel_Code-Vintage-mean,Region_Code-Channel_Code-Vintage-std,Region_Code-Channel_Code-Vintage-cv,Region_Code-Channel_Code-Avg_Account_Balance-mean,Region_Code-Channel_Code-Avg_Account_Balance-std,Region_Code-Channel_Code-Avg_Account_Balance-cv,Region_Code-Channel_Code-Is_Lead-mean,Region_Code-Channel_Code-Is_Lead-std,Region_Code-Channel_Code-Is_Lead-cv,Region_Code-Credit_Product-Vintage-mean,Region_Code-Credit_Product-Vintage-std,Region_Code-Credit_Product-Vintage-cv,Region_Code-Credit_Product-Avg_Account_Balance-mean,Region_Code-Credit_Product-Avg_Account_Balance-std,Region_Code-Credit_Product-Avg_Account_Balance-cv,Region_Code-Credit_Product-Is_Lead-mean,Region_Code-Credit_Product-Is_Lead-std,Region_Code-Credit_Product-Is_Lead-cv,Region_Code-Is_Active-Vintage-mean,Region_Code-Is_Active-Vintage-std,Region_Code-Is_Active-Vintage-cv,Region_Code-Is_Active-Avg_Account_Balance-mean,Region_Code-Is_Active-Avg_Account_Balance-std,Region_Code-Is_Active-Avg_Account_Balance-cv,Region_Code-Is_Active-Is_Lead-mean,Region_Code-Is_Active-Is_Lead-std,Region_Code-Is_Active-Is_Lead-cv,Occupation-Channel_Code-Vintage-mean,Occupation-Channel_Code-Vintage-std,Occupation-Channel_Code-Vintage-cv,Occupation-Channel_Code-Avg_Account_Balance-mean,Occupation-Channel_Code-Avg_Account_Balance-std,Occupation-Channel_Code-Avg_Account_Balance-cv,Occupation-Channel_Code-Is_Lead-mean,Occupation-Channel_Code-Is_Lead-std,Occupation-Channel_Code-Is_Lead-cv,Occupation-Credit_Product-Vintage-mean,Occupation-Credit_Product-Vintage-std,Occupation-Credit_Product-Vintage-cv,Occupation-Credit_Product-Avg_Account_Balance-mean,Occupation-Credit_Product-Avg_Account_Balance-std,Occupation-Credit_Product-Avg_Account_Balance-cv,Occupation-Credit_Product-Is_Lead-mean,Occupation-Credit_Product-Is_Lead-std,Occupation-Credit_Product-Is_Lead-cv,Occupation-Is_Active-Vintage-mean,Occupation-Is_Active-Vintage-std,Occupation-Is_Active-Vintage-cv,Occupation-Is_Active-Avg_Account_Balance-mean,Occupation-Is_Active-Avg_Account_Balance-std,Occupation-Is_Active-Avg_Account_Balance-cv,Occupation-Is_Active-Is_Lead-mean,Occupation-Is_Active-Is_Lead-std,Occupation-Is_Active-Is_Lead-cv,Channel_Code-Credit_Product-Vintage-mean,Channel_Code-Credit_Product-Vintage-std,Channel_Code-Credit_Product-Vintage-cv,Channel_Code-Credit_Product-Avg_Account_Balance-mean,Channel_Code-Credit_Product-Avg_Account_Balance-std,Channel_Code-Credit_Product-Avg_Account_Balance-cv,Channel_Code-Credit_Product-Is_Lead-mean,Channel_Code-Credit_Product-Is_Lead-std,Channel_Code-Credit_Product-Is_Lead-cv,Channel_Code-Is_Active-Vintage-mean,Channel_Code-Is_Active-Vintage-std,Channel_Code-Is_Active-Vintage-cv,Channel_Code-Is_Active-Avg_Account_Balance-mean,Channel_Code-Is_Active-Avg_Account_Balance-std,Channel_Code-Is_Active-Avg_Account_Balance-cv,Channel_Code-Is_Active-Is_Lead-mean,Channel_Code-Is_Active-Is_Lead-std,Channel_Code-Is_Active-Is_Lead-cv,Credit_Product-Is_Active-Vintage-mean,Credit_Product-Is_Active-Vintage-std,Credit_Product-Is_Active-Vintage-cv,Credit_Product-Is_Active-Avg_Account_Balance-mean,Credit_Product-Is_Active-Avg_Account_Balance-std,Credit_Product-Is_Active-Avg_Account_Balance-cv,Credit_Product-Is_Active-Is_Lead-mean,Credit_Product-Is_Active-Is_Lead-std,Credit_Product-Is_Active-Is_Lead-cv
0,N3ZQ84QR,Female,46,RG280,Self_Employed,X2,51,No,13.668848,Y,0,45-50,45-50-Female,45-50-RG280,45-50-Self_Employed,45-50-X2,45-50-No,45-50-Y,Female-RG280,Female-Self_Employed,Female-X2,Female-No,Female-Y,RG280-Self_Employed,RG280-X2,RG280-No,RG280-Y,Self_Employed-X2,Self_Employed-No,Self_Employed-Y,X2-No,X2-Y,No-Y,63.827208,30.307173,0.467507,13.78569,0.590627,0.039946,0.370681,0.482997,0.352377,41.827474,30.01163,0.700756,13.713333,0.622809,0.04233,0.203477,0.402586,0.334519,43.188539,32.184806,0.728352,13.411736,0.548784,0.038079,0.233718,0.423216,0.343041,55.55107,32.595353,0.576388,13.745605,0.607609,0.041206,0.27627,0.447155,0.350361,54.598919,28.159967,0.506484,13.75621,0.604981,0.040998,0.327398,0.469268,0.353525,40.581871,28.885153,0.694657,13.684008,0.624317,0.042517,0.073083,0.260273,0.242547,55.489554,34.657305,0.613517,13.802428,0.625565,0.042261,0.282528,0.450232,0.35105,62.977699,30.122365,0.470826,13.811018,0.588057,0.039704,0.376505,0.484534,0.352003,60.989229,30.628936,0.494101,13.417902,0.521058,0.03614,0.36454,0.4815,0.352866,62.908421,30.122425,0.471337,13.770971,0.587852,0.039798,0.316266,0.46503,0.353295,59.700782,27.3079,0.449877,13.750146,0.582647,0.039501,0.345976,0.475705,0.353428,60.597707,28.948966,0.469968,13.733113,0.597245,0.040538,0.113183,0.316831,0.284617,65.428105,30.664196,0.461615,13.803724,0.598925,0.040458,0.370303,0.482905,0.352407,37.097365,28.727333,0.75405,13.401521,0.550985,0.038259,0.198526,0.398935,0.332855,52.403134,31.859805,0.596591,13.756265,0.610664,0.041383,0.261751,0.439595,0.3484,52.13468,26.875135,0.505793,13.764786,0.605232,0.040992,0.319007,0.466104,0.353375,36.246971,26.02971,0.698841,13.658229,0.62249,0.042467,0.065741,0.247831,0.232543,50.468969,33.627966,0.653364,13.803423,0.626325,0.042309,0.255161,0.435959,0.347333,52.210358,32.865881,0.617659,13.383783,0.528443,0.036739,0.277436,0.447784,0.350534,52.755913,29.125233,0.541805,13.392667,0.531946,0.03696,0.345002,0.475444,0.353489,38.642549,29.712687,0.749515,13.379518,0.551894,0.038381,0.078901,0.269606,0.249889,53.452126,34.630722,0.635985,13.427537,0.552327,0.038283,0.278111,0.448127,0.350617,53.313132,27.966771,0.514917,13.726652,0.59804,0.040609,0.288074,0.452872,0.351589,51.904785,31.022695,0.586387,13.705602,0.616762,0.041941,0.080981,0.272809,0.252372,59.285561,33.368966,0.553515,13.769124,0.617108,0.041784,0.283215,0.450566,0.351123,53.602277,26.623489,0.487589,13.728676,0.618954,0.042024,0.105249,0.30688,0.277657,53.959872,28.68473,0.521921,13.748476,0.61782,0.04189,0.302569,0.459378,0.352671,48.25175,32.637442,0.662666,13.763813,0.63022,0.042687,0.08143,0.273497,0.252903
1,JWGAMK7P,Male,67,RG258,Other,X2,43,Yes,13.46755,N,0,65-70,65-70-Male,65-70-RG258,65-70-Other,65-70-X2,65-70-Yes,65-70-N,Male-RG258,Male-Other,Male-X2,Male-Yes,Male-N,RG258-Other,RG258-X2,RG258-Yes,RG258-N,Other-X2,Other-Yes,Other-N,X2-Yes,X2-N,Yes-N,70.515147,32.091662,0.448739,13.924694,0.606062,0.040608,0.25479,0.435779,0.347292,51.248928,33.568613,0.642475,13.751552,0.617448,0.041856,0.265174,0.441428,0.348907,38.40629,27.976339,0.709946,13.331448,0.489656,0.034167,0.211168,0.408268,0.337087,54.886123,34.306833,0.61387,13.798887,0.630414,0.042599,0.243915,0.429446,0.345237,54.598919,28.159967,0.506484,13.75621,0.604981,0.040998,0.327398,0.469268,0.353525,51.701079,34.259627,0.650074,13.794534,0.608655,0.041141,0.315702,0.464799,0.353271,41.567754,29.540018,0.693953,13.690874,0.612792,0.041712,0.208397,0.406164,0.336118,72.823574,31.875641,0.431781,13.918012,0.600994,0.040286,0.282181,0.450118,0.351056,65.357143,31.327676,0.472107,13.567054,0.50631,0.034757,0.214286,0.417855,0.344116,70.515147,32.091662,0.448739,13.924694,0.606062,0.040608,0.25479,0.435779,0.347292,61.946872,27.049487,0.429719,13.879703,0.608501,0.040895,0.255214,0.436089,0.347423,70.56758,33.537921,0.468619,13.953239,0.600272,0.040143,0.284972,0.451508,0.351376,67.380449,31.610169,0.462269,13.923153,0.594644,0.039847,0.270833,0.444461,0.34974,44.967431,30.760728,0.669185,13.354686,0.494156,0.034425,0.25573,0.436534,0.347634,59.464858,34.764459,0.574953,13.818952,0.62599,0.042243,0.271883,0.444937,0.349825,55.983858,28.764827,0.504789,13.751391,0.604796,0.040999,0.332113,0.470978,0.353557,55.172474,34.830128,0.620057,13.796677,0.604623,0.040862,0.337511,0.472868,0.353543,45.681388,31.394054,0.672518,13.715595,0.609421,0.041413,0.238955,0.426448,0.3442,46.258883,31.807772,0.673054,13.339823,0.496179,0.034601,0.215736,0.411855,0.33877,50.216157,28.161833,0.549862,13.374351,0.44384,0.030877,0.323144,0.468189,0.353846,41.193853,29.966969,0.710221,13.372748,0.466577,0.032463,0.293144,0.455743,0.35243,33.233302,24.184362,0.706457,13.314882,0.496401,0.034677,0.181562,0.385664,0.326402,58.589195,27.653358,0.464067,13.804688,0.612041,0.041341,0.321933,0.467233,0.353446,59.013508,35.438797,0.590514,13.866254,0.620115,0.041713,0.302226,0.459236,0.352655,49.045805,32.594454,0.651292,13.75803,0.625097,0.042356,0.219768,0.414096,0.339487,52.7749,28.80404,0.535641,13.783989,0.589136,0.03985,0.368359,0.482372,0.352519,55.277626,27.576043,0.49,13.764424,0.590941,0.040025,0.353767,0.478147,0.353197,45.636257,32.006965,0.686311,13.761389,0.602281,0.040801,0.248289,0.432026,0.346094
2,CX9NGNQT,Male,46,RG279,Self_Employed,X2,26,Yes,12.953253,Y,0,45-50,45-50-Male,45-50-RG279,45-50-Self_Employed,45-50-X2,45-50-Yes,45-50-Y,Male-RG279,Male-Self_Employed,Male-X2,Male-Yes,Male-Y,RG279-Self_Employed,RG279-X2,RG279-Yes,RG279-Y,Self_Employed-X2,Self_Employed-Yes,Self_Employed-Y,X2-Yes,X2-Y,Yes-Y,63.827208,30.307173,0.467507,13.78569,0.590627,0.039946,0.370681,0.482997,0.352377,51.248928,33.568613,0.642475,13.751552,0.617448,0.041856,0.265174,0.441428,0.348907,40.181242,29.860427,0.725098,13.336136,0.525134,0.03663,0.231305,0.421734,0.34251,55.55107,32.595353,0.576388,13.745605,0.607609,0.041206,0.27627,0.447155,0.350361,54.598919,28.159967,0.506484,13.75621,0.604981,0.040998,0.327398,0.469268,0.353525,51.701079,34.259627,0.650074,13.794534,0.608655,0.041141,0.315702,0.464799,0.353271,55.489554,34.657305,0.613517,13.802428,0.625565,0.042261,0.282528,0.450232,0.35105,64.368372,30.412947,0.465255,13.769555,0.591713,0.040063,0.366971,0.481995,0.3526,55.836858,30.886851,0.54343,13.368687,0.467958,0.032568,0.432024,0.496108,0.346438,62.908421,30.122425,0.471337,13.770971,0.587852,0.039798,0.316266,0.46503,0.353295,59.700782,27.3079,0.449877,13.750146,0.582647,0.039501,0.345976,0.475705,0.353428,63.361925,31.321242,0.486642,13.823748,0.580282,0.039145,0.410159,0.49189,0.348819,65.428105,30.664196,0.461615,13.803724,0.598925,0.040458,0.370303,0.482905,0.352407,44.668284,31.754611,0.695332,13.367298,0.540254,0.037603,0.265009,0.441472,0.348987,57.670355,32.912512,0.560973,13.738429,0.605444,0.041079,0.286045,0.451915,0.351399,55.983858,28.764827,0.504789,13.751391,0.604796,0.040999,0.332113,0.470978,0.353557,55.172474,34.830128,0.620057,13.796677,0.604623,0.040862,0.337511,0.472868,0.353543,59.018389,34.934066,0.582056,13.801728,0.625037,0.042227,0.301763,0.459028,0.35262,48.938944,31.888579,0.638551,13.337831,0.491703,0.034294,0.305281,0.460716,0.352963,49.232365,27.87216,0.554865,13.350726,0.507539,0.035367,0.356846,0.479401,0.35332,43.177994,32.18376,0.728502,13.368497,0.529219,0.036832,0.300971,0.458928,0.352758,47.788931,33.87039,0.694223,13.362416,0.543938,0.037872,0.288931,0.453478,0.351825,53.313132,27.966771,0.514917,13.726652,0.59804,0.040609,0.288074,0.452872,0.351589,56.46292,33.669439,0.585933,13.784911,0.595917,0.040306,0.333222,0.471374,0.35356,59.285561,33.368966,0.553515,13.769124,0.617108,0.041784,0.283215,0.450566,0.351123,52.7749,28.80404,0.535641,13.783989,0.589136,0.03985,0.368359,0.482372,0.352519,53.959872,28.68473,0.521921,13.748476,0.61782,0.04189,0.302569,0.459378,0.352671,65.513366,35.202327,0.529252,13.870021,0.616328,0.041448,0.46923,0.499066,0.339679


In [84]:
all_cat_cols = cat_cols + featured_cols
encoder = LabelEncoder()
for col in all_cat_cols:
    train[col] = encoder.fit_transform(train[col])
    test[col] = encoder.transform(test[col])
    valid[col] = encoder.transform(valid[col])

In [85]:
x_train = train.drop(['ID','Is_Lead'],axis=1)
y_train = train['Is_Lead']

x_valid = valid.drop(['ID','Is_Lead'],axis=1)
y_valid = valid['Is_Lead']

x_test = test.drop(['ID'],axis=1)

In [86]:
cat_indx = []
for idx, col in enumerate(x_train.columns):
    if col in all_cat_cols:
        cat_indx.append(idx)

In [89]:
hist_params = {'max_iter':1000,
               'learning_rate' : 0.05,
               'max_depth' : 30,
               'early_stopping' : 'auto',
               'verbose':1,
               'max_bins' : 255,
               'random_state':636,
              
              }


hist = HistGradientBoostingClassifier(**hist_params)
hist.fit(x_train,y_train)
pred = hist.predict_proba(x_valid)[:, 1]
roc_score = roc_auc_score(y_valid, pred)
print(f"roc_auc_score: {roc_score}")

Binning 0.401 GB of training data: 7.460 s
Binning 0.045 GB of validation data: 0.224 s
Fitting gradient boosted rounds:
[1/1000] 1 tree, 31 leaves, max depth = 8, train loss: 0.52718, val loss: 0.52718, in 0.318s
[2/1000] 1 tree, 31 leaves, max depth = 9, train loss: 0.50953, val loss: 0.50950, in 0.306s
[3/1000] 1 tree, 31 leaves, max depth = 8, train loss: 0.49420, val loss: 0.49412, in 0.314s
[4/1000] 1 tree, 31 leaves, max depth = 7, train loss: 0.48071, val loss: 0.48059, in 0.311s
[5/1000] 1 tree, 31 leaves, max depth = 7, train loss: 0.46874, val loss: 0.46856, in 0.313s
[6/1000] 1 tree, 31 leaves, max depth = 9, train loss: 0.45802, val loss: 0.45777, in 0.325s
[7/1000] 1 tree, 31 leaves, max depth = 7, train loss: 0.44837, val loss: 0.44807, in 0.338s
[8/1000] 1 tree, 31 leaves, max depth = 8, train loss: 0.43966, val loss: 0.43930, in 0.326s
[9/1000] 1 tree, 31 leaves, max depth = 8, train loss: 0.43176, val loss: 0.43135, in 0.361s
[10/1000] 1 tree, 31 leaves, max depth = 8

In [90]:
submit = pd.DataFrame()
submit['ID'] = test['ID']
submit['Is_Lead'] = hist.predict_proba(x_test)[:, 1]
submit.head(3)

Unnamed: 0,ID,Is_Lead
0,VBENBARO,0.033305
1,CCMEWNKY,0.87107
2,VK3KGA9M,0.054181


In [91]:
submit.to_csv("Hist_submit_v3.csv",index=False)