# Categorical Feature Challenge

We have been provided with a dataset that only has categorical variables and we are asked to try out different encoding schemes and compare how they perform.The competition is binary classification challenge with only categorical variables to train on.

### References

In [None]:
#https://www.kaggle.com/cdeotte/high-scoring-lgbm-malware-0-702-0-775
#https://www.kaggle.com/fabiendaniel/detecting-malwares-with-lgbm
#https://www.kaggle.com/humananalog/xgboost-lasso
#https://www.kaggle.com/ogrellier/good-fun-with-ligthgbm
#https://www.kaggle.com/mlisovyi/modular-good-fun-with-ligthgbm/output

### Import Necessary libraries:

In [1]:


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


import gc
from tqdm import tqdm

### Reading the data

In [3]:
kaggle=0

if kaggle==0:
    train=pd.read_csv("data/train.csv")
    test=pd.read_csv("data/test.csv")
    sample_submission=pd.read_csv("data/sample_submission.csv")
    
else:
    train=pd.read_csv("../input/train.csv")
    test=pd.read_csv("../input/test.csv")
    sample_submission=pd.read_csv("../input/sample_submission.csv")

In [152]:
train.head()

Unnamed: 0,id,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,...,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month,target
0,0,0,0,0,T,Y,Green,Triangle,Snake,Finland,...,2f4cb3d51,2,Grandmaster,Cold,h,D,kr,2,2,0
1,1,0,1,0,T,Y,Green,Trapezoid,Hamster,Russia,...,f83c56c21,1,Grandmaster,Hot,a,A,bF,7,8,0
2,2,0,0,0,F,Y,Blue,Trapezoid,Lion,Russia,...,ae6800dd0,1,Expert,Lava Hot,h,R,Jc,7,2,0
3,3,0,1,0,F,Y,Red,Trapezoid,Snake,Canada,...,8270f0d71,1,Grandmaster,Boiling Hot,i,D,kW,2,1,1
4,4,0,0,0,F,N,Red,Trapezoid,Lion,Canada,...,b164b72a7,1,Grandmaster,Freezing,a,R,qP,7,8,0


In [214]:
train.shape,test.shape

((300000, 25), (200000, 24))

We see that the train dataset has 25 categorical columns with varying degree of cardinality.

Let check the distribution of the target value to understand whether the dataset is balanced or not.

In [86]:
train['target'].value_counts()

0    208236
1     91764
Name: target, dtype: int64

We see that the target has lot of 0's than 1's.Its an unbalanced problem.

In [87]:
train.dtypes

id         int64
bin_0      int64
bin_1      int64
bin_2      int64
bin_3     object
bin_4     object
nom_0     object
nom_1     object
nom_2     object
nom_3     object
nom_4     object
nom_5     object
nom_6     object
nom_7     object
nom_8     object
nom_9     object
ord_0      int64
ord_1     object
ord_2     object
ord_3     object
ord_4     object
ord_5     object
day        int64
month      int64
target     int64
dtype: object

In [88]:
test.dtypes

id        int64
bin_0     int64
bin_1     int64
bin_2     int64
bin_3    object
bin_4    object
nom_0    object
nom_1    object
nom_2    object
nom_3    object
nom_4    object
nom_5    object
nom_6    object
nom_7    object
nom_8    object
nom_9    object
ord_0     int64
ord_1    object
ord_2    object
ord_3    object
ord_4    object
ord_5    object
day       int64
month     int64
dtype: object

In [31]:
#convert all the columns to category datatype:
for f in train.columns:
    if f=="id" or f=="target": continue
    print(f'Converting {f} into category datatype\n')
    train[f]=train[f].astype('category')
    test[f]=test[f].astype('category')

Converting bin_0 into category datatype

Converting bin_1 into category datatype

Converting bin_2 into category datatype

Converting bin_3 into category datatype

Converting bin_4 into category datatype

Converting nom_0 into category datatype

Converting nom_1 into category datatype

Converting nom_2 into category datatype

Converting nom_3 into category datatype

Converting nom_4 into category datatype

Converting nom_5 into category datatype

Converting nom_6 into category datatype

Converting nom_7 into category datatype

Converting nom_8 into category datatype

Converting nom_9 into category datatype

Converting ord_0 into category datatype

Converting ord_1 into category datatype

Converting ord_2 into category datatype

Converting ord_3 into category datatype

Converting ord_4 into category datatype

Converting ord_5 into category datatype

Converting day into category datatype

Converting month into category datatype



### Cardinality of the columns

In [32]:
## For binary columns , the cardinality will be 2.Lets separate them out .
binary_columns=[c for c in train.columns if train[c].nunique()==2]

In [33]:
binary_columns

['bin_0', 'bin_1', 'bin_2', 'bin_3', 'bin_4', 'target']

In [34]:
categorical_columns=[c for c in train.columns if (c not in binary_columns)]

In [7]:
cardinality=[]
for c in categorical_columns:
    if c=='id':continue
    cardinality.append([c,train[c].nunique()])
cardinality.sort(key=lambda x:x[1],reverse=True)


In [159]:
cardinality

[['nom_9', 11981],
 ['nom_8', 2215],
 ['nom_7', 1220],
 ['nom_6', 522],
 ['nom_5', 222],
 ['ord_5', 192],
 ['ord_4', 26],
 ['ord_3', 15],
 ['month', 12],
 ['day', 7],
 ['nom_1', 6],
 ['nom_2', 6],
 ['nom_3', 6],
 ['ord_2', 6],
 ['ord_1', 5],
 ['nom_4', 4],
 ['nom_0', 3],
 ['ord_0', 3]]

We see that there are 7 columns with high cardinality.Feature encoding for these columns may include frequency encoding which is based on the ranking of categories based on the frequency of occurence in the group.

In [35]:
# Columns that can be safely label encoded
good_label_cols = [col for col in categorical_columns if 
                   set(train[col]) == set(test[col])]

In [38]:
good_label_cols

['nom_0',
 'nom_1',
 'nom_2',
 'nom_3',
 'nom_4',
 'nom_5',
 'nom_6',
 'ord_0',
 'ord_1',
 'ord_2',
 'ord_3',
 'ord_4',
 'ord_5',
 'day',
 'month']

In [8]:
## from https://www.kaggle.com/fabiendaniel/detecting-malwares-with-lgbm
def frequency_encoding(variable):
    t = pd.concat([train[variable], test[variable]]).value_counts().reset_index()
    t = t.reset_index()
    t.loc[t[variable] == 1, 'level_0'] = np.nan
    t.set_index('index', inplace=True)
    max_label = t['level_0'].max() + 1
    t.fillna(max_label, inplace=True)
    return t.to_dict()['level_0']

In [9]:
 #frequency_encoded_columns=['nom_9','nom_8','nom_7','nom_6','nom_5','ord_5','ord_4']

In [39]:
for variable in tqdm(good_label_cols):
    freq_encod_dict=frequency_encoding(variable)
    train[variable+'_FE']=train[variable].map(lambda x:freq_encod_dict.get(x,np.nan))
    test[variable+'_FE']=test[variable].map(lambda x:freq_encod_dict.get(x,np.nan))
    categorical_columns.remove(variable)

100%|██████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 55.29it/s]


### Label Encoding

In [11]:
#https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study

def factorize(train, test, features, na_value=-9999, full=False, sort=True):
    """Factorize categorical features.
    Parameters
    ----------
    train : pd.DataFrame
    test : pd.DataFrame
    features : list
           Column names in the DataFrame to be encoded.
    na_value : int, default -9999
    full : bool, default False
        Whether use all columns from train/test or only from train.
    sort : bool, default True
        Sort by values.
    Returns
    -------
    train : pd.DataFrame
    test : pd.DataFrame
    """

    for column in features:
        if full:
            vs = pd.concat([train[column], test[column]])
            labels, indexer = pd.factorize(vs, sort=sort)
        else:
            labels, indexer = pd.factorize(train[column], sort=sort)

        train[column+'_LE'] = indexer.get_indexer(train[column])
        test[column+'_LE'] = indexer.get_indexer(test[column])

        if na_value != -1:
            train[column] = train[column].replace(-1, na_value)
            test[column] = test[column].replace(-1, na_value)

    return train, test

In [9]:
# indexer = {}
# for col in tqdm(categorical_columns):
#     if col == 'id': continue
#     _, indexer[col] = pd.factorize([train[col],test[col]])

In [40]:
#categorical_columns.remove('id')
train,test=factorize(train,test,categorical_columns,full=True)

In [13]:
# train,test=factorize(train,test,frequency_encoded_columns,full=True)

In [120]:
# for col in tqdm(categorical_columns):
#     if col=='id':continue
#     train[col+'_LE']=indexer[col].get_indexer(train[col])
#     test[col+'_LE']=indexer[col].get_indexer(test[col])
    

Now we do one hot encoding for all the binary categorical variables.

In [41]:
train_cat_dum=pd.DataFrame()
test_cat_dum=pd.DataFrame()
for c_ in binary_columns:
    if c_=='target':continue
    train_cat_dum=pd.concat([train_cat_dum,pd.get_dummies(train[c_],prefix=c_).astype(np.uint8)],axis=1)
    test_cat_dum=pd.concat([test_cat_dum,pd.get_dummies(test[c_],prefix=c_).astype(np.uint8)],axis=1)

In [14]:
train_cat_dum.head()

Unnamed: 0,bin_0_0,bin_0_1,bin_1_0,bin_1_1,bin_2_0,bin_2_1,bin_3_F,bin_3_T,bin_4_N,bin_4_Y
0,1,0,1,0,1,0,0,1,0,1
1,1,0,0,1,1,0,0,1,0,1
2,1,0,1,0,1,0,1,0,0,1
3,1,0,0,1,1,0,1,0,0,1
4,1,0,1,0,1,0,1,0,1,0


In [42]:
train=pd.concat([train,train_cat_dum],axis=1)
test=pd.concat([test,test_cat_dum],axis=1)

In [43]:
train.head()

Unnamed: 0,id,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,...,bin_0_0,bin_0_1,bin_1_0,bin_1_1,bin_2_0,bin_2_1,bin_3_F,bin_3_T,bin_4_N,bin_4_Y
0,0,0,0,0,T,Y,Green,Triangle,Snake,Finland,...,1,0,1,0,1,0,0,1,0,1
1,1,0,1,0,T,Y,Green,Trapezoid,Hamster,Russia,...,1,0,0,1,1,0,0,1,0,1
2,2,0,0,0,F,Y,Blue,Trapezoid,Lion,Russia,...,1,0,1,0,1,0,1,0,0,1
3,3,0,1,0,F,Y,Red,Trapezoid,Snake,Canada,...,1,0,0,1,1,0,1,0,0,1
4,4,0,0,0,F,N,Red,Trapezoid,Lion,Canada,...,1,0,1,0,1,0,1,0,1,0


Now,we have taken care of all the categorical variables.Lets build the model and with 5 fold cross validation .Before this ,lets delete the original categorical columns.

In [55]:
train.columns,test.columns

(Index(['nom_0_FE', 'nom_1_FE', 'nom_2_FE', 'nom_3_FE', 'nom_4_FE', 'nom_5_FE',
        'nom_6_FE', 'ord_0_FE', 'ord_1_FE', 'ord_2_FE', 'ord_3_FE', 'ord_4_FE',
        'ord_5_FE', 'day_FE', 'month_FE', 'nom_7_LE', 'nom_8_LE', 'nom_9_LE',
        'bin_0_0', 'bin_0_1', 'bin_1_0', 'bin_1_1', 'bin_2_0', 'bin_2_1',
        'bin_3_F', 'bin_3_T', 'bin_4_N', 'bin_4_Y'],
       dtype='object'),
 Index(['nom_0_FE', 'nom_1_FE', 'nom_2_FE', 'nom_3_FE', 'nom_4_FE', 'nom_5_FE',
        'nom_6_FE', 'ord_0_FE', 'ord_1_FE', 'ord_2_FE', 'ord_3_FE', 'ord_4_FE',
        'ord_5_FE', 'day_FE', 'month_FE', 'nom_7_LE', 'nom_8_LE', 'nom_9_LE',
        'bin_0_0', 'bin_0_1', 'bin_1_0', 'bin_1_1', 'bin_2_0', 'bin_2_1',
        'bin_3_F', 'bin_3_T', 'bin_4_N', 'bin_4_Y'],
       dtype='object'))

In [45]:
cols_to_remove=['id', 'bin_0', 'bin_1', 'bin_2', 'bin_3', 'bin_4', 'nom_0', 'nom_1',
       'nom_2', 'nom_3', 'nom_4', 'nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9',
       'ord_0', 'ord_1', 'ord_2', 'ord_3', 'ord_4', 'ord_5', 'day', 'month','id_LE']

In [46]:
train=train.drop(cols_to_remove,axis=1)
test=test.drop(cols_to_remove,axis=1)

In [47]:
train.shape

(300000, 29)

In [48]:
test.shape

(200000, 28)

### Building the model

In [21]:
## Importing required libraries:
from sklearn.model_selection import KFold, StratifiedKFold
import lightgbm as lgb
from sklearn.metrics import roc_auc_score, precision_recall_curve, roc_curve, average_precision_score
from bayes_opt import BayesianOptimization
import warnings

In [49]:
y=train['target']
del train['target']

In [50]:
n_folds=5

In [51]:
folds=StratifiedKFold(n_splits=5,shuffle=True,random_state=1234)
feats=[f for f in train.columns if f not in ['id']]

In [52]:
categorical_columns

['id', 'nom_7', 'nom_8', 'nom_9']

In [53]:
oof_preds = np.zeros(train.shape[0])
sub_preds = np.zeros(test.shape[0])
    
feature_importance_df = pd.DataFrame()
categorical_features=[c for c in train.columns if c not in ['id_LE']]

In [54]:
categorical_features

['nom_0_FE',
 'nom_1_FE',
 'nom_2_FE',
 'nom_3_FE',
 'nom_4_FE',
 'nom_5_FE',
 'nom_6_FE',
 'ord_0_FE',
 'ord_1_FE',
 'ord_2_FE',
 'ord_3_FE',
 'ord_4_FE',
 'ord_5_FE',
 'day_FE',
 'month_FE',
 'nom_7_LE',
 'nom_8_LE',
 'nom_9_LE',
 'bin_0_0',
 'bin_0_1',
 'bin_1_0',
 'bin_1_1',
 'bin_2_0',
 'bin_2_1',
 'bin_3_F',
 'bin_3_T',
 'bin_4_N',
 'bin_4_Y']

### Bayesian Optimization

In [30]:
bayesian_tr_index, bayesian_val_index  = list(StratifiedKFold(n_splits=3, shuffle=True, random_state=1).split(train.values, y.values))[0]

In [41]:
def LGB_bayesian(
    num_leaves,  # int
    min_data_in_leaf,  # int
    learning_rate,
    lambda_l1,
    feature_fraction,
    max_depth):
    
    
    
    # LightGBM expects next three parameters need to be integer. So we make them integer
    num_leaves = int(num_leaves)
    min_data_in_leaf = int(min_data_in_leaf)
    max_depth = int(max_depth)

    assert type(num_leaves) == int
    assert type(min_data_in_leaf) == int
    assert type(max_depth) == int

    param = {
        'num_leaves': num_leaves,
        'min_data_in_leaf': min_data_in_leaf,
        'learning_rate': learning_rate,
        'max_depth': max_depth,
        'lambda_l1': lambda_l1,
        'save_binary': True, 
        'seed': 123,
        'bagging_seed': 123,
        'drop_seed': 123,
        'data_random_seed': 123,
        'objective': 'binary',
        'boosting_type': 'gbdt',
        'verbose': 1,
        'metric': 'auc',
        'is_unbalance': True,   

    }    
    
    
    xg_train = lgb.Dataset(train.iloc[bayesian_tr_index][feats].values,
                           label=y.iloc[bayesian_tr_index].values,
                           
                           free_raw_data = False
                           )
    xg_valid = lgb.Dataset(train.iloc[bayesian_val_index][feats].values,
                           label=y.iloc[bayesian_val_index].values,
                           
                           free_raw_data = False
                           )   

    num_round = 5000
    clf = lgb.train(param, xg_train, num_round, valid_sets = [xg_valid], verbose_eval=250, early_stopping_rounds = 50)
    
    predictions = clf.predict(train.iloc[bayesian_val_index][feats].values, num_iteration=clf.best_iteration)   
    
    score = roc_auc_score(y.iloc[bayesian_val_index].values, predictions)
    
    return score

In [46]:
bay_param={
    'num_leaves': (30,70), 
    'min_data_in_leaf': (10,70),  
    'learning_rate': (0.01, 0.3),
    'feature_fraction': (0.05, 0.5),
    'lambda_l1': (0, 5.0), 
    'max_depth':(3,15)
}

In [47]:
LGB_BO = BayesianOptimization(LGB_bayesian, bay_param, random_state=123)

In [34]:
init_points=3
n_iter=3

In [48]:
print('-' * 150)

with warnings.catch_warnings():
    warnings.filterwarnings('ignore')
    LGB_BO.maximize(init_points=init_points,n_iter=n_iter,alpha=1e-06)

------------------------------------------------------------------------------------------------------------------------------------------------------
|   iter    |  target   | featur... | lambda_l1 | learni... | max_depth | min_da... | num_le... |
-------------------------------------------------------------------------------------------------
Training until validation scores don't improve for 50 rounds.
[250]	valid_0's auc: 0.770767
Early stopping, best iteration is:
[442]	valid_0's auc: 0.772888
|  1        |  0.7729   |  0.3634   |  1.431    |  0.07579  |  9.616    |  53.17    |  46.92    |
Training until validation scores don't improve for 50 rounds.
[250]	valid_0's auc: 0.770955
Early stopping, best iteration is:
[222]	valid_0's auc: 0.771078
|  2        |  0.7711   |  0.4913   |  3.424    |  0.1495   |  7.705    |  30.59    |  59.16    |
Training until validation scores don't improve for 50 rounds.
[250]	valid_0's auc: 0.771209
Early stopping, best iteration is:
[249]	valid_0's 

In [49]:
LGB_BO.max['params']

{'feature_fraction': 0.44409372851515394,
 'lambda_l1': 0.010784561545872373,
 'learning_rate': 0.06815113734587146,
 'max_depth': 3.4673240905177125,
 'min_data_in_leaf': 68.92857798548327,
 'num_leaves': 31.109834110784448}

In [56]:
# param = {'num_leaves': 60,
#          'min_data_in_leaf': 60, 
#          'objective':'binary',
#          'max_depth': -1,
#          'learning_rate': 0.1,
#          "boosting": "gbdt",
#          "feature_fraction": 0.8,
#          "bagging_freq": 1,
#          "bagging_fraction": 0.8 ,
#          "bagging_seed": 11,
#          "metric": 'auc',
#          "lambda_l1": 0.1,
#          "random_state": 133,
#          "verbosity": -1}

In [61]:
#params after bayesian optimisation:

param = {'num_leaves': 31,
         'min_data_in_leaf': 69, 
         'objective':'binary',
         'max_depth': 4,
         'learning_rate': 0.06,
         "boosting": "gbdt",
         "feature_fraction": 0.33,
         "metric": 'auc',
         "lambda_l1": 0.01,
         "random_state": 133,
         "verbosity": -1}

In [62]:
for n_folds,(train_idx,valid_idx) in enumerate(folds.split(train.values,y.values)):
    print("fold n°{}".format(n_folds+1))
    trn_data = lgb.Dataset(train.iloc[train_idx][feats],
                           label=y.iloc[train_idx],
                           categorical_feature=categorical_features
                          )
    val_data = lgb.Dataset(train.iloc[valid_idx][feats],
                           label=y.iloc[valid_idx],categorical_feature=categorical_features
                          )

    num_round = 10000
    clf = lgb.train(param,
                    trn_data,
                    num_round,
                    valid_sets = [trn_data, val_data],
                    verbose_eval=100,
                    early_stopping_rounds = 200)
    
    #clf.fit(train_x,train_y,eval_set=[(train_x,train_y),(valid_x,valid_y)],verbose=500,eval_metric="auc",early_stopping_rounds=100)
    
    oof_preds[valid_idx]=clf.predict(train.iloc[valid_idx][feats],num_iteration=clf.best_iteration)
    sub_preds+=clf.predict(test[feats],num_iteration=clf.best_iteration)/folds.n_splits
    
    fold_importance_df=pd.DataFrame()
    fold_importance_df['features']=feats
    fold_importance_df['importance']=clf.feature_importance(importance_type='gain')
    fold_importance_df['folds']=n_folds+1
    print(f'Fold {n_folds+1}: Most important features are:\n')
    for i in np.argsort(fold_importance_df['importance'])[-5:]:
        print(f'{fold_importance_df.iloc[i,0]}-->{fold_importance_df.iloc[i,1]}')
    
    feature_importance_df=pd.concat([feature_importance_df,fold_importance_df],axis=0)
    
    print('Fold %2d AUC : %.6f' % (n_folds + 1, roc_auc_score(y.iloc[valid_idx], oof_preds[valid_idx])))
    del clf
    gc.collect()
    


print('Full auc score %.6f' % (roc_auc_score(y,oof_preds)))

test['target']=sub_preds
              

fold n°1




Training until validation scores don't improve for 200 rounds.
[100]	training's auc: 0.811879	valid_1's auc: 0.777955
[200]	training's auc: 0.842821	valid_1's auc: 0.788593
[300]	training's auc: 0.858847	valid_1's auc: 0.791891
[400]	training's auc: 0.870213	valid_1's auc: 0.793363
[500]	training's auc: 0.876864	valid_1's auc: 0.794068
[600]	training's auc: 0.881991	valid_1's auc: 0.794513
[700]	training's auc: 0.88593	valid_1's auc: 0.79476
[800]	training's auc: 0.890052	valid_1's auc: 0.794652
[900]	training's auc: 0.893563	valid_1's auc: 0.794376
Early stopping, best iteration is:
[734]	training's auc: 0.887034	valid_1's auc: 0.794813
Fold 1: Most important features are:

nom_9_LE-->67420.97451972961
nom_6_FE-->73209.61964058876
nom_8_LE-->88927.89646756649
nom_7_LE-->93389.90722155571
ord_5_FE-->98690.93132388592
Fold  1 AUC : 0.794813
fold n°2




Training until validation scores don't improve for 200 rounds.
[100]	training's auc: 0.811679	valid_1's auc: 0.776536
[200]	training's auc: 0.841886	valid_1's auc: 0.78824
[300]	training's auc: 0.857558	valid_1's auc: 0.792274
[400]	training's auc: 0.869453	valid_1's auc: 0.794206
[500]	training's auc: 0.876101	valid_1's auc: 0.795125
[600]	training's auc: 0.881031	valid_1's auc: 0.795686
[700]	training's auc: 0.88518	valid_1's auc: 0.795913
[800]	training's auc: 0.889347	valid_1's auc: 0.795782
[900]	training's auc: 0.892984	valid_1's auc: 0.795724
Early stopping, best iteration is:
[758]	training's auc: 0.887613	valid_1's auc: 0.795956
Fold 2: Most important features are:

nom_5_FE-->67740.34781748056
nom_6_FE-->73109.19807302952
nom_8_LE-->86625.97900009155
nom_7_LE-->97201.71965074539
ord_5_FE-->99465.74937680364
Fold  2 AUC : 0.795956
fold n°3




Training until validation scores don't improve for 200 rounds.
[100]	training's auc: 0.811692	valid_1's auc: 0.776925
[200]	training's auc: 0.841777	valid_1's auc: 0.788139
[300]	training's auc: 0.857718	valid_1's auc: 0.791788
[400]	training's auc: 0.868986	valid_1's auc: 0.793557
[500]	training's auc: 0.875916	valid_1's auc: 0.794369
[600]	training's auc: 0.880902	valid_1's auc: 0.794787
[700]	training's auc: 0.884956	valid_1's auc: 0.794968
[800]	training's auc: 0.88915	valid_1's auc: 0.79474
Early stopping, best iteration is:
[674]	training's auc: 0.883974	valid_1's auc: 0.795057
Fold 3: Most important features are:

nom_5_FE-->64530.66731393337
nom_6_FE-->70967.87460744381
nom_8_LE-->86735.7122707367
nom_7_LE-->89998.23136329651
ord_5_FE-->98152.36149513721
Fold  3 AUC : 0.795057
fold n°4




Training until validation scores don't improve for 200 rounds.
[100]	training's auc: 0.81238	valid_1's auc: 0.77463
[200]	training's auc: 0.842337	valid_1's auc: 0.78555
[300]	training's auc: 0.858409	valid_1's auc: 0.788866
[400]	training's auc: 0.870006	valid_1's auc: 0.790355
[500]	training's auc: 0.876864	valid_1's auc: 0.791324
[600]	training's auc: 0.882023	valid_1's auc: 0.791562
[700]	training's auc: 0.886209	valid_1's auc: 0.791634
[800]	training's auc: 0.89035	valid_1's auc: 0.791664
[900]	training's auc: 0.893884	valid_1's auc: 0.791479
Early stopping, best iteration is:
[737]	training's auc: 0.887395	valid_1's auc: 0.791805
Fold 4: Most important features are:

nom_9_LE-->67800.31713676453
nom_6_FE-->74421.83401936293
nom_8_LE-->87794.76416492462
nom_7_LE-->94528.62718009949
ord_5_FE-->99506.84189277887
Fold  4 AUC : 0.791805
fold n°5




Training until validation scores don't improve for 200 rounds.
[100]	training's auc: 0.812541	valid_1's auc: 0.776299
[200]	training's auc: 0.842658	valid_1's auc: 0.787245
[300]	training's auc: 0.858716	valid_1's auc: 0.790882
[400]	training's auc: 0.870606	valid_1's auc: 0.792234
[500]	training's auc: 0.877143	valid_1's auc: 0.793063
[600]	training's auc: 0.882093	valid_1's auc: 0.79337
[700]	training's auc: 0.886037	valid_1's auc: 0.793492
[800]	training's auc: 0.890215	valid_1's auc: 0.793477
[900]	training's auc: 0.893718	valid_1's auc: 0.793362
Early stopping, best iteration is:
[752]	training's auc: 0.888206	valid_1's auc: 0.793617
Fold 5: Most important features are:

nom_9_LE-->70548.29574012756
nom_6_FE-->73361.55004185438
nom_8_LE-->88230.54189872742
nom_7_LE-->95761.65492653847
ord_5_FE-->100280.53248739243
Fold  5 AUC : 0.793617
Full auc score 0.794241


In [58]:
sample_submission['target']=sub_preds

In [59]:
sample_submission.head()

Unnamed: 0,id,target
0,300000,0.293341
1,300001,0.544078
2,300002,0.210609
3,300003,0.309833
4,300004,0.705689


In [60]:
sample_submission.to_csv("sample_submission_01.csv",index=False)