# Train More - XGB + NN - Achieve LB Boost!
Previously when we ensembled our 7 KFold XGB with our 7 KFold NN, we achieved LB 0.97730 (see [here][1]). 

We will now attempt to boost their CV and LB! Previously our XGB learning rate was 0.1 for experiments. We will now train it with 0.02 for our final submission. Previously our XGB had 7 folds, we will now train it with 10 folds.

Previously our NN had 7 folds, we will now train it with 10 folds. And we will train the same 10 folds 5x times. Each time we will use a different seed.

Afterward, we will ensemble all these new XGB and all these new NN. We will try to beat CV 0.97630 and LB 0.97730 by training more!

==============================

**NOTE** Version 1 and 2 of this notebook have a bug where the OOF and PREDS of the multiple NN are not being saved correctly. Inside each repeat for-loop, the variable OOF and PREDS was being reset to zero. 

[1]: https://www.kaggle.com/code/adilshamim8/bank-term-deposit-prediction
[2]: https://www.kaggle.com/code/cdeotte/xgboost-using-original-data-cv-0-976
[3]: https://www.kaggle.com/code/cdeotte/nn-by-gpt5-cv-0-974-wow

# Load Data
We load train, test, and original datasets. In every Kaggle playground competition, the data is synthetic and it is generated from an original dataset. In this competition, the original dataset is [here][1]

[1]: https://www.kaggle.com/datasets/sushant097/bank-marketing-dataset-full

In [1]:
import pandas as pd, numpy as np, os
import cudf

PATH = "/kaggle/input/playground-series-s5e8/"
train = cudf.read_csv(f"{PATH}train.csv").set_index('id')
print("Train shape", train.shape )
train.head()

Train shape (750000, 17)


Unnamed: 0_level_0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,42,technician,married,secondary,no,7,no,no,cellular,25,aug,117,3,-1,0,unknown,0
1,38,blue-collar,married,secondary,no,514,no,no,unknown,18,jun,185,1,-1,0,unknown,0
2,36,blue-collar,married,secondary,no,602,yes,no,unknown,14,may,111,2,-1,0,unknown,0
3,27,student,single,secondary,no,34,yes,no,unknown,28,may,10,2,-1,0,unknown,0
4,26,technician,married,secondary,no,889,yes,no,cellular,3,feb,902,1,-1,0,unknown,1


In [2]:
test = cudf.read_csv(f"{PATH}test.csv").set_index('id')
test['y'] = np.random.randint(0, 2, len(test))
print("Test shape", test.shape )
test.head()

Test shape (250000, 17)


Unnamed: 0_level_0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
750000,32,blue-collar,married,secondary,no,1397,yes,no,unknown,21,may,224,1,-1,0,unknown,1
750001,44,management,married,tertiary,no,23,yes,no,cellular,3,apr,586,2,-1,0,unknown,1
750002,36,self-employed,married,primary,no,46,yes,yes,cellular,13,may,111,2,-1,0,unknown,1
750003,58,blue-collar,married,secondary,no,-1380,yes,yes,unknown,29,may,125,1,-1,0,unknown,1
750004,28,technician,single,secondary,no,1950,yes,no,cellular,22,jul,181,1,-1,0,unknown,1


In [3]:
orig = cudf.read_csv("/kaggle/input/bank-marketing-dataset-full/bank-full.csv",delimiter=";")
orig['y'] = orig.y.map({'yes':1,'no':0})
orig['id'] = (np.arange(len(orig))+1e6).astype('int')
orig = orig.set_index('id')
print("Original data shape", orig.shape )
orig.head()

Original data shape (45211, 17)


Unnamed: 0_level_0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1000000,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,0
1000001,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,0
1000002,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,0
1000003,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,0
1000004,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,0


# EDA (Exploratory Data Analysis)
We now combine all data together and then explore the columns and their properties. We observe that there is no missing data. And we observe that the categorical columns have low cardinality (i.e. under 12). We observe that most numerical columns have few unique values, and two numerical columns have around 2k and 8k unique values.

In [4]:
combine = cudf.concat([train,test,orig],axis=0)
print("Combined data shape", combine.shape )

Combined data shape (1045211, 17)


In [5]:
CATS = []
NUMS = []
for c in combine.columns[:-1]:
    t = "CAT"
    if combine[c].dtype=='object':
        CATS.append(c)
    else:
        NUMS.append(c)
        t = "NUM"
    n = combine[c].nunique()
    na = combine[c].isna().sum()
    print(f"[{t}] {c} has {n} unique and {na} NA")
print("CATS:", CATS )
print("NUMS:", NUMS )

[NUM] age has 78 unique and 0 NA
[CAT] job has 12 unique and 0 NA
[CAT] marital has 3 unique and 0 NA
[CAT] education has 4 unique and 0 NA
[CAT] default has 2 unique and 0 NA
[NUM] balance has 8590 unique and 0 NA
[CAT] housing has 2 unique and 0 NA
[CAT] loan has 2 unique and 0 NA
[CAT] contact has 3 unique and 0 NA
[NUM] day has 31 unique and 0 NA
[CAT] month has 12 unique and 0 NA
[NUM] duration has 1824 unique and 0 NA
[NUM] campaign has 52 unique and 0 NA
[NUM] pdays has 628 unique and 0 NA
[NUM] previous has 54 unique and 0 NA
[CAT] poutcome has 4 unique and 0 NA
CATS: ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']
NUMS: ['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']


# Feature Engineer (LE - Label Encode)
We will label encode all categorical columns. Also we will make a duplicate of each numerical column and treat the copy as a categorical column.

In [6]:
CATS1 = []
SIZES = {}
for c in NUMS + CATS:
    n = c
    if c in NUMS: 
        n = f"{c}2"
        CATS1.append(n)
    combine[n],_ = combine[c].factorize()
    SIZES[n] = combine[n].max()+1

    combine[c] = combine[c].astype('int32')
    combine[n] = combine[n].astype('int32')

print("New CATS:", CATS1 )
print("Cardinality of all CATS:", SIZES )

New CATS: ['age2', 'balance2', 'day2', 'duration2', 'campaign2', 'pdays2', 'previous2']
Cardinality of all CATS: {'age2': 78, 'balance2': 8590, 'day2': 31, 'duration2': 1824, 'campaign2': 52, 'pdays2': 628, 'previous2': 54, 'job': 12, 'marital': 3, 'education': 4, 'default': 2, 'housing': 2, 'loan': 2, 'contact': 3, 'month': 12, 'poutcome': 4}


# Feature Engineer (Combine Column Pairs)
We will create a new categorical column from every pair of existing categorical columns. The original categorical columns have been label encoded into integers from `0 to N-1` each. Therefore we can create a new column with unique integers using the formula `new_cols[name] = combine[c1] * SIZES[c2] + combine[c2]`.

In [7]:
from itertools import combinations

pairs = combinations(CATS + CATS1, 2)
new_cols = {}
CATS2 = []

for c1, c2 in pairs:
    name = "_".join(sorted((c1, c2)))
    new_cols[name] = combine[c1] * SIZES[c2] + combine[c2]
    CATS2.append(name)
if new_cols:
    new_df = cudf.DataFrame(new_cols)         
    combine = cudf.concat([combine, new_df], axis=1) 

print(f"Created {len(CATS2)} new CAT columns")

Created 120 new CAT columns


# Feature Engineer (CE - Count Encoding)
We now have 136 categorical columns. We will count encode each of them and create 136 new columns.

In [8]:
CE = []
CC = CATS+CATS1+CATS2
combine['i'] = np.arange( len(combine) )

print(f"Processing {len(CC)} columns... ",end="")
for i,c in enumerate(CC):
    if i%10==0: print(f"{i}, ",end="")
    tmp = combine.groupby(c).y.count()
    tmp = tmp.astype('int32')
    tmp.name = f"CE_{c}"
    CE.append( f"CE_{c}" )
    combine = combine.merge(tmp, on=c, how='left')
combine = combine.sort_values('i')
print()

Processing 136 columns... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 


In [9]:
train = combine.iloc[:len(train)]
test = combine.iloc[len(train):len(train)+len(test)]
orig = combine.iloc[-len(orig):]
del combine
print("Train shape", train.shape,"Test shape", test.shape,"Original shape", orig.shape )

Train shape (750000, 281) Test shape (250000, 281) Original shape (45211, 281)


# Feature Engineering (TE - Original Data as Cols)
Below is a technique to add the original data as new columns.

In [10]:
TE = []
CC = CATS+CATS1+CATS2

#mn = orig.y.mean() # WE FILL NAN AFTER XGB
print(f"Processing {len(CC)} columns... ",end="")
for i,c in enumerate(CC):
    if i%10==0: print(f"{i}, ",end="")
    tmp = orig.groupby(c).y.mean()
    tmp = tmp.astype('float32')
    NAME = f"TE_ORIG_{c}"
    tmp.name = NAME
    TE.append( NAME )
    train = train.merge(tmp, on=c, how='left')
    #train[NAME] = train[NAME].fillna(mn) # WE FILL NAN AFTER XGB
    test = test.merge(tmp, on=c, how='left')
    #test[NAME] = test[NAME].fillna(mn) # WE FILL NAN AFTER XGB
train = train.sort_values('i')
test = test.sort_values('i')
print()

Processing 136 columns... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 


# Train More - Train XGB w/ Original Data as Rows
Previously our XGB learning rate was 0.1 for experiments. We will now train it with 0.02 for our final submission. Previously our XGB had 7 folds, we will now train it with 10 folds.

In [11]:
from cuml.preprocessing import TargetEncoder
from sklearn.model_selection import KFold
import xgboost as xgb

print(f"XGBoost version {xgb.__version__}")

FEATURES = NUMS+CATS+CATS1+CATS2+CE
print(f"We have {len(FEATURES)} features.")

FOLDS = 10
SEED = 42

params = {
    "objective": "binary:logistic",  
    "eval_metric": "auc",           
    "learning_rate": 0.02,
    "max_depth": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.7,
    "seed": SEED,
    "device": "cuda",
    "grow_policy": "lossguide", 
    "max_leaves": 32,          
    "alpha": 2.0,
}

class IterLoadForDMatrix(xgb.core.DataIter):
    def __init__(self, df=None, features=None, target=None, batch_size=256*1024):
        self.features = features
        self.target = target
        self.df = df
        self.it = 0 
        self.batch_size = batch_size
        self.batches = int( np.ceil( len(df) / self.batch_size ) )
        super().__init__()

    def reset(self):
        '''Reset the iterator'''
        self.it = 0

    def next(self, input_data):
        '''Yield next batch of data.'''
        if self.it == self.batches:
            return 0 # Return 0 when there's no more batch.
        
        a = self.it * self.batch_size
        b = min( (self.it + 1) * self.batch_size, len(self.df) )
        #dt = cudf.DataFrame(self.df.iloc[a:b])
        dt = self.df.iloc[a:b]
        input_data(data=dt[self.features], label=dt[self.target]) 
        self.it += 1
        return 1

oof_preds = np.zeros(len(train))
test_preds = np.zeros(len(test))

REPEATS = 3
for kk in range(REPEATS):
    kf = KFold(n_splits=FOLDS, shuffle=True, random_state=SEED)
    for fold, (train_idx, val_idx) in enumerate(kf.split(train)):
        print("#"*25)
        print(f"### REPEAT {kk+1}, Fold {fold+1} ###")
        print("#"*25)
    
        Xy_train = train.iloc[train_idx][ FEATURES+['y'] ].copy()
        Xy_more = orig[ FEATURES+['y'] ]
        for k in range(1):
            Xy_train = cudf.concat([Xy_train,Xy_more],axis=0,ignore_index=True)
        
        X_valid = train.iloc[val_idx][FEATURES].copy()
        y_valid = train.iloc[val_idx]['y']
        X_test = test[FEATURES].copy()
    
        CC = CATS1+CATS2
        print(f"Target encoding {len(CC)} features... ",end="")
        for i,c in enumerate(CC):
            if i%10==0: print(f"{i}, ",end="")
            TE0 = TargetEncoder(n_folds=10, smooth=0, split_method='random', stat='mean')
            Xy_train[c] = TE0.fit_transform(Xy_train[c],Xy_train['y']).astype('float32')
            X_valid[c] = TE0.transform(X_valid[c]).astype('float32')
            X_test[c] = TE0.transform(X_test[c]).astype('float32')
        print()
    
        Xy_train[CATS] = Xy_train[CATS].astype('category')
        X_valid[CATS] = X_valid[CATS].astype('category')
        X_test[CATS] = X_test[CATS].astype('category')
    
        Xy_train = IterLoadForDMatrix(Xy_train, FEATURES, 'y')
        dtrain = xgb.QuantileDMatrix(Xy_train, enable_categorical=True, max_bin=256)
        dval   = xgb.DMatrix(X_valid, label=y_valid, enable_categorical=True)
        dtest  = xgb.DMatrix(X_test, enable_categorical=True)
    
        params['seed'] = kk*FOLDS + fold
        model = xgb.train(
            params=params,
            dtrain=dtrain,
            num_boost_round=100_000, 
            evals=[(dtrain, "train"), (dval, "valid")],
            early_stopping_rounds=250,
            verbose_eval=500
        )
    
        oof_preds[val_idx] += model.predict(dval, iteration_range=(0, model.best_iteration + 1)) / REPEATS
        test_preds += model.predict(dtest, iteration_range=(0, model.best_iteration + 1)) / FOLDS / REPEATS

XGBoost version 2.0.3
We have 279 features.
#########################
### REPEAT 1, Fold 1 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94638	valid-auc:0.94769
[500]	train-auc:0.97518	valid-auc:0.97460
[1000]	train-auc:0.97757	valid-auc:0.97530
[1500]	train-auc:0.97935	valid-auc:0.97550
[2000]	train-auc:0.98092	valid-auc:0.97566
[2500]	train-auc:0.98234	valid-auc:0.97574
[3000]	train-auc:0.98366	valid-auc:0.97580
[3500]	train-auc:0.98486	valid-auc:0.97585
[4000]	train-auc:0.98598	valid-auc:0.97586
[4500]	train-auc:0.98705	valid-auc:0.97588
[5000]	train-auc:0.98803	valid-auc:0.97588
[5003]	train-auc:0.98803	valid-auc:0.97588
#########################
### REPEAT 1, Fold 2 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94960	valid-auc:0.95271
[500]	train-auc:0.97518	valid-auc:0.97487
[1000]	train-auc:0.97758	valid-auc:0.97549
[1500]	train-auc:0.97938	valid-auc:0.97571
[2000]	train-auc:0.98093	valid-auc:0.97579
[2500]	train-auc:0.98236	valid-auc:0.97586
[3000]	train-auc:0.98367	valid-auc:0.97591
[3500]	train-auc:0.98489	valid-auc:0.97593
[3913]	train-auc:0.98582	valid-auc:0.97594
#########################
### REPEAT 1, Fold 3 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94853	valid-auc:0.95158
[500]	train-auc:0.97519	valid-auc:0.97477
[1000]	train-auc:0.97759	valid-auc:0.97545
[1500]	train-auc:0.97939	valid-auc:0.97571
[2000]	train-auc:0.98097	valid-auc:0.97583
[2500]	train-auc:0.98237	valid-auc:0.97592
[3000]	train-auc:0.98370	valid-auc:0.97598
[3500]	train-auc:0.98491	valid-auc:0.97606
[4000]	train-auc:0.98605	valid-auc:0.97610
[4500]	train-auc:0.98710	valid-auc:0.97612
[5000]	train-auc:0.98806	valid-auc:0.97615
[5500]	train-auc:0.98898	valid-auc:0.97617
[5910]	train-auc:0.98968	valid-auc:0.97617
#########################
### REPEAT 1, Fold 4 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95505	valid-auc:0.95548
[500]	train-auc:0.97525	valid-auc:0.97391
[1000]	train-auc:0.97764	valid-auc:0.97460
[1500]	train-auc:0.97944	valid-auc:0.97487
[2000]	train-auc:0.98099	valid-auc:0.97501
[2500]	train-auc:0.98242	valid-auc:0.97510
[3000]	train-auc:0.98373	valid-auc:0.97517
[3500]	train-auc:0.98493	valid-auc:0.97520
[4000]	train-auc:0.98605	valid-auc:0.97521
[4500]	train-auc:0.98708	valid-auc:0.97523
[5000]	train-auc:0.98806	valid-auc:0.97524
[5500]	train-auc:0.98896	valid-auc:0.97524
[5691]	train-auc:0.98929	valid-auc:0.97523
#########################
### REPEAT 1, Fold 5 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94739	valid-auc:0.95125
[500]	train-auc:0.97521	valid-auc:0.97436
[1000]	train-auc:0.97761	valid-auc:0.97501
[1500]	train-auc:0.97940	valid-auc:0.97521
[2000]	train-auc:0.98097	valid-auc:0.97532
[2500]	train-auc:0.98239	valid-auc:0.97540
[3000]	train-auc:0.98369	valid-auc:0.97545
[3500]	train-auc:0.98489	valid-auc:0.97548
[4000]	train-auc:0.98603	valid-auc:0.97551
[4500]	train-auc:0.98706	valid-auc:0.97553
[4729]	train-auc:0.98751	valid-auc:0.97552
#########################
### REPEAT 1, Fold 6 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94557	valid-auc:0.94843
[500]	train-auc:0.97521	valid-auc:0.97484
[1000]	train-auc:0.97760	valid-auc:0.97544
[1500]	train-auc:0.97939	valid-auc:0.97566
[2000]	train-auc:0.98095	valid-auc:0.97579
[2500]	train-auc:0.98239	valid-auc:0.97587
[3000]	train-auc:0.98371	valid-auc:0.97592
[3500]	train-auc:0.98490	valid-auc:0.97596
[4000]	train-auc:0.98602	valid-auc:0.97601
[4314]	train-auc:0.98668	valid-auc:0.97600
#########################
### REPEAT 1, Fold 7 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94524	valid-auc:0.95056
[500]	train-auc:0.97513	valid-auc:0.97587
[1000]	train-auc:0.97751	valid-auc:0.97637
[1500]	train-auc:0.97932	valid-auc:0.97657
[2000]	train-auc:0.98091	valid-auc:0.97669
[2500]	train-auc:0.98233	valid-auc:0.97675
[3000]	train-auc:0.98365	valid-auc:0.97680
[3500]	train-auc:0.98487	valid-auc:0.97684
[4000]	train-auc:0.98599	valid-auc:0.97685
[4074]	train-auc:0.98615	valid-auc:0.97685
#########################
### REPEAT 1, Fold 8 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94753	valid-auc:0.94958
[500]	train-auc:0.97516	valid-auc:0.97490
[1000]	train-auc:0.97758	valid-auc:0.97557
[1500]	train-auc:0.97938	valid-auc:0.97576
[2000]	train-auc:0.98093	valid-auc:0.97584
[2500]	train-auc:0.98234	valid-auc:0.97591
[3000]	train-auc:0.98363	valid-auc:0.97595
[3500]	train-auc:0.98484	valid-auc:0.97600
[4000]	train-auc:0.98597	valid-auc:0.97603
[4500]	train-auc:0.98702	valid-auc:0.97604
[5000]	train-auc:0.98799	valid-auc:0.97605
[5133]	train-auc:0.98824	valid-auc:0.97605
#########################
### REPEAT 1, Fold 9 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95293	valid-auc:0.95521
[500]	train-auc:0.97511	valid-auc:0.97502
[1000]	train-auc:0.97753	valid-auc:0.97566
[1500]	train-auc:0.97933	valid-auc:0.97586
[2000]	train-auc:0.98091	valid-auc:0.97596
[2500]	train-auc:0.98232	valid-auc:0.97601
[3000]	train-auc:0.98365	valid-auc:0.97603
[3396]	train-auc:0.98463	valid-auc:0.97603
#########################
### REPEAT 1, Fold 10 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94853	valid-auc:0.95354
[500]	train-auc:0.97507	valid-auc:0.97619
[1000]	train-auc:0.97749	valid-auc:0.97678
[1500]	train-auc:0.97929	valid-auc:0.97696
[2000]	train-auc:0.98087	valid-auc:0.97707
[2500]	train-auc:0.98231	valid-auc:0.97713
[3000]	train-auc:0.98362	valid-auc:0.97717
[3500]	train-auc:0.98485	valid-auc:0.97721
[4000]	train-auc:0.98597	valid-auc:0.97723
[4100]	train-auc:0.98619	valid-auc:0.97723
#########################
### REPEAT 2, Fold 1 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94948	valid-auc:0.95135
[500]	train-auc:0.97523	valid-auc:0.97462
[1000]	train-auc:0.97759	valid-auc:0.97527
[1500]	train-auc:0.97936	valid-auc:0.97551
[2000]	train-auc:0.98094	valid-auc:0.97562
[2500]	train-auc:0.98237	valid-auc:0.97572
[3000]	train-auc:0.98367	valid-auc:0.97578
[3500]	train-auc:0.98489	valid-auc:0.97583
[4000]	train-auc:0.98602	valid-auc:0.97586
[4500]	train-auc:0.98706	valid-auc:0.97588
[4854]	train-auc:0.98775	valid-auc:0.97588
#########################
### REPEAT 2, Fold 2 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95120	valid-auc:0.95310
[500]	train-auc:0.97519	valid-auc:0.97486
[1000]	train-auc:0.97758	valid-auc:0.97550
[1500]	train-auc:0.97936	valid-auc:0.97571
[2000]	train-auc:0.98093	valid-auc:0.97582
[2500]	train-auc:0.98234	valid-auc:0.97589
[2826]	train-auc:0.98322	valid-auc:0.97588
#########################
### REPEAT 2, Fold 3 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95279	valid-auc:0.95527
[500]	train-auc:0.97517	valid-auc:0.97477
[1000]	train-auc:0.97759	valid-auc:0.97543
[1500]	train-auc:0.97939	valid-auc:0.97568
[2000]	train-auc:0.98097	valid-auc:0.97584
[2500]	train-auc:0.98240	valid-auc:0.97592
[3000]	train-auc:0.98371	valid-auc:0.97601
[3500]	train-auc:0.98490	valid-auc:0.97604
[4000]	train-auc:0.98602	valid-auc:0.97607
[4500]	train-auc:0.98706	valid-auc:0.97611
[5000]	train-auc:0.98805	valid-auc:0.97614
[5500]	train-auc:0.98896	valid-auc:0.97615
[5757]	train-auc:0.98941	valid-auc:0.97614
#########################
### REPEAT 2, Fold 4 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94978	valid-auc:0.95093
[500]	train-auc:0.97523	valid-auc:0.97391
[1000]	train-auc:0.97764	valid-auc:0.97459
[1500]	train-auc:0.97941	valid-auc:0.97481
[2000]	train-auc:0.98098	valid-auc:0.97494
[2500]	train-auc:0.98239	valid-auc:0.97503
[3000]	train-auc:0.98368	valid-auc:0.97508
[3500]	train-auc:0.98488	valid-auc:0.97514
[4000]	train-auc:0.98600	valid-auc:0.97517
[4500]	train-auc:0.98703	valid-auc:0.97517
[4549]	train-auc:0.98713	valid-auc:0.97516
#########################
### REPEAT 2, Fold 5 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94866	valid-auc:0.95109
[500]	train-auc:0.97520	valid-auc:0.97429
[1000]	train-auc:0.97759	valid-auc:0.97492
[1500]	train-auc:0.97940	valid-auc:0.97517
[2000]	train-auc:0.98095	valid-auc:0.97528
[2500]	train-auc:0.98238	valid-auc:0.97536
[3000]	train-auc:0.98368	valid-auc:0.97541
[3500]	train-auc:0.98487	valid-auc:0.97544
[3729]	train-auc:0.98539	valid-auc:0.97543
#########################
### REPEAT 2, Fold 6 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94842	valid-auc:0.95200
[500]	train-auc:0.97519	valid-auc:0.97481
[1000]	train-auc:0.97760	valid-auc:0.97544
[1500]	train-auc:0.97940	valid-auc:0.97565
[2000]	train-auc:0.98098	valid-auc:0.97579
[2500]	train-auc:0.98240	valid-auc:0.97588
[3000]	train-auc:0.98370	valid-auc:0.97591
[3500]	train-auc:0.98491	valid-auc:0.97595
[4000]	train-auc:0.98603	valid-auc:0.97598
[4351]	train-auc:0.98678	valid-auc:0.97598
#########################
### REPEAT 2, Fold 7 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94750	valid-auc:0.95332
[500]	train-auc:0.97514	valid-auc:0.97580
[1000]	train-auc:0.97753	valid-auc:0.97631
[1500]	train-auc:0.97936	valid-auc:0.97650
[2000]	train-auc:0.98092	valid-auc:0.97658
[2500]	train-auc:0.98233	valid-auc:0.97666
[3000]	train-auc:0.98364	valid-auc:0.97672
[3500]	train-auc:0.98484	valid-auc:0.97674
[4000]	train-auc:0.98599	valid-auc:0.97675
[4018]	train-auc:0.98603	valid-auc:0.97676
#########################
### REPEAT 2, Fold 8 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95229	valid-auc:0.95475
[500]	train-auc:0.97513	valid-auc:0.97489
[1000]	train-auc:0.97755	valid-auc:0.97556
[1500]	train-auc:0.97936	valid-auc:0.97578
[2000]	train-auc:0.98094	valid-auc:0.97587
[2500]	train-auc:0.98236	valid-auc:0.97593
[3000]	train-auc:0.98366	valid-auc:0.97596
[3500]	train-auc:0.98485	valid-auc:0.97599
[4000]	train-auc:0.98597	valid-auc:0.97599
[4065]	train-auc:0.98611	valid-auc:0.97599
#########################
### REPEAT 2, Fold 9 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95137	valid-auc:0.95331
[500]	train-auc:0.97512	valid-auc:0.97499
[1000]	train-auc:0.97753	valid-auc:0.97559
[1500]	train-auc:0.97934	valid-auc:0.97580
[2000]	train-auc:0.98092	valid-auc:0.97593
[2500]	train-auc:0.98233	valid-auc:0.97598
[3000]	train-auc:0.98364	valid-auc:0.97603
[3500]	train-auc:0.98483	valid-auc:0.97605
[3594]	train-auc:0.98505	valid-auc:0.97605
#########################
### REPEAT 2, Fold 10 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94721	valid-auc:0.95210
[500]	train-auc:0.97508	valid-auc:0.97616
[1000]	train-auc:0.97751	valid-auc:0.97676
[1500]	train-auc:0.97929	valid-auc:0.97693
[2000]	train-auc:0.98088	valid-auc:0.97708
[2500]	train-auc:0.98230	valid-auc:0.97712
[3000]	train-auc:0.98362	valid-auc:0.97715
[3500]	train-auc:0.98482	valid-auc:0.97718
[4000]	train-auc:0.98596	valid-auc:0.97719
[4500]	train-auc:0.98699	valid-auc:0.97720
[4683]	train-auc:0.98736	valid-auc:0.97720
#########################
### REPEAT 3, Fold 1 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94625	valid-auc:0.94906
[500]	train-auc:0.97515	valid-auc:0.97462
[1000]	train-auc:0.97756	valid-auc:0.97530
[1500]	train-auc:0.97937	valid-auc:0.97556
[2000]	train-auc:0.98094	valid-auc:0.97563
[2500]	train-auc:0.98236	valid-auc:0.97572
[3000]	train-auc:0.98366	valid-auc:0.97578
[3500]	train-auc:0.98487	valid-auc:0.97580
[4000]	train-auc:0.98601	valid-auc:0.97584
[4394]	train-auc:0.98683	valid-auc:0.97584
#########################
### REPEAT 3, Fold 2 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94899	valid-auc:0.95071
[500]	train-auc:0.97521	valid-auc:0.97486
[1000]	train-auc:0.97761	valid-auc:0.97549
[1500]	train-auc:0.97939	valid-auc:0.97568
[2000]	train-auc:0.98096	valid-auc:0.97578
[2500]	train-auc:0.98238	valid-auc:0.97586
[3000]	train-auc:0.98369	valid-auc:0.97589
[3500]	train-auc:0.98490	valid-auc:0.97594
[4000]	train-auc:0.98601	valid-auc:0.97596
[4500]	train-auc:0.98705	valid-auc:0.97598
[4560]	train-auc:0.98717	valid-auc:0.97599
#########################
### REPEAT 3, Fold 3 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94979	valid-auc:0.95178
[500]	train-auc:0.97518	valid-auc:0.97477
[1000]	train-auc:0.97760	valid-auc:0.97544
[1500]	train-auc:0.97939	valid-auc:0.97566
[2000]	train-auc:0.98095	valid-auc:0.97578
[2500]	train-auc:0.98237	valid-auc:0.97587
[3000]	train-auc:0.98367	valid-auc:0.97596
[3500]	train-auc:0.98488	valid-auc:0.97599
[4000]	train-auc:0.98602	valid-auc:0.97602
[4500]	train-auc:0.98705	valid-auc:0.97605
[5000]	train-auc:0.98804	valid-auc:0.97607
[5500]	train-auc:0.98896	valid-auc:0.97608
[5560]	train-auc:0.98906	valid-auc:0.97608
#########################
### REPEAT 3, Fold 4 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95000	valid-auc:0.95181
[500]	train-auc:0.97524	valid-auc:0.97391
[1000]	train-auc:0.97763	valid-auc:0.97459
[1500]	train-auc:0.97943	valid-auc:0.97478
[2000]	train-auc:0.98099	valid-auc:0.97490
[2500]	train-auc:0.98240	valid-auc:0.97496
[3000]	train-auc:0.98370	valid-auc:0.97503
[3500]	train-auc:0.98491	valid-auc:0.97505
[3639]	train-auc:0.98523	valid-auc:0.97506
#########################
### REPEAT 3, Fold 5 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94978	valid-auc:0.95141
[500]	train-auc:0.97523	valid-auc:0.97430
[1000]	train-auc:0.97763	valid-auc:0.97494
[1500]	train-auc:0.97941	valid-auc:0.97517
[2000]	train-auc:0.98097	valid-auc:0.97532
[2500]	train-auc:0.98238	valid-auc:0.97542
[3000]	train-auc:0.98368	valid-auc:0.97548
[3500]	train-auc:0.98488	valid-auc:0.97550
[4000]	train-auc:0.98601	valid-auc:0.97552
[4010]	train-auc:0.98603	valid-auc:0.97552
#########################
### REPEAT 3, Fold 6 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95199	valid-auc:0.95516
[500]	train-auc:0.97519	valid-auc:0.97484
[1000]	train-auc:0.97759	valid-auc:0.97549
[1500]	train-auc:0.97940	valid-auc:0.97569
[2000]	train-auc:0.98096	valid-auc:0.97581
[2500]	train-auc:0.98238	valid-auc:0.97589
[3000]	train-auc:0.98369	valid-auc:0.97594
[3500]	train-auc:0.98489	valid-auc:0.97596
[4000]	train-auc:0.98601	valid-auc:0.97597
[4214]	train-auc:0.98646	valid-auc:0.97597
#########################
### REPEAT 3, Fold 7 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95165	valid-auc:0.95720
[500]	train-auc:0.97512	valid-auc:0.97589
[1000]	train-auc:0.97752	valid-auc:0.97642
[1500]	train-auc:0.97933	valid-auc:0.97658
[2000]	train-auc:0.98090	valid-auc:0.97667
[2500]	train-auc:0.98232	valid-auc:0.97673
[3000]	train-auc:0.98365	valid-auc:0.97677
[3311]	train-auc:0.98441	valid-auc:0.97677
#########################
### REPEAT 3, Fold 8 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94063	valid-auc:0.94407
[500]	train-auc:0.97513	valid-auc:0.97490
[1000]	train-auc:0.97757	valid-auc:0.97561
[1500]	train-auc:0.97935	valid-auc:0.97580
[2000]	train-auc:0.98091	valid-auc:0.97591
[2500]	train-auc:0.98233	valid-auc:0.97600
[3000]	train-auc:0.98362	valid-auc:0.97603
[3500]	train-auc:0.98483	valid-auc:0.97604
[4000]	train-auc:0.98596	valid-auc:0.97607
[4243]	train-auc:0.98648	valid-auc:0.97606
#########################
### REPEAT 3, Fold 9 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95227	valid-auc:0.95423
[500]	train-auc:0.97512	valid-auc:0.97503
[1000]	train-auc:0.97754	valid-auc:0.97565
[1500]	train-auc:0.97932	valid-auc:0.97581
[2000]	train-auc:0.98088	valid-auc:0.97590
[2500]	train-auc:0.98231	valid-auc:0.97598
[3000]	train-auc:0.98362	valid-auc:0.97602
[3500]	train-auc:0.98482	valid-auc:0.97605
[4000]	train-auc:0.98595	valid-auc:0.97608
[4284]	train-auc:0.98656	valid-auc:0.97608
#########################
### REPEAT 3, Fold 10 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95044	valid-auc:0.95440
[500]	train-auc:0.97505	valid-auc:0.97612
[1000]	train-auc:0.97747	valid-auc:0.97673
[1500]	train-auc:0.97928	valid-auc:0.97695
[2000]	train-auc:0.98085	valid-auc:0.97706
[2500]	train-auc:0.98228	valid-auc:0.97713
[3000]	train-auc:0.98359	valid-auc:0.97717
[3500]	train-auc:0.98481	valid-auc:0.97721
[4000]	train-auc:0.98595	valid-auc:0.97722
[4399]	train-auc:0.98678	valid-auc:0.97722


In [12]:
from sklearn.metrics import roc_auc_score

m = roc_auc_score(train.y.to_numpy(), oof_preds)
print(f"XGB (Train More) with Original Data as rows CV = {m}")

XGB (Train More) with Original Data as rows CV = 0.976159429546943


# Train More - Train XGB w/ Original Data as Cols
Previously our XGB learning rate was 0.1 for experiments. We will now train it with 0.02 for our final submission. Previously our XGB had 7 folds, we will now train it with 10 folds.

In [13]:
FEATURES += TE
print(f"We have {len(FEATURES)} features.")

oof_preds2 = np.zeros(len(train))
test_preds2 = np.zeros(len(test))

REPEATS = 3
for kk in range(REPEATS):
    kf = KFold(n_splits=FOLDS, shuffle=True, random_state=SEED)
    for fold, (train_idx, val_idx) in enumerate(kf.split(train)):
        print("#"*25)
        print(f"### REPEAT {kk+1}, Fold {fold+1} ###")
        print("#"*25)
    
        Xy_train = train.iloc[train_idx][ FEATURES+['y'] ].copy()  
        X_valid = train.iloc[val_idx][FEATURES].copy()
        y_valid = train.iloc[val_idx]['y']
        X_test = test[FEATURES].copy()
    
        CC = CATS1+CATS2
        print(f"Target encoding {len(CC)} features... ",end="")
        for i,c in enumerate(CC):
            if i%10==0: print(f"{i}, ",end="")
            TE0 = TargetEncoder(n_folds=10, smooth=0, split_method='random', stat='mean')
            Xy_train[c] = TE0.fit_transform(Xy_train[c],Xy_train['y']).astype('float32')
            X_valid[c] = TE0.transform(X_valid[c]).astype('float32')
            X_test[c] = TE0.transform(X_test[c]).astype('float32')
        print()
    
        Xy_train[CATS] = Xy_train[CATS].astype('category')
        X_valid[CATS] = X_valid[CATS].astype('category')
        X_test[CATS] = X_test[CATS].astype('category')
    
        Xy_train = IterLoadForDMatrix(Xy_train, FEATURES, 'y')
        dtrain = xgb.QuantileDMatrix(Xy_train, enable_categorical=True, max_bin=256)
        dval   = xgb.DMatrix(X_valid, label=y_valid, enable_categorical=True)
        dtest  = xgb.DMatrix(X_test, enable_categorical=True)
    
        params['seed'] = kk*FOLDS + fold 
        model = xgb.train(
            params=params,
            dtrain=dtrain,
            num_boost_round=100_000, 
            evals=[(dtrain, "train"), (dval, "valid")],
            early_stopping_rounds=250,
            verbose_eval=500
        )
    
        oof_preds2[val_idx] += model.predict(dval, iteration_range=(0, model.best_iteration + 1)) / REPEATS
        test_preds2 += model.predict(dtest, iteration_range=(0, model.best_iteration + 1)) / FOLDS / REPEATS

# FILL NAN FOR NN BELOW (we skipped this above for xgb)
CC = CATS+CATS1+CATS2
mn = orig.y.mean()
for i,c in enumerate(CC):
    NAME = f"TE_ORIG_{c}"
    train[NAME] = train[NAME].fillna(mn)
    test[NAME] = test[NAME].fillna(mn)

We have 415 features.
#########################
### REPEAT 1, Fold 1 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95059	valid-auc:0.94970
[500]	train-auc:0.97694	valid-auc:0.97475
[1000]	train-auc:0.97948	valid-auc:0.97547
[1500]	train-auc:0.98133	valid-auc:0.97573
[2000]	train-auc:0.98296	valid-auc:0.97586
[2500]	train-auc:0.98444	valid-auc:0.97595
[3000]	train-auc:0.98581	valid-auc:0.97600
[3400]	train-auc:0.98680	valid-auc:0.97601
#########################
### REPEAT 1, Fold 2 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95047	valid-auc:0.94998
[500]	train-auc:0.97695	valid-auc:0.97490
[1000]	train-auc:0.97947	valid-auc:0.97560
[1500]	train-auc:0.98134	valid-auc:0.97582
[2000]	train-auc:0.98296	valid-auc:0.97591
[2500]	train-auc:0.98443	valid-auc:0.97599
[3000]	train-auc:0.98578	valid-auc:0.97602
[3500]	train-auc:0.98703	valid-auc:0.97607
[4000]	train-auc:0.98817	valid-auc:0.97608
[4044]	train-auc:0.98827	valid-auc:0.97609
#########################
### REPEAT 1, Fold 3 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95053	valid-auc:0.94990
[500]	train-auc:0.97695	valid-auc:0.97487
[1000]	train-auc:0.97946	valid-auc:0.97564
[1500]	train-auc:0.98132	valid-auc:0.97582
[2000]	train-auc:0.98295	valid-auc:0.97592
[2500]	train-auc:0.98442	valid-auc:0.97600
[3000]	train-auc:0.98578	valid-auc:0.97607
[3500]	train-auc:0.98701	valid-auc:0.97611
[4000]	train-auc:0.98815	valid-auc:0.97613
[4500]	train-auc:0.98922	valid-auc:0.97612
[4501]	train-auc:0.98922	valid-auc:0.97612
#########################
### REPEAT 1, Fold 4 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95397	valid-auc:0.95111
[500]	train-auc:0.97703	valid-auc:0.97387
[1000]	train-auc:0.97950	valid-auc:0.97462
[1500]	train-auc:0.98137	valid-auc:0.97484
[2000]	train-auc:0.98301	valid-auc:0.97495
[2500]	train-auc:0.98447	valid-auc:0.97504
[3000]	train-auc:0.98581	valid-auc:0.97510
[3500]	train-auc:0.98705	valid-auc:0.97513
[4000]	train-auc:0.98820	valid-auc:0.97517
[4290]	train-auc:0.98881	valid-auc:0.97517
#########################
### REPEAT 1, Fold 5 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94948	valid-auc:0.95094
[500]	train-auc:0.97699	valid-auc:0.97452
[1000]	train-auc:0.97949	valid-auc:0.97524
[1500]	train-auc:0.98133	valid-auc:0.97547
[2000]	train-auc:0.98295	valid-auc:0.97561
[2500]	train-auc:0.98442	valid-auc:0.97568
[3000]	train-auc:0.98577	valid-auc:0.97574
[3500]	train-auc:0.98701	valid-auc:0.97577
[3808]	train-auc:0.98773	valid-auc:0.97576
#########################
### REPEAT 1, Fold 6 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95145	valid-auc:0.95095
[500]	train-auc:0.97694	valid-auc:0.97489
[1000]	train-auc:0.97946	valid-auc:0.97569
[1500]	train-auc:0.98132	valid-auc:0.97592
[2000]	train-auc:0.98294	valid-auc:0.97605
[2500]	train-auc:0.98442	valid-auc:0.97613
[3000]	train-auc:0.98576	valid-auc:0.97620
[3500]	train-auc:0.98700	valid-auc:0.97624
[4000]	train-auc:0.98813	valid-auc:0.97626
[4500]	train-auc:0.98919	valid-auc:0.97628
[4970]	train-auc:0.99010	valid-auc:0.97629
#########################
### REPEAT 1, Fold 7 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95435	valid-auc:0.95627
[500]	train-auc:0.97691	valid-auc:0.97601
[1000]	train-auc:0.97941	valid-auc:0.97658
[1500]	train-auc:0.98127	valid-auc:0.97676
[2000]	train-auc:0.98291	valid-auc:0.97684
[2500]	train-auc:0.98437	valid-auc:0.97688
[3000]	train-auc:0.98573	valid-auc:0.97691
[3500]	train-auc:0.98698	valid-auc:0.97693
[3855]	train-auc:0.98781	valid-auc:0.97693
#########################
### REPEAT 1, Fold 8 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95042	valid-auc:0.95012
[500]	train-auc:0.97694	valid-auc:0.97509
[1000]	train-auc:0.97942	valid-auc:0.97582
[1500]	train-auc:0.98126	valid-auc:0.97602
[2000]	train-auc:0.98289	valid-auc:0.97617
[2500]	train-auc:0.98435	valid-auc:0.97624
[3000]	train-auc:0.98571	valid-auc:0.97628
[3500]	train-auc:0.98695	valid-auc:0.97632
[4000]	train-auc:0.98809	valid-auc:0.97634
[4431]	train-auc:0.98900	valid-auc:0.97634
#########################
### REPEAT 1, Fold 9 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95297	valid-auc:0.95256
[500]	train-auc:0.97690	valid-auc:0.97509
[1000]	train-auc:0.97940	valid-auc:0.97583
[1500]	train-auc:0.98128	valid-auc:0.97605
[2000]	train-auc:0.98292	valid-auc:0.97617
[2500]	train-auc:0.98438	valid-auc:0.97626
[3000]	train-auc:0.98574	valid-auc:0.97631
[3500]	train-auc:0.98698	valid-auc:0.97633
[3513]	train-auc:0.98701	valid-auc:0.97633
#########################
### REPEAT 1, Fold 10 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95448	valid-auc:0.95601
[500]	train-auc:0.97681	valid-auc:0.97647
[1000]	train-auc:0.97932	valid-auc:0.97712
[1500]	train-auc:0.98120	valid-auc:0.97731
[2000]	train-auc:0.98286	valid-auc:0.97743
[2500]	train-auc:0.98433	valid-auc:0.97750
[3000]	train-auc:0.98568	valid-auc:0.97754
[3500]	train-auc:0.98694	valid-auc:0.97755
[3762]	train-auc:0.98755	valid-auc:0.97755
#########################
### REPEAT 2, Fold 1 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95057	valid-auc:0.94882
[500]	train-auc:0.97693	valid-auc:0.97473
[1000]	train-auc:0.97941	valid-auc:0.97545
[1500]	train-auc:0.98128	valid-auc:0.97573
[2000]	train-auc:0.98292	valid-auc:0.97584
[2500]	train-auc:0.98440	valid-auc:0.97591
[3000]	train-auc:0.98576	valid-auc:0.97599
[3500]	train-auc:0.98700	valid-auc:0.97604
[4000]	train-auc:0.98815	valid-auc:0.97606
[4140]	train-auc:0.98846	valid-auc:0.97607
#########################
### REPEAT 2, Fold 2 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95398	valid-auc:0.95285
[500]	train-auc:0.97697	valid-auc:0.97490
[1000]	train-auc:0.97948	valid-auc:0.97556
[1500]	train-auc:0.98133	valid-auc:0.97575
[2000]	train-auc:0.98297	valid-auc:0.97587
[2500]	train-auc:0.98446	valid-auc:0.97597
[3000]	train-auc:0.98581	valid-auc:0.97601
[3500]	train-auc:0.98704	valid-auc:0.97606
[3909]	train-auc:0.98799	valid-auc:0.97607
#########################
### REPEAT 2, Fold 3 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95313	valid-auc:0.95290
[500]	train-auc:0.97695	valid-auc:0.97486
[1000]	train-auc:0.97945	valid-auc:0.97562
[1500]	train-auc:0.98132	valid-auc:0.97583
[2000]	train-auc:0.98297	valid-auc:0.97596
[2500]	train-auc:0.98445	valid-auc:0.97606
[3000]	train-auc:0.98580	valid-auc:0.97611
[3500]	train-auc:0.98703	valid-auc:0.97617
[3937]	train-auc:0.98803	valid-auc:0.97616
#########################
### REPEAT 2, Fold 4 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94753	valid-auc:0.94467
[500]	train-auc:0.97701	valid-auc:0.97391
[1000]	train-auc:0.97952	valid-auc:0.97467
[1500]	train-auc:0.98139	valid-auc:0.97492
[2000]	train-auc:0.98301	valid-auc:0.97502
[2500]	train-auc:0.98447	valid-auc:0.97508
[3000]	train-auc:0.98578	valid-auc:0.97513
[3500]	train-auc:0.98701	valid-auc:0.97514
[3714]	train-auc:0.98751	valid-auc:0.97514
#########################
### REPEAT 2, Fold 5 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95320	valid-auc:0.95350
[500]	train-auc:0.97695	valid-auc:0.97450
[1000]	train-auc:0.97946	valid-auc:0.97529
[1500]	train-auc:0.98133	valid-auc:0.97552
[2000]	train-auc:0.98295	valid-auc:0.97563
[2500]	train-auc:0.98441	valid-auc:0.97571
[3000]	train-auc:0.98575	valid-auc:0.97574
[3500]	train-auc:0.98698	valid-auc:0.97578
[4000]	train-auc:0.98812	valid-auc:0.97580
[4222]	train-auc:0.98861	valid-auc:0.97580
#########################
### REPEAT 2, Fold 6 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94978	valid-auc:0.94938
[500]	train-auc:0.97695	valid-auc:0.97493
[1000]	train-auc:0.97944	valid-auc:0.97569
[1500]	train-auc:0.98132	valid-auc:0.97593
[2000]	train-auc:0.98296	valid-auc:0.97605
[2500]	train-auc:0.98442	valid-auc:0.97610
[3000]	train-auc:0.98576	valid-auc:0.97615
[3438]	train-auc:0.98686	valid-auc:0.97616
#########################
### REPEAT 2, Fold 7 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95124	valid-auc:0.95391
[500]	train-auc:0.97691	valid-auc:0.97599
[1000]	train-auc:0.97943	valid-auc:0.97660
[1500]	train-auc:0.98130	valid-auc:0.97677
[2000]	train-auc:0.98293	valid-auc:0.97682
[2500]	train-auc:0.98440	valid-auc:0.97689
[3000]	train-auc:0.98575	valid-auc:0.97692
[3456]	train-auc:0.98688	valid-auc:0.97694
#########################
### REPEAT 2, Fold 8 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95092	valid-auc:0.95021
[500]	train-auc:0.97693	valid-auc:0.97510
[1000]	train-auc:0.97942	valid-auc:0.97587
[1500]	train-auc:0.98128	valid-auc:0.97612
[2000]	train-auc:0.98289	valid-auc:0.97623
[2500]	train-auc:0.98433	valid-auc:0.97630
[3000]	train-auc:0.98568	valid-auc:0.97635
[3500]	train-auc:0.98694	valid-auc:0.97639
[3965]	train-auc:0.98800	valid-auc:0.97640
#########################
### REPEAT 2, Fold 9 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95331	valid-auc:0.95352
[500]	train-auc:0.97688	valid-auc:0.97507
[1000]	train-auc:0.97943	valid-auc:0.97583
[1500]	train-auc:0.98129	valid-auc:0.97602
[2000]	train-auc:0.98292	valid-auc:0.97617
[2500]	train-auc:0.98440	valid-auc:0.97626
[3000]	train-auc:0.98573	valid-auc:0.97630
[3500]	train-auc:0.98696	valid-auc:0.97631
[3551]	train-auc:0.98708	valid-auc:0.97631
#########################
### REPEAT 2, Fold 10 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95381	valid-auc:0.95466
[500]	train-auc:0.97680	valid-auc:0.97649
[1000]	train-auc:0.97931	valid-auc:0.97713
[1500]	train-auc:0.98119	valid-auc:0.97737
[2000]	train-auc:0.98283	valid-auc:0.97747
[2500]	train-auc:0.98433	valid-auc:0.97753
[3000]	train-auc:0.98569	valid-auc:0.97758
[3500]	train-auc:0.98692	valid-auc:0.97760
[3664]	train-auc:0.98730	valid-auc:0.97760
#########################
### REPEAT 3, Fold 1 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95004	valid-auc:0.94829
[500]	train-auc:0.97697	valid-auc:0.97477
[1000]	train-auc:0.97946	valid-auc:0.97550
[1500]	train-auc:0.98133	valid-auc:0.97573
[2000]	train-auc:0.98293	valid-auc:0.97584
[2500]	train-auc:0.98440	valid-auc:0.97592
[3000]	train-auc:0.98574	valid-auc:0.97600
[3500]	train-auc:0.98697	valid-auc:0.97601
[3895]	train-auc:0.98791	valid-auc:0.97602
#########################
### REPEAT 3, Fold 2 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95402	valid-auc:0.95320
[500]	train-auc:0.97699	valid-auc:0.97492
[1000]	train-auc:0.97949	valid-auc:0.97560
[1500]	train-auc:0.98136	valid-auc:0.97581
[2000]	train-auc:0.98299	valid-auc:0.97591
[2500]	train-auc:0.98446	valid-auc:0.97597
[3000]	train-auc:0.98580	valid-auc:0.97601
[3500]	train-auc:0.98703	valid-auc:0.97605
[4000]	train-auc:0.98816	valid-auc:0.97607
[4081]	train-auc:0.98834	valid-auc:0.97606
#########################
### REPEAT 3, Fold 3 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95421	valid-auc:0.95283
[500]	train-auc:0.97694	valid-auc:0.97485
[1000]	train-auc:0.97945	valid-auc:0.97564
[1500]	train-auc:0.98131	valid-auc:0.97589
[2000]	train-auc:0.98294	valid-auc:0.97601
[2500]	train-auc:0.98442	valid-auc:0.97608
[3000]	train-auc:0.98577	valid-auc:0.97614
[3500]	train-auc:0.98702	valid-auc:0.97619
[4000]	train-auc:0.98819	valid-auc:0.97623
[4500]	train-auc:0.98925	valid-auc:0.97626
[4958]	train-auc:0.99013	valid-auc:0.97626
#########################
### REPEAT 3, Fold 4 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95117	valid-auc:0.94866
[500]	train-auc:0.97701	valid-auc:0.97388
[1000]	train-auc:0.97953	valid-auc:0.97462
[1500]	train-auc:0.98138	valid-auc:0.97487
[2000]	train-auc:0.98298	valid-auc:0.97500
[2500]	train-auc:0.98446	valid-auc:0.97507
[3000]	train-auc:0.98577	valid-auc:0.97513
[3500]	train-auc:0.98701	valid-auc:0.97518
[4000]	train-auc:0.98814	valid-auc:0.97521
[4500]	train-auc:0.98921	valid-auc:0.97525
[5000]	train-auc:0.99018	valid-auc:0.97527
[5026]	train-auc:0.99023	valid-auc:0.97528
#########################
### REPEAT 3, Fold 5 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94968	valid-auc:0.95092
[500]	train-auc:0.97699	valid-auc:0.97454
[1000]	train-auc:0.97949	valid-auc:0.97526
[1500]	train-auc:0.98132	valid-auc:0.97549
[2000]	train-auc:0.98296	valid-auc:0.97561
[2500]	train-auc:0.98443	valid-auc:0.97568
[3000]	train-auc:0.98578	valid-auc:0.97570
[3500]	train-auc:0.98702	valid-auc:0.97574
[4000]	train-auc:0.98816	valid-auc:0.97576
[4500]	train-auc:0.98922	valid-auc:0.97578
[4539]	train-auc:0.98930	valid-auc:0.97578
#########################
### REPEAT 3, Fold 6 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95334	valid-auc:0.95394
[500]	train-auc:0.97695	valid-auc:0.97492
[1000]	train-auc:0.97946	valid-auc:0.97568
[1500]	train-auc:0.98133	valid-auc:0.97592
[2000]	train-auc:0.98294	valid-auc:0.97601
[2500]	train-auc:0.98439	valid-auc:0.97611
[3000]	train-auc:0.98573	valid-auc:0.97618
[3500]	train-auc:0.98695	valid-auc:0.97621
[3981]	train-auc:0.98806	valid-auc:0.97620
#########################
### REPEAT 3, Fold 7 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95342	valid-auc:0.95548
[500]	train-auc:0.97695	valid-auc:0.97603
[1000]	train-auc:0.97945	valid-auc:0.97664
[1500]	train-auc:0.98132	valid-auc:0.97682
[2000]	train-auc:0.98294	valid-auc:0.97689
[2500]	train-auc:0.98440	valid-auc:0.97692
[3000]	train-auc:0.98575	valid-auc:0.97697
[3306]	train-auc:0.98651	valid-auc:0.97697
#########################
### REPEAT 3, Fold 8 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95386	valid-auc:0.95314
[500]	train-auc:0.97695	valid-auc:0.97516
[1000]	train-auc:0.97943	valid-auc:0.97590
[1500]	train-auc:0.98128	valid-auc:0.97611
[2000]	train-auc:0.98290	valid-auc:0.97621
[2500]	train-auc:0.98437	valid-auc:0.97631
[3000]	train-auc:0.98571	valid-auc:0.97637
[3500]	train-auc:0.98695	valid-auc:0.97641
[3735]	train-auc:0.98751	valid-auc:0.97641
#########################
### REPEAT 3, Fold 9 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95134	valid-auc:0.95015
[500]	train-auc:0.97693	valid-auc:0.97510
[1000]	train-auc:0.97945	valid-auc:0.97589
[1500]	train-auc:0.98132	valid-auc:0.97610
[2000]	train-auc:0.98292	valid-auc:0.97619
[2500]	train-auc:0.98440	valid-auc:0.97627
[3000]	train-auc:0.98575	valid-auc:0.97632
[3500]	train-auc:0.98700	valid-auc:0.97636
[4000]	train-auc:0.98814	valid-auc:0.97639
[4305]	train-auc:0.98880	valid-auc:0.97639
#########################
### REPEAT 3, Fold 10 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95327	valid-auc:0.95365
[500]	train-auc:0.97676	valid-auc:0.97650
[1000]	train-auc:0.97928	valid-auc:0.97712
[1500]	train-auc:0.98115	valid-auc:0.97732
[2000]	train-auc:0.98279	valid-auc:0.97745
[2500]	train-auc:0.98426	valid-auc:0.97752
[3000]	train-auc:0.98561	valid-auc:0.97757
[3500]	train-auc:0.98687	valid-auc:0.97760
[4000]	train-auc:0.98802	valid-auc:0.97762
[4500]	train-auc:0.98908	valid-auc:0.97765
[4825]	train-auc:0.98972	valid-auc:0.97763


In [14]:
from sklearn.metrics import roc_auc_score

m = roc_auc_score(train.y.to_numpy(), oof_preds2)
print(f"XGB (Train More) with Original Data as rows CV = {m}")

XGB (Train More) with Original Data as rows CV = 0.9763730574197561


# Normalize
NN prefer numerical columns to be Gaussian distributed. Therefore we first apply log transform to skewed distributions and then standardize with subtract mean divide standard deviation. (The standardization code is included in the NN code below).

In [15]:
LOG = ['balance','duration','campaign','pdays','previous']

for c in LOG+CE:
    if c in LOG: 
        mn = min( (train[c].min(), test[c].min()) )
        train[c] = train[c]-mn
        test[c] = test[c]-mn
    train[c] = np.log1p( train[c] )
    test[c] = np.log1p( test[c] )

In [16]:
FEATURES = CATS+NUMS+CATS1+CE+TE
TARGET_COL = 'y'
print(f"We have {len( FEATURES )} features.")

We have 295 features.


In [17]:
# =========================
# PyTorch MLP w/ Cat Embeddings + KFold OOF
# =========================
import os, math, random, gc, json
import numpy as np
import pandas as pd
from tqdm import tqdm

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import roc_auc_score, log_loss, accuracy_score

# -------------------------
# Config
# -------------------------
SEED          = 42
FOLDS         = 10
EPOCHS        = 8
BATCH_SIZE    = 512
LR_MAX        = 3e-3          # peak LR for OneCycle
WD            = 1e-4          # weight decay (AdamW)
EARLY_STOP    = 4             # epochs with no val AUC improvement
GRAD_CLIP     = 1.0
NUM_WORKERS   = 4
MODEL_DIR     = "./mlp_catemb_models"
OOF_PATH      = "./oof_catemb.csv"

os.makedirs(MODEL_DIR, exist_ok=True)

KNOWN_CARDINALITIES = SIZES
df = train[ FEATURES+['y'] ].to_pandas()
df2 = test[ FEATURES+['y'] ].to_pandas()

# -------------------------
# Repro
# -------------------------
def seed_everything(seed=SEED):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = False
    torch.backends.cudnn.benchmark = True

seed_everything(SEED)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)

# -------------------------
# Load/define your dataframe `df`
# -------------------------
# df = pd.read_parquet("your_data.parquet")
# assert TARGET_COL in df.columns

categorical_cols = list( KNOWN_CARDINALITIES.keys() )
numeric_cols = [c for c in df.columns if c not in categorical_cols + [TARGET_COL]]

print(f"#categoricals={len(categorical_cols)}, #numerics={len(numeric_cols)}")

# -------------------------
# Label-encode categoricals (reserve 0=UNK)
# -------------------------
encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    tmp = pd.concat([ df[[col]],df2[[col]] ],axis=0)
    tmp[col] = tmp[col].astype(str).fillna("NaN")
    le.fit(tmp[col].values)
    df[col] = le.transform(df[col].astype(str).values) + 1
    df2[col] = le.transform(df2[col].astype(str).values) + 1
    encoders[col] = le
    del tmp

cardinalities = {}
for col in categorical_cols:
    if col in KNOWN_CARDINALITIES:
        cardinalities[col] = KNOWN_CARDINALITIES[col] + 1  # +1 for UNK
        df[col] = np.clip(df[col], 0, KNOWN_CARDINALITIES[col])
        df2[col] = np.clip(df2[col], 0, KNOWN_CARDINALITIES[col])
    else:
        cardinalities[col] = int( max(df[col].max(),df2[col].max()) ) + 1

# -------------------------
# Scale numerics
# -------------------------
scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols].astype(np.float32))
df2[numeric_cols] = scaler.transform(df2[numeric_cols].astype(np.float32))

# -------------------------
# Embedding dims
# -------------------------
def emb_dim_from_card(n):
    return int(min(50, round(1.6 * (n**0.56))))

emb_info = [(cardinalities[c], emb_dim_from_card(cardinalities[c])) for c in categorical_cols]
total_emb_dim = sum(d for _, d in emb_info)
print("Embedding config:", dict(zip(categorical_cols, emb_info)))
print("Total embedding dim:", total_emb_dim, " + numeric:", len(numeric_cols))

# -------------------------
# Dataset
# -------------------------
class TabDataset(Dataset):
    def __init__(self, df, cat_cols, num_cols, target_col=None, idx=None):
        self.cat_cols = cat_cols
        self.num_cols = num_cols
        self.target_col = target_col
        self.df = df if idx is None else df.iloc[idx]
        self.cats = self.df[self.cat_cols].values.astype(np.int64)
        self.nums = self.df[self.num_cols].values.astype(np.float32)
        self.y = None if self.target_col is None else self.df[self.target_col].values.astype(np.float32)

    def __len__(self):
        return len(self.df)

    def __getitem__(self, i):
        cats = torch.from_numpy(self.cats[i])
        nums = torch.from_numpy(self.nums[i])
        if self.y is None:
            return cats, nums
        return cats, nums, torch.tensor(self.y[i])

# -------------------------
# Model
# -------------------------
class MLPWithCatEmb(nn.Module):
    def __init__(self, emb_info, n_num, hidden=[512, 256, 128], dropout=0.15):
        super().__init__()
        self.emb_layers = nn.ModuleList(
            [nn.Embedding(num_embeddings=card, embedding_dim=dim, padding_idx=0)
             for card, dim in emb_info]
        )
        emb_total = sum(dim for _, dim in emb_info)
        in_dim = emb_total + n_num

        self.bn_nums = nn.BatchNorm1d(n_num) if n_num > 0 else nn.Identity()
        self.dropout = nn.Dropout(dropout)

        layers = []
        last = in_dim
        for h in hidden:
            layers += [
                nn.Linear(last, h),
                nn.BatchNorm1d(h),
                nn.SiLU(),
                nn.Dropout(dropout)
            ]
            last = h
        self.mlp = nn.Sequential(*layers)
        self.head = nn.Linear(last, 1)

        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
                if m.bias is not None:
                    nn.init.zeros_(m.bias)

    def forward(self, x_cat, x_num):
        emb = [emb_layer(x_cat[:, i]) for i, emb_layer in enumerate(self.emb_layers)]
        x_emb = torch.cat(emb, dim=1) if emb else None
        if x_num is not None and x_num.shape[1] > 0:
            x_num = self.bn_nums(x_num)
            x = torch.cat([x_emb, x_num], dim=1) if x_emb is not None else x_num
        else:
            x = x_emb
        x = self.mlp(x)
        logit = self.head(x).squeeze(1)
        return logit

# -------------------------
# Train / Eval helpers
# -------------------------
def train_one_epoch(model, loader, optimizer, scaler, scheduler=None):
    model.train()
    running = 0.0
    for cats, nums, y in loader:
        cats = cats.to(device, non_blocking=True)
        nums = nums.to(device, non_blocking=True)
        y = y.to(device, non_blocking=True)

        optimizer.zero_grad(set_to_none=True)
        with torch.amp.autocast('cuda', enabled=True):
            logits = model(cats, nums)
            loss = F.binary_cross_entropy_with_logits(logits, y)
        scaler.scale(loss).backward()
        if GRAD_CLIP is not None:
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), GRAD_CLIP)
        scaler.step(optimizer)
        scaler.update()
        if scheduler is not None:
            scheduler.step()

        running += loss.detach().item() * y.size(0)
    return running / len(loader.dataset)

@torch.no_grad()
def validate(model, loader):
    model.eval()
    all_logits, all_y = [], []
    for batch in loader:
        if len(batch) == 3:
            cats, nums, y = batch
            all_y.append(y.detach().cpu())
        else:
            cats, nums = batch
        cats = cats.to(device, non_blocking=True)
        nums = nums.to(device, non_blocking=True)
        all_logits.append(model(cats, nums).detach().cpu())

    logits = torch.cat(all_logits).numpy()
    probs  = 1.0 / (1.0 + np.exp(-logits))
    probs_c = np.clip(probs, 1e-7, 1 - 1e-7)  # clip instead of log_loss(..., eps=...)

    if all_y:
        y_true = torch.cat(all_y).numpy().astype(np.int64)
        auc = roc_auc_score(y_true, probs)
        ll  = log_loss(y_true, probs_c)
        acc = accuracy_score(y_true, (probs >= 0.5).astype(int))
        return probs, {"auc": auc, "logloss": ll, "acc": acc}
    return probs, {}

oof = np.zeros(len(df), dtype=np.float32)
preds = np.zeros(len(df2), dtype=np.float32)

REPEATS = 5
for kk in range(REPEATS):
    print(f"##### REPEAT {kk+1} of {REPEATS} #####")
    seed_everything(SEED+kk)

    # -------------------------
    # KFold training
    # -------------------------
    skf = KFold(n_splits=FOLDS, shuffle=True, random_state=SEED)
    
    y = df[TARGET_COL].astype(int).values
    oof2 = np.zeros(len(df), dtype=np.float32)
    fold_metrics = []
    
    for fold, (trn_idx, val_idx) in enumerate(skf.split(df, y), start=1):
        print(f"\n========== Fold {fold}/{FOLDS} ==========")
        trn_ds = TabDataset(df, categorical_cols, numeric_cols, TARGET_COL, trn_idx)
        val_ds = TabDataset(df, categorical_cols, numeric_cols, TARGET_COL, val_idx)
        test_ds = TabDataset(df2, categorical_cols, numeric_cols, TARGET_COL, np.arange(len(df2)) )
    
        trn_loader = DataLoader(
            trn_ds, batch_size=BATCH_SIZE, shuffle=True,
            num_workers=NUM_WORKERS, pin_memory=True, drop_last=True
        )
        val_loader = DataLoader(
            val_ds, batch_size=BATCH_SIZE, shuffle=False,
            num_workers=NUM_WORKERS, pin_memory=True
        )
        test_loader = DataLoader(
            test_ds, batch_size=BATCH_SIZE, shuffle=False,
            num_workers=NUM_WORKERS, pin_memory=True
        )
    
        model = MLPWithCatEmb(emb_info=emb_info, n_num=len(numeric_cols)).to(device)
        optimizer = torch.optim.AdamW(model.parameters(), lr=LR_MAX, weight_decay=WD)
    
        total_steps = EPOCHS * len(trn_loader)
        scheduler = torch.optim.lr_scheduler.OneCycleLR(
            optimizer, max_lr=LR_MAX, total_steps=total_steps,
            pct_start=0.0, div_factor=25.0, final_div_factor=10.5, anneal_strategy='cos'
        )
    
        scaler = torch.amp.GradScaler('cuda', enabled=True)
    
        best_auc = -1.0
        best_epoch = -1
        epochs_no_improve = 0
        best_path = os.path.join(MODEL_DIR, f"fold{fold}.pt")
    
        for epoch in range(1, EPOCHS+1):
            train_loss = train_one_epoch(model, trn_loader, optimizer, scaler, scheduler)
            _, val_stats = validate(model, val_loader)
            auc = val_stats.get("auc", float("nan"))
            print(f"Epoch {epoch:02d}: train_loss={train_loss:.4f} | "
                  f"val_auc={auc:.5f} val_logloss={val_stats['logloss']:.5f} val_acc={val_stats['acc']:.4f}")
    
            if 1: #auc > best_auc:
                best_auc = auc
                best_epoch = epoch
                epochs_no_improve = 0
                torch.save({
                    "model_state": model.state_dict(),
                    "config": {
                        "emb_info": emb_info,
                        "numeric_cols": numeric_cols,
                        "categorical_cols": categorical_cols
                    }
                }, best_path)
            else:
                epochs_no_improve += 1
                if epochs_no_improve >= EARLY_STOP:
                    print(f"Early stopping at epoch {epoch}. Best AUC {best_auc:.5f} @ epoch {best_epoch}")
                    break
    
        # Load best
        ckpt = torch.load(best_path, map_location="cpu", weights_only=False)
        model.load_state_dict(ckpt["model_state"])
        model.to(device)
    
        # OOF for this fold
        val_probs, val_stats = validate(model, val_loader)
        oof[val_idx] += val_probs / REPEATS
        oof2[val_idx] = val_probs
        fold_metrics.append({"fold": fold, **val_stats})
        print(f"[Fold {fold}] AUC={val_stats['auc']:.5f}  LogLoss={val_stats['logloss']:.5f}  Acc={val_stats['acc']:.4f}")
    
        test_probs, _ = validate(model, test_loader)
        preds += test_probs / FOLDS / REPEATS
    
    # -------------------------
    # Overall OOF metrics
    # -------------------------
    oof_c = np.clip(oof2, 1e-7, 1 - 1e-7)
    oof_auc = roc_auc_score(y, oof2)
    oof_ll  = log_loss(y, oof_c)     # no eps kwarg
    oof_acc = accuracy_score(y, (oof2>=0.5).astype(int))
    print("\n========== Overall OOF ==========")
    print(f"OOF AUC={oof_auc:.5f}  LogLoss={oof_ll:.5f}  Acc={oof_acc:.4f}")
    
    # Save OOF and metrics
    #pd.DataFrame({
    #    "oof_pred": oof,
    #    TARGET_COL: y
    #}).to_csv(OOF_PATH, index=False)
    
    #with open(os.path.join(MODEL_DIR, "fold_metrics.json"), "w") as f:
    #    json.dump({"folds": fold_metrics, "oof": {"auc": oof_auc, "logloss": oof_ll, "acc": oof_acc}}, f, indent=2)
    
    #print(f"Saved OOF -> {OOF_PATH}")
    #print(f"Saved models -> {MODEL_DIR}")

Device: cuda
#categoricals=16, #numerics=279
Embedding config: {'age2': (79, 18), 'balance2': (8591, 50), 'day2': (32, 11), 'duration2': (1825, 50), 'campaign2': (53, 15), 'pdays2': (629, 50), 'previous2': (55, 15), 'job': (13, 7), 'marital': (4, 3), 'education': (5, 4), 'default': (3, 3), 'housing': (3, 3), 'loan': (3, 3), 'contact': (4, 3), 'month': (13, 7), 'poutcome': (5, 4)}
Total embedding dim: 246  + numeric: 279
##### REPEAT 1 of 5 #####

Epoch 01: train_loss=0.1501 | val_auc=0.97226 val_logloss=0.13613 val_acc=0.9412
Epoch 02: train_loss=0.1322 | val_auc=0.97397 val_logloss=0.13145 val_acc=0.9434
Epoch 03: train_loss=0.1258 | val_auc=0.97423 val_logloss=0.13155 val_acc=0.9434
Epoch 04: train_loss=0.1213 | val_auc=0.97391 val_logloss=0.13141 val_acc=0.9440
Epoch 05: train_loss=0.1163 | val_auc=0.97362 val_logloss=0.13324 val_acc=0.9434
Epoch 06: train_loss=0.1110 | val_auc=0.97304 val_logloss=0.13518 val_acc=0.9432
Epoch 07: train_loss=0.1065 | val_auc=0.97223 val_logloss=0.139

In [18]:
m = roc_auc_score(train.y.to_numpy(), oof)
print(f"NN (Train More) CV = {m}")

NN (Train More) CV = 0.9738956080062735


# Create NN (Train More) Submission CSV

In [19]:
sub = pd.read_csv(f"{PATH}sample_submission.csv")
sub['y'] = preds
sub.to_csv("submission_nn_final.csv",index=False)
np.save("oof_nn_train_more",oof)
print('Submission shape',sub.shape)
#sub.head()

Submission shape (250000, 2)


# Create XGB (Train More) Submission CSV

In [20]:
sub = pd.read_csv(f"{PATH}sample_submission.csv")
preds_xgb = (test_preds+test_preds2)/2. 
sub['y'] = preds_xgb
sub.to_csv("submission_xgb_final.csv",index=False)
np.save("oof_xgb_rows_train_more",oof_preds)
np.save("oof_xgb_cols_train_more",oof_preds2)
print('Submission shape',sub.shape)
#sub.head()

Submission shape (250000, 2)


# Ensemble - XGB and NN (Train More) - CV Score

In [21]:
oof_xgb = (oof_preds+oof_preds2)/2.
m = roc_auc_score(train.y.to_numpy(), oof_xgb)
print(f"Both XGB rows and XGB cols (Train More) CV = {m}")

Both XGB rows and XGB cols (Train More) CV = 0.9764472486174771


In [22]:
best_m = 0
best_w = 0
for w in np.arange(0,1.01,0.01):
    oof_ensemble = (1-w)*oof_xgb + w*oof
    m = roc_auc_score(train.y.to_numpy(), oof_ensemble)
    if m>best_m:
        best_m = m
        best_w = w
        
oof_ensemble = (1-best_w)*oof_xgb + best_w*oof
m = roc_auc_score(train.y.to_numpy(), oof_ensemble)
print(f"Ensemble XGB and NN (Train More) CV = {m}")
print(f" using best NN weight = {best_w}")

Ensemble XGB and NN (Train More) CV = 0.9767024498632033
 using best NN weight = 0.24


# Create - XGB and NN (Train More) - Submission CSV

In [23]:
xgb_preds = preds_xgb 
sub = pd.read_csv(f"{PATH}sample_submission.csv")
sub['y'] = (1-best_w)*xgb_preds + best_w*preds
sub.to_csv("submission_ensemble_final.csv",index=False)
print('Submission shape',sub.shape)
sub.head()

Submission shape (250000, 2)


Unnamed: 0,id,y
0,750000,0.000717
1,750001,0.073897
2,750002,0.000165
3,750003,3.8e-05
4,750004,0.005686
