![](https://lightgbm.readthedocs.io/en/latest/_images/LightGBM_logo_black_text.svg)

# Summary

With only 5 days left in the competition, I am sharing my best models so that you can incorporate these into your stacks or work on further improving them as standalone models. Considering the public subset is only 20% of the entire test set, I find it essential to train models that are likely to do well on unseen data. Thus, I set conservative hyperparameters - more on this in the Model training section - and trained each of the 15 models on 20 different seeds and averaged them. Typically, these models scored 0.85647 on the public test set. Adding the last 5 to my existing stack still improved the result, but not as much as the first 10.

# Dataset

You can find the 15 out-of-fold and test predictions [here](https://www.kaggle.com/adamwurdits/tps-10-2021-lightgbm-predictions). Individual model performances are listed in the dataset description.

# Importing libraries

In [None]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.metrics import roc_auc_score
import optuna
# from optuna.integration import LightGBMPruningCallback
from lightgbm import LGBMClassifier
import gc

# Loading the data

In [None]:
df_train = pd.read_csv('../input/tabular-playground-series-oct-2021/train.csv')
df_test = pd.read_csv('../input/tabular-playground-series-oct-2021/test.csv')
sample_submission = pd.read_csv('../input/tabular-playground-series-oct-2021/sample_submission.csv')

# Preprocessing

I tried adding various features and experimented with different scalers in earlier versions (15-19). Adding the sum of the continuous and binary features and using a standard scaler gave me the best results. I didn’t try transforming the numbers in any way.

In [None]:
features = [c for c in df_test.columns if 'f' in c]
bin_features = [c for c in df_test[features] if df_test[c].dtype=='int64']
cont_features = [c for c in df_test[features] if df_test[c].dtype=='float64']

df_train['bin_count'] = df_train[bin_features].sum(axis=1)
df_test['bin_count'] = df_test[bin_features].sum(axis=1)
features.append('bin_count')

df_train['cont_sum'] = df_train[cont_features].sum(axis=1)
df_test['cont_sum'] = df_test[cont_features].sum(axis=1)
features.append('cont_sum')

scaler = preprocessing.StandardScaler()
df_train[features] = scaler.fit_transform(df_train[features])
df_test[features] = scaler.transform(df_test[features])

# Creating folds

Stratified KFold cross-validation using 5 splits. I decided to go with 5 folds as this will keep training times short and allow me to train more models on more seeds.

In [None]:
df_train['kfold'] = -1

y_train = df_train.target
X_train = df_train.drop('target', axis=1)

skf = model_selection.StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for fold, (i_train, i_valid) in enumerate (skf.split(X_train, y_train)):
    df_train.loc[i_valid, 'kfold'] = fold
    
del X_train, y_train
gc.collect()

# Hyperparameter optimization with Optuna

For the first 10 days of the competition I focused more on finding the ideal k in KFold, making the most of feature engineering and scaling, and getting a ballpark idea of the hyperparameters I am going to be using. Later on, I ran Optuna studies in batches of 12 trials. For the most part these notebooks could finish running in less than 6 hours which was ideal for overnight work.

In [None]:
# seed = 0

# def objective(trial):
#     fold = 0
#     params = {
#         'num_leaves': trial.suggest_int('num_leaves', 16, 16),
#         'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 2000, 6700),
#         'max_depth': trial.suggest_int('max_depth', 0, 0),
#         'max_bin': trial.suggest_int('max_bin', 200, 400),
#         'learning_rate': trial.suggest_float('learning_rate', 0.0071, 0.0076),
#         'lambda_l1': trial.suggest_loguniform('lambda_l1', 0.00001, 8),
#         'lambda_l2': trial.suggest_loguniform('lambda_l2', 0.00001, 100),
#         'min_gain_to_split': trial.suggest_float('min_gain_to_split', 0, 4),
#         'feature_fraction': trial.suggest_float('feature_fraction', 0.22, 0.35),
#         'bagging_fraction': trial.suggest_float('bagging_fraction', 0.49, 0.52),
#         'bagging_freq': trial.suggest_int('bagging_freq', 1, 1)        
#     }

#     X_train = df_train[df_train.kfold != fold].reset_index(drop=True)
#     X_valid = df_train[df_train.kfold == fold].reset_index(drop=True)
        
#     y_train = X_train.target
#     y_valid = X_valid.target
    
#     X_train = X_train[features]
#     X_valid = X_valid[features]
    
#     model = LGBMClassifier(
#             objective='binary',
#             tree_learner='serial',
#             seed=seed,
#             n_estimators=20000,
#             **params)
    
#     model.fit(X_train,
#               y_train,
#               early_stopping_rounds=500,
#               eval_set=[(X_valid, y_valid)],
#               eval_metric='auc',
# #               callbacks=[LightGBMPruningCallback(trial, 'auc')],
#               verbose=1000)
    
#     valid_pred = model.predict_proba(X_valid)[:,1]
        
#     auc = roc_auc_score(y_valid, valid_pred)
    
#     del X_train, X_valid, y_train, y_valid
#     gc.collect()
    
#     return auc

# study = optuna.create_study(direction="maximize")
# study.optimize(objective, n_trials=12)

# Model training

I wanted to use conservative models so I set max_depth to 0, num_leaves to half the default amount and used low learning rates. The models were trained on roughly half of all the samples and used only a quarter of all the features. Binning and regularization I used were very different from model to model.

I set up 10 notebooks to run the same models on seeds 0 to 9 and 10 to 19 as this helped to keep everything organized.

In [None]:
%%time

m = 15
s = 0

valid_preds = {}
test_preds = []
scores = []

for fold in range(5):
    X_train = df_train[df_train.kfold != fold].reset_index(drop=True)
    X_valid = df_train[df_train.kfold == fold].reset_index(drop=True)

    X_test = df_test[features].copy()

    valid_ids = X_valid.id.values.tolist()

    y_train = X_train.target
    y_valid = X_valid.target

    X_train = X_train[features]
    X_valid = X_valid[features]

    params = {'num_leaves': 16,
              'min_data_in_leaf': 4980,
              'max_depth': 0,
              'max_bin': 399,
              'learning_rate': 0.007443215095336714,
              'lambda_l1': 1.6181770821331433e-05,
              'lambda_l2': 1.696656500639349e-05,
              'min_gain_to_split': 1.335756684660189,
              'feature_fraction': 0.2550512769849608,
              'bagging_fraction': 0.5143697950295731,
              'bagging_freq': 1}

    model = LGBMClassifier(
        objective='binary',
        importance_type='split',
        boosting_type='gbdt',
        tree_learner='serial',
        num_threads=-1,
        random_state=s,
        n_estimators=20000,
        **params)

    model.fit(X_train,
              y_train,
              early_stopping_rounds=500,
              eval_set=[(X_valid, y_valid)],
              eval_metric='auc',
              verbose=1000)

    valid_pred = model.predict_proba(X_valid)[:,1]
    test_pred = model.predict_proba(X_test)[:,1]

    valid_preds.update(dict(zip(valid_ids, valid_pred)))
    test_preds.append(test_pred)

    score = roc_auc_score(y_valid, valid_pred)    
    scores.append(score)

print(f'Mean auc {np.mean(scores)}, std {np.std(scores)}')

valid_preds = pd.DataFrame.from_dict(valid_preds, orient='index').reset_index()
valid_preds.columns = ['id', f'm{m}s{s}_pred']
valid_preds.to_csv(f'm{m}s{s}_valid_pred.csv', index=False)

sample_submission.target = np.mean(np.column_stack(test_preds), axis=1)
sample_submission.columns = ['id', f'm{m}s{s}_pred']
sample_submission.to_csv(f'm{m}s{s}_test_pred.csv', index=False)

This is the same model as the one above, except it is trained on another seed. In this notebook I trained on seeds 0 and 10, in others I trained on seeds 1 and 11, 2 and 12 and so on.

In [None]:
%%time

s = 10

valid_preds = {}
test_preds = []
scores = []

for fold in range(5):
    X_train = df_train[df_train.kfold != fold].reset_index(drop=True)
    X_valid = df_train[df_train.kfold == fold].reset_index(drop=True)

    X_test = df_test[features].copy()

    valid_ids = X_valid.id.values.tolist()

    y_train = X_train.target
    y_valid = X_valid.target

    X_train = X_train[features]
    X_valid = X_valid[features]

    params = {'num_leaves': 16,
              'min_data_in_leaf': 4980,
              'max_depth': 0,
              'max_bin': 399,
              'learning_rate': 0.007443215095336714,
              'lambda_l1': 1.6181770821331433e-05,
              'lambda_l2': 1.696656500639349e-05,
              'min_gain_to_split': 1.335756684660189,
              'feature_fraction': 0.2550512769849608,
              'bagging_fraction': 0.5143697950295731,
              'bagging_freq': 1}

    model = LGBMClassifier(
        objective='binary',
        importance_type='split',
        boosting_type='gbdt',
        tree_learner='serial',
        num_threads=-1,
        random_state=s,
        n_estimators=20000,
        **params)

    model.fit(X_train,
              y_train,
              early_stopping_rounds=500,
              eval_set=[(X_valid, y_valid)],
              eval_metric='auc',
              verbose=1000)

    valid_pred = model.predict_proba(X_valid)[:,1]
    test_pred = model.predict_proba(X_test)[:,1]

    valid_preds.update(dict(zip(valid_ids, valid_pred)))
    test_preds.append(test_pred)

    score = roc_auc_score(y_valid, valid_pred)    
    scores.append(score)

print(f'Mean auc {np.mean(scores)}, std {np.std(scores)}')

valid_preds = pd.DataFrame.from_dict(valid_preds, orient='index').reset_index()
valid_preds.columns = ['id', f'm{m}s{s}_pred']
valid_preds.to_csv(f'm{m}s{s}_valid_pred.csv', index=False)

sample_submission.target = np.mean(np.column_stack(test_preds), axis=1)
sample_submission.columns = ['id', f'm{m}s{s}_pred']
sample_submission.to_csv(f'm{m}s{s}_test_pred.csv', index=False)

# Considerations for stacking

Adding similar boosting algorithms to my stack didn't improve my CV or LB scores. Adding non-boosting type models, however - even ones that had way worse scores - improved my LB score by around 0.00010. Interestingly, tuning these non-boosting models decreased my overall scores again.

Thank you for reading my notebook! Let me know if these models helped you in any way or if you have suggestion for improving them. I am still experimenting with different combinations and hoping to increase my final results to around 0.85665.