## Tabular Playground Series March 2021

<img src="https://i.imgur.com/uHVJtv0.png">



<br><br>

### Notebook Contents:

0. [**Imports, Data Loading and Preprocessing**](#loading)

1. [**Optuna Hyperparameter Optimization**](#optuna)

2. [**Submission**](#submission)

In [3]:
pip install optuna

Collecting optuna
  Downloading optuna-2.6.0-py3-none-any.whl (293 kB)
[K     |████████████████████████████████| 293 kB 4.5 MB/s eta 0:00:01
Collecting cliff
  Downloading cliff-3.7.0-py3-none-any.whl (80 kB)
[K     |████████████████████████████████| 80 kB 10.7 MB/s eta 0:00:01
[?25hCollecting alembic
  Downloading alembic-1.5.7-py2.py3-none-any.whl (159 kB)
[K     |████████████████████████████████| 159 kB 15.2 MB/s eta 0:00:01
[?25hCollecting sqlalchemy>=1.1.0
  Downloading SQLAlchemy-1.4.2-cp38-cp38-macosx_10_14_x86_64.whl (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 24.2 MB/s eta 0:00:01
[?25hCollecting colorlog
  Downloading colorlog-4.8.0-py2.py3-none-any.whl (10 kB)
Collecting cmaes>=0.8.2
  Downloading cmaes-0.8.2-py3-none-any.whl (15 kB)
Collecting greenlet!=0.4.17
  Downloading greenlet-1.0.0-cp38-cp38-macosx_10_14_x86_64.whl (86 kB)
[K     |████████████████████████████████| 86 kB 9.6 MB/s  eta 0:00:01
Collecting Mako
  Downloading Mako-1.1.4-py2.py3-none-

<a id="loading"></a>

##### 0. Imports, Data Loading and Preprocessing

In [9]:
import numpy as np
import pandas as pd
pd.options.display.max_columns = 100
from lightgbm import LGBMClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import QuantileTransformer, StandardScaler
from sklearn.feature_selection import VarianceThreshold, SelectKBest
import warnings
warnings.filterwarnings('ignore')
import optuna
import os
root_path = '../../input'

In [10]:
train = pd.read_csv(os.path.join(root_path, 'train.csv'))
test = pd.read_csv(os.path.join(root_path, 'test.csv'))
sample_submission = pd.read_csv(os.path.join(root_path, 'sample_submission.csv'))

categorical_cols = [i for i in train.columns if 'cat' in i]

dataset = pd.concat([train, test], axis = 0, ignore_index = True)
train_len = len(train)

In [11]:
float_cols = list(set(dataset.select_dtypes(['float']).columns.tolist()) - set(['target']))

In [12]:
for col in float_cols:
    transformer = QuantileTransformer(n_quantiles=100, 
                                      random_state=0, output_distribution="normal")   # from optimal commit 9
    data_len = len(dataset)
    raw_vec = dataset[col].values.reshape(data_len, 1)
    transformer.fit(raw_vec)
    dataset[col+"_qt"] = transformer.transform(raw_vec)

In [13]:
dataset = pd.get_dummies(dataset, columns = categorical_cols)
train_preprocessed = dataset.iloc[:train_len, ]
test_preprocessed = dataset.iloc[train_len:, ]

test_cols_always_0 = (test_preprocessed.drop('target',1).sum(axis = 0)
                      .rename("n_non_null").to_frame().query("n_non_null == 0").index.tolist())

features = list(set(train_preprocessed.drop(['id', 'target'], 1).columns.tolist()) - set(test_cols_always_0))

assert train_preprocessed.shape[1] == test_preprocessed.shape[1]

**Disclaimer:** 

I did not inspect whether some of the categorical columns in train have values not present in test or viceversa. A simple solution would be to drop them in train if you knew it beforehand.

<a id="optuna"></a>

### Optuna

Look [here](https://optuna.readthedocs.io/en/stable/tutorial/) for reference about Optuna library. 

Look [here](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html) for a set of Lightgbm Classifier hyperparameters.


Skip and go [here](#hyperparams) to find my best parameters.

In [14]:
#Set to False if you want to skip it

OPTUNA_OPTIMIZATION = True

In [15]:
N_SPLITS = 10 #Number of folds for validation
N_TRIALS = 50 #Number of trials to find best hyperparameters

In [16]:
def objective(trial, cv=StratifiedKFold(N_SPLITS, shuffle = True, random_state = 29)):
    
    
    param_lgb = {
        "random_state": trial.suggest_int("random_state", 1, 100),
        "objective": "binary",
        "metric": "binary_logloss",
        "verbosity": -1,
        "boosting_type": "gbdt",
        "reg_alpha": trial.suggest_float("reg_alpha", 1e-8, 10.0, log=True),
        "reg_lambda": trial.suggest_float("reg_lambda", 1e-8, 10.0, log=True),
        "max_depth": trial.suggest_int("max_depth", -1, 10),
        "n_estimators": trial.suggest_int("n_estimators", 50, 500),
        "num_leaves": trial.suggest_int("num_leaves", 2, 256),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.4, 1.0),
        "subsample": trial.suggest_float("subsample", 0.4, 1.0),
        "subsample_freq": trial.suggest_int("subsample_freq", 1, 7),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
    }
    
    model = LGBMClassifier(**param_lgb)
    
    val_aucs = []
    aucs = []
    
    for kfold, (train_idx, val_idx) in enumerate(cv.split(train_preprocessed[features].values, train['target'].values)):
        
        model.fit(train_preprocessed.loc[train_idx, features], train_preprocessed.loc[train_idx, 'target'])
        print('Fitted {}'.format(type(model).__name__))
        val_true = train.loc[val_idx, 'target'].values
        
        preds = model.predict(train_preprocessed.loc[val_idx, features])
        
        auc = roc_auc_score(val_true, preds)
        
        print('Fold: {}\t AUC: {}\n'.format(kfold, auc))
        aucs.append(auc)
    
    print('Average AUC: {}'.format(np.average(auc)))
    return np.average(aucs)

In [None]:
if OPTUNA_OPTIMIZATION:
    study = optuna.create_study(study_name = 'lgbm_parameter_opt', direction="maximize")
    study.optimize(objective, n_trials=N_TRIALS) 
    
    trial = study.best_trial
    
    print("  Value: {}".format(trial.value))
    
    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))
else:
    trial = {"reg_alpha": 0.362136938773081,
             "reg_lambda": 2.930297242488071,
             "max_depth": 10,
             "n_estimators": 306,
             "num_leaves": 71,
             "colsample_bytree": 0.7121396258381646,
             "subsample": 0.793959734582999,
             "subsample_freq": 2,
             "min_child_samples": 18}

[32m[I 2021-03-22 14:02:51,857][0m A new study created in memory with name: lgbm_parameter_opt[0m


Best Params: 
    
    reg_alpha: 0.362136938773081
    reg_lambda: 2.930297242488071
    max_depth: 10
    n_estimators: 306
    num_leaves: 71
    colsample_bytree: 0.7121396258381646
    subsample: 0.793959734582999
    subsample_freq: 2
    min_child_samples: 18

In [10]:
if OPTUNA_OPTIMIZATION:
    final_model = LGBMClassifier(**trial.params)
else:
    final_model = LGBMClassifier(**trial)

In [11]:
test_preds = []

skf = StratifiedKFold(N_SPLITS, shuffle = True, random_state = 29)
aucs = []
for kfold, (train_idx, val_idx) in enumerate(skf.split(train_preprocessed[features].values, 
                                                      train_preprocessed['target'].values)):
        
        final_model.fit(train_preprocessed.loc[train_idx, features], train_preprocessed.loc[train_idx, 'target'])
        print('Fitted {}'.format(type(final_model).__name__))
        val_true = train.loc[val_idx, 'target'].values
        
        preds = final_model.predict(train_preprocessed.loc[val_idx, features])
        
        auc = roc_auc_score(val_true, preds)
        aucs.append(auc)
        print('Fold: {}\t Validation AUC: {}\n'.format(kfold, auc))
        
        test_preds.append(final_model.predict_proba(test_preprocessed[features])[:, 1])
        
print("Best Parameters mean AUC: {}".format(np.mean(aucs)))

Fitted LGBMClassifier
Fold: 0	 Validation AUC: 0.7813630335680872

Fitted LGBMClassifier
Fold: 1	 Validation AUC: 0.7797455076230869

Fitted LGBMClassifier
Fold: 2	 Validation AUC: 0.7801258351038276

Fitted LGBMClassifier
Fold: 3	 Validation AUC: 0.7765705729142831

Fitted LGBMClassifier
Fold: 4	 Validation AUC: 0.7798306360624886

Fitted LGBMClassifier
Fold: 5	 Validation AUC: 0.7797695544955657

Fitted LGBMClassifier
Fold: 6	 Validation AUC: 0.778813192002396

Fitted LGBMClassifier
Fold: 7	 Validation AUC: 0.7770425227454083

Fitted LGBMClassifier
Fold: 8	 Validation AUC: 0.7792939183142692

Fitted LGBMClassifier
Fold: 9	 Validation AUC: 0.7769395743604536

Best Parameters mean AUC: 0.7789494347189867


<a id = "submission"></a>

### Submission

In [12]:
test_predictions = np.mean(test_preds, axis = 0)

In [13]:
len(test_predictions) == len(test)

True

In [14]:
sample_submission['target'] = test_predictions

sample_submission.to_csv("submission.csv", index = False)