# LightGBM with parameter tunning using Optuna

This notebook is based on the one I used in February. It can help you get started with LGBM and Optuna, in case you have not used these technologies before. I believe there is a lot of room for improvement and I will work on it throughout the competition.

**VERSION 7**: I tried standardizing the continuous variables, but the results were a little worse.

**VERSION 8**: The continuous variables are not standardized anymore. The categorical variables are now label encoded independntly in each column, since an "A" in column `cat0` does not necessarily mean the same thing as an "A" in the column `cat1`.

**VERSION 10**: Added `cat_l2` to list of parameters being optimized.

**VERSION 11**: Now Using LGBM to recognize categorical features by changing the variable type to "category". Parameter `cat_feature` removed from list.

# Load libraries and data

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ParameterGrid
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score
from lightgbm import LGBMClassifier
import optuna
        
input_path = Path('/kaggle/input/tabular-playground-series-mar-2021/')

In [None]:
train = pd.read_csv(input_path / 'train.csv', index_col='id')
test = pd.read_csv(input_path / 'test.csv', index_col='id')
target = train.pop('target')
submission = pd.read_csv(input_path / 'sample_submission.csv', index_col='id')

# Encode categorical variables

In [None]:
cat_cols = [col for col in train.columns if 'cat' in col]

train[cat_cols] = train[cat_cols].astype('category')
test[cat_cols] = test[cat_cols].astype('category')

# Data split and base model

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(train, target, test_size=0.1, random_state=0)

Last month we had a regression problem and now we have a classification one. So I'm using `LGBMClassifier` instead of `LGBMRegressor`. For the base model, we need only set our metric to `auc`.

In [None]:
%%time
model = LGBMClassifier(random_state=0, metric='auc')
model.fit(X_train, y_train)
y_pred = model.predict_proba(X_valid)[:,1]
auc = roc_auc_score(y_valid, y_pred)
print('AUC =', f'{auc:0.5f}')

The AUC is 0.89125, which is a little better than the score obtained with the [Getting Started Notebook](https://www.kaggle.com/inversion/get-started-mar-tabular-playground-competition) random forest model (0.87176).

# Set objective function for Optuna with parameters and their ranges

I am using the same parameters and ranges I did last time, but of course you can try to find more suitable ones.

In [None]:
def objective(trial):
    params = {
        'metric': 'auc',
        'random_state': 0,
        'n_estimators': trial.suggest_categorical('n_estimators', [1000]),
        'learning_rate': trial.suggest_float('learning_rate', 0.001, 0.1),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.001, 10.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.001, 10.0),
        'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.1, 0.2, 0.3, 0.4, 0.5]),
        'min_child_samples': trial.suggest_int('min_child_samples', 1, 300),
        'max_depth': trial.suggest_int('max_depth', 6, 127),
        'num_leaves': trial.suggest_int('num_leaves', 31, 128),
        'cat_smooth': trial.suggest_int('cat_smooth', 10, 100),
        'cat_l2': trial.suggest_int('cat_l2', 1, 20),
    }
    model = LGBMClassifier(**params) 
    model.fit(X_train, y_train, eval_set=[(X_valid,y_valid)], early_stopping_rounds=100, verbose=0)
    y_pred = model.predict_proba(X_valid)[:,1]
    auc = roc_auc_score(y_valid, y_pred)
    
    return auc

In last month's competition, the metric used was the RMSE (root mean squared error), but this month it is the AUC (area under curve). So this time we want to **maximize** it. In [this notebook](https://www.kaggle.com/ekozyreff/tps-2021-03-roc-and-auc-tutorial) I wrote a brief tutorial about ROC and AUC.

In [None]:
%%time
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler(seed=0))
study.optimize(objective, n_trials=100)
print('Number of finished trials:', len(study.trials))
print('Best parameters:', study.best_trial.params)
print('Best AUC:', study.best_trial.value)

# Visualize optimization history

In [None]:
optuna.visualization.plot_optimization_history(study)

# Recover best parameters found and build final predictions

Let's recover the best parameters for our model according to the optimization performed by Optuna. 

To perform the final predictions, I increased the number of estimators to 20000, since this worked well in previous tests.

In [None]:
params = study.best_params
params['random_state'] = 0
params['metric'] = 'auc'
params['n_estimators'] = 20000

Finally, let us split the training data in 10 folds and build our final model.

In [None]:
n_folds = 10
kf = KFold(n_splits=n_folds, shuffle=True, random_state=0)
y_pred = np.zeros(test.shape[0])

for fold, (train_index, valid_index) in enumerate(kf.split(train, target)):
    print("Running Fold {}".format(fold + 1))
    X_train, X_valid = pd.DataFrame(train.iloc[train_index]), pd.DataFrame(train.iloc[valid_index])
    y_train, y_valid = target.iloc[train_index], target.iloc[valid_index]
    model = LGBMClassifier(**params)
    model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], early_stopping_rounds=1000, verbose=0)
    print("  AUC: {}".format(roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1])))
    y_pred += model.predict_proba(test)[:,1]    

y_pred /= n_folds

print("")
print("Done!")

In [None]:
submission['target'] = y_pred
submission.to_csv('lgbm_optuna_enc.csv')

If you found this notebook helpful, please upvote 👍 and also feel free to leave comments and suggestions below. Thanks!