# [하이퍼파라미터 튜닝을 쉽고 빠르게 하는 방법](https://dacon.io/competitions/official/235713/codeshare/2704?page=1&dtype=recent)

Kaggle Competition 에서 많이 사용되는 하이퍼파라미터 튜닝 라이브러리 `optuna`

## Optuna

- 하이퍼파라미터 튜닝에 쓰고 있는 최신 Automl 기법입니다.
- 빠르게 튜닝이 가능하다는 장점이 있습니다.
- 하이퍼파라미터 튜닝 방식을 지정할수 있다. -> 직관적인 api인 튜닝된 lightgbm도 제공해줍니다.
- 다른 라이브러리들에 비해 직관적인 장점이 있어 코딩하기 용이합니다.

In [2]:
!pip install optuna

Collecting optuna
[?25l  Downloading https://files.pythonhosted.org/packages/2b/21/d13081805e1e1afc71f5bb743ece324c8bd576237c51b899ecb38a717502/optuna-2.7.0-py3-none-any.whl (293kB)
[K     |████████████████████████████████| 296kB 6.6MB/s 
[?25hCollecting alembic
[?25l  Downloading https://files.pythonhosted.org/packages/ab/ff/375a0a81965a7ad4e23d1786de218e9bae050c4d3927cc9b2783aa045401/alembic-1.6.2.tar.gz (1.2MB)
[K     |████████████████████████████████| 1.2MB 7.9MB/s 
[?25hCollecting cmaes>=0.8.2
  Downloading https://files.pythonhosted.org/packages/01/1f/43b01223a0366171f474320c6e966c39a11587287f098a5f09809b45e05f/cmaes-0.8.2-py3-none-any.whl
Collecting cliff
[?25l  Downloading https://files.pythonhosted.org/packages/a2/d6/7d9acb68a77acd140be7fececb7f2701b2a29d2da9c54184cb8f93509590/cliff-3.7.0-py3-none-any.whl (80kB)
[K     |████████████████████████████████| 81kB 5.5MB/s 
Collecting colorlog
  Downloading https://files.pythonhosted.org/packages/32/e6/e9ddc6fa1104fda718338b3

In [3]:
import numpy as np
import pandas as pd
import optuna
from lightgbm import LGBMClassifier
from optuna import Trial
from optuna.samplers import TPESampler
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split

## 간단한 전처리

In [4]:
from urllib.request import urlretrieve

urlretrieve('https://drive.google.com/uc?export=download&id=1XLVFI_sK0smRVVuT8XU2s-M3lJT-68sN', './open.zip')

!unzip ./open.zip

train = pd.read_csv('./open/train.csv')
test = pd.read_csv('./open/test.csv')

Archive:  ./open.zip
   creating: open/
  inflating: open/train.csv          
  inflating: open/sample_submission.csv  
  inflating: open/test.csv           


In [5]:
train = train.drop(["index"], axis=1)
train.fillna("NAN", inplace=True)

test = test.drop(["index"], axis=1)
test.fillna("NAN", inplace=True)

In [6]:
train_ohe = pd.get_dummies(train)
test_ohe = pd.get_dummies(test)

In [7]:
X = train_ohe.drop(["credit"], axis=1)
y = train["credit"]
X_test = test_ohe.copy()

- Optuna는 objective하이퍼 파라미터의 성능을 평가하고 향후 시험에서 샘플링 할 위치를 결정하기 위해 숫자 값을 반환 하는 함수가 필요하다는 것을 의미하는 블랙 박스 최적화 프로그램 입니다.
- Optuna의 특정 인수를 object에 전달됩니다.
- trial는 조정해야하는 하이퍼 파라미터를 지정 하기 위해 objective 함수에 전달됩니다.
- 이것은 logloss 성능에 대한 피드백으로 Optuna에서 사용하는 모델에서 반환합니다.

In [8]:
def objective(trial: Trial) -> float:
    params_lgb = {
        "random_state": 42,
        "verbosity": -1,
        "learning_rate": 0.05,
        "n_estimators": 10000,
        "objective": "multiclass",
        "metric": "multi_logloss",
        "reg_alpha": trial.suggest_float("reg_alpha", 1e-8, 3e-5),
        "reg_lambda": trial.suggest_float("reg_lambda", 1e-8, 9e-2),
        "max_depth": trial.suggest_int("max_depth", 1, 20),
        "num_leaves": trial.suggest_int("num_leaves", 2, 256),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.4, 1.0),
        "subsample": trial.suggest_float("subsample", 0.3, 1.0),
        "subsample_freq": trial.suggest_int("subsample_freq", 1, 10),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
        "max_bin": trial.suggest_int("max_bin", 200, 500),
    }
    
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

    model = LGBMClassifier(**params_lgb)
    model.fit(
        X_train,
        y_train,
        eval_set=[(X_train, y_train), (X_valid, y_valid)],
        early_stopping_rounds=100,
        verbose=False,
    )

    lgb_pred = model.predict_proba(X_valid)
    log_score = log_loss(y_valid, lgb_pred)
    
    return log_score

In [9]:
sampler = TPESampler(seed=42)
study = optuna.create_study(
    study_name="lgbm_parameter_opt",
    direction="minimize",
    sampler=sampler,
)
study.optimize(objective, n_trials=10)
print("Best Score:", study.best_value)
print("Best trial:", study.best_trial.params)

[32m[I 2021-05-21 04:35:01,442][0m A new study created in memory with name: lgbm_parameter_opt[0m
[32m[I 2021-05-21 04:35:14,057][0m Trial 0 finished with value: 0.7411985631873189 and parameters: {'reg_alpha': 1.12424581642324e-05, 'reg_lambda': 0.08556428806974939, 'max_depth': 15, 'num_leaves': 154, 'colsample_bytree': 0.4936111842654619, 'subsample': 0.40919616423534183, 'subsample_freq': 1, 'min_child_samples': 88, 'max_bin': 380}. Best is trial 0 with value: 0.7411985631873189.[0m
[32m[I 2021-05-21 04:35:24,616][0m Trial 1 finished with value: 0.7369014929153384 and parameters: {'reg_alpha': 2.1245096608103405e-05, 'reg_lambda': 0.0018526142807772773, 'max_depth': 20, 'num_leaves': 214, 'colsample_bytree': 0.5274034664069657, 'subsample': 0.42727747704497043, 'subsample_freq': 2, 'min_child_samples': 34, 'max_bin': 357}. Best is trial 1 with value: 0.7369014929153384.[0m
[32m[I 2021-05-21 04:35:42,596][0m Trial 2 finished with value: 0.746597216355216 and parameters: {

Best Score: 0.7262569995177078
Best trial: {'reg_alpha': 1.987904330777592e-05, 'reg_lambda': 0.028054003730936226, 'max_depth': 11, 'num_leaves': 141, 'colsample_bytree': 0.5109126733153162, 'subsample': 0.9787092394351908, 'subsample_freq': 8, 'min_child_samples': 95, 'max_bin': 469}


In [10]:
# 시각화
optuna.visualization.plot_optimization_history(study)

In [11]:
# 파라미터들관의 관계
optuna.visualization.plot_parallel_coordinate(study)

In [12]:
# 각 파라미터들의 상관관계
optuna.visualization.plot_contour(
    study,
    params=[
        "max_depth",
        "num_leaves",
        "colsample_bytree",
        "subsample",
        "subsample_freq",
        "min_child_samples",
        "max_bin",
    ],
)

In [13]:
# 하이퍼파라미터 중요도
optuna.visualization.plot_param_importances(study)