<a href="https://colab.research.google.com/github/Hibi1001/practice/blob/main/section_3/02_hyperparameter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ハイパーパラメータの調整
精度向上のために、ハイパーパラメータを調整します。

## Optunaのインストール
ハイパーパラメータの最適化に使用するライブラリ、Optunaをインストールします。

In [2]:
!pip install optuna

Collecting optuna
  Downloading optuna-4.1.0-py3-none-any.whl.metadata (16 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.14.0-py3-none-any.whl.metadata (7.4 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.3.8-py3-none-any.whl.metadata (2.9 kB)
Downloading optuna-4.1.0-py3-none-any.whl (364 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m364.4/364.4 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading alembic-1.14.0-py3-none-any.whl (233 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.5/233.5 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.9.0-py3-none-any.whl (11 kB)
Downloading Mako-1.3.8-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Ma

## データの準備
必要なライブラリの導入、データの読み込みと加工を行います。

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

import lightgbm as lgb
import optuna

train_data = pd.read_csv("train-2.csv")  # 訓練データ
test_data = pd.read_csv("test-2.csv") # テストデータ

test_id = test_data["PassengerId"]  # 結果の提出時に使用

data = pd.concat([train_data, test_data], sort=False)  # テストデータ、訓練データを結合

# カテゴリデータの変換
data["Sex"].replace(["male", "female"], [0, 1], inplace=True)
data["Embarked"].fillna(("S"), inplace=True)
data["Embarked"] = data["Embarked"].map({"S": 0, "C": 1, "Q": 2})

# 欠損値を埋める
data["Fare"].fillna(data["Fare"].mean(), inplace=True)
data["Age"].fillna(data["Age"].mean(), inplace=True)

# 新しい特徴量の作成
data["Family"] = data["Parch"] + data["SibSp"]

# 不要な特徴量の削除
data.drop(["Name", "PassengerId", "SibSp", "Parch", "Ticket", "Cabin"],
          axis=1, inplace=True)

# 入力と正解の作成
#data = pd.concat([train_data, test_data], sort=False)
train_data = data[:len(train_data)]
test_data = data[len(train_data):]
t = train_data["Survived"]  # 正解
x_train = train_data.drop("Survived", axis=1)  # 訓練時の入力
x_test = test_data.drop("Survived", axis=1)  # テスト時の入力

x_train.head()

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data["Sex"].replace(["male", "female"], [0, 1], inplace=True)
  data["Sex"].replace(["male", "female"], [0, 1], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df

Unnamed: 0,Pclass,Sex,Age,Fare,Embarked,Family
0,3,0,22.0,7.25,0,1
1,1,1,38.0,71.2833,1,1
2,3,1,26.0,7.925,0,0
3,1,1,35.0,53.1,0,1
4,3,0,35.0,8.05,0,0


## データの分割
訓練用データを訓練用と検証用に分割します。

In [4]:
x_train, x_valid, t_train, t_valid = train_test_split(x_train, t, test_size=0.3, stratify=t)
print(len(x_train))
print(len(x_valid))
print(len(t_train))
print(len(x_valid))


623
268
623
268


## ハイパーパラメータの最適化
ハイパーパラメータ最適化のための関数を用意します。  
最適化には、Optunaというライブラリを使用します。  
https://github.com/optuna/optuna  
  
また、機械学習のアルゴリズムには決定木をベースにした「LightGBM」を使います。  
LightGBMは「勾配ブースティング」の一種で使い勝手が良く、多くのKaggle上位者に使用された実績があります。  
大量の決定木を使用し、ある決定木の予測結果から誤差の大きなデータをうまく予測できるように次の決定木を作成します。  
https://lightgbm.readthedocs.io/en/latest/


In [5]:
categorical_features = ["Embarked", "Pclass", "Sex"]

def objective(trial):
    # ハイパーパラメータの探索範囲
    params = {
        "objective": "binary",  # 二値分類
        "max_bin": trial.suggest_int("max_bin", 200, 500),  # 特徴量の最大分割数
        "learning_rate": 0.05,  # 学習率
        "num_leaves": trial.suggest_int("num_leaves", 16, 128)  # 分岐の末端の最大数
    }

    # データセットの作成
    lgb_train = lgb.Dataset(x_train, t_train, categorical_feature=categorical_features)
    lgb_val = lgb.Dataset(x_valid, t_valid, reference=lgb_train, categorical_feature=categorical_features)

    # モデルの訓練
    verbose_eval = 20  # 学習過程の表示間隔
    model = lgb.train(params, lgb_train, valid_sets=[lgb_train, lgb_val],
                      num_boost_round=500,  # 学習回数の最大値
                      callbacks=[lgb.early_stopping(stopping_rounds=10,  # 連続して10回性能が向上しなければ終了
                                                    verbose=True),
                                 lgb.log_evaluation(verbose_eval)])

    y_valid = model.predict(x_valid, num_iteration=model.best_iteration)  # 訓練済みのモデルを使用
    score = log_loss(t_valid, y_valid)  # 二値の交差エントロピー誤差
    return score

Optunaを使い、ハイパーパラメータを最適化します。


In [6]:
study = optuna.create_study(sampler=optuna.samplers.RandomSampler())
study.optimize(objective, n_trials=30)

[I 2025-01-12 17:21:12,085] A new study created in memory with name: no-name-c8c5b2ac-301a-42cb-98f5-66997ac714e7


[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000391 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.429707	valid_1's binary_logloss: 0.477069

[I 2025-01-12 17:21:14,380] Trial 0 finished with value: 0.4352583294123291 and parameters: {'max_bin': 338, 'num_leaves': 103}. Best is trial 0 with value: 0.4352583294123291.



[40]	training's binary_logloss: 0.347955	valid_1's binary_logloss: 0.441419
[60]	training's binary_logloss: 0.298483	valid_1's binary_logloss: 0.441139
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.321785	valid_1's binary_logloss: 0.435258
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.011293 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.429707	valid_1's binary_logloss: 0.477069
[40]	training's binary_logloss: 0.347955	valid_1'

[I 2025-01-12 17:21:16,042] Trial 1 finished with value: 0.4352583294123291 and parameters: {'max_bin': 388, 'num_leaves': 102}. Best is trial 0 with value: 0.4352583294123291.


[60]	training's binary_logloss: 0.298483	valid_1's binary_logloss: 0.441139
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.321785	valid_1's binary_logloss: 0.435258
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000092 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.429707	valid_1's binary_logloss: 0.477069
[40]	training's binary_logloss: 0.347955	valid_1's binary_logloss: 0.441419


[I 2025-01-12 17:21:18,488] Trial 2 finished with value: 0.4352583294123291 and parameters: {'max_bin': 230, 'num_leaves': 126}. Best is trial 0 with value: 0.4352583294123291.


[60]	training's binary_logloss: 0.298483	valid_1's binary_logloss: 0.441139
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.321785	valid_1's binary_logloss: 0.435258
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.174850 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.429723	valid_1's binary_logloss: 0.477072
[40]	training's binary_logloss: 0.348181	valid_1's binary_logloss: 0.444367


[I 2025-01-12 17:21:20,245] Trial 3 finished with value: 0.43675380329188673 and parameters: {'max_bin': 378, 'num_leaves': 23}. Best is trial 0 with value: 0.4352583294123291.


[60]	training's binary_logloss: 0.297735	valid_1's binary_logloss: 0.438217
Early stopping, best iteration is:
[53]	training's binary_logloss: 0.313826	valid_1's binary_logloss: 0.436754
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000276 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.429707	valid_1's binary_logloss: 0.477069
[40]	training's binary_logloss: 0.347955	valid_1's binary_logloss: 0.441419
[60]	training's binary_logloss: 0.298483	valid_1's

[I 2025-01-12 17:21:22,109] Trial 4 finished with value: 0.4352583294123291 and parameters: {'max_bin': 497, 'num_leaves': 50}. Best is trial 0 with value: 0.4352583294123291.


[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000090 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds


[I 2025-01-12 17:21:22,486] Trial 5 finished with value: 0.4352583294123291 and parameters: {'max_bin': 420, 'num_leaves': 78}. Best is trial 0 with value: 0.4352583294123291.


[20]	training's binary_logloss: 0.429707	valid_1's binary_logloss: 0.477069
[40]	training's binary_logloss: 0.347955	valid_1's binary_logloss: 0.441419
[60]	training's binary_logloss: 0.298483	valid_1's binary_logloss: 0.441139
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.321785	valid_1's binary_logloss: 0.435258
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000424 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.429707	valid_1's

[I 2025-01-12 17:21:22,664] Trial 6 finished with value: 0.4352583294123291 and parameters: {'max_bin': 444, 'num_leaves': 88}. Best is trial 0 with value: 0.4352583294123291.
[I 2025-01-12 17:21:22,735] Trial 7 finished with value: 0.4337916959352715 and parameters: {'max_bin': 262, 'num_leaves': 22}. Best is trial 7 with value: 0.4337916959352715.


[40]	training's binary_logloss: 0.347955	valid_1's binary_logloss: 0.441419
[60]	training's binary_logloss: 0.298483	valid_1's binary_logloss: 0.441139
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.321785	valid_1's binary_logloss: 0.435258
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002287 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.429678	valid_1's binary_logloss: 0.476815
[40]	training's binary_logloss: 0.347933	valid_1's

[I 2025-01-12 17:21:23,031] Trial 8 finished with value: 0.4343627378126399 and parameters: {'max_bin': 387, 'num_leaves': 19}. Best is trial 7 with value: 0.4337916959352715.


[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.013585 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.429707	valid_1's binary_logloss: 0.477069


[I 2025-01-12 17:21:23,446] Trial 9 finished with value: 0.4352583294123291 and parameters: {'max_bin': 364, 'num_leaves': 101}. Best is trial 7 with value: 0.4337916959352715.


[40]	training's binary_logloss: 0.347955	valid_1's binary_logloss: 0.441419
[60]	training's binary_logloss: 0.298483	valid_1's binary_logloss: 0.441139
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.321785	valid_1's binary_logloss: 0.435258


[I 2025-01-12 17:21:23,616] Trial 10 finished with value: 0.4369354281653766 and parameters: {'max_bin': 203, 'num_leaves': 49}. Best is trial 7 with value: 0.4337916959352715.


[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000294 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 252
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.426279	valid_1's binary_logloss: 0.475102
[40]	training's binary_logloss: 0.344356	valid_1's binary_logloss: 0.438275
Early stopping, best iteration is:
[48]	training's binary_logloss: 0.32303	valid_1's binary_logloss: 0.436935
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, th

[I 2025-01-12 17:21:24,001] Trial 11 finished with value: 0.4352583294123291 and parameters: {'max_bin': 237, 'num_leaves': 119}. Best is trial 7 with value: 0.4337916959352715.


[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.039040 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 252
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.426279	valid_1's binary_logloss: 0.475102
[40]	training's binary_logloss: 0.344356	valid_1's binary_logloss: 0.438275
Early stopping, best iteration is:
[48]	training's binary_logloss: 0.32303	valid_1's binary_logloss: 0.436935

[I 2025-01-12 17:21:26,071] Trial 12 finished with value: 0.4369354281653766 and parameters: {'max_bin': 215, 'num_leaves': 86}. Best is trial 7 with value: 0.4337916959352715.



[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.014839 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.429707	valid_1's binary_logloss: 0.477069
[40]	training's binary_logloss: 0.347955	valid_1's binary_logloss: 0.441419
[60]	training's binary_logloss: 0.298483	valid_1's binary_logloss: 0.441139
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.321785	valid_1's binary_logloss: 0.435258


[I 2025-01-12 17:21:27,849] Trial 13 finished with value: 0.4352583294123291 and parameters: {'max_bin': 362, 'num_leaves': 55}. Best is trial 7 with value: 0.4337916959352715.


[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.094777 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.429707	valid_1's binary_logloss: 0.477069


[I 2025-01-12 17:21:28,836] Trial 14 finished with value: 0.4352583294123291 and parameters: {'max_bin': 253, 'num_leaves': 39}. Best is trial 7 with value: 0.4337916959352715.


[40]	training's binary_logloss: 0.347955	valid_1's binary_logloss: 0.441419
[60]	training's binary_logloss: 0.298483	valid_1's binary_logloss: 0.441139
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.321785	valid_1's binary_logloss: 0.435258
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000110 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.429707	valid_1's binary_logloss: 0.477069
[40]	training's binary_logloss: 0.347955	valid_1's

[I 2025-01-12 17:21:29,215] Trial 15 finished with value: 0.4352583294123291 and parameters: {'max_bin': 249, 'num_leaves': 125}. Best is trial 7 with value: 0.4337916959352715.



[60]	training's binary_logloss: 0.298483	valid_1's binary_logloss: 0.441139
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.321785	valid_1's binary_logloss: 0.435258
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000088 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.429707	valid_1's binary_logloss: 0.477069


[I 2025-01-12 17:21:29,527] Trial 16 finished with value: 0.4352583294123291 and parameters: {'max_bin': 350, 'num_leaves': 59}. Best is trial 7 with value: 0.4337916959352715.


[40]	training's binary_logloss: 0.347955	valid_1's binary_logloss: 0.441419
[60]	training's binary_logloss: 0.298483	valid_1's binary_logloss: 0.441139
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.321785	valid_1's binary_logloss: 0.435258
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000103 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.429707	valid_1's binary_logloss: 0.477069

[I 2025-01-12 17:21:29,755] Trial 17 finished with value: 0.4352583294123291 and parameters: {'max_bin': 323, 'num_leaves': 101}. Best is trial 7 with value: 0.4337916959352715.



[40]	training's binary_logloss: 0.347955	valid_1's binary_logloss: 0.441419
[60]	training's binary_logloss: 0.298483	valid_1's binary_logloss: 0.441139
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.321785	valid_1's binary_logloss: 0.435258
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000093 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds


[I 2025-01-12 17:21:29,990] Trial 18 finished with value: 0.4352583294123291 and parameters: {'max_bin': 266, 'num_leaves': 54}. Best is trial 7 with value: 0.4337916959352715.


[20]	training's binary_logloss: 0.429707	valid_1's binary_logloss: 0.477069
[40]	training's binary_logloss: 0.347955	valid_1's binary_logloss: 0.441419
[60]	training's binary_logloss: 0.298483	valid_1's binary_logloss: 0.441139
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.321785	valid_1's binary_logloss: 0.435258
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000097 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.429707	valid_1's

[I 2025-01-12 17:21:30,261] Trial 19 finished with value: 0.4352583294123291 and parameters: {'max_bin': 329, 'num_leaves': 30}. Best is trial 7 with value: 0.4337916959352715.


[60]	training's binary_logloss: 0.298483	valid_1's binary_logloss: 0.441139
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.321785	valid_1's binary_logloss: 0.435258
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000386 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.429707	valid_1's binary_logloss: 0.477069


[I 2025-01-12 17:21:30,630] Trial 20 finished with value: 0.4352583294123291 and parameters: {'max_bin': 482, 'num_leaves': 78}. Best is trial 7 with value: 0.4337916959352715.


[40]	training's binary_logloss: 0.347955	valid_1's binary_logloss: 0.441419
[60]	training's binary_logloss: 0.298483	valid_1's binary_logloss: 0.441139
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.321785	valid_1's binary_logloss: 0.435258
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000417 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 252


[I 2025-01-12 17:21:30,779] Trial 21 finished with value: 0.43188770843094665 and parameters: {'max_bin': 212, 'num_leaves': 16}. Best is trial 21 with value: 0.43188770843094665.


[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.430343	valid_1's binary_logloss: 0.476566
[40]	training's binary_logloss: 0.352972	valid_1's binary_logloss: 0.435083
[60]	training's binary_logloss: 0.309209	valid_1's binary_logloss: 0.433896
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.3286	valid_1's binary_logloss: 0.431888
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000447 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train

[I 2025-01-12 17:21:31,054] Trial 22 finished with value: 0.4352583294123291 and parameters: {'max_bin': 375, 'num_leaves': 65}. Best is trial 21 with value: 0.43188770843094665.


[40]	training's binary_logloss: 0.347955	valid_1's binary_logloss: 0.441419
[60]	training's binary_logloss: 0.298483	valid_1's binary_logloss: 0.441139
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.321785	valid_1's binary_logloss: 0.435258
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000761 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.429707	valid_1's binary_logloss: 0.477069
[40]	training's binary_logloss: 0.347955	valid_1's

[I 2025-01-12 17:21:31,302] Trial 23 finished with value: 0.4352583294123291 and parameters: {'max_bin': 412, 'num_leaves': 45}. Best is trial 21 with value: 0.43188770843094665.


[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000110 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.429707	valid_1's binary_logloss: 0.477069
[40]	training's binary_logloss: 0.347955	valid_1's binary_logloss: 0.441419


[I 2025-01-12 17:21:31,591] Trial 24 finished with value: 0.4352583294123291 and parameters: {'max_bin': 299, 'num_leaves': 53}. Best is trial 21 with value: 0.43188770843094665.


[60]	training's binary_logloss: 0.298483	valid_1's binary_logloss: 0.441139
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.321785	valid_1's binary_logloss: 0.435258
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000091 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds


[I 2025-01-12 17:21:31,858] Trial 25 finished with value: 0.4352583294123291 and parameters: {'max_bin': 226, 'num_leaves': 105}. Best is trial 21 with value: 0.43188770843094665.


[20]	training's binary_logloss: 0.429707	valid_1's binary_logloss: 0.477069
[40]	training's binary_logloss: 0.347955	valid_1's binary_logloss: 0.441419
[60]	training's binary_logloss: 0.298483	valid_1's binary_logloss: 0.441139
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.321785	valid_1's binary_logloss: 0.435258
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000108 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds


[I 2025-01-12 17:21:32,292] Trial 26 finished with value: 0.4352583294123291 and parameters: {'max_bin': 308, 'num_leaves': 28}. Best is trial 21 with value: 0.43188770843094665.


[20]	training's binary_logloss: 0.429707	valid_1's binary_logloss: 0.477069
[40]	training's binary_logloss: 0.347955	valid_1's binary_logloss: 0.441419
[60]	training's binary_logloss: 0.298483	valid_1's binary_logloss: 0.441139
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.321785	valid_1's binary_logloss: 0.435258
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.140109 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179


[I 2025-01-12 17:21:33,384] Trial 27 finished with value: 0.43729234128457384 and parameters: {'max_bin': 369, 'num_leaves': 24}. Best is trial 21 with value: 0.43188770843094665.


Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.429708	valid_1's binary_logloss: 0.477059
[40]	training's binary_logloss: 0.34802	valid_1's binary_logloss: 0.441359
[60]	training's binary_logloss: 0.297347	valid_1's binary_logloss: 0.441327
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.320932	valid_1's binary_logloss: 0.437292
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000445 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 

[I 2025-01-12 17:21:33,797] Trial 28 finished with value: 0.4352583294123291 and parameters: {'max_bin': 418, 'num_leaves': 110}. Best is trial 21 with value: 0.43188770843094665.


[20]	training's binary_logloss: 0.429707	valid_1's binary_logloss: 0.477069
[40]	training's binary_logloss: 0.347955	valid_1's binary_logloss: 0.441419
[60]	training's binary_logloss: 0.298483	valid_1's binary_logloss: 0.441139
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.321785	valid_1's binary_logloss: 0.435258
[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.127210 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.429707	valid_1's

[I 2025-01-12 17:21:35,298] Trial 29 finished with value: 0.4352583294123291 and parameters: {'max_bin': 289, 'num_leaves': 30}. Best is trial 21 with value: 0.43188770843094665.


ベストなハイパーパラメータを表示します。

In [7]:
print(study.best_params)

{'max_bin': 212, 'num_leaves': 16}


ベストなハイパーパラメータを使って予測を行います。

In [8]:
# ベストなハイパーパラメータの設定
params = {
    "objective": "binary",  # 二値分類
    "max_bin": study.best_params["max_bin"],  # 特徴量の最大分割数
    "learning_rate": 0.05,  # 学習率
    "num_leaves": study.best_params["num_leaves"]  # 分岐の末端の最大数
}

# データセットの作成
lgb_train = lgb.Dataset(x_train, t_train, categorical_feature=categorical_features)
lgb_val = lgb.Dataset(x_valid, t_valid, reference=lgb_train, categorical_feature=categorical_features)

# モデルの訓練
verbose_eval = 20  # 学習過程の表示間隔
model = lgb.train(params, lgb_train, valid_sets=[lgb_train, lgb_val],
                    num_boost_round=500,  # 学習回数の最大値
                    callbacks=[lgb.early_stopping(stopping_rounds=10,  # 連続して10回性能が向上しなければ終了
                                                verbose=True),
                                lgb.log_evaluation(verbose_eval)])

y_test = model.predict(x_test, num_iteration=model.best_iteration)  # 訓練済みのモデルを使用

[LightGBM] [Info] Number of positive: 239, number of negative: 384
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000094 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 252
[LightGBM] [Info] Number of data points in the train set: 623, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.383628 -> initscore=-0.474179
[LightGBM] [Info] Start training from score -0.474179
Training until validation scores don't improve for 10 rounds
[20]	training's binary_logloss: 0.430343	valid_1's binary_logloss: 0.476566
[40]	training's binary_logloss: 0.352972	valid_1's binary_logloss: 0.435083
[60]	training's binary_logloss: 0.309209	valid_1's binary_logloss: 0.433896
Early stopping, best iteration is:
[50]	training's binary_logloss: 0.3286	valid_1's binary_logloss: 0.431888


## 提出用のデータ
提出量データの形式を整え、CSVファイルに保存します。

In [9]:
# 結果を0か1に
y_test = (y_test > 0.5).astype(int)

# 形式を整える
survived_test = pd.Series(y_test, name="Survived")
subm_data = pd.concat([test_id, survived_test], axis=1)

# 提出用のcsvファイルを保存
subm_data.to_csv("submission_titanic_hp.csv", index=False)

subm_data

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0
