<a href="https://colab.research.google.com/github/H-yana/colab/blob/main/Titanic_LightGBM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#データ読み込み

In [None]:
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split
train = pd.read_csv("/content/drive/MyDrive/titanic/train.csv") # 学習用データ
test = pd.read_csv("/content/drive/MyDrive/titanic/test.csv") # テスト用データ

In [None]:
train.head() # 先頭から5つの要素を表示

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
print(train.shape) #要素数の確認 => (891, 12)
print(test.shape) #要素数の確認 => (418, 11)

(891, 12)
(418, 11)


欠損値の保管

In [None]:
# 欠損データをisnull()で探してカラムごとに返す
def null_table(df):
    null_val = df.isnull().sum()
    percent = null_val/len(df)*100
    ret = pd.concat([null_val, percent], axis=1)
    ret = ret.rename(columns = {0:"欠損数", 1:"%"})
    return ret

In [None]:
null_table(train)

Unnamed: 0,欠損数,%
PassengerId,0,0.0
Survived,0,0.0
Pclass,0,0.0
Name,0,0.0
Sex,0,0.0
Age,177,19.86532
SibSp,0,0.0
Parch,0,0.0
Ticket,0,0.0
Fare,0,0.0


In [None]:
null_table(test)

Unnamed: 0,欠損数,%
PassengerId,0,0.0
Pclass,0,0.0
Name,0,0.0
Sex,0,0.0
Age,86,20.574163
SibSp,0,0.0
Parch,0,0.0
Ticket,0,0.0
Fare,1,0.239234
Cabin,327,78.229665


欠損値を代理データで埋める

In [None]:
train["Age"] = train["Age"].fillna(train["Age"].median()) # 中央値で埋める
train["Embarked"] = train["Embarked"].fillna("S") # 再頻値で埋める
test["Age"] = test["Age"].fillna(test["Age"].median()) # 中央値で埋める
test["Fare"] = test["Fare"].fillna(test["Fare"].mean()) # 平均値で埋める

モデルが解釈しやすいよう文字列を数値に変換

In [None]:
train["Sex"][train["Sex"]=="male"] = 0
train["Sex"][train["Sex"]=="female"] = 1
train["Embarked"][train["Embarked"]=="S"] = 0
train["Embarked"][train["Embarked"]=="C"] = 1
train["Embarked"][train["Embarked"]=="Q"] = 2
train["Sex"] = train["Sex"].astype(int)
train["Embarked"] = train["Embarked"].astype(int)
test["Sex"][test["Sex"]=="male"] = 0
test["Sex"][test["Sex"]=="female"] = 1
test["Embarked"][test["Embarked"]=="S"] = 0
test["Embarked"][test["Embarked"]=="C"] = 1
test["Embarked"][test["Embarked"]=="Q"] = 2
test["Sex"] = test["Sex"].astype(int)
test["Embarked"] = test["Embarked"].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A

対象カラムを絞る

In [None]:
x_train = train[["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]] # 特徴
y_train = train[["Survived"]] # 目的変数

過学習防止

In [None]:
x_trn, x_val, y_trn, y_val = train_test_split(x_train, y_train, test_size=0.1) # 10%を評価用データにする
train_data = lgb.Dataset(x_trn, label=y_trn) # 学習用データ
val_data = lgb.Dataset(x_val, label=y_val) # 評価用データ

ハイパーパラメータ設定

In [None]:
parameter = {
    "objective": "binary" # 二値分類、ラベルは0or1
}

モデルを学習

In [None]:
model = lgb.train(
    params = parameter,
    train_set = train_data,
    valid_sets = [train_data, val_data],
    num_boost_round = 10000,
    early_stopping_rounds = 100,
    verbose_eval = 200 # 200イテレーションごとにloglossを表示
)

Training until validation scores don't improve for 100 rounds.
[200]	training's binary_logloss: 0.112271	valid_1's binary_logloss: 0.318879
Early stopping, best iteration is:
[122]	training's binary_logloss: 0.155988	valid_1's binary_logloss: 0.294761


予測データ作成

In [None]:
testcase = test[["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]]
prediction = model.predict(testcase, num_iteration=model.best_iteration)
# 0.5未満は0, それ以外は1
prediction = np.where(prediction<0.5, 0, 1)
print(prediction)

[0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 1 0 1 0 0 1 0 1 1 1 0 1 1 1 1 0 0 1 0 1 0 0
 0 0 1 0 1 0 1 1 0 0 0 1 1 1 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 0 0
 1 1 0 1 0 1 1 0 0 0 0 0 1 1 1 1 0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 1 1 0 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0
 1 0 1 0 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 1 1 0 1
 0 1 1 0 0 0 0 1 0 1 0 1 0 0 1 1 1 0 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0
 1 0 0 1 0 1 0 0 1 1 0 0 1 0 1 1 1 1 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0
 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0
 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 1
 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0
 0 1 1 1 1 0 0 1 0 0 1]


In [None]:
id = np.array(test["PassengerId"]).astype(int)
prediction = pd.DataFrame(prediction, id, columns=["Survived"])
prediction.to_csv("prediction.csv", index_label=["PassengerId"])