Kaggle 의 titanic dataset 을 이용하여 탑승자의 생존여부를 판별하는 Logistic Regression 모델을 학습합니다. 이 데이터에 대한 자세한 설명은 아래의 링크를 참고하세요. 우리는 미리 다운로드한 데이터를 로딩합니다.

https://www.kaggle.com/c/titanic/data

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data_path = 'C:/Users/gmlkd/data/titanic_train.csv'
titanic = pd.read_csv(data_path, index_col='PassengerId')
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


아래는 데이터를 행렬 형태로 변환하는 과정입니다.

In [3]:
titanic.drop(['Name', 'Ticket'], axis=1, inplace=True)
embark_dummy = pd.get_dummies(titanic['Embarked'], prefix='port')

age_group = titanic['Age'] < 20
age_group[age_group] = 'child'
age_group[titanic['Age'] >= 20] = 'adult'
age_group[titanic['Age'].isnull()] = 'unknown'
age_group.name = 'AgeGroup'
age_dummy = pd.get_dummies(age_group, prefix='Age')

pclass_dummy = pd.get_dummies(titanic['Pclass'], prefix='Pclass')
titanic['Sex'] = titanic['Sex'].map({'female':1, 'male':0})

titanic = pd.concat([titanic, pclass_dummy, embark_dummy, age_dummy], axis=1)
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Pclass_1,Pclass_2,Pclass_3,port_C,port_Q,port_S,Age_adult,Age_child,Age_unknown
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,0,3,0,22.0,1,0,7.25,,S,0,0,1,0,0,1,1,0,0
2,1,1,1,38.0,1,0,71.2833,C85,C,1,0,0,1,0,0,1,0,0
3,1,3,1,26.0,0,0,7.925,,S,0,0,1,0,0,1,1,0,0
4,1,1,1,35.0,1,0,53.1,C123,S,1,0,0,0,0,1,1,0,0
5,0,3,0,35.0,0,0,8.05,,S,0,0,1,0,0,1,1,0,0


이번에는 학습용 데이터와 테스트용 데이터를 4:1 로 구분합니다.

In [4]:
from sklearn.model_selection import train_test_split

def make_train_data(input_names, output_name):
    X = titanic[input_names].to_numpy()
    y = titanic[output_name].to_numpy()
    print(f'shape of X = {X.shape}')
    print(f'shape of y = {y.shape}')
    return X, y

input_names = 'Pclass_1 Pclass_2 Pclass_3 Sex SibSp Parch Fare port_C port_Q port_S Age_adult Age_child Age_unknown'.split()
output_name = 'Survived'

X, y = make_train_data(input_names, output_name)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

shape of X = (891, 13)
shape of y = (891,)


In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

lr = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

def check_accuracy(y_answer, y_pred):
    accuracy = (y_answer == y_pred).sum() / y_pred.shape[0]
    return accuracy

accuracy_lr = check_accuracy(y_pred_lr, y_test)
accuracy_lr

0.7877094972067039

이번에는 XGBoost 를 이용하여 동일한 데이터를 구분하는 판별기를 학습해 봅니다. XGBoost 의 버전은 0.90 또는 1.6.1 입니다.

In [6]:
!pip install xgboost



In [None]:
# 모든 버전 확인해주는 txt 생성
# !pip freeze > requirements.txt

In [7]:
import warnings
import xgboost as xgb

print(f'xgboost=={xgb.__version__}')

xgboost==1.5.1


  from pandas import MultiIndex, Int64Index


XGBoost 역시 scikit-learn 처럼 손쉽게 학습을 할 수 있습니다. 우선 boosting trees 의 개수를 10 개로 적게 설정해봅니다. 정확도가 같습니다.

In [8]:
xgb_clf = xgb.XGBClassifier(
    n_estimators = 10,
    max_depth = 4,
    booster = 'gbtree',
    eta = 0.3,
    gamma = 0,    
    silent = 0,
    objective = 'binary:logistic',
    nthread = 4,
    base_score = 0.5,    
)

xgb_clf.fit(X_train, y_train)
y_pred_xgb = xgb_clf.predict(X_test)
check_accuracy(y_pred_xgb, y_test)

Parameters: { "silent" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






0.7932960893854749

XGBoost 는 각 base estimators 가 순차적으로 정의/학습되기 때문에 우선 최대한 과적합을 한 뒤에, 예측 시 사용하는 base estimators 의 개수를 제한해도 됩니다. `ntree_limit` 는 이에 대한 값입니다. 우리는 500 개의 boosting trees 를 이용하여 우선 과적합을 한 뒤, 10 부터 500 까지 10 단위로 trees 의 개수를 조절하며 예측 성능의 정확도를 측정합니다.

In [9]:
xgb_clf = xgb.XGBClassifier(
        n_estimators = 500,
        max_depth = 4,
        booster = 'gbtree',
        eta = 0.3,
        gamma = 0,    
        silent = 0,
        objective = 'binary:logistic',
        nthread = 4,
        base_score = 0.5,    
    )
xgb_clf.fit(X_train, y_train)

# (n_estimators, train accuracy, test accuracy)
performances = []
for n_estimators in range(10, 501, 10):
    y_pred_train_xgb = xgb_clf.predict(X_train, ntree_limit=n_estimators)
    y_pred_test_xgb = xgb_clf.predict(X_test, ntree_limit=n_estimators)
    train_accuracy = check_accuracy(y_pred_train_xgb, y_train)
    test_accuracy = check_accuracy(y_pred_test_xgb, y_test)
    performances.append((n_estimators, train_accuracy, test_accuracy))



Parameters: { "silent" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




Trees 의 개수와 정확도의 플랏을 그려보면 학습 데이터에 대한 정확도는 계속하여 올라가지만 테스트 데이터에 대한 정확도는 0.82 정도까지 올라간 뒤 오히려 감소합니다. 이는 학습데이터의 개수는 작은데 (트레이닝 데이터 약 680 개), trees 의 개수가 지나치게 많기 때문에 과적합이 발생하였기 때문입니다.

사실 XGBoost 는 titanic 처럼 작은 데이터에 적합한 알고리즘이 아닙니다. 매우 큰 용량의 데이터에서 훨씬 잘 작동하는 알고리즘이지만, 우리는 사용법에 대해서만 살펴봅니다.

In [10]:
!pip install bokeh

Collecting bokeh
  Downloading bokeh-2.4.3-py3-none-any.whl (18.5 MB)
     --------------------------------------- 18.5/18.5 MB 54.7 MB/s eta 0:00:00
Installing collected packages: bokeh
Successfully installed bokeh-2.4.3


In [11]:
from bokeh.plotting import figure, show, output_notebook, save
from bokeh.layouts import gridplot

output_notebook()

n_estimators, train_accuracy, test_accuracy = zip(*performances)
p = figure(plot_width=800, plot_height=400, title='Accuracy by n_estimators (XGB)')
p.line(n_estimators, train_accuracy, line_width=2, line_color='orange', legend_label='Train')
p.line(n_estimators, test_accuracy, line_width=2, line_color='blue', legend_label='Test')
p.xaxis.axis_label = 'n estimators'
p.yaxis.axis_label = 'accuracy'
show(p)

리그레션 모델을 이용하는 방법도 앞과 동일합니다. 단지 loss function 인 `objective` 를 회귀모델용으로 바꿔줘야 합니다.
아래 코드는 예시입니다.

In [None]:
xgb_reg = xgb.XGBRegressor(
        n_estimators = 500,
        max_depth = 4,
        booster = 'gbtree',
        eta = 0.3,
        gamma = 0,    
        silent = 0,
        objective = 'reg:squarederror',
        nthread = 4,
        base_score = 0.5,    
    )
xgb_reg = xgb_reg.fit(x.reshape(-1, 1), y)