このノートブックでは以下の二つを行います.

* Category_encodersの動作確認
* Sklearn pipelineの動作確認

In [1]:
import category_encoders as ce
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline
from catboost import CatBoostClassifier

## Category_encodersの動作確認

タイタニックデータセットを使用します. データに関しては[Kaggleの説明](https://www.kaggle.com/c/titanic/data)を参照.

In [2]:
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


推論に使用する特徴量を選択します.

In [3]:
# 使う特徴量
feature_names = [
    'class',
    'sex',
    'age',
    'sibsp',
    'parch',
    'fare',
    'embark_town',
    'deck',
]
df_x = df[feature_names]
df_y = df['survived']
print(type(df_x))
print(type(df_y))
df_x.head()

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


Unnamed: 0,class,sex,age,sibsp,parch,fare,embark_town,deck
0,Third,male,22.0,1,0,7.25,Southampton,
1,First,female,38.0,1,0,71.2833,Cherbourg,C
2,Third,female,26.0,0,0,7.925,Southampton,
3,First,female,35.0,1,0,53.1,Southampton,C
4,Third,male,35.0,0,0,8.05,Southampton,


カテゴリ列にone-hot encodingを適用します. 数値の列はそのままで, カテゴリ列に処理が適用されているのを確認できます.

In [4]:
cols = ['class', 'sex', 'embark_town', 'deck']
encoder = ce.OneHotEncoder(cols=cols, handle_unknown='impute')
df_x = encoder.fit_transform(df_x)
df_x.head()

Unnamed: 0,class_1,class_2,class_3,sex_1,sex_2,age,sibsp,parch,fare,embark_town_1,...,embark_town_3,embark_town_4,deck_1,deck_2,deck_3,deck_4,deck_5,deck_6,deck_7,deck_8
0,1,0,0,1,0,22.0,1,0,7.25,1,...,0,0,1,0,0,0,0,0,0,0
1,0,1,0,0,1,38.0,1,0,71.2833,0,...,0,0,0,1,0,0,0,0,0,0
2,1,0,0,0,1,26.0,0,0,7.925,1,...,0,0,1,0,0,0,0,0,0,0
3,0,1,0,0,1,35.0,1,0,53.1,1,...,0,0,0,1,0,0,0,0,0,0
4,1,0,0,1,0,35.0,0,0,8.05,1,...,0,0,1,0,0,0,0,0,0,0


## Sklearn pipelineの動作確認

データをロードし直して, 訓練データとテストデータに分割します.

In [5]:
df = sns.load_dataset('titanic')
df_x = df[feature_names]
df_y = df["survived"]
x_tr, x_vl, y_tr, y_vl = train_test_split(df_x, df_y, test_size=0.33, shuffle=True, random_state=42)

One-hot encodingを適用し, モデルを訓練します.

In [6]:
encoder = ce.OneHotEncoder(cols=cols, handle_unknown='impute')
x_tr_enc = encoder.fit_transform(x_tr)

clf = CatBoostClassifier(iterations=1000)
clf.fit(x_tr_enc, y_tr, verbose=False)

<catboost.core.CatBoostClassifier at 0x1084c12b0>

訓練データの一部に対してラベルが0の確率を計算します.

In [7]:
print(df_y.head())
clf.predict_proba(x_tr_enc.head())

0    0
1    1
2    1
3    1
4    0
Name: survived, dtype: int64


array([[0.79840799, 0.20159201],
       [0.86335611, 0.13664389],
       [0.84584548, 0.15415452],
       [0.82396434, 0.17603566],
       [0.68126113, 0.31873887]])

訓練データに対する性能を評価します.

In [8]:
f1_score(y_tr, clf.predict(x_tr_enc))

0.8725490196078431

テストデータに対する推論のため, パイプラインを構成します.

In [9]:
steps = [
    ('preprocessing', encoder),
    ('classification', clf)
]
pipe = Pipeline(steps)

パイプラインをテストデータに適用します。

In [10]:
pipe.predict_proba(x_vl.head())

array([[0.35825813, 0.64174187],
       [0.86556978, 0.13443022],
       [0.84418872, 0.15581128],
       [0.02557658, 0.97442342],
       [0.51861138, 0.48138862]])

テストデータに対する性能を評価します.

In [11]:
f1_score(y_vl, pipe.predict(x_vl))

0.7377777777777778