このノートブックでは以下の二つを行います.

* Category_encodersの動作確認
* Sklearn pipelineの動作確認

In [1]:
import category_encoders as ce
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline
from catboost import CatBoostClassifier

## Category_encodersの動作確認

タイタニックデータセットを使用します. データに関しては[Kaggleの説明](https://www.kaggle.com/c/titanic/data)を参照.

In [2]:
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


推論に使用する特徴量を選択します.

In [3]:
# 使う特徴量
feature_names = [
    'class',
    'sex',
    'age',
    'sibsp',
    'parch',
    'fare',
    'embark_town',
    'deck',
]
df_x = df[feature_names]
df_y = df['survived']
print(type(df_x))
print(type(df_y))
df_x.head()

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


Unnamed: 0,class,sex,age,sibsp,parch,fare,embark_town,deck
0,Third,male,22.0,1,0,7.25,Southampton,
1,First,female,38.0,1,0,71.2833,Cherbourg,C
2,Third,female,26.0,0,0,7.925,Southampton,
3,First,female,35.0,1,0,53.1,Southampton,C
4,Third,male,35.0,0,0,8.05,Southampton,


各カテゴリ列が取り得る値を確認します.

In [4]:
# カテゴリ列の列名
cols = ['class', 'sex', 'embark_town', 'deck']
for col in cols:
    print(df_x[col].unique())

['Third', 'First', 'Second']
Categories (3, object): ['First', 'Second', 'Third']
['male' 'female']
['Southampton' 'Cherbourg' 'Queenstown' nan]
[NaN, 'C', 'E', 'G', 'D', 'A', 'B', 'F']
Categories (7, object): ['A', 'B', 'C', 'D', 'E', 'F', 'G']


この情報に基づいて, エンコーディング用の写像を定義します.

In [5]:
mapping = [
    {"col": "class", "mapping": {"First": 0, "Second": 1, "Third": 2}},
    {"col": "sex", "mapping": {"male": 0, "female": 1}},
    {"col": "embark_town", "mapping": {"Southampton": 0, "Cherbourg": 1, "Queenstown": 2}},
    {"col": "deck", "mapping": {"A": 0, "B": 1, "C": 2, "D": 3, "E": 4, "F": 5, "G": 6}},
]

カテゴリ列にordinal encodingを適用します. 数値の列はそのままで, カテゴリ列に処理が適用されているのを確認できます.

In [6]:

encoder = ce.OrdinalEncoder(cols=cols, mapping=mapping, handle_unknown='value')
df_x_enc = encoder.fit_transform(df_x)

# 欠損値の穴埋め時に列の型がfloatになるようなので, intに変換します.
# この変換はCatBoostでカテゴリ列を使用するために必要です.
df_x_enc["embark_town"] = df_x_enc["embark_town"].astype(int)
df_x_enc["deck"] = df_x_enc["deck"].astype(int)

df_x_enc.head()

Unnamed: 0,class,sex,age,sibsp,parch,fare,embark_town,deck
0,2,0,22.0,1,0,7.25,0,-1
1,0,1,38.0,1,0,71.2833,1,2
2,2,1,26.0,0,0,7.925,0,-1
3,0,1,35.0,1,0,53.1,0,2
4,2,0,35.0,0,0,8.05,0,-1


## Sklearn pipelineの動作確認

パイプラインを構成するため, 独自の処理を加えたエンコーダを定義します.

In [7]:
class IntOrdEncoder(ce.OrdinalEncoder):
    def __init__(self, cols, mapping, handle_unknown):
        super().__init__(cols=cols, mapping=mapping, handle_unknown=handle_unknown)
        self.cols = cols

    def transform(self, *args, **kwargs):
        """xはpd.DataFrameです.
        """
        x = super().transform(*args, **kwargs)
        for col in self.cols:
            x[col] = x[col].astype(int)

        return x

    def fit_transform(self, *args, **kwargs):
        """xはpd.DataFrameです.
        """
        x = super().fit_transform(*args, **kwargs)
        for col in self.cols:
            x[col] = x[col].astype(int)

        return x

データをロードし直して, 訓練データとテストデータに分割します.

In [8]:
df = sns.load_dataset('titanic')
df_x = df[feature_names]
df_y = df["survived"]
x_tr, x_vl, y_tr, y_vl = train_test_split(df_x, df_y, test_size=0.33, shuffle=True, random_state=42)

Ordinal encoderを適用し, モデルを訓練します.

In [9]:
encoder = IntOrdEncoder(cols=cols, mapping=mapping, handle_unknown='value')
x_tr_enc = encoder.fit_transform(x_tr)

clf = CatBoostClassifier(iterations=1000, cat_features=cols)
clf.fit(x_tr_enc, y_tr, verbose=False)

<catboost.core.CatBoostClassifier at 0x1124c1550>

訓練データの一部に対してラベルが0の確率を計算します.

In [10]:
print(df_y.head())
clf.predict_proba(x_tr_enc.head())

0    0
1    1
2    1
3    1
4    0
Name: survived, dtype: int64


array([[0.71547344, 0.28452656],
       [0.81862504, 0.18137496],
       [0.77672603, 0.22327397],
       [0.76732297, 0.23267703],
       [0.62095082, 0.37904918]])

訓練データに対する性能を評価します.

In [11]:
f1_score(y_tr, clf.predict(x_tr_enc))

0.8070175438596492

テストデータに対する推論のため, パイプラインを構成します.

In [12]:
steps = [
    ('preprocessing', encoder),
    ('classification', clf)
]
pipe = Pipeline(steps)

パイプラインをテストデータに適用します。

In [13]:
pipe.predict_proba(x_vl.head())

array([[0.57817033, 0.42182967],
       [0.86374878, 0.13625122],
       [0.84522229, 0.15477771],
       [0.03521136, 0.96478864],
       [0.43031737, 0.56968263]])

テストデータに対する性能を評価します.

In [14]:
f1_score(y_vl, pipe.predict(x_vl))

0.730593607305936