## 13.4 sklearnのone-hotエンコーディングとトランスフォーマーのパイプライン
```sklearn```のパイプラインを用いた重回帰分析のスニペット。

```pandas.loc[]```を前提にしているため、pandasへの変換が必要。

### (共通)データの読み込み
```seaborn```の```tips```データセットを用いる。

In [1]:
import polars as pl
import seaborn as sns

# データの読み込み
tips = pl.DataFrame(sns.load_dataset("tips"))
display(tips.head())

total_bill,tip,sex,smoker,day,time,size
f64,f64,cat,cat,cat,cat,i64
16.99,1.01,"""Female""","""No""","""Sun""","""Dinner""",2
10.34,1.66,"""Male""","""No""","""Sun""","""Dinner""",3
21.01,3.5,"""Male""","""No""","""Sun""","""Dinner""",3
23.68,3.31,"""Male""","""No""","""Sun""","""Dinner""",2
24.59,3.61,"""Female""","""No""","""Sun""","""Dinner""",4


['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size']

[Float64, Float64, Categorical, Categorical, Categorical, Categorical, Int64]

array(['Male', 'Female'], dtype=object)

array(['Yes', 'No'], dtype=object)

array(['Thur', 'Fri', 'Sat', 'Sun'], dtype=object)

array(['Lunch', 'Dinner'], dtype=object)

In [4]:
from sklearn import linear_model

# 前処理関係のライブラリをインポートする
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
import numpy as np

# ダミー変数化したい変数を指定する
categorical_features = ["sex", "smoker", "day", "time"]
categorical_transformer = OneHotEncoder(drop = "first")

# 前処理器の初期化
preprocessor = ColumnTransformer(
    transformers = [
        ("cat", categorical_transformer, categorical_features),
    ],
    remainder = "passthrough"# 処理しない変数はそのまま通す
)

# パイプラインの初期化
pipe = Pipeline(
    steps = [
        ("preprocessor", preprocessor),
        ("lr", linear_model.LinearRegression())
    ]
)

# 学習
pipe.fit(
    X = tips[["total_bill", "size", "sex", "smoker", "day", "time"]].to_pandas(),
    y = tips["tip"].to_pandas(),
)

# 係数と切片を表示する
coefficients = np.append(
    pipe.named_steps["lr"].intercept_, pipe.named_steps["lr"].coef_
)
labels = np.append(
    ["intercept"], pipe[:-1].get_feature_names_out()
)
coefs = pl.DataFrame({"variables": labels, "coef": coefficients})
display(coefs)

variables,coef
object,f64
intercept,0.803817
cat__sex_Male,-0.032441
cat__smoker_Yes,-0.086408
cat__day_Sat,-0.121458
cat__day_Sun,-0.025481
cat__day_Thur,-0.162259
cat__time_Lunch,0.068129
remainder__total_bill,0.094487
remainder__size,0.175992
