## 14.1 ロジスティック回帰
```statsmodels```と```sklearn```を用いたロジスティック回帰のスニペット。


### (共通)データの読み込み
```seaborn```の```tips```データセットを用いる。

In [26]:
import polars as pl
import seaborn as sns

# データの読み込み
titanic = pl.DataFrame(sns.load_dataset("titanic"))

# 解析に使用するデータを抽出する
titanic_sub = (
    titanic[["survived", "sex", "age", "embarked"]].drop_nulls()
)
# ダミー変数作成時にpandasでは自動的にsortが行われるがpolarsでは行われないので、順番を入れかえておく。
titanic_sub = titanic_sub.sort(by = ["sex", "age", "embarked"])
display(titanic_sub.head())

# 生存者数
display(titanic_sub["survived"].value_counts())
# 乗船場所
display(titanic_sub["embarked"].value_counts())

survived,sex,age,embarked
i64,str,f64,str
1,"""female""",0.75,"""C"""
1,"""female""",0.75,"""C"""
1,"""female""",1.0,"""C"""
1,"""female""",1.0,"""S"""
0,"""female""",2.0,"""S"""


survived,counts
i64,u32
0,424
1,288


embarked,counts
str,u32
"""S""",554
"""Q""",28
"""C""",130


### 14.1.1 statsmodels

In [27]:
import statsmodels.formula.api as smf

# モデルを学習する
form = "survived ~ sex + age + embarked"
model = smf.logit(formula = form, data = titanic_sub).fit()

# 学習結果を表示する
display(model.summary())

import pandas as pd
import numpy as np
res_sm = pd.DataFrame(model.params, columns = ["coefs_sm"])
res_sm["odds_sm"] = np.exp(res_sm["coefs_sm"])
display(res_sm.round(3))

Optimization terminated successfully.
         Current function value: 0.509889
         Iterations 6


0,1,2,3
Dep. Variable:,survived,No. Observations:,712.0
Model:,Logit,Df Residuals:,707.0
Method:,MLE,Df Model:,4.0
Date:,"Mon, 01 Jan 2024",Pseudo R-squ.:,0.2444
Time:,16:47:01,Log-Likelihood:,-363.04
converged:,True,LL-Null:,-480.45
Covariance Type:,nonrobust,LLR p-value:,1.209e-49

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,2.2046,0.322,6.851,0.000,1.574,2.835
sex[T.male],-2.4760,0.191,-12.976,0.000,-2.850,-2.102
embarked[T.Q],-1.8156,0.535,-3.393,0.001,-2.864,-0.767
embarked[T.S],-1.0069,0.237,-4.251,0.000,-1.471,-0.543
age,-0.0081,0.007,-1.233,0.217,-0.021,0.005


Unnamed: 0,coefs_sm,odds_sm
Intercept,2.205,9.066
sex[T.male],-2.476,0.084
embarked[T.Q],-1.816,0.163
embarked[T.S],-1.007,0.365
age,-0.008,0.992


### 14.1.2 sklearn

In [28]:
from sklearn import linear_model

# ダミー変数を作成する
import polars.selectors as cs
# 数値型は外しておく
df_numeric = titanic_sub.select(cs.numeric()).drop("survived")
# カテゴリ変数のみダミー変数を作成する
df_categorical = titanic_sub.select(cs.string())
df_dummy = df_categorical.to_dummies(drop_first = True)
# 学習用のDataFrameを作成する
df_study = pl.concat(items = [df_numeric, df_dummy], how = "horizontal")


# モデルを学習する
lr = linear_model.LogisticRegression(penalty = None)
model = lr.fit(X = df_study, y = titanic_sub["survived"])


# 学習結果を表示する
import numpy as np
# 係数の名称
labels = ["Intercept"]
for col in df_study.columns:
    labels.append(col)
# 係数
coefficients = np.append(
    model.intercept_, model.coef_
)
# 学習結果
result = pl.DataFrame({
    "Label": labels,
    "Coefficient": coefficients
})
# オッズ
result = result.with_columns(
    result["Coefficient"].exp().round(decimals = 3).alias("Odds")
)
display(result)

Label,Coefficient,Odds
str,f64,f64
"""Intercept""",2.204564,9.066
"""age""",-0.008078,0.992
"""sex_male""",-2.475953,0.084
"""embarked_Q""",-1.815557,0.163
"""embarked_S""",-1.006956,0.365
