# 2値分類のロジスティック回帰
sklearn.linear_model.LogisticRegression
- 引数
    - penalty: 'l1', 'l2', 'elasticnet', 'none'
    - solver: 'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'
    - multi_class: 'ovr', 'auto', 'multinomial'
- .predict(X)でラベル(クラス)の分類結果を取得
- .predict_proba(X)でラベル(クラス)の確率p(X)を取得

sklearn.metrics.log_loss
- log_loss(y_true, y_pred)
    - y_predには.predict_proba(X)の戻り値を入れる

### データ準備

In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split
import seaborn as sns
df = sns.load_dataset('titanic').dropna()
X = df.loc[:, (df.columns != 'survived') & (df.columns != 'alive')]
X = pd.get_dummies(X, drop_first=True)
y = df['survived']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)


## 実装

In [20]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


logloss, coefficient, intercept, classes, features namesなど

In [23]:
from sklearn.metrics import log_loss
log_loss(y_test, y_pred_proba), model.coef_, model.intercept_,\
model.classes_, model.feature_names_in_

(0.4111567729103682,
 array([[-0.29613445, -0.02130647,  0.62287823, -0.37405926,  0.00478665,
         -0.88713974,  0.17503682,  0.38845621, -0.4045677 , -0.02303382,
         -0.4531837 , -0.40475737, -0.88713974,  1.21198449, -0.11127389,
         -1.20757958, -0.14162972,  0.61110768, -0.13071674, -0.5516418 ,
         -0.4045677 , -0.02303382]]),
 array([2.14323822]),
 array([0, 1]),
 array(['pclass', 'age', 'sibsp', 'parch', 'fare', 'adult_male', 'alone',
        'sex_male', 'embarked_Q', 'embarked_S', 'class_Second',
        'class_Third', 'who_man', 'who_woman', 'deck_B', 'deck_C',
        'deck_D', 'deck_E', 'deck_F', 'deck_G', 'embark_town_Queenstown',
        'embark_town_Southampton'], dtype=object))

# 多クラス分類のロジスティック回帰
sklearn.linear_model.LogisticRegression
- multi_cluss: 'auto', 'ovr', 'multinomial'
    - 'auto': solver='liblinear' もしくは2値分類の場合は'ovr'
    - 'ovr': One vs Rest
    - 'multinomial': 多項ロジスティック回帰

### データ準備


In [27]:
import seaborn as sns
df = sns.load_dataset('iris')
y_col = 'species'
X = df.loc[:, df.columns!=y_col]
y = df[y_col]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)
print(len(X_train), len(X_test))

105 45


## OvR

クラスの種類分だけ分類器を作る。計算に時間がかかる。

In [33]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='none', multi_class='ovr')
model.fit(X_train, y_train)
y_pred_ovr = model.predict_proba(X_test)

interceptやcoefはラベルの種類分だけ作られる。

In [32]:
model.classes_, model.intercept_, model.coef_

(array(['setosa', 'versicolor', 'virginica'], dtype=object),
 array([   1.11952961,    6.81324426, -255.99667981]),
 array([[ 1.91635746e+00,  6.80805390e+00, -1.08014054e+01,
         -5.01387880e+00],
        [-4.15059756e-01, -2.43651049e+00,  1.48863127e+00,
         -3.08728666e+00],
        [-3.59713655e+02, -2.82241847e+02,  5.44953421e+02,
          3.64284106e+02]]))

In [39]:
y_pred_ovr[:5]

array([[2.52308411e-016, 8.82201826e-002, 9.11779817e-001],
       [1.28141183e-009, 9.99999999e-001, 5.98424999e-214],
       [9.85945513e-001, 1.40544866e-002, 0.00000000e+000],
       [2.82376791e-019, 3.87069863e-001, 6.12930137e-001],
       [8.87568928e-001, 1.12431072e-001, 0.00000000e+000]])

## multinomial

ソフトマックス関数を用いる。結果が極端になりやすい。

In [36]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='none', multi_class='multinomial')
model.fit(X_train, y_train)
y_pred_mn = model.predict_proba(X_test)

OvRとmultinomialで結果は異なる。 

In [35]:
model.classes_, model.intercept_, model.coef_


(array(['setosa', 'versicolor', 'virginica'], dtype=object),
 array([  80.23761155,  129.79119698, -210.02880853]),
 array([[ 155.59729672,  358.73830451, -523.93808685, -248.11590736],
        [ 118.20297865,  -15.08081949,  -41.54669626,  -91.11017602],
        [-273.80027538, -343.65748502,  565.48478311,  339.22608338]]))

multinomialは、結果が極端になりやすい。

In [38]:
y_pred_mn[:5]

array([[0.00000000e+000, 3.17565566e-259, 1.00000000e+000],
       [0.00000000e+000, 1.00000000e+000, 2.66332546e-242],
       [1.00000000e+000, 0.00000000e+000, 0.00000000e+000],
       [0.00000000e+000, 8.28139203e-194, 1.00000000e+000],
       [1.00000000e+000, 1.68097172e-284, 0.00000000e+000]])