## Sklearn LogisticRegression

### 주요 파라미터
- penalty : 규제 유형 설정. l1 : L1, l2 : L2
- C : 규제 강도 조절. alpha값의 역수. C = 1/alpha
- solver : 최적화 방식

#### solver
- lbfgs : sklearn version 0.22이후 설정된 default. 메모리 공간을 절약할 수 있고, cpu 코어 수가 많다면 최적화를 병렬로 수행 가능
- liblinear : sklearn verson 0.21이전 설정된 default. 다차원이고 작은 데이터 세트에서 효과적. 그러나 국소 최적화(Local Minimun)이슈가 있고, 병렬로 최적화 불가능
- newton-cg : 정교한 최적화 가능. 그러나 대용량의 데이터에서 속도가 매우 느림
- sag : Stochastic Average Gradient로서 경사 하강법 기반의 최적화 적용. 대용량의 데이터에서 빠르게 최적화
- saga : sag와 유사한 최적화 방식. L1 정규화 가능

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression

cancer = load_breast_cancer()

In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# StandardScaler()로 평균 0, 분산 1로 데이터 분포도 변환
scaler = StandardScaler()
data_scaled = scaler.fit_transform(cancer.data)

X_train, X_test, y_train, y_test = train_test_split(data_scaled, cancer.target, test_size = 0.3, random_state = 0)

In [6]:
from sklearn.metrics import accuracy_score, roc_auc_score

# solver default = lbfgs
lr_clf = LogisticRegression()
lr_clf.fit(X_train, y_train)
lr_preds = lr_clf.predict(X_test)

print('accuracy : {0:.3f}, roc_auc : {1:.3f}'.format(accuracy_score(y_test, lr_preds), roc_auc_score(y_test, lr_preds)))

accuracy : 0.977, roc_auc : 0.972


In [9]:
import time

solvers = ['lbfgs', 'liblinear', 'newton-cg', 'sag', 'saga']

for solver in solvers:
    start = time.time()
    lr_clf = LogisticRegression(solver = solver, max_iter = 600)
    lr_clf.fit(X_train, y_train)
    lr_preds = lr_clf.predict(X_test)
    
    print('solver : {0}, accuracy : {1:.3f}, roc_auc : {2:.3f}'.format(solver, 
                                                                       accuracy_score(y_test, lr_preds), 
                                                                       roc_auc_score(y_test, lr_preds)))
    end = time.time()
    print('{0:.5f}'.format(end-start))

solver : lbfgs, accuracy : 0.977, roc_auc : 0.972
0.01556
solver : liblinear, accuracy : 0.982, roc_auc : 0.979
0.00478
solver : newton-cg, accuracy : 0.977, roc_auc : 0.972
0.01119
solver : sag, accuracy : 0.982, roc_auc : 0.979
0.03900
solver : saga, accuracy : 0.982, roc_auc : 0.979
0.04502


In [14]:
from sklearn.model_selection import GridSearchCV

params={'solver':['liblinear', 'lbfgs'],
        'penalty':['l1', 'l2'],
        'C':[0.01, 0.1, 1, 5, 10]}

lr_clf = LogisticRegression()

grid_clf = GridSearchCV(lr_clf, param_grid = params, scoring = 'accuracy', cv = 3)
grid_clf.fit(data_scaled, cancer.target)
print('최적 하이퍼 파라미터 : {0}, 최적 평균 정확도 : {1:.3f}'.format(grid_clf.best_params_,
                                                      grid_clf.best_score_))

최적 하이퍼 파라미터 : {'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}, 최적 평균 정확도 : 0.979


15 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

 0.97891024 0.97364708 0.968