# Wine 데이터의 로지스틱 회귀 성능 향상시키기

## 필요한 라이브러리 불러오기

In [1]:
import multiprocessing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use(['seaborn-whitegrid'])

In [2]:
from sklearn.datasets import load_wine
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, SGDClassifier

## Wine 데이터 분석하기

In [3]:
wine = load_wine()
print(wine.keys())
print(wine.DESCR)

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:        

In [5]:
X,y = load_wine(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

In [6]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Logistic Regrssion 에 대한 최적 파라미터

In [8]:
estimator = LogisticRegression()
TF = [True, False]
C = [0.1, 0.5 , 1, 5, 10, 20, 30]
tol = [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1]
penalty = ['l1','l2']
param_grid = {'C':C,
              'dual':TF,
              'fit_intercept':TF,
              'penalty':penalty,
              'tol':tol,
              'warm_start':TF}

gs = GridSearchCV(
    estimator=estimator,
    param_grid=param_grid,
    n_jobs=multiprocessing.cpu_count(),
    cv=10,
    verbose=True
)
result = gs.fit(X_train, y_train)

print("최적의 점수: {}".format(result.best_score_))
print("최적의 파라미터: {}".format(result.best_params_))
print(gs.best_estimator_)
pd.DataFrame(result.cv_results_);

Fitting 10 folds for each of 1008 candidates, totalling 10080 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done 2054 tasks      | elapsed:    3.5s


최적의 점수: 0.9857142857142858
최적의 파라미터: {'C': 0.1, 'dual': False, 'fit_intercept': True, 'penalty': 'l2', 'tol': 0.0001, 'warm_start': True}
LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=True)


[Parallel(n_jobs=2)]: Done 10080 out of 10080 | elapsed:   17.6s finished


## SGDClassifier 에 대한 최적 파라미터

In [12]:
estimator = SGDClassifier()
TF = [True, False]
alpha = [0.0005, 0.001, 0.01, 0.1, 0.5, 1]
penalty = ['l1','l2']
param_grid = {'alpha':alpha,
              'average':TF,
              'early_stopping':TF,
              'epsilon':alpha,
              'fit_intercept':TF,
              'penalty':penalty,
              'shuffle':TF,
              'tol':tol,
              'warm_start':TF}

gs = GridSearchCV(
    estimator=estimator,
    param_grid=param_grid,
    n_jobs=multiprocessing.cpu_count(),
    cv=10,
    verbose=True
)
result = gs.fit(X_train, y_train)

print("최적의 점수: {}".format(result.best_score_))
print("최적의 파라미터: {}".format(result.best_params_))
print(gs.best_estimator_)
pd.DataFrame(result.cv_results_);

Fitting 10 folds for each of 20736 candidates, totalling 207360 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done 216 tasks      | elapsed:    2.9s
[Parallel(n_jobs=2)]: Done 1416 tasks      | elapsed:   14.0s
[Parallel(n_jobs=2)]: Done 3416 tasks      | elapsed:   32.1s
[Parallel(n_jobs=2)]: Done 6216 tasks      | elapsed:   57.5s
[Parallel(n_jobs=2)]: Done 13064 tasks      | elapsed:  1.5min
[Parallel(n_jobs=2)]: Done 30664 tasks      | elapsed:  3.2min
[Parallel(n_jobs=2)]: Done 51464 tasks      | elapsed:  5.0min
[Parallel(n_jobs=2)]: Done 75464 tasks      | elapsed:  7.6min
[Parallel(n_jobs=2)]: Done 102664 tasks      | elapsed:  9.9min
[Parallel(n_jobs=2)]: Done 133064 tasks      | elapsed: 12.9min
[Parallel(n_jobs=2)]: Done 166664 tasks      | elapsed: 16.1min
[Parallel(n_jobs=2)]: Done 203464 tasks      | elapsed: 19.5min
[Parallel(n_jobs=2)]: Done 207360 out of 207360 | elapsed: 19.7min finished


최적의 점수: 1.0
최적의 파라미터: {'alpha': 0.01, 'average': False, 'early_stopping': True, 'epsilon': 0.0005, 'fit_intercept': True, 'penalty': 'l2', 'shuffle': False, 'tol': 0.05, 'warm_start': False}
SGDClassifier(alpha=0.01, average=False, class_weight=None, early_stopping=True,
              epsilon=0.0005, eta0=0.0, fit_intercept=True, l1_ratio=0.15,
              learning_rate='optimal', loss='hinge', max_iter=1000,
              n_iter_no_change=5, n_jobs=None, penalty='l2', power_t=0.5,
              random_state=None, shuffle=False, tol=0.05,
              validation_fraction=0.1, verbose=0, warm_start=False)


SGDClasifier 의 최적의 파라메터를 구한 결과 :<br>
'alpha': 0.01, 'average': False, 'early_stopping': True, 'epsilon': 0.0005, <br>'fit_intercept': True, 'penalty': 'l2', 'shuffle': False, 'tol': 0.05, 'warm_start': False<br>

###최고 점수는 1.0 으로 나타남