<h1><font color="#f37626">[Lifecycle]</font> ML-Classification 예제</h1>

-----

`add_experiment`는 모델링 과정 중 실험 이력으로 저장하고자 하는 모델을 자동으로 accuinsight+ modeler 콘솔 화면에 기록해주는 메소드 입니다.

- data: breast cancer data 
    - label: (1:악성) / (0: 양성)
- sklearn 사용
- logistic regression
---

### 1. Import packages
- sklearn을 사용하므로, ML 모듈의 accuinsight 메소드를 불러와야 함

In [1]:
from Accuinsight.Lifecycle.ML import accuinsight

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

2021-12-08 14:15:24.037643: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1


In [2]:
accu = accuinsight()

### 2. Data load and split

In [4]:
data = pd.read_csv('../data/breast_cancer_data.csv', index_col='id')
data.head()

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [5]:
# target variable encoding
data[['diagnosis']] = data[['diagnosis']].replace(['M', 'B'], [1, 0])

In [6]:
# distributon of target vaiable
data[['diagnosis']].value_counts()

diagnosis
0            357
1            212
dtype: int64

In [7]:
from sklearn.model_selection import train_test_split

breast_X = data.drop('diagnosis', axis=1)
breast_y = data.loc[:, 'diagnosis']

X_train, X_test, y_train, y_test = train_test_split(breast_X, breast_y, test_size=0.2, random_state=0, stratify=data[['diagnosis']])

X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 455 entries, 901836 to 90401601
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   radius_mean              455 non-null    float64
 1   texture_mean             455 non-null    float64
 2   perimeter_mean           455 non-null    float64
 3   area_mean                455 non-null    float64
 4   smoothness_mean          455 non-null    float64
 5   compactness_mean         455 non-null    float64
 6   concavity_mean           455 non-null    float64
 7   concave points_mean      455 non-null    float64
 8   symmetry_mean            455 non-null    float64
 9   fractal_dimension_mean   455 non-null    float64
 10  radius_se                455 non-null    float64
 11  texture_se               455 non-null    float64
 12  perimeter_se             455 non-null    float64
 13  area_se                  455 non-null    float64
 14  smoothness_se   

In [8]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(455, 30)
(114, 30)
(455,)
(114,)


### 3. Build model

In [9]:
# find appropriate C value with GridSearch
model = GridSearchCV(LogisticRegression(), param_grid={'C': [0.001, 0.005, 0.01, 0.05, 0.1, 1]}, cv=5)
model.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=LogisticRegression(),
             param_grid={'C': [0.001, 0.005, 0.01, 0.05, 0.1, 1]})

In [10]:
print(model.best_params_)
print(model.best_score_)

{'C': 0.1}
0.945054945054945


In [11]:
logistic = model.best_estimator_
logistic.fit(X_train, y_train)

LogisticRegression(C=0.1)

In [12]:
logistic.get_params()

{'C': 0.1,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

### 4. Model evaluation

In [13]:
pred_test = logistic.predict(X_test)

accuracy = accuracy_score(y_test, pred_test)
f1score = f1_score(y_test, pred_test)

### 5. Run model with experiment

#### [optional] 1-(1) 메시지 설정 

message를 푸시하는 방법은 2가지 입니다.
1. 모델 학습 완료시 메시지 푸시  
    - `send_message(message = 'your_message')`  
    
    
2. 학습에 사용되는 metric이 일정 thresholds가 넘은 경우에만 메시지 푸시
    - `send_message(thresholds = 0.5)`

In [14]:
accu.send_message('[ML-binary-classification] 모델 학습 완료')

#### [optional] 1-(2) alarm method 설정
- web push는 기본 method이며, message가 있을 경우 자동으로 alart 됩니다.
- slack: slack channel의 hook url을 입력합니다.
- mail: mail address를 입력합니다.

In [15]:
# accu.set_slack(hook_url='hook_url')

__Data drift 기능을 사용할 경우, `model_monitor=True` 옵션을 사용하여 피처 중요도를 저장합니다.__  
비정형데이터의 경우 사용할 수 없습니다.

In [16]:
with accu.add_experiment(logistic, X_train, y_train, X_test, y_test, model_monitor=True) as exp:
    exp.log_params('max_iter')
    exp.log_params('C')
    exp.log_metrics('Accuracy', accuracy)
    exp.log_metrics('F1-score', f1score)

Using add_experiment(model_monitor=True)


### 6. Load saved model
- `add_experiment()`를 사용하여 모델의 학습 이력을 Lifecycle에 기록할 경우, 자동으로 모델 파일이 저장됩니다.
- 저장된 모델을 불러와 공동 작업자들과 모델을 공유하거나, 모델 재학습을 수행할 수 있습니다.

    1. Accuinsight+ workspace list 혹은 해당 모델의 상세화면으로 접속하여 _Experiment_ 중 불러오고자 하는 모델의 __Run name__을 복사합니다.
    2. ___utils___에서 `load_model()` 함수를 호출하여 모델을 불러올 수 있습니다.

In [16]:
from Accuinsight.Lifecycle.utils import load_model

loaded = load_model('LogisticRegression-FEC350F1A4AA4CC99E21EDB2D0D88392_146')

In [17]:
loaded.coef_

array([[ 0.50901549,  0.15411692,  0.33345012, -0.01079167, -0.01783694,
        -0.08635994, -0.12171115, -0.05012436, -0.03069576, -0.00552263,
         0.01754543,  0.19257186,  0.02587546, -0.07332319, -0.00179177,
        -0.01974638, -0.02803761, -0.00696167, -0.00750329, -0.00162289,
         0.50901535, -0.26793484, -0.18165898, -0.01676528, -0.03292529,
        -0.28051571, -0.35105002, -0.09993797, -0.09331063, -0.0258582 ]])