<a href="https://colab.research.google.com/github/CAVASOL/aiffel_quest/blob/main/ML_node/ML_with_Python_Supervised_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**노드 5. 지도학습(분류)**

1. 의사결정나무(Decision Tree)
2. 랜덤포레스트(Random Forest)
3. XGBoost
4. Cross Validation
5. 평가(분류)

In [None]:
#lib
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
#generate data
from sklearn.datasets import load_breast_cancer

def make_dataset():
  iris = load_breast_cancer()
  df = pd.DataFrame(iris.data, columns=iris.feature_names)
  df['target'] = iris.target
  X_train, X_test, y_train, y_test = train_test_split(
      df.drop('target', axis=1),
      df['target'],
      test_size=0.5,
      random_state=1004)
  return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = make_dataset()
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((284, 30), (285, 30), (284,), (285,))

In [None]:
#check target
y_train.value_counts()

1    190
0     94
Name: target, dtype: int64

**1. Decision Tree**

* 지도학습(분류)에서 가장 유용하게 사용되고 있는 기법 중 하나
* 트리의 루트에서 시작해서 정보이득이 최대가 되는 특성으로 데이터를 나눔
* 정보이득(Information gain)이 최대가 되는 특성을 나누는 기준(불순도를 측정하는 기준)은 '지니'와 '엔트로피'가 사용됨. 데이터가 한 종류만 있다면 엔트로피/지니 불순도는 0에 가깝고, 서로 다른 데이터의 비율이 비슷하면 1에 가까움. 정보이득이 최대라는 것은 불순도를 최소화 하는 방향임(1-불순도)

In [None]:
#Decision Tree
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=0)
model.fit(X_train, y_train)
pred = model.predict(X_test)
accuracy_score(y_test, pred)

0.9263157894736842

**Decision Tree Hyperparameters**

* criterion(default gini) 불순도 지표(또는 엔트로피 불순도 entropy)
* max_depth(default None)
* min_samples_split(default 2)
* min_samples_leaf(default 1)
* max_features
* min_weight_fraction_leaf
* random_state
* class_weight

In [None]:
#Decision Tree Hyperparameters
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(
    criterion='entropy',
    max_depth=4,
    min_samples_leaf=2,
    min_samples_split=5,
    random_state=0)
model.fit(X_train, y_train)
pred = model.predict(X_test)
accuracy_score(y_test, pred)

0.9403508771929825

**2. Random Forest**

여러 개의 의사결정나무로 구성
앙상블 방법 중 bagging 방식
Bootstrap sampling
최종 다수결 투표

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
    n_estimators=500,
    max_depth=5,
    random_state=0)
model.fit(X_train, y_train)
pred = model.predict(X_test)
accuracy_score(y_test, pred)

0.9473684210526315

**Random Forest Hyperparameters**

* n_estimators(default 100) 트리의 수
* criterion(default gini) 불순도 지표
* max_depth(default None)
* min_samples_split(default 2)
* min_samples_leaf(default 1)

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
    n_estimators=200,
    max_depth=5,
    random_state=0)
model.fit(X_train, y_train)
pred = model.predict(X_test)
accuracy_score(y_test, pred)

0.9473684210526315

**3. XGBoost, Extreme Gradient Boosting**

부스팅(ensemble) 기반의 알고리즘
트리 앙상블 중 성능이 좋은 알고리즘

In [None]:
from xgboost import XGBClassifier
model = XGBClassifier(random_state=0, use_label_encoder=False, eval_metric='logloss')
model.fit(X_train, y_train)
pred = model.predict(X_test)
accuracy_score(y_test, pred)

0.9508771929824561

**XGBoost Hyperparameters**

* booster(default gbtree): 부스팅 알고리즘 (또는 dart, gblinear)
* objective(default binary:logistic): 이진분류 (다중분류: multi:softmax)
* max_depth(default 6): 최대 한도 깊이
* learning_rate(default 0.1): 학습률
* n_estimators(default 100): 트리의 수
* subsample(default 1): 훈련 샘플 개수의 비율
* colsample_bytree(default 1): 특성 개수의 비율
* n_jobs(default 1): 사용 코어 수 (-1: 모든 코어를 다 사용)

In [None]:
from xgboost import XGBClassifier
model = XGBClassifier(random_state=0,
                      use_label_encoder=False,
                      eval_metric='logloss',
                      booster = 'gbtree',
                      objective = 'binary:logistic',
                      max_depth = 5,
                      learning_rate = 0.05,
                      n_estimators = 500,
                      subsample = 1,
                      colsample_bytree = 1,
                      n_jobs = -1
                     )
model.fit(X_train, y_train)
pred = model.predict(X_test)
accuracy_score(y_test, pred)

0.9649122807017544

In [None]:
#early stopping rounds
from xgboost import XGBClassifier
model = XGBClassifier(random_state=0,
                      use_label_encoder=False,
                      eval_metric='logloss',
                      learning_rate = 0.05,
                      n_estimators = 500)
eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, eval_set=eval_set, early_stopping_rounds=10)
pred = model.predict(X_test)
accuracy_score(y_test, pred)

[0]	validation_0-logloss:0.65391
[1]	validation_0-logloss:0.61861
[2]	validation_0-logloss:0.58697
[3]	validation_0-logloss:0.55756
[4]	validation_0-logloss:0.53038
[5]	validation_0-logloss:0.50611
[6]	validation_0-logloss:0.48363
[7]	validation_0-logloss:0.46304
[8]	validation_0-logloss:0.44332
[9]	validation_0-logloss:0.42512
[10]	validation_0-logloss:0.40821
[11]	validation_0-logloss:0.39260
[12]	validation_0-logloss:0.37838
[13]	validation_0-logloss:0.36512
[14]	validation_0-logloss:0.35276
[15]	validation_0-logloss:0.34090
[16]	validation_0-logloss:0.33018
[17]	validation_0-logloss:0.31967
[18]	validation_0-logloss:0.30998
[19]	validation_0-logloss:0.30105
[20]	validation_0-logloss:0.29259
[21]	validation_0-logloss:0.28478
[22]	validation_0-logloss:0.27725
[23]	validation_0-logloss:0.27027
[24]	validation_0-logloss:0.26359
[25]	validation_0-logloss:0.25755
[26]	validation_0-logloss:0.25139
[27]	validation_0-logloss:0.24593
[28]	validation_0-logloss:0.24103
[29]	validation_0-loglos

0.9473684210526315

In [None]:
def make_dataset2():
    bc = load_breast_cancer()
    df = pd.DataFrame(bc.data, columns=bc.feature_names)
    df['target'] = bc.target
    return df.drop('target', axis=1), df['target']
X, y = make_dataset2()

In [None]:
#KFold
from sklearn.model_selection import KFold
model = DecisionTreeClassifier(random_state=0)

kfold = KFold(n_splits=5)
for train_idx, test_idx in kfold.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    print(accuracy_score(y_test, pred))

0.8771929824561403
0.9122807017543859
0.9473684210526315
0.9385964912280702
0.8407079646017699


In [None]:
#Stratified KFold
from sklearn.model_selection import StratifiedKFold
model = DecisionTreeClassifier(random_state=0)

kfold = StratifiedKFold(n_splits=5)
for train_idx, test_idx in kfold.split(X, y):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    print(accuracy_score(y_test, pred))

0.9035087719298246
0.9210526315789473
0.9122807017543859
0.9473684210526315
0.9026548672566371


**Scikit-learn Cross Validation**
사이킷런 내부 API를 통해 fit(학습) - predict(예측) - evaluation(평가)

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=3)
scores

array([0.88947368, 0.94210526, 0.86243386])

In [None]:
#avg
scores.mean()

0.8980042699340944

In [None]:
#Cross Validation Stratified KFold
kfold = StratifiedKFold(n_splits=5)
scores = cross_val_score(model, X, y, cv=kfold)
scores

array([0.90350877, 0.92105263, 0.9122807 , 0.94736842, 0.90265487])

In [None]:
#avg
scores.mean()

0.9173730787144851

**5. 평가(분류)**
* Accuracy 실제값과 예측값이 일치하는 비율
* Precision 양성 예측한 값 중 실제 양성인 값(암이라고 예측한 값 중에서 실제 암일 확률)
* Recall 실제 양성 값 중 양성으로 예측한 값(실제 암인 값 중 예측도 암인 확률)
* F1 정밀도와 재현율의 조화 평균
* ROC-AUC
    * ROC 참 양성 비율(True Positive Rate)에 대한 거짓 양성 비율(False Positive Rate) 곡선
    * AUC ROC곡선 면적 아래(완벽하게 분류되면 AUC 값이 1임)

In [None]:
#Accuracy 정확도
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred)

0.9026548672566371

In [None]:
#Precision 정밀도
from sklearn.metrics import precision_score
precision_score(y_test, pred)

0.9545454545454546

In [None]:
#Recall 재현율
from sklearn.metrics import recall_score
recall_score(y_test, pred)

0.8873239436619719

In [None]:
#F1
from sklearn.metrics import f1_score
f1_score(y_test, pred)

0.9197080291970803

In [None]:
#ROC-AUC
from sklearn.metrics import roc_auc_score
model = XGBClassifier(random_state=0, use_label_encoder=False, eval_metric='logloss')
model.fit(X_train, y_train)
pred = model.predict_proba(X_test)

roc_auc_score(y_test, pred[:,1])

0.999664654594232