### 앙상블(Ensemble) : 결정트리 기반 알고리즘 결합하여 구현
[앙상블 종류 요약]

* (1) Voting : 서로 다른 알고리즘 가진 분류기를 결합, 사이킷런은 VotingClassifier 클래스를 제공함

 <1> 하드보팅(Hard Voting) : 분류기들이 예측한 결과 값을 다수결로 결정 <br>
 <2> 소프트 보팅(Soft Voting) : 각 분류기들이 예측값을 확률로 구하면 이를 평균 내어 확률이 가장 높은 값을 결과 값으로 결정 <br> <br>

* (2) Bagging : 같은 유형의 알고리즘을 결합, 데이터 샘플링시 서로 다르게 가져가면서 학습, RandomForest 가 대표적, Bootstrapping Aggregation 줄임말 <br>
 ( Bootstrapping : 여러개의 데이터 세트를 중첩되게 분리하는 분할 방식  )
 <br>

* (3) Boosting : 여러개의 분류기가 순차적으로 학습하면서 가중치를 부스팅한다, XGBoost(캐글 대회 상위 석권),LightGBM

 AdaBoost 알고리즘 참고사이트:  https://dohk.tistory.com/217

### Bagging

### 랜덤포레스트(RandomForest)
- 의사 결정 트리 기반(Decision Tree) 기반 분류 알고리즘
- 앙상블(Ensemble), 같은 결정트리를 여러개 사용, 비교적 빠른 수행
- 현재의 랜덤 포레스트의 개념은 레오 브레이먼(Leo Breimen)의 논문에서 만들어짐, 이 논문은 랜덤 노드 최적화(Randomized Node Optimization,RNO)와 배깅(bagging)을 결합한 방법과 같은 CART(Classification And Regression Tree)를 사용해 상관관계가 없는 트리들로 포레스트를 구성하는 방법을 제시했다

In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

from sklearn.datasets import load_breast_cancer

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Breast Cancer Wisconsin
dataset = load_breast_cancer()
type(dataset)  # Bunch
dataset.data.shape   # (569, 30)
dataset.target.shape # (569,) , 0: 악성(malignant), 1: 양성(benign)

x_features = dataset.data # X , 피쳐
y_label = dataset.target  # Y , 레이블

cancer_df = pd.DataFrame(data=x_features,columns=dataset.feature_names)
cancer_df['target'] = y_label
cancer_df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0


In [3]:
print(dataset.target_names)
print(cancer_df['target'].value_counts())

# train(80%):test(20%) 로 데이터 분리
X_train,X_test,y_train,y_test = train_test_split(x_features,y_label,
                                                test_size=0.2,
                                                random_state=0)
print(X_train.shape) # (455, 30)
print(X_test.shape)  # (114, 30)
# type(X_train)      # ndarray

['malignant' 'benign']
1    357
0    212
Name: target, dtype: int64
(455, 30)
(114, 30)


In [4]:
'RandomForestClassifier() 파라메터 설명'
# RandomForestClassifier(
#     n_estimators=100, (결정트리의 갯수, default=100, 많을 수록 좋은 성능을 기대할수 있지만 속도가 느려진다)
#     criterion='gini',
#     max_depth=None,   (트리의 최대 깊이, 결정트리의 파라메터와 동일)
#     min_samples_split=2,(노드를 분할하기 위한 최소한의 샘플 수,default=2,과적합 제어에 사용,결정트리의 파라메터와 동일)
#     min_samples_leaf=1,(leaf 노드가 되기 위한 최소한의 샘플 수,default=1,결정트리의 파라메터와 동일)
#     min_weight_fraction_leaf=0.0,
#     max_features='auto', (최적의 분할을 위해 고려할 최대 피쳐 갯수,결정트리의 파라메터와 동일)
#     max_leaf_nodes=None, (리프노드의 최대 갯수, 결정트리의 파라메터와 동일)
#     min_impurity_decrease=0.0,
#     min_impurity_split=None,
#     bootstrap=True,
#     oob_score=False,
#     n_jobs=None,      (병렬처리 CPU의 갯수, -1이면 전체 CPU 모두 사용)
#     random_state=None,(랜덤 seed 설정 값)
#     verbose=0,
#     warm_start=False,
#     class_weight=None,
# )

'RandomForestClassifier() 파라메터 설명'

In [18]:
# from sklearn.ensemble import RandomForestClassifier

# 학습
clf = RandomForestClassifier(n_estimators=100,random_state=10,n_jobs=-1)
clf.fit(X_train,y_train)

# 예측
pred = clf.predict(X_test)

# 정확도 측정
ac_score = metrics.accuracy_score(y_test,pred)
print('정확도:',ac_score)  # 0.9649122807017544

roc_auc = metrics.roc_auc_score(y_test,pred)
print('roc_auc:',roc_auc)  # 0.9637980311209908

정확도: 0.9649122807017544
roc_auc: 0.9637980311209908


In [20]:
print(len(clf.estimators_) ) #  100개의 DecisionTreeClassifier
# clf.estimators_

100


### Boosting
 여러 개의 약한 학습기(분류기)가 순차적으로 학습,예측 하면서 잘못 예측한 데이터에 가중치를 부여하여 오류를 개선해 나가는 학습 방법<br>
XGBoost(Kaggle 대회 상위 석권), LightGBM(속도 빠름)
* (1) AdaBoost(Adaptive Boosting) : 개별 약한 학습기에 순차적으로 가중치를 부여해 결합하여 예측
* (2) GBM(Gradient Boost Machine) : 에이다 부스트와 유사하나 가중치를 경사하강법(Gradient Descent)을 사용하여 구한다

 https://roytravel.tistory.com/52

### XGBoost (eXtra Gradient Boost)
:일반 GBM보다 속도가 빠름, 자체 과적합 규제 기능으로 과적합에 강하다.
<br>
Tree Pruning(나무 가지치기)으로 긍정 이득이 없는 분할을 가지치기해서 분할 수를 줄임
<br>
Early Stopping(조기 중단) 기능으로 오차가 줄지 않을 경우 실제 estimators 갯수만큼 학습하지 않고 학습을 중단
<br>
원래는 C/C++ 용 라이브러리인데 XGBoost 개발 그룹에서 나중에 파이썬용 사이킷런용 제공

In [21]:
# from sklearn.ensemble import GradientBoostingClassifier # 일반 GBM

In [24]:
# !pip install xgboost

In [32]:
from xgboost import XGBClassifier

evals = [(X_test,y_test)]

xgb = XGBClassifier(n_estimators=400,learning_rate=0.1,max_depth=3)

# 학습
xgb.fit(X_train,y_train,early_stopping_rounds=10,eval_set=evals,
               eval_metric='logloss', verbose=False)
# 예측
pred = xgb.predict(X_test)

# 정확도 측정
ac_score = metrics.accuracy_score(y_test,pred)
print('정확도:',ac_score)  # 0.9736842105263158

roc_auc = metrics.roc_auc_score(y_test,pred)
print('roc_auc:',roc_auc)  # 0.971260717688155



정확도: 0.9736842105263158
roc_auc: 0.971260717688155


### LightGBM
Microsoft 사 제공 <br>
부스팅, XGBoost보다 속도가 빠르다, 메모리 사용량이 적다, 사이킷런 래퍼 제공 <br>
적은 데이터 세트에서 과적합이 발생할수 있다, 10000 이상의 데이터세트에 적당

In [35]:
# lightgbm 모듈 설치
# ! pip install lightgbm

In [34]:
import lightgbm
lightgbm.__version__

'3.3.5'

In [38]:
from lightgbm import LGBMClassifier
evals = [(X_test,y_test)]

lgbm = LGBMClassifier(n_estimators=400,learning_rate=0.1,max_depth=3)

# 학습
lgbm.fit(X_train,y_train,early_stopping_rounds=100,eval_set=evals,
               eval_metric='logloss', verbose=False)
# 예측
pred = lgbm.predict(X_test)

# 정확도 측정
ac_score = metrics.accuracy_score(y_test,pred)
print('정확도:',ac_score)  # 0.9649122807017544

roc_auc = metrics.roc_auc_score(y_test,pred)
print('roc_auc:',roc_auc)  # 0.9606224198158145



정확도: 0.9649122807017544
roc_auc: 0.9606224198158145


### 앙상블 모델이 아닌 sklearn의 분류기

### SVM(support Vector Machine)

In [39]:
from sklearn import svm

clf = svm.SVC(kernel='linear')
clf.fit(X_train,y_train)

# 예측
pred = clf.predict(X_test)

# 정확도 측정
ac_score = metrics.accuracy_score(y_test,pred)
print('정확도:',ac_score) 

roc_auc = metrics.roc_auc_score(y_test,pred)
print('roc_auc:',roc_auc) 

정확도: 0.956140350877193
roc_auc: 0.9595109558590028


### K-Nearest Neighbors(KNN)

In [40]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier()
clf.fit(X_train,y_train)

# 예측
pred = clf.predict(X_test)

# 정확도 측정
ac_score = metrics.accuracy_score(y_test,pred)
print('정확도:',ac_score) 

roc_auc = metrics.roc_auc_score(y_test,pred)
print('roc_auc:',roc_auc)

정확도: 0.9385964912280702
roc_auc: 0.938234360114322


### Naive Bayes

In [41]:
from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()
clf.fit(X_train,y_train)

# 예측
pred = clf.predict(X_test)

# 정확도 측정
ac_score = metrics.accuracy_score(y_test,pred)
print('정확도:',ac_score) 

roc_auc = metrics.roc_auc_score(y_test,pred)
print('roc_auc:',roc_auc)

정확도: 0.9298245614035088
roc_auc: 0.9275960622419815


### Multi-layer Perceptron(MLP) :  Neural Networks 

In [42]:
from sklearn.neural_network import MLPClassifier

clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
                    hidden_layer_sizes=(10, 4), random_state=1)
clf.fit(X_train,y_train)

# 예측
pred = clf.predict(X_test)

# 정확도 측정
ac_score = metrics.accuracy_score(y_test,pred)
print('정확도:',ac_score) 

roc_auc = metrics.roc_auc_score(y_test,pred)
print('roc_auc:',roc_auc)  

정확도: 0.9385964912280702
roc_auc: 0.938234360114322


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
