# **Ensemble**


*   여러 모델을 형성하여 예측들을 결합하여 강력한 model을 형성
*   Voting, Bagging, Boosting 방식으로 분류됨




# **Random Forest** 


*   Decision Tree를 random하게 만들어서 평균치로 평가
*   Decision Tree 분할 시 최적의 분할을 찾아 모음
*   앙상블 학습에서 안정적인 성능이 보장됨
*   Train data 생성방식이 부트스트랩 방식(중복허용)



### **데이터 전처리** 

In [1]:
import numpy as np
import pandas as pd
wine=pd.read_csv("https://bit.ly/wine_csv_data")
data=wine[['alcohol', 'sugar', 'pH']].to_numpy()
target=wine['class'].to_numpy()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(
    data, target, test_size=0.2, random_state=42
)

### **Random Forest** 

In [29]:
from sklearn.model_selection import cross_validate # 교차검증 
from sklearn.ensemble import RandomForestClassifier # 랜덤포레스트 모델 
rfc=RandomForestClassifier(n_jobs=-1) # job=-1 전부 사용
# random forest fit by k-flod 교차검증
scores=cross_validate(rfc, X_train, y_train, return_train_score=True, n_jobs=-1)
print(np.mean(scores['train_score']))
print(np.mean(scores['test_score']))

0.9974022965603432
0.8916709854149699


### **특성 중요도** 

In [21]:
rfc.fit(X_train, y_train)
print(rfc.feature_importances_)


[0.22889174 0.50452091 0.26658735]


### **OOB sample 성능**

In [16]:
rfc=RandomForestClassifier(oob_score=True, n_jobs=-1)
rfc.fit(X_train, y_train)
print(rfc.oob_score_)

0.895516644217818


# **Extra Tree**


*   Random Forest 와 샘플링 방식만 다름 
*   Decision Tree 생성시 전체 훈련 세트를 사용
*   노드 분할시 최적의 분할이 아니라 무작위 분할을 사용 (속도가 빠름)




In [17]:
from sklearn.ensemble import ExtraTreesClassifier
etc=ExtraTreesClassifier(n_jobs=-1)
scores=cross_validate(etc, X_train, y_train, return_train_score=True, n_jobs=-1)
print(np.mean(scores['train_score']))
print(np.mean(scores['test_score']))

0.9974503966084433
0.8891683941659879


# **Gradient Boosting**


*   깊이가 얕은(default=3) Decision Tree를 사용하여 과적합 방지
*   Gradient -> 경사 하강법을 사용해 트리를 앙상블에 추가하는 방식



In [22]:
from sklearn.ensemble import GradientBoostingClassifier
gb=GradientBoostingClassifier(n_estimators=500, learning_rate=0.2) #learn: default 0.1
scores=cross_validate(gb, X_train, y_train, return_train_score=True, n_jobs=-1)
print(np.mean(scores['train_score']))
print(np.mean(scores['test_score']))
gb.fit(X_train, y_train)
print(gb.feature_importances_)

0.9464595437171814
0.8780082549788999
[0.15833642 0.68028419 0.16137939]


# **Histogram-based Gradient Boosting**


In [28]:
from sklearn.ensemble import HistGradientBoostingClassifier
hgb=HistGradientBoostingClassifier()
scores=cross_validate(hgb, X_train, y_train, return_train_score=True, n_jobs=-1)
print(np.mean(scores['train_score']))
print(np.mean(scores['test_score']))

0.9321723946453317
0.8801241948619236


# **XGBoost**

In [24]:
from xgboost import XGBClassifier
xgb=XGBClassifier(tree_method='hist')
scores=cross_validate(xgb, X_train, y_train, return_train_score=True, n_jobs=-1)
print(np.mean(scores['train_score']))
print(np.mean(scores['test_score']))

0.8824322471423747
0.8726214185237284


# **LightGBM**

In [25]:
from lightgbm import LGBMClassifier
lgb=LGBMClassifier(tree_method='hist')
scores=cross_validate(lgb, X_train, y_train, return_train_score=True, n_jobs=-1)
print(np.mean(scores['train_score']))
print(np.mean(scores['test_score']))

0.9338079582727165
0.8789710890649293
