# 앙상블(Ensemble)

* 일반화와 강건성(Robustness)을 향상시키기 위해 여러 모델의 예측 값을 결합하는 방법
* 앙상블에는 크게 두가지 종류가 존재
  * 평균 방법
    * 여러개의 추정값을 독립적으로 구한뒤 평균을 취함
    * 결합 추정값은 분산이 줄어들기 때문에 단일 추정값보다 좋은 성능을 보임
  * 부스팅 방법
    * 순차적으로 모델 생성
    * 결합된 모델의 편향을 감소 시키기 위해 노력
    * 부스팅 방법의 목표는 여러개의 약한 모델들을 결합해 하나의 강력한 앙상블 모델을 구축하는 것

## Bagging meta-estimator

* bagging은 bootstrap aggregating의 줄임말
* 원래 훈련 데이터셋의 일부를 사용해 여러 모델을 훈련
* 각각의 결과를 결합해 최종 결과를 생성
* 분산을 줄이고 과적합을 막음
* 강력하고 복잡한 모델에서 잘 동작

In [36]:
from sklearn.datasets import load_iris, load_wine, load_breast_cancer
from sklearn.datasets import load_boston, load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate

In [5]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

In [42]:
from sklearn.ensemble import BaggingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

### Bagging을 사용한 분류

#### 데이터셋 불러오기

In [7]:
iris = load_iris()
wine = load_wine()
cancer = load_breast_cancer()

#### KNN

##### 붓꽃 데이터

In [8]:
base_model = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier()
)

bagging_model = BaggingClassifier(base_model, n_estimators=10, max_samples=0.5, max_features=0.5)

In [10]:
cross_val = cross_validate(
    estimator=base_model,
    X=iris.data, y=iris.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.001415395736694336 (+/- 0.0005992305128886332)
avg score time : 0.0020985126495361326 (+/- 0.0006733710154526535)
avg test score : 0.96 (+/- 0.024944382578492935)


In [11]:
cross_val = cross_validate(
    estimator=bagging_model,
    X=iris.data, y=iris.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.012061214447021485 (+/- 0.005768281702950531)
avg score time : 0.0032387733459472655 (+/- 0.0007328310242160943)
avg test score : 0.9533333333333334 (+/- 0.04521553322083511)


##### 와인 데이터

In [12]:
base_model = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier()
)

bagging_model = BaggingClassifier(base_model, n_estimators=10, max_samples=0.5, max_features=0.5)

In [13]:
cross_val = cross_validate(
    estimator=base_model,
    X=wine.data, y=wine.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.00148773193359375 (+/- 0.0005281331995077764)
avg score time : 0.0023520469665527345 (+/- 0.0009040415830707362)
avg test score : 0.9493650793650794 (+/- 0.037910929811115976)


In [14]:
cross_val = cross_validate(
    estimator=bagging_model,
    X=wine.data, y=wine.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.012511587142944336 (+/- 0.00565176088782645)
avg score time : 0.0034942626953125 (+/- 0.0007973450616712772)
avg test score : 0.9384126984126985 (+/- 0.020454645980853996)


##### 유방암 데이터

In [15]:
base_model = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier()
)

bagging_model = BaggingClassifier(base_model, n_estimators=10, max_samples=0.5, max_features=0.5)

In [16]:
cross_val = cross_validate(
    estimator=base_model,
    X=cancer.data, y=cancer.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.00142364501953125 (+/- 0.0008537683683582184)
avg score time : 0.010377311706542968 (+/- 0.006998926483571368)
avg test score : 0.9648501785437045 (+/- 0.009609970350036127)


In [17]:
cross_val = cross_validate(
    estimator=bagging_model,
    X=cancer.data, y=cancer.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.022642898559570312 (+/- 0.03019439011825988)
avg score time : 0.005859231948852539 (+/- 0.0006572396990652549)
avg test score : 0.9648501785437045 (+/- 0.005549245701079563)


#### SVC

##### 붓꽃 데이터

In [18]:
base_model = make_pipeline(
    StandardScaler(),
    SVC()
)

bagging_model = BaggingClassifier(base_model, n_estimators=10, max_samples=0.5, max_features=0.5)

In [19]:
cross_val = cross_validate(
    estimator=base_model,
    X=iris.data, y=iris.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.002055978775024414 (+/- 0.0010457669272601476)
avg score time : 0.0006916999816894531 (+/- 0.0002241628369265914)
avg test score : 0.9666666666666666 (+/- 0.02108185106778919)


In [20]:
cross_val = cross_validate(
    estimator=bagging_model,
    X=cancer.data, y=cancer.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.017092466354370117 (+/- 0.0044032972652880645)
avg score time : 0.0045321941375732425 (+/- 0.0004840945050530314)
avg test score : 0.9666045645086168 (+/- 0.012901097440554028)


##### 와인 데이터

In [21]:
base_model = make_pipeline(
    StandardScaler(),
    SVC()
)

bagging_model = BaggingClassifier(base_model, n_estimators=10, max_samples=0.5, max_features=0.5)

In [22]:
cross_val = cross_validate(
    estimator=base_model,
    X=wine.data, y=wine.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.0023026466369628906 (+/- 0.000400953054785292)
avg score time : 0.0008338451385498047 (+/- 8.564831644019779e-05)
avg test score : 0.9833333333333334 (+/- 0.022222222222222233)


In [23]:
cross_val = cross_validate(
    estimator=bagging_model,
    X=wine.data, y=wine.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.01601419448852539 (+/- 0.006157265951380905)
avg score time : 0.001936960220336914 (+/- 0.00038864148566427763)
avg test score : 0.9665079365079364 (+/- 0.020746948644437477)


##### 유방암 데이터

In [24]:
base_model = make_pipeline(
    StandardScaler(),
    SVC()
)

bagging_model = BaggingClassifier(base_model, n_estimators=10, max_samples=0.5, max_features=0.5)

In [25]:
cross_val = cross_validate(
    estimator=base_model,
    X=cancer.data, y=cancer.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.0037891387939453123 (+/- 0.0020459864792854764)
avg score time : 0.0012862205505371094 (+/- 0.0008267189777365049)
avg test score : 0.9736376339077782 (+/- 0.014678541667933545)


In [26]:
cross_val = cross_validate(
    estimator=bagging_model,
    X=cancer.data, y=cancer.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.01792149543762207 (+/- 0.007245902452851581)
avg score time : 0.004671049118041992 (+/- 0.0005456176006148764)
avg test score : 0.9630957925787922 (+/- 0.014027872671634403)


#### Decision Tree

##### 붓꽃 데이터

In [27]:
base_model = make_pipeline(
    StandardScaler(),
    DecisionTreeClassifier()
)

bagging_model = BaggingClassifier(base_model, n_estimators=10, max_samples=0.5, max_features=0.5)

In [28]:
cross_val = cross_validate(
    estimator=base_model,
    X=iris.data, y=iris.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.0018534183502197266 (+/- 0.0010189442112240456)
avg score time : 0.0004413604736328125 (+/- 0.00010705003759639017)
avg test score : 0.9666666666666668 (+/- 0.036514837167011066)


In [29]:
cross_val = cross_validate(
    estimator=bagging_model,
    X=iris.data, y=iris.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.012614917755126954 (+/- 0.0049090418864234635)
avg score time : 0.0008601188659667969 (+/- 7.65176711432489e-05)
avg test score : 0.9466666666666665 (+/- 0.03399346342395189)


##### 와인 데이터

In [30]:
base_model = make_pipeline(
    StandardScaler(),
    DecisionTreeClassifier()
)

bagging_model = BaggingClassifier(base_model, n_estimators=10, max_samples=0.5, max_features=0.5)

In [31]:
cross_val = cross_validate(
    estimator=base_model,
    X=wine.data, y=wine.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.0017827987670898438 (+/- 0.00023741619154321326)
avg score time : 0.0003842353820800781 (+/- 8.3945363579487e-05)
avg test score : 0.8709523809523809 (+/- 0.051305948266633)


In [32]:
cross_val = cross_validate(
    estimator=bagging_model,
    X=wine.data, y=wine.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.013465785980224609 (+/- 0.004857615911885542)
avg score time : 0.0009153842926025391 (+/- 5.1288175598823304e-05)
avg test score : 0.9553968253968254 (+/- 0.037610674843479096)


##### 유방암 데이터

In [33]:
base_model = make_pipeline(
    StandardScaler(),
    DecisionTreeClassifier()
)

bagging_model = BaggingClassifier(base_model, n_estimators=10, max_samples=0.5, max_features=0.5)

In [34]:
cross_val = cross_validate(
    estimator=base_model,
    X=cancer.data, y=cancer.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.007410192489624023 (+/- 0.0011814101746197376)
avg score time : 0.00040678977966308595 (+/- 0.00017386566950672606)
avg test score : 0.9208818506443098 (+/- 0.011367663900812813)


In [35]:
cross_val = cross_validate(
    estimator=bagging_model,
    X=cancer.data, y=cancer.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.021436691284179688 (+/- 0.0075137363999600625)
avg score time : 0.0010018348693847656 (+/- 0.00023843569694103361)
avg test score : 0.952585002328831 (+/- 0.01305405236401316)


### Bagging을 사용한 회귀

#### 데이터셋 불러오기

In [38]:
boston = load_boston()
diabetes = load_diabetes()

#### KNN

##### 보스턴 주택 가격 데이터

In [43]:
base_model = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor()
)

bagging_model = BaggingRegressor(base_model, n_estimators=10, max_samples=0.5, max_features=0.5)

In [44]:
cross_val = cross_validate(
    estimator=base_model,
    X=boston.data, y=boston.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.001759815216064453 (+/- 0.0007601353762795599)
avg score time : 0.0023005008697509766 (+/- 0.000883677845604637)
avg test score : 0.47357748833823543 (+/- 0.13243123464477455)


In [45]:
cross_val = cross_validate(
    estimator=bagging_model,
    X=boston.data, y=boston.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.010712766647338867 (+/- 0.0058968633192018165)
avg score time : 0.003631258010864258 (+/- 0.00024809881789860305)
avg test score : 0.4992203990238428 (+/- 0.1214285724303666)


##### 당뇨병 데이터

In [46]:
base_model = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor()
)

bagging_model = BaggingRegressor(base_model, n_estimators=10, max_samples=0.5, max_features=0.5)

In [47]:
cross_val = cross_validate(
    estimator=base_model,
    X=diabetes.data, y=diabetes.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.0017104148864746094 (+/- 0.0008435605533813964)
avg score time : 0.0018630027770996094 (+/- 0.0004943005392948625)
avg test score : 0.3689720650295623 (+/- 0.044659049060165365)


In [48]:
cross_val = cross_validate(
    estimator=bagging_model,
    X=diabetes.data, y=diabetes.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.010668373107910157 (+/- 0.005990193967423894)
avg score time : 0.0038419723510742187 (+/- 0.0005258139926263126)
avg test score : 0.3965541729094708 (+/- 0.06995635450284457)


#### SVR

##### 보스턴 주택 가격 데이터

In [49]:
base_model = make_pipeline(
    StandardScaler(),
    SVR()
)

bagging_model = BaggingRegressor(base_model, n_estimators=10, max_samples=0.5, max_features=0.5)

In [50]:
cross_val = cross_validate(
    estimator=base_model,
    X=boston.data, y=boston.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.0069140434265136715 (+/- 0.004441731086414138)
avg score time : 0.0018841743469238282 (+/- 0.00011062348707702104)
avg test score : 0.17631266230186604 (+/- 0.522491491512898)


In [51]:
cross_val = cross_validate(
    estimator=bagging_model,
    X=boston.data, y=boston.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.01911478042602539 (+/- 0.005144104398862236)
avg score time : 0.008036947250366211 (+/- 0.0005520189054432421)
avg test score : 0.09298838255918784 (+/- 0.3602102434602145)


##### 당뇨병 데이터

In [52]:
base_model = make_pipeline(
    StandardScaler(),
    SVR()
)

bagging_model = BaggingRegressor(base_model, n_estimators=10, max_samples=0.5, max_features=0.5)

In [53]:
cross_val = cross_validate(
    estimator=base_model,
    X=diabetes.data, y=diabetes.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.005385637283325195 (+/- 0.0037336164994106746)
avg score time : 0.0017621994018554687 (+/- 0.0003587677084303775)
avg test score : 0.14659936199629436 (+/- 0.02190798003342926)


In [54]:
cross_val = cross_validate(
    estimator=bagging_model,
    X=diabetes.data, y=diabetes.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.016515588760375975 (+/- 0.005617482051999203)
avg score time : 0.006556844711303711 (+/- 0.0007055688002283409)
avg test score : 0.059949307360234315 (+/- 0.032604381868061215)


#### Decision Tree

##### 보스턴 주택 가격 데이터

In [55]:
base_model = make_pipeline(
    StandardScaler(),
    DecisionTreeRegressor()
)

bagging_model = BaggingRegressor(base_model, n_estimators=10, max_samples=0.5, max_features=0.5)

In [56]:
cross_val = cross_validate(
    estimator=base_model,
    X=boston.data, y=boston.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.004807615280151367 (+/- 0.0007538427293245176)
avg score time : 0.000522613525390625 (+/- 0.0001848177397565174)
avg test score : 0.10468765942368866 (+/- 0.9851980767138045)


In [57]:
cross_val = cross_validate(
    estimator=bagging_model,
    X=boston.data, y=boston.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.015772294998168946 (+/- 0.005013588892850026)
avg score time : 0.0008565425872802734 (+/- 9.608932500587256e-05)
avg test score : 0.4362717068043554 (+/- 0.4121217955791576)


##### 당뇨병 데이터

In [58]:
base_model = make_pipeline(
    StandardScaler(),
    DecisionTreeRegressor()
)

bagging_model = BaggingRegressor(base_model, n_estimators=10, max_samples=0.5, max_features=0.5)

In [59]:
cross_val = cross_validate(
    estimator=base_model,
    X=diabetes.data, y=diabetes.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.0034531116485595702 (+/- 0.0008152977655261128)
avg score time : 0.0004782676696777344 (+/- 0.00017816661652187134)
avg test score : -0.12439887092799551 (+/- 0.10497244792854409)


In [60]:
cross_val = cross_validate(
    estimator=bagging_model,
    X=diabetes.data, y=diabetes.target,
    cv=5
)
print("avg fit time : {} (+/- {})".format(cross_val["fit_time"].mean(), cross_val["fit_time"].std()))
print("avg score time : {} (+/- {})".format(cross_val["score_time"].mean(), cross_val["score_time"].std()))
print("avg test score : {} (+/- {})".format(cross_val["test_score"].mean(), cross_val["test_score"].std()))

avg fit time : 0.015687036514282226 (+/- 0.005207596405584561)
avg score time : 0.0009302139282226562 (+/- 0.00014515768971402234)
avg test score : 0.37104467453420403 (+/- 0.06169126426439304)


## Forests of randomized trees

* `sklearn.ensemble` 모듈에는 무작위 결정 트리를 기반으로하는 두 개의 평균화 알고리즘이 존재
  * Random Forest
  * Extra-Trees
* 모델 구성에 임의성을 추가해 다양한 모델 집합이 생성
* 앙상블 모델의 예측은 각 모델의 평균

### Random Forests 분류

### Random Forests 회귀

### Extremely Randomized Trees 분류

### Extremely Randomized Trees 회귀

### Random Forest, Extra Tree 시각화

* 결정 트리, Random Forest, Extra Tree의 결정 경계와 회귀식 시각화

## AdaBoost

* 대표적인 부스팅 알고리즘
* 일련의 약한 모델들을 학습
* 수정된 버전의 데이터를 반복 학습 (가중치가 적용된)
* 가중치 투표(또는 합)을 통해 각 모델의 예측 값을 결합
* 첫 단계에서는 원본 데이터를 학습하고 연속적인 반복마다 개별 샘플에 대한 가중치가 수정되고 다시 모델이 학습
  * 잘못 예측된 샘플은 가중치 증가, 올바르게 예측된 샘플은 가중치 감소
  * 각각의 약한 모델들은 예측하기 어려운 샘플에 집중하게 됨

![AdaBoost](https://scikit-learn.org/stable/_images/sphx_glr_plot_adaboost_hastie_10_2_0011.png)

### AdaBoost 분류

### AdaBoost 회귀

## Gradient Tree Boosting

* 임의의 차별화 가능한 손실함수로 일반화한 부스팅 알고리즘
* 웹 검색, 분류 및 회귀 등 다양한 분야에서 모두 사용 가능

### Gradient Tree Boosting 분류

### Gradient Tree Boosting 회귀

## 투표 기반 분류 (Voting Classifier)

* 서로 다른 모델들의 결과를 투표를 통해 결합
* 두가지 방법으로 투표 가능
  * 가장 많이 예측된 클래스를 정답으로 채택 (hard voting)
  * 예측된 확률의 가중치 평균 (soft voting)

### 결정 경계 시각화

## 투표 기반 회귀 (Voting Regressor)

* 서로 다른 모델의 예측 값의 평균을 사용

### 회귀식 시각화

## 스택 일반화 (Stacked Generalization)

* 각 모델의 예측 값을 최종 모델의 입력으로 사용
* 모델의 편향을 줄이는데 효과적

### 스택 회귀

#### 회귀식 시각화

### 스택 분류

#### 결정 경계 시각화