<a href="https://colab.research.google.com/github/LeeSeungYun1020/Machine_Learning/blob/main/colab/Ensemble.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 앙상블

In [11]:
from sklearn.datasets import load_iris, load_wine, load_boston, load_diabetes
from sklearn.model_selection import cross_validate, cross_val_score
from sklearn.pipeline import make_pipeline

- cross-validation 교차 검증
  - 과적합 회피
  - 훈련 데이터 전체를 한 번에 사용하지 않고 일부 남겨두고 테스트에 활용
- cross_val_score
  - `cross_val_score(model, X, Y, cv=반복 횟수)`
  - 일반적으로 큰 값이 좋은 결과 나타냄
  - 회귀 모델에서는 MSE를 얻기 위해 'neg_mean_squared_error' 자주 사용하며 0에 가까울수록 좋다.

In [20]:
iris = load_iris()
wine = load_wine()
boston = load_boston()
diabetes = load_diabetes()

## 투표 기반 분류와 회귀

In [7]:
from sklearn.ensemble import VotingClassifier, VotingRegressor

### 투표 기반 분류

In [8]:
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier

In [9]:
model1 = SVC()
model2 = GaussianNB()
model3 = SGDClassifier()

In [10]:
vote_model = VotingClassifier(
    estimators = [('svc', model1), ("guassioan", model2), ("sgd", model3)]
)

In [12]:
for model in (model1, model2, model3, vote_model):
  model_name = str(type(model)).split('.')[-1][:-2]
  scores = cross_val_score(model, iris.data, iris.target, cv=5)
  print(f"{model_name}, Accuracy: {scores.mean()}")

SVC, Accuracy: 0.9666666666666666
GaussianNB, Accuracy: 0.9533333333333334
SGDClassifier, Accuracy: 0.8666666666666666
VotingClassifier, Accuracy: 0.9466666666666665


### 투표 기반 회귀

In [14]:
from sklearn.svm import SVR
from sklearn.linear_model import SGDRegressor, LinearRegression

In [25]:
model1 = SVR()
model2 = SGDRegressor()
model3 = LinearRegression()

In [26]:
vote_model = VotingRegressor(
    estimators = [("svm", model1), ("sgd", model2), ("linear", model3)]
)

In [27]:
for model in (model1, model2, model3, vote_model):
  model_name = str(type(model)).split('.')[-1][:-2]
  scores = -cross_val_score(model, boston.data, boston.target, scoring='neg_mean_squared_error',cv=5)
  print(f"{model_name}, Accuracy: {scores.mean()}")

SVR, Accuracy: 71.85800739156483
SGDRegressor, Accuracy: 6.331453067269364e+27
LinearRegression, Accuracy: 37.13180746769903
VotingRegressor, Accuracy: 3.3003590163164434e+27


투표 기반 회귀는 잘 사용되지 않음...

## 배깅

- n_estimators: 사용할 예측기 갯수
- max_samples: 무작위로 뽑을 샘플의 양 (배깅이므로 무작위 복원 추출)
- max_features: 사용할 데이터의 속성 갯수 (속성 값도 전체를 사용하지 않고 랜덤하게 학습)

In [28]:
from sklearn.ensemble import BaggingClassifier, BaggingRegressor

### 베깅 사용 분류

In [34]:
base_model = BaggingClassifier(SVC(), n_estimators=10, max_samples=0.5, max_features=0.5)
cross_val = cross_validate(
    estimator = base_model, X = iris.data, y = iris.target, cv = 5
)
print(f"{cross_val['test_score'].mean()}")

0.9466666666666667


In [45]:
from sklearn.preprocessing import StandardScaler

base_model = make_pipeline(
    StandardScaler(),
    SVC()
)
bagging_model = BaggingClassifier(base_model, n_estimators=10, max_samples=0.5, max_features=0.5)

base_cross_val = cross_validate(
    estimator = base_model, X = wine.data, y = wine.target, cv = 5
)
print(f"base: {base_cross_val['test_score'].mean()}")

bagging_cross_val = cross_validate(
    estimator = bagging_model, X = wine.data, y = wine.target, cv = 5
)
print(f"bagging: {bagging_cross_val['test_score'].mean()}")

base: 0.9833333333333334
bagging: 0.9607936507936508


In [46]:
from sklearn.preprocessing import StandardScaler

base_model = make_pipeline(
    StandardScaler(),
    GaussianNB()
)
bagging_model = BaggingClassifier(base_model, n_estimators=10, max_samples=0.5, max_features=0.5)

base_cross_val = cross_validate(
    estimator = base_model, X = wine.data, y = wine.target, cv = 5
)
print(f"base: {base_cross_val['test_score'].mean()}")

bagging_cross_val = cross_validate(
    estimator = bagging_model, X = wine.data, y = wine.target, cv = 5
)
print(f"bagging: {bagging_cross_val['test_score'].mean()}")

base: 0.9663492063492063
bagging: 0.9441269841269841


### 베깅 사용 회귀

In [49]:
from sklearn.preprocessing import StandardScaler

base_model = make_pipeline(
    StandardScaler(),
    SVR()
)
bagging_model = BaggingRegressor(base_model, n_estimators=10, max_samples=0.5, max_features=0.5)

base_cross_val = cross_validate(
    estimator = base_model, X = boston.data, y = boston.target, cv = 5
)
print(f"base: {base_cross_val['test_score'].mean()}")

bagging_cross_val = cross_validate(
    estimator = bagging_model, X = wine.data, y = wine.target, cv = 5
)
print(f"bagging: {bagging_cross_val['test_score'].mean()}")

base: 0.17631266230186618
bagging: 0.22878911567010385


## Ada boost

In [50]:
from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor

### AdaBoost 분류

In [51]:
model = make_pipeline(
    StandardScaler(),
    AdaBoostClassifier()
)

In [55]:
cross_val = cross_validate(estimator=model, X=iris.data, y=iris.target, cv = 5)
print(f"{cross_val['test_score'].mean()}")

0.9466666666666667


In [56]:
cross_val = cross_validate(estimator=model, X=wine.data, y=wine.target, cv = 5)
print(f"{cross_val['test_score'].mean()}")

0.8085714285714285


### AdaBoost 회귀

In [57]:
model = make_pipeline(
    StandardScaler(),
    AdaBoostRegressor()
)

In [58]:
cross_val = cross_validate(estimator=model, X=boston.data, y=boston.target, cv = 5)
print(f"{cross_val['test_score'].mean()}")

0.56313831539566
