### sklearn.ensemble.AdaBoostClassifier
### sklearn.ensemble.AdaBoostRegressor
### sklearn.ensemble.GradientBoostingClassifier
### sklearn.ensemble.GradientBoostingRegressor

- 여러 약한 학습기를 순차적(직렬식)으로 학습시켜 예측하면서 잘못 예측한 데이터에 가중치를 부여하여 오류를 개선해 나가며 학습하는 앙상블 모델

#### 주요 Hyperparameter
- base_estimator : 기저 모델
- n_estimators : 부트스트랩 데이터셋 수

##### AdaBoostClassifier(base_estimator=None, *, n_estimators=10, learning_rate=1.0, algorithm='SAMME.R', random_state=None)
##### AdaBoostRegressor(base_estimator=None, *, n_estimators=10, learning_rate=1.0, loss='linear', random_state=None)

##### GradientBoostingClassifier(*, loss='deviance', learning_rate=0.1, n_estimators=100, subsample=1.0, criterion='friedman_mse', min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0)
##### GradientBoostingRegressor(*, loss='ls', learning_rate=0.1, n_estimators=100, subsample=1.0, criterion='friedman_mse', min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None, warm_start=False, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0)

# 분석 코드 - Classification
### AdaBoost

In [1]:
# 라이브러리 및 데이터 로드
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import randint
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder
from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor, GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.metrics import classification_report, confusion_matrix, mean_squared_error, accuracy_score

df = pd.read_csv('../input/big-data-certification-study/breast-cancer-wisconsin.csv', encoding='utf-8')
df.head()

Unnamed: 0,code,Clump_Thickness,Cell_Size,Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,0
1,1002945,5,4,4,5,7,10,3,2,1,0
2,1015425,3,1,1,1,2,2,3,1,1,0
3,1016277,6,8,8,1,3,4,3,7,1,0
4,1017023,4,1,1,3,2,1,3,1,1,0


In [2]:
# 데이터 분리
X=df.drop(columns=['code','Class'])
y=df[['Class']]
X_train,X_test,y_train,y_test=train_test_split(X,y,stratify=y,random_state=42)

In [3]:
# 정규화
scaler=MinMaxScaler()
scaler.fit(X_train)
mm_X_train=scaler.transform(X_train)
mm_X_test=scaler.transform(X_test)

In [4]:
# 모델 적용
model=AdaBoostClassifier(n_estimators=100, random_state=0)
model.fit(mm_X_train, y_train)
x_pred=model.predict(mm_X_train)
model.score(mm_X_train, y_train)

# 훈련데이터의 오류를 해결하는 방식이므로 과대적합 가능성 높음

1.0

In [5]:
# 혼동행렬, 분류예측 보고서
cm_train=confusion_matrix(y_train,x_pred)
cfr_train=classification_report(y_train,x_pred)
print('혼동행렬 :\n',cm_train,'\n',
      '분류예측 보고서 :\n',cfr_train,'\n\n')

혼동행렬 :
 [[333   0]
 [  0 179]] 
 분류예측 보고서 :
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       333
           1       1.00      1.00      1.00       179

    accuracy                           1.00       512
   macro avg       1.00      1.00      1.00       512
weighted avg       1.00      1.00      1.00       512
 




In [6]:
y_pred=model.predict(mm_X_test)
accuracy_score(y_test,y_pred)

0.9532163742690059

In [7]:
# 혼동행렬, 분류예측 보고서
cm_train=confusion_matrix(y_test,y_pred)
cfr_train=classification_report(y_test,y_pred)
print('혼동행렬 :\n',cm_train,'\n',
      '분류예측 보고서 :\n',cfr_train,'\n\n')

혼동행렬 :
 [[106   5]
 [  3  57]] 
 분류예측 보고서 :
               precision    recall  f1-score   support

           0       0.97      0.95      0.96       111
           1       0.92      0.95      0.93        60

    accuracy                           0.95       171
   macro avg       0.95      0.95      0.95       171
weighted avg       0.95      0.95      0.95       171
 




### GradientBoosting

In [8]:
# 모델 적용
model=GradientBoostingClassifier(n_estimators=100,
                                 learning_rate=0.1,
                                 max_depth=1, random_state=0)
model.fit(mm_X_train, y_train)
x_pred=model.predict(mm_X_train)
model.score(mm_X_train, y_train)

# 훈련데이터의 오류를 해결하는 방식이므로 과대적합 가능성 높음

0.984375

In [9]:
# 혼동행렬, 분류예측 보고서
cm_train=confusion_matrix(y_train,x_pred)
cfr_train=classification_report(y_train,x_pred)
print('혼동행렬 :\n',cm_train,'\n',
      '분류예측 보고서 :\n',cfr_train,'\n\n')

혼동행렬 :
 [[329   4]
 [  4 175]] 
 분류예측 보고서 :
               precision    recall  f1-score   support

           0       0.99      0.99      0.99       333
           1       0.98      0.98      0.98       179

    accuracy                           0.98       512
   macro avg       0.98      0.98      0.98       512
weighted avg       0.98      0.98      0.98       512
 




In [10]:
y_pred=model.predict(mm_X_test)
accuracy_score(y_test,y_pred)

0.9649122807017544

In [11]:
# 혼동행렬, 분류예측 보고서
cm_train=confusion_matrix(y_test,y_pred)
cfr_train=classification_report(y_test,y_pred)
print('혼동행렬 :\n',cm_train,'\n',
      '분류예측 보고서 :\n',cfr_train,'\n\n')

혼동행렬 :
 [[106   5]
 [  1  59]] 
 분류예측 보고서 :
               precision    recall  f1-score   support

           0       0.99      0.95      0.97       111
           1       0.92      0.98      0.95        60

    accuracy                           0.96       171
   macro avg       0.96      0.97      0.96       171
weighted avg       0.97      0.96      0.97       171
 




# 분석 코드 - Regression
### AdaBoost

In [12]:
df2=pd.read_csv('../input/big-data-certification-study/house_price.csv', encoding='utf-8')
df2.head()

Unnamed: 0,housing_age,income,bedrooms,households,rooms,house_value
0,23,6.777,0.141112,2.442244,8.10396,500000
1,49,6.0199,0.160984,2.726688,5.752412,500000
2,35,5.1155,0.249061,1.902676,3.888078,500000
3,32,4.7109,0.231383,1.913669,4.508393,500000
4,21,4.5625,0.255583,3.092664,4.667954,500000


In [13]:
# 데이터 분리
X=df2.drop(columns=['house_value'])
y=df2[['house_value']]
X_train, X_test, y_train, y_test=train_test_split(X,y,random_state=42)

In [14]:
# 정규화
scale=MinMaxScaler()
scale.fit(X_train)
ms_X_train=scale.transform(X_train)
ms_X_test=scale.transform(X_test)

In [15]:
# 모델 적용
model=AdaBoostRegressor(n_estimators=100,random_state=0)
model.fit(ms_X_train, y_train)

pred_x=model.predict(ms_X_train)
model.score(ms_X_train, y_train)

0.44121772746215804

In [16]:
pred_y=model.predict(ms_X_test)
model.score(ms_X_test, y_test)

0.44209369928960385

In [17]:
# RMSE
rmse_train=np.sqrt(mean_squared_error(y_train,pred_x))
rmse_test=np.sqrt(mean_squared_error(y_test,pred_y))
print('Train RMSE :', round(rmse_train),
      '\nTest RMSE :', round(rmse_test))

Train RMSE : 71346 
Test RMSE : 71407


### GradientBoosting

In [18]:
# 모델 적용
model=GradientBoostingRegressor(random_state=0)
model.fit(ms_X_train, y_train)

pred_x=model.predict(ms_X_train)
model.score(ms_X_train, y_train)

0.6528129290117282

In [19]:
pred_y=model.predict(ms_X_test)
model.score(ms_X_test, y_test)

0.6255002049129976

In [20]:
# RMSE
rmse_train=np.sqrt(mean_squared_error(y_train,pred_x))
rmse_test=np.sqrt(mean_squared_error(y_test,pred_y))
print('Train RMSE :', round(rmse_train),
      '\nTest RMSE :', round(rmse_test))

Train RMSE : 56238 
Test RMSE : 58504


- GradientBoosting 또는 AdaBoost 모델 모두 튜닝을 통해 성능을 올릴 수 있고, 두 모델 중 어느 모델이 더 좋다고는 할 수 없음을 참고