# __Ensembles__

### __Ensembles almost always work better__

### Bias & Variance

![Alt text](./images/bias.png)

## 앙상블의목적: 다수의모델을학습하여오류의감소를추구
>**분산의감소에의한오류감소: 배깅(Bagging), 랜덤포레스트(Random Forest)** <br>
>편향의감소에의한오류감소: 부스팅(Boosting)

# __Bagging__

### Bagging: Bootstrapp Aggregating
> 앙상블의 각 멤버(모델)은 서로 다른 학습 데이터셋을 이용 <br>
> 개별 데이터셋을 붓스트랩(bootstrap)이라 부름

![Alt text](./images/bagging.png)

In [1]:
# sklearn으로 bagging 만들기
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html
import sklearn
import pandas as pd
import numpy as np
from sklearn import model_selection # cross-validation score를 가져오기 위함
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import BaggingClassifier # bagging
from sklearn.tree import DecisionTreeClassifier # 의사 결정 나무
from collections import Counter # count
from sklearn.metrics import f1_score

import warnings
warnings.simplefilter("ignore", UserWarning)

- 변수설명
    - preg: Number of times pregnant

    - plas: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

    - pres: Diastolic blood pressure ($\text{mm Hg}$)

    - skin: Triceps skin fold thickness ($\text{mm}$)

    - test: 2-Hour serum insulin ($\text{mu U/ml}$)

    - mass: Body mass index ($\text{weight in kg}$/$(\text{height in m})^2$)

    - pedi: Diabetes pedigree function

    - age: Age ($\text{years}$)

    - class = (1: `tested positive for diabetes`, 0: `tested negative for diabetes`)

In [2]:
filename = './dataset/pima-indians-diabetes.data.csv'
# 
dataframe = pd.read_csv(filename, header=None)
dataframe.columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Class']
dataframe.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
array = dataframe.values # 손 쉬운 indexing을 위하여 array로 변형
array

array([[  6.   , 148.   ,  72.   , ...,   0.627,  50.   ,   1.   ],
       [  1.   ,  85.   ,  66.   , ...,   0.351,  31.   ,   0.   ],
       [  8.   , 183.   ,  64.   , ...,   0.672,  32.   ,   1.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,   0.245,  30.   ,   0.   ],
       [  1.   , 126.   ,  60.   , ...,   0.349,  47.   ,   1.   ],
       [  1.   ,  93.   ,  70.   , ...,   0.315,  23.   ,   0.   ]])

In [4]:
X = array[:,0:8].astype(float)  # 0 - 7 column은 독립변수
Y = array[:,8].astype(int) # 마지막 column은 종속변수

print('X:',X[:5])
print('y:',Y[:5])

X: [[6.000e+00 1.480e+02 7.200e+01 3.500e+01 0.000e+00 3.360e+01 6.270e-01
  5.000e+01]
 [1.000e+00 8.500e+01 6.600e+01 2.900e+01 0.000e+00 2.660e+01 3.510e-01
  3.100e+01]
 [8.000e+00 1.830e+02 6.400e+01 0.000e+00 0.000e+00 2.330e+01 6.720e-01
  3.200e+01]
 [1.000e+00 8.900e+01 6.600e+01 2.300e+01 9.400e+01 2.810e+01 1.670e-01
  2.100e+01]
 [0.000e+00 1.370e+02 4.000e+01 3.500e+01 1.680e+02 4.310e+01 2.288e+00
  3.300e+01]]
y: [1 0 1 0 1]


In [5]:
train_x, test_x, train_y, test_y = train_test_split(X, Y, test_size=0.3, random_state=0)
print('Number of train set:', len(train_x))
print('Number of test set:', len(test_x))

Number of train set: 537
Number of test set: 231


In [6]:
assert len(train_x) == len(train_y)
assert len(test_x) == len(test_y)

In [7]:
# hyperparameters
param_grid = {'n_estimators': [100, 200],
              'max_features': [1.0], 
              'bootstrap_features': [False], # no replacement
              'oob_score': [True], # compute out of bag error
              'n_jobs':[-1], 
              'base_estimator__max_depth': [3, 5]
              }

- GridSearchCV에 대한 설명
    - None, to use the default 5-fold cross validation,
    - integer, to specify the number of folds in a (Stratified)KFold,

In [8]:
# 1) 모델 선언
DT = DecisionTreeClassifier()
DT

In [9]:
sklearn.metrics.SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'max_error', 'matthews_corrcoef', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_mean_poisson_deviance', 'neg_mean_gamma_deviance', 'accuracy', 'top_k_accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'neg_brier_score', 'positive_likelihood_ratio', 'neg_negative_likelihood_ratio', 'adjusted_rand_score', 'rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weig

In [27]:
# 2) 여러 모델들을 ensemble: bagging
bag_model = BaggingClassifier(base_estimator=DT, random_state=1, max_samples=0.5)

# hyperparameter search
grid_search = GridSearchCV(bag_model, param_grid=param_grid, cv=5, scoring='f1')
grid_search.fit( train_x, train_y)

GridSearchCV(cv=5,
             estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(),
                                         max_samples=0.5, random_state=1),
             param_grid={'base_estimator__max_depth': [3, 5],
                         'bootstrap_features': [False], 'max_features': [1.0],
                         'n_estimators': [100, 200], 'n_jobs': [-1],
                         'oob_score': [True]},
             scoring='f1')

In [28]:
grid_search.best_params_

{'base_estimator__max_depth': 3,
 'bootstrap_features': False,
 'max_features': 1.0,
 'n_estimators': 200,
 'n_jobs': -1,
 'oob_score': True}

- 최적의 파라미터를 찾은 후 모델 결정

In [29]:
opt_model = grid_search.best_estimator_
opt_model

BaggingClassifier(base_estimator=DecisionTreeClassifier(max_depth=3),
                  max_samples=0.5, n_estimators=200, n_jobs=-1, oob_score=True,
                  random_state=1)

In [15]:
# 검증데이터에 대한 f1-score
opt_model.oob_score_

0.7616387337057728

In [30]:
# 4) 예측
test_pred_y = opt_model.predict(test_x)
test_pred_y

array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0])

In [31]:
# 테스트 데이터에 대한 f1-score
bag_f1 = f1_score(y_true= test_y, y_pred= test_pred_y)
bag_f1

0.528

- 변수중요도
    - 모델이름.feature_importances_

In [32]:
def get_variable_importance(model):
    return np.mean([tree.feature_importances_ for tree in model.estimators_], axis =0)

var_df = pd.Series(get_variable_importance(opt_model), index = dataframe.columns[:-1])

var_df.sort_values(ascending=False)

Glucose                     0.556637
BMI                         0.171305
Age                         0.126675
DiabetesPedigreeFunction    0.061528
Insulin                     0.027821
Pregnancies                 0.023630
BloodPressure               0.017476
SkinThickness               0.014928
dtype: float64

In [33]:
dataframe.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Class'],
      dtype='object')

---

# __Package를 사용하여 random forest 코드 작성__

### Random forest
> Bagging 모델: subsample들의 **모든 변수**를 사용해 모델 구성 <br>
> Random forest 모델: subsample들의 **랜덤하게 선택된 변수**를 사용해 모델 구성

![Alt text](./images/bagging.png)

In [34]:
# sklearn으로 random forest 만들기
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

import pandas as pd
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier

In [35]:
# hyperparameters
param_grid = {'n_estimators': [100, 200],
              'oob_score': [True], # compute out of bag error
              'n_jobs':[-1], 
              'max_depth': [3, 5]
              }

In [36]:
# 1) 모델 선언 & 2) 여러 모델들을 ensemble: randomforest
rf_model = RandomForestClassifier()

# hyperparameter search
grid_search = GridSearchCV(rf_model, param_grid=param_grid, cv=5, scoring='f1')
grid_search.fit( train_x, train_y)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'max_depth': [3, 5], 'n_estimators': [100, 200],
                         'n_jobs': [-1], 'oob_score': [True]},
             scoring='f1')

In [37]:
grid_search.best_params_

{'max_depth': 5, 'n_estimators': 200, 'n_jobs': -1, 'oob_score': True}

- 최적의 파라미터를 찾은 후 모델 결정

In [38]:
opt_model = grid_search.best_estimator_
opt_model

RandomForestClassifier(max_depth=5, n_estimators=200, n_jobs=-1, oob_score=True)

In [39]:
# 검증데이터에 대한 f1-score
opt_model.oob_score_

0.7560521415270018

In [40]:
# 4) 예측
test_pred_y = opt_model.predict(test_x)
test_pred_y

array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0])

In [41]:
# 테스트 데이터에 대한 f1-score
rf_f1 = f1_score(y_true= test_y, y_pred= test_pred_y)
rf_f1

0.5528455284552845

- 변수중요도
    - 모델이름.feature_importances_

In [42]:
opt_model.feature_importances_

array([0.07253087, 0.34993161, 0.0531809 , 0.0440453 , 0.06213759,
       0.17237682, 0.09305169, 0.15274522])

In [43]:
var_df = pd.Series(opt_model.feature_importances_, index = dataframe.columns[:-1])
var_df.sort_values(ascending=False)

Glucose                     0.349932
BMI                         0.172377
Age                         0.152745
DiabetesPedigreeFunction    0.093052
Pregnancies                 0.072531
Insulin                     0.062138
BloodPressure               0.053181
SkinThickness               0.044045
dtype: float64

---

- Summary

In [44]:
pd.Series([bag_f1,rf_f1],index =['bag', 'rf'], name = 'f1-score')

bag    0.528000
rf     0.552846
Name: f1-score, dtype: float64