#### Tree Ensemble
##### Ensemble Learning (앙상블 학습)
- Based on decision tree
- best machine learning algorithm to deal with structured data

##### Random Forest
- Randomly make decision tree and make the forest of a number of decision trees.
- Make a use of prediction from each Decition Tree and construct a final prediction

**Bootstrap Sample**
- In data set, 중복을 허용하여 데이터를 sampling, allowing repetition
    - Bootstrap Sample: Sample made by Bootstrap method
- 1,000 개의 샘플이 들어있는 가방에서 100개의 샘플을 뽑는다면 먼저 한 개를 뽑고, 뽑았던 것을 다시 가방에 넣는다.
- 이런 식으로 100개를 뽑으면 중복된 샘플을 뽑을 수 있음 -> 이게 Bootstrap Sample, 부트스트랩 샘플
- usually the size of bootstrap sample is equal to the size of train set
    - 1,000 samples -> 1,000 bootstrap samples
- Train Set -> random smaple by bootstrap sampling -> decision tree training
    -> will make a lot of trees, and eventually construct a forest



\
**In Scikit learn**
- `RandomForestClassifier`
    - 전체 특성 개수의 square root만큼의 feature을 선택
        - e.g. 4 features -> each node will randomly choose 2 features and use it to decide the best split for each node
    - Train 100 decision trees using the method specified above
        - If classification
            - 각 트리의 클래스별 확률을 평균하여 가장 높은 확률을 가진 클래스를 예측
        - If regression
            - 단순히 각 트리의 예측을 평균
---

#### Extra Tree (엑스트라 트리)
- similar to random forest
- default tree number = 100
- 전체 특성 중에 일부 특성을 랜덤하게 선택하여 노드를 분할하는 데에 사용
    - 성능은 낮지만
    - 많은 트리를 만들기엔 적합
- not using Bootstrap Sample
---

#### Gradient Boosting
- 깊이가 얕은 결정 트리를 사용하여 이전 트리의 오차를 보완하는 방식으로 앙상블

##### in Scikit Learn
- `GradientBoostingClassifier`
    - default depth = 3
    - 높은 일반화 성능 & useful to prevent overfitting
- stochastic gradient descent is used to add trees into Ensemble
- For Classification
    - logistic loss function is used
- For Regression
    - average square loss function is used
- Parameter: `subsample`
    - default value is 1 -> use the entire train set for training
    - less than 1 -> part of train set for taining. -> similar to stochastic gradient descent and mini-batch gradient descent
---

#### Histogram-Based Gradient Boosting
- split features into 256 parts
    - can be used to find the most optimal split very fast
    - one part is used for exceptioal feature

##### in Scikitlearn
- `HistGradientBoostingClassifier`
    - stable performance
---

#### Type of Data
##### Structured Data (정형 데이터)
- easy to store in DB, EXCEL or CSV
- Best to be deailt by Ensemble Learning


##### Unstructured Data (비정형 데이터)
- hard to store in EXCEL, CSV
- NLP will deal with it (신경망 알고리즘)
- e.g.,
    - music
    - text
    - etc.
---

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
wine = pd.read_csv('https://bit.ly/wine_csv_data')
wine.head()

Unnamed: 0,alcohol,sugar,pH,class
0,9.4,1.9,3.51,0.0
1,9.8,2.6,3.2,0.0
2,9.8,2.3,3.26,0.0
3,9.8,1.9,3.16,0.0
4,9.4,1.9,3.51,0.0


In [6]:
data = wine[['alcohol', 'sugar', 'pH']].to_numpy()
target = wine['class'].to_numpy()

In [7]:
train_input, test_input, train_target, test_target = train_test_split(
    data,
    target,
    test_size=0.2,
    random_state=42
)

In [8]:
from sklearn.model_selection import cross_validate

#### Random Tree

In [9]:
# import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# declare RandomForestClassifer
rf = RandomForestClassifier(
    n_estimators=100, # the number of trees constructing ensemble, default value is 100
    n_jobs=-1, # use all of the cores
    random_state=42, # should be commented in the real-life
)

In [10]:
# use cross_validate for RandomForestClassifier
scores = cross_validate(
    rf,
    train_input,
    train_target,
    return_train_score=True,
    n_jobs=-1
)

In [41]:
# Overfitted
print(
    np.mean(scores.get("train_score")),
    np.mean(scores.get("test_score"))
)

0.9321723946453317 0.8801241948619236


In [18]:
rf.fit(train_input, train_target)
print(rf.feature_importances_) # alcohol, sugar, pH
print(rf.n_features_in_)

[0.23167441 0.50039841 0.26792718]
3


In [19]:
# OOB: out of bag sample, which has not been selected during bootstrap selection
rf = RandomForestClassifier(
    oob_score=True,
    n_jobs=1,
    random_state=42
)

rf.fit(train_input, train_target)

print(rf.oob_score_)

0.8934000384837406


##### Extra Tree

In [20]:
from sklearn.ensemble import ExtraTreesClassifier

et = ExtraTreesClassifier(
    n_jobs = -1,
    random_state=42, # commented in real-life
)

In [21]:
scores = cross_validate(
    et,
    train_input,
    train_target,
    return_train_score=True,
    n_jobs = -1
)

In [23]:
print(np.mean(scores["train_score"]), np.mean(scores["test_score"]))

0.9974503966084433 0.8887848893166506


In [25]:
et.fit(train_input, train_target)
print(et.feature_importances_) # Alcohol, Sugar, pH

[0.20183568 0.52242907 0.27573525]


In [26]:
# Regrssion version: ExtraTreeRegressor

##### Gradient Boosting

In [28]:
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(
    random_state=42
)

scores = cross_validate(
    gb,
    train_input,
    train_target,
    return_train_score= True,
    n_jobs=-1
)

print(np.mean(scores['train_score']), np.mean(scores['test_score'])) # Oh yeah, not overfitting

0.8881086892152563 0.8720430147331015


In [30]:
gb = GradientBoostingClassifier(
    n_estimators=500,
    learning_rate=0.2,
    random_state=42
)

scores = cross_validate(
    gb,
    train_input,
    train_target,
    return_train_score=True,
    n_jobs= -1
)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

0.9464595437171814 0.8780082549788999


In [31]:
gb.fit(train_input, train_target)
print(gb.feature_importances_)

[0.15887763 0.6799705  0.16115187]


##### Histogram-Based Gradient Boosting

In [33]:
from sklearn.ensemble import HistGradientBoostingClassifier

hgb = HistGradientBoostingClassifier(
    random_state=42
)

scores = cross_validate(
    hgb,
    train_input,
    train_target,
    return_train_score=True,
)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

0.9321723946453317 0.8801241948619236


In [37]:
# to calcualte the feature importance
from sklearn.inspection import permutation_importance

hgb.fit(train_input, train_target)
result = permutation_importance(
    hgb,
    train_input,
    train_target,
    n_repeats=10,
    random_state=42,
    n_jobs=1
)
print(result.importances_mean)

[0.08876275 0.23438522 0.08027708]


In [39]:
result = permutation_importance(
    hgb,
    test_input,
    test_target,
    n_repeats=10,
    random_state=42,
    n_jobs = -1
)
print(result.importances_mean)

[0.05969231 0.20238462 0.049     ]


In [40]:
hgb.score(
    test_input,
    test_target
)

0.8723076923076923