## Decision Trees

- Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression: https://scikit-learn.org/stable/modules/tree.html
- Decision tree can be used to visually and explicitly represent decisions and decision making.
- It is a basic component of random forest, one of the most powerful machine learning algorithm.
- DT is relatively free from preprocessing data (e.g., scaling).

#### Advantages and disadvantages (https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052)

**Advantages**

  - Simple to understand, interpret, visualize.
  - Decision trees implicitly perform variable screening or feature selection.
  - Can handle both numerical and categorical data.
  - Decision trees require relatively little effort from users for data preparation.

**Disadvantages**

  - Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting.
  - Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. Thus, decision trees exhibit high variance, which needs to be lowered by methods like bagging and boosting. (we wil learn about this later)
  - Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the data set prior to fitting with the decision tree.

## How decision trees work

<img src="./img/decision_trees.PNG" width="700" height="700">

## Hyperparameters

- hyperparameters for regularization:
  - max_depth: 결정트리의 최대 깊이, default = None
  - min_samples_split: 분할되기 위해 노드가 가져야 하는 최소 샘플수
  - min_samples_leaf: 리프 노프가 가지고 있어야 할 최소 샘플수
  - min-weight_fraction_leaf: min_samples_leaf와 같지만 가중치가 부여된 전체 샘플수 대비 비율
  - max_leaf_nodes: 리프 노드의 최대 수
  - max_features: 각 노드에서 분할에 사용할 특성의 최대 수
- min_으로 시작하는 매개변수를 증가시키거나 max_로 증가하는 매개변수를 감소시키면 regularization 증가

### The effect of hyperparameters - 1

- Max_depth = 3
<img src="./img/decision_trees2.PNG" width="600" height="600">
- Max_depth = 5
<img src="./img/decision_trees3.png" width="600" height="600">

### The effect of hyperparameters - 2

- No restriction
<img src="./img/decision_trees4.PNG" width="600" height="600">
- min_sample_leafs = 4
<img src="./img/decision_trees5.png" width="600" height="600">

## Ensemble Learning
- Ensemble: a set of weak learners are combined to create a strong learner that obtains better performance than a single one.
  - Bagging (bootstrap aggregating): 데이터가 주어졌을 때, 배깅 알고리즘은 무작위 추출을 통해 새로운 훈련 데이터들을 다수 생성, 각각의 데이터에 모형을 훈련시킨 후 예측 결과를 종합하여 예측치를 도출하는 알고리즘이다.
  - Boosting: 여러 개의 알고리즘이 순차적으로 학습-예측을 진행, 뒤의 모형이 이전 단계의 잘못된 예측을 올바르게 예측할 수 있도록 뒤의 모형에 가중치를 부여하여 학습과 예측을 진행하는 방식이다.
    - Adaboosting (adaptive boosting)
    - Gradient boosting
  - Stacking

- Ensemble Learning can be viewed as the wisdom of the crowd.

<img src="./img/boosting.PNG" width="600" height="600">

| | bagging                   |      boosting      |                stacking                |          
|---------------------------|:------------------:|:--------------------------------------:|:--------:|
| partitioning into subsets |       random       | higher weights on misclassified samples |  various |
| goal                      |    min. variance   |        increase predictive power       |   both   |
| methods                   |   random subspace  |            gradient descent            | blending |
| functions to combine      | (weighted) average |            weighted majority           |   logit  |

- $T_1 \sim T_{100}$
- correlation
- $X_1 \sim X_{5}$

### Random Forest

- Decision trees are known to be prone to overfitting, which increases the variance of the forecasts.
- Random forest (RF) method was designed to produce ensemble forecasts with lower variance.
- random forest introduces two dimensions of randomness:
    - (1) Bootstrap aggregation: individual trees are independently trained over bootstrapped subsets of the data.
    - (2) When optimizing each node split, only a random subsample of features are used.
        - In decision trees, predictive features tend to be placed on the upper nodes.
        - If the features are not randomly sampled, due to this feature the decision trees will be more correlated.
        - Thus, the effect of ensemble will decrease.

In [58]:
# Decision Tree example

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings(action='ignore')
%matplotlib inline

df = pd.read_csv('data/titanic.csv', index_col=0)
df.head()

Unnamed: 0,survived,pclass,sex,sibsp,parch,fare,class,who,adult_male,alive,alone
0,0,3,male,1,0,7.25,Third,man,True,no,False
1,1,1,female,1,0,71.2833,First,woman,False,yes,False
2,1,3,female,0,0,7.925,Third,woman,False,yes,True
3,1,1,female,1,0,53.1,First,woman,False,yes,False
4,0,3,male,0,0,8.05,Third,man,True,no,True


In [59]:
temp = pd.get_dummies(df[['sex', 'class', 'adult_male', 'alone']]).replace({False: 0, True:1})
temp = pd.concat([temp, df[['pclass', 'fare', 'sibsp', 'parch']]], axis=1)
temp.head()

Unnamed: 0,adult_male,alone,sex_female,sex_male,class_First,class_Second,class_Third,pclass,fare,sibsp,parch
0,1,0,0,1,0,0,1,3,7.25,1,0
1,0,0,1,0,1,0,0,1,71.2833,1,0
2,0,1,1,0,0,0,1,3,7.925,0,0
3,0,0,1,0,1,0,0,1,53.1,1,0
4,1,1,0,1,0,0,1,3,8.05,0,0


In [67]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.model_selection import GridSearchCV

criterion = ['gini', 'entropy']
maxdepth = [3,5,10,20]
minsampleleafs = [0.005 ,0.01, 0.05, 0.10]

model = DecisionTreeClassifier(random_state=42)

hyperparameters = {'criterion': criterion,
                   'max_depth': maxdepth,
                   'min_samples_leaf':minsampleleafs
                  }

y = df.survived
X = temp.copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

gsearch = GridSearchCV(model, hyperparameters, verbose=1)
gsearch.fit(X_train, y_train, )


Fitting 5 folds for each of 32 candidates, totalling 160 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 160 out of 160 | elapsed:    0.7s finished


GridSearchCV(estimator=DecisionTreeClassifier(random_state=42),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [3, 5, 10, 20],
                         'min_samples_leaf': [0.005, 0.01, 0.05, 0.1]},
             verbose=1)

In [62]:
gsearch.best_params_

{'criterion': 'gini', 'max_depth': 3, 'min_samples_leaf': 0.05}

In [63]:
mod = gsearch.best_estimator_.fit(X_train, y_train)
accuracy_score(y_test, mod.predict(X_test))

0.8208955223880597

In [64]:
print(confusion_matrix(y_test, mod.predict(X_test)))

[[140  17]
 [ 31  80]]


In [66]:
print(classification_report(y_test, mod.predict(X_test)))

              precision    recall  f1-score   support

           0       0.82      0.89      0.85       157
           1       0.82      0.72      0.77       111

    accuracy                           0.82       268
   macro avg       0.82      0.81      0.81       268
weighted avg       0.82      0.82      0.82       268



In [68]:
len(X.columns)

11

In [70]:
# Random Forest

nestimators = [10,20,50,100]
criterion = ['gini', 'entropy']
maxdepth = [3,5,10,20]
minsampleleafs = [0.005 ,0.01, 0.05, 0.10]
maxfeatures = [2,3,5,11]

model = RandomForestClassifier(random_state=42)

hyperparameters = {'n_estimators': nestimators,
                   'max_features': maxfeatures,
                   'criterion': criterion,
                   'max_depth': maxdepth,
                   'min_samples_leaf':minsampleleafs
                  }

y = df.survived
X = temp.copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

gsearch = GridSearchCV(model, hyperparameters, verbose=1)
gsearch.fit(X_train, y_train, )

Fitting 5 folds for each of 512 candidates, totalling 2560 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 2560 out of 2560 | elapsed:  2.8min finished


GridSearchCV(estimator=RandomForestClassifier(random_state=42),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [3, 5, 10, 20],
                         'max_features': [2, 3, 5, 11],
                         'min_samples_leaf': [0.005, 0.01, 0.05, 0.1],
                         'n_estimators': [10, 20, 50, 100]},
             verbose=1)

In [71]:
mod = gsearch.best_estimator_.fit(X_train, y_train)
accuracy_score(y_test, mod.predict(X_test))

0.7910447761194029

In [72]:
print(confusion_matrix(y_test, mod.predict(X_test)))

[[145  12]
 [ 44  67]]


In [73]:
print(classification_report(y_test, mod.predict(X_test)))

              precision    recall  f1-score   support

           0       0.77      0.92      0.84       157
           1       0.85      0.60      0.71       111

    accuracy                           0.79       268
   macro avg       0.81      0.76      0.77       268
weighted avg       0.80      0.79      0.78       268

