# Before jumping to the code implementation
***
- Git을 사용하지 않고 해당 ipynb파일만 다운받고 싶으면 wget이나 curl을 사용하여도 좋습니다. 
    - wget https://raw.githubusercontent.com/COMBINE-SKKU/bio_data_mining/master/Lecture6-EnsembleMethods-Adaboost_GradientBoosting.ipynb
    - curl https://raw.githubusercontent.com/COMBINE-SKKU/bio_data_mining/master/Lecture6-EnsembleMethods-Adaboost_GradientBoosting.ipynb --output Lecture6-EnsembleMethods-Adaboost_GradientBoosting.ipynb
<br><br><br>    
- 동영상과 같이 git clone을 사용하고 싶은 학생이 주의하여야 할 점은 git은 원칙적으로 이미 존재하는 디렉토리에는 clone을 하지 않으므로 디렉토리 (예를 들자면 /Documents/BioDataMining)를 지워주고 실행하여야 합니다. 
- Open the terminal and make a directory dedicated for this class code implementation (e.g., mkdir ~/Documents/BioDataMining)
- Install Git (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
- Go to the directory for this class, and type: git clone https://github.com/COMBINE-SKKU/bio_data_mining.git
- If you are a newbie for programming a python and jupyter notebook, please install anaconda 
    - Window: https://problemsolvingwithpython.com/01-Orientation/01.03-Installing-Anaconda-on-Windows/
    - Mac: https://problemsolvingwithpython.com/01-Orientation/01.04-Installing-Anaconda-on-MacOS/
    - Linux: https://problemsolvingwithpython.com/01-Orientation/01.05-Installing-Anaconda-on-Linux/
- And learn how to open a jupyter notebook (https://www.youtube.com/watch?v=OJMILWh6ARY)
- Run the following codes.

# Ensemble Methods
---
1. <b>Random Forest</b>: Bagging + Decision Tree
2. <b>Boosting</b>: Sequantial bagging + any type of supervised learning methods (but mostly with Decision Tree)
    - <b>AdaBoost (Adaptive Boosting)</b>: At every iteration of bagging, the Adaboost weights more those cases that were misclassified in the previous classification and updates the classifier towards improving the accuracy.
    - <b>Gradient boosting</b>: At every iteration of training procedure, it fits the classifier model to the residual that could not be predicted in the previous regression. As such, the algorithim gradually reduces the residual between observation and prediction, providing increasingly higher prediction accuracy.

# Adaboost
---
## Differences with Random Forest
- Flexible depth of decision trees vs. Fixed size of stump
- Independent tree training vs. Sequential tree imporving model
- Multiple randomly selected features vs. One feature at a time
- Equal sample weight vs. different weight depending on the prediction results

## Keywords
- How to build the stump? -> Based on CALT + Gini index! (Decision Tree Algorithm)
- How to compute the "Amount of Say"
- How to update the weight of correctly classified and misclassified samples
- How to resample the cases for the next stump -> Bootstrap of weighted samples

# Heart disease classification
---
<img src='https://raw.githubusercontent.com/COMBINE-SKKU/combine-skku/master/class/week6/Fig6-1.png' width="500"/>
<br>
1. How to build the stump? Which feature are we going to use to build a first stump?: CALT cost function + Gini index
<img src='https://raw.githubusercontent.com/COMBINE-SKKU/combine-skku/master/class/week6/Fig6-2.png' width="500"/>

# Heart disease classification
---
    Answer: The CALT cost function selected the income feature with $51509 of a threshold as the best criterion to predict the illness condition. 
<img src='https://raw.githubusercontent.com/COMBINE-SKKU/combine-skku/master/class/week6/Fig6-3.png' width="500"/>

2. How to compute the "Amount of Say"? 
<img src='https://raw.githubusercontent.com/COMBINE-SKKU/combine-skku/master/class/week6/Fig6-4.png' width="500"/>
Answer: Amount of Say for Income (threshold: 51509): <b>1/2 * log((1-(2/8))/(2/8))=0.23</b>

3. How to update the weight of correctly classified and misclassified samples?
<img src='https://raw.githubusercontent.com/COMBINE-SKKU/combine-skku/master/class/week6/Fig6-5.png' width="500"/>

<img src='https://raw.githubusercontent.com/COMBINE-SKKU/combine-skku/master/class/week6/Fig6-6.png' width="500"/>

4. How to resample the cases for the next stump?
<img src='https://raw.githubusercontent.com/COMBINE-SKKU/combine-skku/master/class/week6/Fig6-7.png' width="500"/>
Answer: Resampling more those cases with a higher weight. By doing this, the training of Adaboost can be more adjusted towards previously mis-classified samples.

<img src='https://raw.githubusercontent.com/COMBINE-SKKU/combine-skku/master/class/week6/Fig6-8.png' width="500"/>

5. Iterate the entire procedure 1)-4) again for next stumps building until the total error rate becomes unchanged as a minimum.
<img src='https://raw.githubusercontent.com/COMBINE-SKKU/combine-skku/master/class/week6/Fig6-9.png' width="500"/>

# Building an Adaboost model in Python
---

In [43]:
# Load libraries
from sklearn.ensemble import AdaBoostClassifier
from sklearn import datasets

# Import train_test_split function
from sklearn.model_selection import train_test_split

#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

In [44]:
# Load Iris data
# This dataset comprises 4 features 
# (sepal length, sepal width, petal length, 
# petal width) and a target (the type of flower).
iris = datasets.load_iris()
X = iris.data
y = iris.target

In [45]:
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and 30% test

In [46]:
# Create adaboost classifer object
abc = AdaBoostClassifier(n_estimators=50, learning_rate=1) 
# Learning rate shrinks the contribution of each classifier by learning_rate. 
# There is a trade-off between learning_rate and n_estimators.

# Train Adaboost Classifer
model = abc.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = model.predict(X_test)

In [47]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred)) 

Accuracy: 0.9333333333333333


# Using Different Base Learners
---

In [34]:
# Import Support Vector Classifier
from sklearn.svm import SVC

#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
svc=SVC(probability=True, kernel='linear')

# Create adaboost classifer object
abc =AdaBoostClassifier(n_estimators=50, base_estimator=svc, learning_rate=1)

# Train Adaboost Classifer
model = abc.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = model.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9555555555555556


## Pros
---
AdaBoost is easy to implement. It iteratively corrects the mistakes of the weak classifier and improves accuracy by combining weak learners. You can use many base classifiers with AdaBoost. AdaBoost is not prone to overfitting. This can be found out via experiment results, but there is no concrete reason available.

## Cons
---
AdaBoost is sensitive to noise data. It is highly affected by outliers because it tries to fit each point perfectly. AdaBoost is slower compared to XGBoost.

# Building a Gradient Boosting model in Python
---
- Another very popular boosting algorithm
- Instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor.
- Compared to AdaBoost, it introduced the ideas from bootstrap aggregation to further improve the models, such as randomly sampling the data as well as features (like random forest) when fitting ensemble members.
- Models are fit using any arbitrary differentiable loss function and <b>gradient descent optimization algorithm</b>. This gives the technique its name, “gradient boosting,” as the loss gradient is minimized as the model is fit, much like a neural network.

In [39]:
# check scikit-learn version
import sklearn
print(sklearn.__version__)

# test regression dataset
from sklearn.datasets import make_regression

# define dataset
X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=7)

# summarize the dataset
print(X.shape, y.shape)

0.22.1
(1000, 20) (1000,)


In [40]:
# evaluate gradient boosting ensemble for regression
from numpy import mean
from numpy import std
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.ensemble import GradientBoostingRegressor

# define the model
model = GradientBoostingRegressor()

# define the evaluation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the model
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)

# report performance
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

MAE: -62.463 (3.233)


# How to set up the hyperparameters
---
- Gradient boosting can be challenging to configure as the algorithm as many key hyperparameters that influence the behavior of the model on training data and the hyperparameters interact with each other.
- Popular search processes include a random search and a grid search.

In [41]:
# example of grid searching key hyperparameters for gradient boosting on a classification dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)

# define the model with default hyperparameters
model = GradientBoostingClassifier()

In [42]:
# define the grid of values to search
grid = dict()
grid['n_estimators'] = [10, 50, 100, 500]
grid['learning_rate'] = [0.0001, 0.001, 0.01, 0.1, 1.0]
grid['subsample'] = [0.5, 0.7, 1.0]
grid['max_depth'] = [3, 7, 9]

# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# define the grid search procedure
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy')

# execute the grid search
grid_result = grid_search.fit(X, y)

# summarize the best score and configuration
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

# summarize all scores that were evaluated
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.938000 using {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 500, 'subsample': 0.7}
0.531333 (0.095070) with: {'learning_rate': 0.0001, 'max_depth': 3, 'n_estimators': 10, 'subsample': 0.5}
0.525333 (0.077060) with: {'learning_rate': 0.0001, 'max_depth': 3, 'n_estimators': 10, 'subsample': 0.7}
0.524000 (0.072874) with: {'learning_rate': 0.0001, 'max_depth': 3, 'n_estimators': 10, 'subsample': 1.0}
0.771667 (0.032154) with: {'learning_rate': 0.0001, 'max_depth': 3, 'n_estimators': 50, 'subsample': 0.5}
0.772333 (0.038874) with: {'learning_rate': 0.0001, 'max_depth': 3, 'n_estimators': 50, 'subsample': 0.7}
0.738667 (0.049982) with: {'learning_rate': 0.0001, 'max_depth': 3, 'n_estimators': 50, 'subsample': 1.0}
0.827000 (0.031953) with: {'learning_rate': 0.0001, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.5}
0.814000 (0.037292) with: {'learning_rate': 0.0001, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.7}
0.761000 (0.043077) with: {'learning_rate': 0.0001,