# Modern Machine Learning - Lecture September 19th, 2019

Lecturer: Zhirong Yang

## Supervised Learning

* We will focus on multiple real covariates and single real responses


## Ensemble Learning (EL)

* State-of-the-art
* AdaBoost
* Random Forest (try first this first)
* XGBoost
* CatBoost/LightGBM (quite new)
* Deep Forest (quite new)

What is ensemble learning?

* Weighted average of "weak learners"
* For binary classification we can use sgn(sum(weak learners))
* Multi-class classification can use majority vote
* In practice, almost all Kaggle winners use ensemble learning
* Ensemble learning can decrease bias, e.g. by boosting
* Ensemble learning can decrease variance, e.g. by bagging
* Ensemble learning can make an overall improvement, e.g. by stacking

How to create different learners?

* Different learning algorithms
* Different hyperparameters
* Different representations
* Different training sets
* Artificial noise added to the data
* Random samples from posterior of the model parameters (instead of finding the maximum)

How to combine base learners?

**Boosting**

Boosting involves incrementally building an ensemble by training each new model instance to emphasize the training instances that previous models mis-classified. Each model corrects the mistakes of its predecessor.

**Bagging**

Simultaneously construct a lot of the base models at the same time. The collection of base models vote on the final decision. The base models are distinguished by the datasets they have access to.

**Stacking**

Use predictions of a set of base learners as features in a higher-order model, a so called "meta learner".

## AdaBoost

The first practical boosting algorithm invented in 1995. Sequentially construct a set of base learners. Each sample has a weight, and the misclassified samples are emphasized more and more.

In [3]:
import pandas
from sklearn import model_selection
from sklearn.ensemble import AdaBoostClassifier

In [4]:
pandas.read_csv("pima-indians-diabetes.csv")

Unnamed: 0,6,148,72,35,0,33.6,0.627,50,1
0,1,85,66,29,0,26.6,0.351,31,0
1,8,183,64,0,0,23.3,0.672,32,1
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,5,116,74,0,0,25.6,0.201,30,0
...,...,...,...,...,...,...,...,...,...
762,10,101,76,48,180,32.9,0.171,63,0
763,2,122,70,27,0,36.8,0.340,27,0
764,5,121,72,23,112,26.2,0.245,30,0
765,1,126,60,0,0,30.1,0.349,47,1


AdaBoost is sensitive to outliers, and is slower than XGBoost.

**Most popular type of base learners**

* LogReg
* Naive Bayes Classifier
* SVM
* Decision Tree (most popular and should be tried first)

Decision trees are highly accurate and easy to use. They are invariant to input scale, and get good performance with little tuning.

**Random Forest**

* AdaBoost can create a forest sequentially
* But can we create the ensemble in parallel? Not by training subsets and bagging, because the trees will end with similar splits.

In [7]:
from sklearn.ensemble import RandomForestClassifier
iris = pandas.read_csv("iris.csv")
iris

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica


## Obtain feature importance

* Weight or split - The number of times a feature is used to split the data across all trees
* Gain - Average gain of the feature when it is used in trees

## Tuning hyperparameters

* Every software model has tunable parameters
* A better score on the project requires parameter tuning
* Cross validation is used for this purpose. Part of the training set is used to mimick the test set, a so-called validation set.

# Lecture hour 2

* Regression trees have real numbers in the leave nodes

In [8]:
from sklearn.ensemble import RandomForestRegressor

Tunable parameters:

* Bootstrap - with or without replacement
* max_depth - max number of levels in each decision tree
* max_features - max number of features considered for splitting a node
* min_samples_leaf - min data points allowed in leaf node
* min_samples_split - minimum of data points placed in node before split
* n_estimators - number of trees in forest

A grid search will become quite expensive. We can use:

1. A random grid search
2. A coarse grid search
3. A fine grid search

**We should do this in the project.**

Read [graduate student descent](https://sciencedryad.wordpress.com/2014/01/25/grad-student-descent/).

Random forest advantages:

* No need for feature normalization
* Parallel training
* Widely used
* Perform reasonably well with default parameters (quite important advantage)
* Random forest is often the first choice for prototyping

How to get from 99% to 99.9%?

* Gradient Tree Boosting (1999)
* Gradient Tree Boosting with Regularization
    * Regularized Greedy Forest
    * XGBoost
    * LightGBM
    * One more
    
**What is GXBoost?**

* Is software, not only a model
* Regularized objective for better model
* Out of core computing
* Cache optimization
* Distributed computing
* Sparse aware algorithm
* Weighted approximate quantile sketch
* More than half teams in Kaggle uses XGBoost
* Many industrial applications use XGBoost or its variants

In [None]:
from matplotlib import pyplot as plt

# We should all install xgboost
import xgboost as xgb

dtrain = xgb.DMatrix(...)
dtest = xgb.DMatrix(...)

...

model = xgb.train(param, dtrain, num_round)

xgb.plot_importance(model, max_num_features=15)
plt.show()

xgb.plot_tree(model)
plt.show()

* XGBoost learns the best direction for misisng values! It creates default directions, directions which are found during training.
* It offers a speedup for sparse data, such as categorical encodings and other cases, for example bag of words.
* Out-of-core (disk-based) runs, approximate algorithm for split finding
* Can customize the loss function
* Can enable early-stopping
* Can set checkpoints
* Etc.

## Parameter tuning in XGBoost

* The default settings usually does not work well, and it is tricky in XGBoost
* There are quite a lot of parameters, way too many actually
* The complete lists can be found by searching "XGBoost parameters"
* "complete guide parameter tuning xgboost with codes python" on analyticsvidhya
    * First choose a relatively high learning rate
    * Fix the learning rate
    * Tune tree-specific parameters
    * More...
* Bayesian optimization can be used in the parameter tuning. Gradient descent can't be used since we don't have the derivative. Read the philipperemy visualiziation blog post on github.io.
* There is also a kaggle kernel in the slides by "nanomathias"
* **Remember to use stratified cross-validation!**

## LightGBM (frontier)

* GBDT requires all data in each boosting step, and this is too expensive for large data sets
* Previous workarounds:
    * uniform downsampling to get a subset for training, but this looses information.
    * XGBoost uses histograms in order to remove some information
* LightGBM tries to downsample in a smarter way
* AdaBoost has sample weight, but GBDT has none weights that can be used
* In LightGBM, leaf with higher gradient/error is used for growing further

## CatBoost (frontier)

* Overcomes the target leakage in category features and boosting. Target leakage is when the target meas wrongly using the responses, which theoretically leads to wrong fitting. I did not understand this, must read up on the concept...
* We usually apply one-hot encoding before using XGBoost, but it becomes problematic when the number of categories is large.
* Uses "ordered target statistics" instead.
    * First it shuffles the data (rows) and TS is calculated using the other responses before the data point
    * Proven to be no target leakage

In [None]:
import lightgbm as lgb

lg = lgb.LGBMClassifier(silent=False)
lgb.Dataset(...)

import catboost as cb
cb_classifier = cb.CatBoostClassifier()

* CatBoost can perform better and can be trained faster

## Choice of stacking bases

This is the priority on the project

#. XGBoost, LightGBM, CatBoost
#. Random forest, AdaBoost, KNN
#. DNN, LegReg, SVM, Gaussian Process (for regression)
#. Fisher's linear discriminant, Naive Bayes

## Exercise

* Optical digits (Optical Recognition of Handwritten Digits Data Set - archive.ics.uci.edu)
* Optional: IJCNN (csie.ntu.edu.tw)

## Practical information

* The last lecture has been changed and the project deadline has changed