![Image of Venturenix](http://www.venturenix.com//assets/images/venture-nix-logo.png)

# Ensemble
AnthonyLo@VenturenixLab [2018]


In [1]:
from itertools import product, combinations, permutations
import random
import math

from sklearn import datasets
import numpy as np
import pandas as pd
from scipy import stats

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier

## "The wisdom of the masse exceed that of the wisest individual."

**Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone**
(https://en.wikipedia.org/wiki/Ensemble_learning#Bootstrap_aggregating_(bagging))
- **Regression:** Take the average result from multiple regression models
- **Classification:** Take the majority vote or average probability

** Two assumptions:**

- **Accurate:** The model has to be better than random guessing
- **Independent:** The models are independent



## voting


## Bagging (Bootstrap aggregating)

- Improve stability and accuracy
- Reduce variance and avoid fitting
- A special case for model averaging

![](File_Ozone.png)

## Boosting
- emphasize the training instances that previous models mis-classified
- yield better accuracy than bagging
- tends to be more likely to over-fit the training data

## Decision Tree (the building block)
![](dt1.png)



**What is Decision Tree?**



*   each node represents a feature
*   each branch represents a decision
* each leaf represents an outcome (categorical or continues value)


<img src="https://cdn-images-1.medium.com/max/660/1*XMId5sJqPtm8-RIwVVz2tg.png" width="400">


**How we make a decision?/What is the learning algorithm?**

* Select a feature that best classifies the training data and become the root of the tree. Repeat this process for each branch.


* This means we are performing top-down, greedy search through the space of possible decision trees. (**Recursive Binary Splitting**). 

* Greedy Algorithm - Does not try variables that have been used.


**How do we select the best variable for splitting? **

*  Select the feature that maximize the information gain

> Equation: $IG(T,a) = H(T) - H(T|a)$

> where $H(T|a)$ is the conditional entropy of T given the value of attributes $a$

* **Information Gain = Entropy Before Split - Entropy after split**
* You might also refer as: **Information Gain = Entropy(parent) - [weightes average] * Entropy(children)**

**What is Entropy? **

In binary Class:
> $Entropy = \sum_i - p_i \log_2 p_i$
* $p_i$ is the probability of class i
* if all example belong to the same class = $entropy = - 1 \log_2 1 = 0$  (Minimum Impurity)
* if 50% in either class,  $entropy = - 0.5 \log_2 0.5 - 0.5 \log_2 0.5  = 1$ (Maximum impurity)
* when we can classify better after the split, we can archive a lower entropy.

https://homes.cs.washington.edu/~shapiro/EE596/notes/InfoGain.pdf


Advantages of Decsion Tree:
* Easy to interpret
* No need to normalize data

Disadvantage: 

*   Easier to overfit when a tree gets deeper.

**Pruning**:  limiting tree depth to reduce overfitting in decision trees.

**Split by pureity**

Entropy = - p(a)*log(p(a)) - p(b)*log(p(b))

![](ent.png)

## Random Forest (Bagging)
![](rf1.png)

- robust to correlated predictors
- can solve both regression and classifcation problems
- can handle thousands of input variables
- used as a feature selection tool


Sklearn: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

## XGBoost (Boosting)
Installation:  
for Mac: >conda install py-xgboost,
<br>
For Windows: https://www.lfd.uci.edu/~gohlke/pythonlibs/#xgboost
run > pip install xxx.whl

XGboost: https://xgboost.readthedocs.io/en/latest/
https://xgboost.readthedocs.io/en/latest/python/python_api.html
http://xgboost.readthedocs.io/en/latest/parameter.html
![](xgb1.png)

- A very common and powerful library for Kaggle compeition
- Regularization supported
- Parallel processing
- Custom optimization objectives and evaluation criteria
- Handle missing values


## Project: Digit Recognizer (MNIST)

https://www.kaggle.com/c/digit-recognizer/
![](1.png)
![](2.png)

## XGBoost

In [5]:
import xgboost as xgb

In [4]:
!conda install py-xgboost 

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/anaconda3

  added / updated specs:
    - py-xgboost


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _py-xgboost-mutex-2.0      |            cpu_0           8 KB
    conda-4.8.4                |           py37_0         2.9 MB
    libxgboost-0.90            |       h0a44026_1         1.2 MB
    py-xgboost-0.90            |   py37h0a44026_1          75 KB
    ------------------------------------------------------------
                                           Total:         4.1 MB

The following NEW packages will be INSTALLED:

  _py-xgboost-mutex  pkgs/main/osx-64::_py-xgboost-mutex-2.0-cpu_0
  libxgboost         pkgs/main/osx-64::libxgboost-0.90-h0a44026_1
  py-xgboost         pkgs/main/osx-64::py-xgboost-0.90-py37h0a44026_1

The following packages wi

## L1 and L2 regularization
https://www.kaggle.com/residentmario/l1-norms-versus-l2-norms