## [Decision Tree](https://en.wikipedia.org/wiki/Decision_tree_learning) 
---
**Elo notes**

Decision tree learning uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). 

Tree models where the target variable can take a **discrete set of values** are called **classification trees**; in these tree structures, **leaves represent class labels** and **branches represent conjunctions of features that lead to those class labels.**

Decision Trees (DTs) are a [greedy](https://en.wikipedia.org/wiki/Greedy_algorithm), non-parametric, non linear supervised learning method used for classification (Nominal/Discrete data) and regression (Continuous data). The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.


**[Decision Tree implementations differ primarily along these axes:](https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart)**

* The splitting criterion (i.e., how "variance" is calculated)

* Whether it builds models for regression (continuous variables, e.g., a score) as well as classification (discrete variables, e.g., a class label)

* Technique to eliminate/reduce over-fitting

* Whether it can handle incomplete data


 **CART**, or Classification And Regression Trees is often used as a generic acronym for the term Decision Tree, though it apparently has a more specific meaning. 

http://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart

### [NP-completeness](https://en.wikipedia.org/wiki/NP-completeness)

Data comes in records of the form:

$  {\displaystyle ({\textbf {x}},Y)=(x_{1},x_{2},x_{3},...,x_{k},Y)} $

The dependent variable, $ {\displaystyle Y}$, is the target variable that we are trying to understand, classify or generalize. The vector $ {\displaystyle x}$ is composed of the features, $x_{1},x_{2},x_{3}$ ..etc., that are used for that task. 

### [Video](https://www.youtube.com/watch?v=AmCV4g7_-QM&list=PLBv09BD7ez_4temBw7vLA19p3tdQH6FYO&index=3)
### In order to pick which feature to split on, we need a way of measuring how good the split is. We select the split by using **Information Gain** or the **Gini Impurity**.

[Why are implementations of decision tree algorithms usually binary and what are the advantages of the different impurity metrics?](https://github.com/rasbt/python-machine-learning-book/blob/master/faq/decision-tree-binary.md)

To arrive to these measurements we need to understand the following: 

1. Information Gain 
2. Gini Impurity

#### [Gini Impurity vs Entropy](https://datascience.stackexchange.com/questions/10228/gini-impurity-vs-entropy)

According to scikit-learn documentation, gini plays the same role as entropy in information gain, rather than information gain itself, which makes the problem much simpler: now it's the question of differences between

Gini is intended for continuous attributes and Entropy is for attributes that occur in classes

- Gini is to minimize misclassification as it is symetric to 0.5
- Entropy is for exploratory analysis, entropy will penalize more small probabilities. (Entropy is a little slower to compute because of the logarithmic function) 

Gini impurity and Information Gain Entropy are pretty much the same. And people do use the values interchangeably. Below are the formulae of both:

$ \displaystyle \mathrm Gini:Gini(E)= {\displaystyle =1-\sum _{i=1}^{J}{p_{i}}^{2}}$

$\displaystyle \mathrm Entropy:H(E)= -\sum _{i}p_{i}\log _{2}p_{i}$

notes: 

- Gini impurity doesn't require to compute logarithmic functions, which are computationally intensive.

-  "Gini method works only when the target variable is a binary variable." - Learning Predictive Analytics with Python. 



### [Gini Impurity](https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity)

Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.

${\displaystyle \operatorname {I} _{G}(p)=1-\sum _{i=1}^{J}{p_{i}}^{2}}$

### [Entropy (Information Entropy or Shannon Entropy)](https://en.wikipedia.org/wiki/Entropy_(information_theory))


**Entropy is zero when one outcome is certain.**

Intuitively, if a set has all the **same labels, that'll have low entropy** and if it has a **mix of labels, that's high entropy**. 

We would like to create splits that minimize the entropy in each size. If our splits do a good job splitting along the boundary between classes, they have more predictive power. 

**The information entropy** is expressed in terms of a discrete set of probabilities $p_i$ so that the measure of information entropy associated with each possible data value is the negative logarithm of the probability mass function for the value: 

${\displaystyle \mathrm {H} (T)=\operatorname {I} _{E}\left(p_{1},p_{2},...,p_{J}\right)=-\sum _{i=1}^{J}{p_{i}\log _{2}p_{i}}}$

Where:

$ \mathrm T$ denote as a set of training examples, each of the form ${\displaystyle ({\textbf {x}},y)=(x_{1},x_{2},x_{3},...,x_{k},y)}$ 

Thus, when the data source has a lower-probability value (i.e., when a low-probability event occurs), the event carries more "information" than when the source data has a higher-probability value. The amount of information conveyed by each event defined in this way becomes a random variable whose expected value is the information entropy. 

Entropy is a measure of unpredictability of the state, or equivalently, of its average information content. To get an intuitive understanding of these terms, consider the example of a political poll. Usually, such polls happen because the outcome of the poll is not already known. In other words, the outcome of the poll is relatively unpredictable, and actually performing the poll and learning the results gives some new information; these are just different ways of saying that the a priori entropy of the poll results is large. Now, consider the case that the same poll is performed a second time shortly after the first poll. Since the result of the first poll is already known, the outcome of the second poll can be predicted well and the results should not contain much new information; in this case the a priori 
entropy of the second poll result is small relative to that of the first. 



### [Information Gain ](https://en.wikipedia.org/wiki/Information_gain_in_decision_trees)

Information gain is used to decide which feature to split on at each step in building the tree. Simplicity is best, so we want to keep our tree small. 

In general terms, the expected information gain is the change in information entropy $ Η$ from a prior state to a state that takes some information as given:

${\displaystyle IG(T,a)=\mathrm {H} {(T)}-\mathrm {H} {(T|a)}} $

where:

$\mathrm {T}$ denote as a set of training examples, each of the form ${\displaystyle ({\textbf {x}},y)=(x_{1},x_{2},x_{3},...,x_{k},y)}$. $\mathrm{T}$ is considered the **parent dataset**.

$ {\displaystyle \mathrm {H} {(T|a)}}$ is the conditional entropy of ${\displaystyle T}$ given the value of attribute ${\displaystyle a}$.

Where:

${\displaystyle a}$ is the feature to perform the split. 

${\displaystyle \overbrace {IG(T,a)} ^{\text{Information Gain}}=\overbrace {\mathrm {H} (T)} ^{\text{Entropy (parent)}}-\overbrace {\mathrm {H} (T|a)} ^{\text{Weighted Sum of Entropy (Children)}}}$

$ {\displaystyle {IG(T,a)} =-\sum _{i=1}^{J}p_{i}\log _{2}{p_{i}}-\sum _{a}{p(a)\sum _{i=1}^{J}-\Pr(i|a)\log _{2}{\Pr(i|a)}}} $

where $\mathrm {H}$ is entropy, $ {\displaystyle \mathrm {H} {(T|a)}}$ is the weighted average entropy of several sub domains after observing Event a. (Since in Decision Tree, a tree node will separate a whole domain into several ones and grow the tree iteratively on each node)

Now lets define a $\mathrm {set}$ of training input of $\mathrm {T}$ with attribute ${\displaystyle \mathrm{a}}$ this is considered the **child dataset**. Then the information gain of $\mathrm {T}$ for attribute ${\displaystyle \mathrm{a}}$ is the difference between the a priori Shannon entropy $ {\displaystyle \mathrm {H} {(T)}}$ of the training set and the conditional entropy $ {\displaystyle \mathrm {H} {(T|a)}}$

$ {\displaystyle  IG(T,a)=\mathrm {H} (T)- \sum _{v\in vals(a)}{{\frac {|S_{a}{(v)}|}{|T|}}\cdot \mathrm {H} \left(S_{a}{\left(v\right)}\right)}.}$

For a value ${\displaystyle v}$ taken by attribute ${\displaystyle a}$ , let
${\displaystyle S_{a}{(v)}=\{{\textbf {x}}\in T|x_{a}=v\}}$

The information gain of ${\displaystyle T}$ given ${\displaystyle a}$ can be defined as the difference between the unconditional Shannon entropy of ${\displaystyle T}$ and the expected entropy of ${\displaystyle T}$ conditioned on ${\displaystyle a}$


### [Information Theory](https://en.wikipedia.org/wiki/Information_theory)


**Entropy is zero when one outcome is certain.**

The basic idea of information theory is the more one knows about a topic, the less new information one is apt to get about it. If an event is very probable, it is no surprise when it happens and thus provides little new information. Inversely, if the event was improbable, it is much more informative that the event happened. Therefore, the information content is an increasing function of the inverse of the probability of the event $(1/p)$.

A key measure in information theory is **"entropy"**. Entropy quantifies the amount of uncertainty (information) involved in the value of a random variable or the outcome of a random process**. 

a measure of information in a single random variable, and mutual information, a measure of information in common between two random variables. 

Intuitively, the entropy $H(X)$ of a discrete random variable X **is a measure of the amount of uncertainty associated with the value of X when only its distribution is known.**

The meaning of the events observed (the meaning of messages) does not matter in the definition of entropy. Entropy only takes into account the probability of observing a specific event, so the information it encapsulates is information about the underlying probability distribution, not the meaning of the events themselves.

### Ensemble Methods

- Boosted trees Incrementally building an ensemble by training each new instance to emphasize the training instances previously mis-modeled. A typical example is AdaBoost. These can be used for regression-type and classification-type problems.

- Bootstrap aggregated (or bagged) decision trees, an early ensemble method, builds multiple decision trees by repeatedly resampling training data with replacement, and voting the trees for a consensus prediction.
    - A random forest classifier is a specific type of bootstrap aggregating

- Rotation forest – in which every decision tree is trained by first applying principal component analysis (PCA) on a random subset of the input features.


### links


#### [Boilerplate](https://en.wikipedia.org/wiki/Boilerplate_code)

#### [Python test](http://pythontesting.net/framework/nose/nose-introduction/)

#### [Overriding __str__ method](https://www.quora.com/What-is-the-use-of-__str__-in-python)

#### [Decision Trees - Sklearn](https://scikit-learn.org/stable/modules/tree.html)

#### [The probability-weighted average ](https://en.wikipedia.org/wiki/Weighted_arithmetic_mean)

#### [Expected value](https://en.wikipedia.org/wiki/Expected_value)

#### [Categorical variables](https://en.wikipedia.org/wiki/Categorical_variable)

In [5]:
from __future__ import division

from sklearn.model_selection import train_test_split
from sklearn import tree
from collections import Counter

import pandas as pd
import numpy as np

import graphviz


In [10]:
X = np.array([[1, 'bat'], [2, 'cat'], [2, 'rat'], [3, 'bat']])
y = np.array([1, 0, 1, 0, 1])

In [11]:
X

array([['1', 'bat'],
       ['2', 'cat'],
       ['2', 'rat'],
       ['3', 'bat']], dtype='|S21')

In [12]:
X[0]

array(['1', 'bat'], dtype='|S21')

In [13]:
X[0, 0]

'1'

In [14]:
X[0, 1]

'bat'

In [15]:
# create a lambda function that checks for True/False in an array
tffunc = lambda x: isinstance(x, str) or \
                    isinstance(x, bool) or \
                    isinstance(x, unicode)

In [16]:
tffunc

<function __main__.<lambda>>

In [17]:
tffunc()

TypeError: <lambda>() takes exactly 1 argument (0 given)

In [18]:
tffunc(X[0, 0])

True

In [19]:
tfarray(X[0, 1])

NameError: name 'tfarray' is not defined

In [20]:
np.vectorize(tffunc)

<numpy.lib.function_base.vectorize at 0x1a17cfdf90>

In [21]:
np.vectorize(tffunc)(X)

array([[ True,  True],
       [ True,  True],
       [ True,  True],
       [ True,  True]])

In [22]:
categorical_variables = np.vectorize(tffunc)(X[0])
categorical_variables

array([ True,  True])

In [23]:
categorical_variables[1]

True

In [17]:
import nose.tools
from nose.tools import assert_almost_equal

In [None]:
assert_almost_equal()

In [7]:
df = pd.read_csv('playgolf.csv')

In [8]:
df[:4]

Unnamed: 0,Outlook,Temperature,Humidity,Windy,Result
0,sunny,85,85,False,Don't Play
1,sunny,80,90,True,Don't Play
2,overcast,83,78,False,Play
3,rain,70,96,False,Play


In [9]:
df.columns

Index([u'Outlook', u'Temperature', u'Humidity', u'Windy', u'Result'], dtype='object')