Topics to covers:

- Decision Tree;
- DBSCAN;
- Time Series;
- Anomaly detection;
- Cross Validation;
- Feature engineering.



## Decision Tree

### 1. Spliting with Gini Index 
Gini index is used in the CART algorithm, with the following formula:
$$ Gini = 1-\sum^n_{i=1}p_i^2$$

It has the properties:
* Favor larger partitions;
* Uses squared proportion of classes;
* Perfectly classified, Gini scores would be zero;
* Evenly distributed would be $1-(1/\text{# classes})$;
* Split on low Gini index;

In [2]:
# def gini(y):
#     hist = np.bincount(y)
#     N = np.sum(hist)
#     return 1- sum([(i/N)**2 for i in hist])
import numpy as np 
def gini(y):
    hist = np.bincount(y)
    N = np.sum(hist)
    return 1-np.sum((hist/N)**2)

### 2. Splitting with Information Gain and Entropy

$$ Entropy = -\sum^n_{i=1}p_i\cdot log_2(p_i) $$

* Favor splits with small counts but many unique values;
* Weights probability of class by log(base=2) of the class probability;
* A smaller value of the Entropy is better. That makes the difference between the parent node's entropy larger;
* Information Gain is the Entropy of the parent node minus the entropy of the child nodes;


In [4]:
def entropy(y):
    hist = np.bincount(y)
    ps = hist/np.sum(hist)
    return  -np.sum(ps * np.log2(ps))

Let's test out the difference of two loss function with a toy dataset:

In [5]:
import pandas as pd
Class = ["A","A","A","A","A","B","B","B","B","B"]
var1 = [0,0,0,0,1,1,1,0,1,0]
var2 = [33,54,56,42,50,55,31,-4,77,49]
data = pd.DataFrame({'Class':Class, "Var1":var1, "Var2":var2})

In [10]:
y = (data.Var1==1)*1
gini_score = gini(y)
entropy_score = entropy(y)
print (f"The Gini Index is {gini_score} and the Entropy is {entropy_score}")

The Gini Index is 0.48 and the Entropy is 0.9709505944546686
