# Decision Tree
Link: https://www.analyticsvidhya.com/blog/2021/08/decision-tree-algorithm/

## How to choose the root node
For each categorical features, we identify the feature values (Y & N) into two leaves. For each leaves, we identify the target distribution

## Prediction Model 1: Entropy
Def: Entropy is an information theory metric that measures the impurity or uncertainty in a group of observations. 
- Bigger = impure
- Smaller = pure
- Range: 0-1

Equation: $E= \sum p(x)log(\frac{1}{p(x)}) = -\sum p(x)\log_2(p(x))$

## Prediction Model 2: Gini Impurity
Def: tells us what is the probability of misclassifying an observation. Note that the lower the Gini the better the split and the lower the likelihood of misclassification. <br>
Gini Impurity for a Leaf = 1 - [\(probability of 'Yes'\)$^2$ + \(probability of 'No'\)$^2$]<br>
$Gini=1 - \sum\limits_{i-1}^{n}(p_i)^2$, for each leaves <br>
Then, sum each Gini with its owned weight(number on its leaves / total number) $\sum\limits_{i=1}^{n}n_i/N*Gini_i$

## Prediction Model: Information Gain
Information gain as a measure of how much information a feature provides about a class. Information gain helps to determine the **order of attributes (the higher the first to be the node)** in the nodes of a decision tree. <br>
E(Parent): the impurity of the y without split by features <br>
E(Parent|Feature X1): the impurity of the y after split by features X1 <br>
**Information Gain = E(Parent) - E(Parent|Feature X1)** <br>

### Gini Impurity (Numeric Data)
**Steps**
1. Sort the feature values
2. Calculate all the average value for all adjacent values. For example, [1, 3, 4, 6, 9, 13], average value: [2, 3.5, 5, 7.5, 11]
3. For each average value, we use it for the classifier threshold.
4. Calculate the Gini impurity(Categorical Data)
5. Choose the threshold with the lowest Gini Impurity <br>

After choosing the root node, each for internal nodes (branches), we can choose the next node based on the same Gini impurity method aboved until Gini impurity 0. But there may be **overfit**, so we can 
1. Pruning method
2. the distribution of extreme enough, eg: 3:7 or 2:8 to stop the tree.
3. stop the tree if people in the leave is small enough in a certain level 

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import math
import warnings
warnings.filterwarnings('ignore')
np.set_printoptions(suppress=True)
sns.set(rc={'figure.figsize':(10,8)})

### Gini impurity

In [3]:
def Gini(*gorups):
    gini = 0
    total_num = np.sum([group.shape[0] for group in groups])
    
    for group in groups:
        y = group[1]
        group_size = y.shape[0]
        _, count = np.unique(y, return_counts=True)
        p = count / group_size
        weight = group_size / total_num
        
        gini += (1 - np.sum(p**2)) * weight
    return entropy

### Data source

In [16]:
fr = open('data/lenses.txt', 'r')
lenses = np.array([inst.strip().split('\t') for inst in fr.readlines()])
lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']
X = lenses[:, :-1]
y = lenses[:, -1]

### Entropy

In [5]:
def Entropy(y):
    _, count = np.unique(y, return_counts=True)
    p = count / len(y)
    entropy = p @ -np.log(p)
    return entropy

### Split data if feature value is XXX

In [6]:
def Split(X, y, col_index, value):
    sub_group = X[X[:, col_index] == value]    # splited X
    index = np.where(X[:, col_index] == value)[0]    # splited y
    return sub_group, y[index]

### Find the best feature to split

In [7]:
def Best_Feature_to_Split(X, y):
    # set initial information gain and root entropy
    n = X.shape[0]
    parent_entropy = Entropy(y)
    best_info_gain = 0
    best_feature = 0

    # for loop each columns
    for column in range(X.shape[1]):
        
        # for loop each unique value under the feature column
        feature_values = np.unique(X[:, column])
        child_entropy = 0
        for value in feature_values:
            subset__X, subset_y = Split(X, y, column, value)    #return list of y after split by x with value 
            p = len(subset_y) / n
            child_entropy += p * Entropy(subset_y)  
        info_gain = parent_entropy - child_entropy    # information gain

        # find the largest information gain and its columns
        if info_gain > best_info_gain:
            best_info_gain = info_gain
            best_feature = column
    return best_feature

### Majority of values in the class

In [8]:
def Majority_Voting(y):
    target, counts = np.unique(y, return_counts=True)
    return target[counts.argmax()]    # return value with highest counts

### Tree

In [9]:
Best_Feature_to_Split(X, y)

3

In [21]:
def Tree(X, y):
    if len(np.unique(y)) == 1:    # if y are all the same, return one of y value
        print(y[0])
    if len(X[0]) == 0:    #
        print(Majority_Voting(y))
    best_feature = Best_Feature_to_Split(X, y)
    decision_tree = {lensesLabels[best_feature]: {}}
    
    for value in np.unique(X[:, best_feature]):    # based on best feature, for loop feature value
        subset_X, subset_y = Split(X, y, best_feature, value)
        subset_X = np.delete(subset_X, best_feature, 1)    # delete the splited column
        decision_tree[decision_tree] = Tree(subset_X, subset_y)
        
        return decision_tree
        

In [22]:
Tree(X, y) 

soft
soft
soft


IndexError: index 0 is out of bounds for axis 1 with size 0

In [57]:
def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList): 
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree  

['no surfacing', 'flippers'] 0 no surfacing
{'no surfacing': {}}
[1, 1, 1, 0, 0]
{0, 1}


NameError: name 'createTree' is not defined