# Decision Tree (DT)
***

1. Decision tree is a tree structure that consists lots of decision nodes and can be used for both classification and regression. Each internal node of the tree splits on certain value of a feature to crate different decision branches and the leaf nodes are the predicted labels. To make a prediction, we start from the root node and follow the path that matches our instance until the leaf node where we are given the label for the instance. 
2. To train a classification decision tree, we greedily split on certain feature value that has the max uncertainty gain among all possible splitting choices. The uncertainty gain is calculated using Gini impurity index or information gain that measure how much uncertainty can be reduced in the dataset after the splitting. After the splitting, two or more new child nodes will be created. For each new node, we apply the same algorithm again with the subset of the training instances that follows the decision path. We only stop splitting when we only have one class left in the remaining training instances and that node is a leaf node with the label given by the remaining training instances. 
3. Decision tree is interpretable and very efficient to learn, but suffers from over-fitting because tree can be constructed very complex so that a slight difference of the instance will cause the label change. We can apply post pruning or setting the maximum depth to reduce it. 

1. **[Gini Index and Information Entropy]**: Both applies to a dataset (instances with labels) to measure its uncertainty. They both become 0 when there is only one class in the set. \
Gini Index (Gini impurity) measures the probability of incorrectly classifying a randomly chosen element in the dataset if it were randomly labeled according to the class distribution in the dataset. 
    $$ G(\mathcal{D}) = \sum_{c=1}^{C} \textrm{P}(c)(1-\textrm{P}(c)) = 1 - \sum_{c=1}^{C} \textrm{P}(c)^2 $$
    Information Entropy can be roughly thought as the dataset's variance.
    $$ E(\mathcal{D}) = \sum_{c=1}^{C} \textrm{P}(c)\log_2\textrm{P}(c) $$
    In both cases, $\mathcal{D}$ is the dataset to be evaluated, $C$ is the total number of classes in $\mathcal{D}$ and $\textrm{P}(c)$ is the probability of picking an instance with the class $c$ (fraction of instances with class $c$ in $\mathcal{D}$).
2. **[Gini Gain and Information Gain]**: Both measure the uncertainty (Gini Index and Information Entropy) difference between before and after a splitting on the dataset. 
    $$ G(\mathcal{D}, S) = M(\mathcal{D}) - \sum_{s\in S}\frac{\lvert s \rvert}{\lvert D \rvert} M(s)$$
    where $\mathcal{D}$ is the dataset before splitting, $S$ are subsets of $\mathcal{D}$ created from all possible splitting of $\mathcal{d}$, $M$ is Gini Index ($G$) or Information Entropy ($E$), and $\lvert \cdot \rvert$ gives the number of items in a set. 
3. **[Decision tree training algorithm]**: We consider binary classification decision tree. Given a dataset $\mathcal{D}$, 
    1. Identify all possible splittings among all features. For each categorical feature, each discrete value is a possible splitting. For each numerical feature, we can do either a) treat it as categorical feature by discretizing it or b) sort all training value of this numerical feature in ascending order and each interval between two consecutive number is a possible split. 
    2. Calculate the uncertainty difference (Gini Gain or Information Gain) for all possible splitting and select the splitting with max uncertainty difference to split. 
    3. Once a node splits into two children, compute the data points that satisfy the two branches respectively. For each branch, return to procedure 1 with the new sub dataset.
    4. The splitting stops when no further splitting can be made (the dataset contains only one class). 

## References
***

1. https://victorzhou.com/blog/intro-to-random-forests/
1. https://www.math.snu.ac.kr/~hichoi/machinelearning/lecturenotes/CART.pdf