## Decision Tree ID3

The classic decision tree algorithms include ID3 algorithm, C4.5 algorithm, and CART algorithm. The main difference among these three is that they have different feature selection criteria. ID3 selects features based on information divergence, C4.5 is based on the ratio of information divergence, and CART is based on the Gini index. 

As a basic classification and regression method, decision trees can be understood in the following two ways. One is that we can think of a decision tree as a set of if-then rules, and the other is the conditional probability distribution of the class given the features.

According to the above two ways of understanding, we can regard the essence of decision tree as summarizing a set of classification rules from the training data set, or it can be regarded as estimating the conditional probability model according to the training data set. The learning process of the entire decision tree is a process of recursively selecting the optimal feature and dividing the data set according to the feature, so that each sample gets a best classification.

Before the introduction of information divergence, we need to understand entropy, which is one way to represent the measurement of the uncertainty of random variables. If the probability of a discrete random variable is defined as $P(X=x_{i})=p_{i}$, the the entropy of $X$ is $H(X)=-\sum_{i=1}^{n}p_{i} \log p_{i}$

Similarly, for a continuous random variable $Y$, its entropy can be defined as: $H(Y)=-\int_{-\infty}^{+\infty} f(y) \log f(y) d y$

When random variable $X$ is given, the entropy of random variable $Y$ can be defined as condition entropy: $H(Y|X) = -\sum_{i=1}^{n}p_{i} H(Y|X=x_{i})$ 

Information divergence is the reduced degree of information uncertainty for class $Y$ when the information of feature $X$ is obtained in the data. Assuming that the entropy of data set $D$ is $H(D)$, the condition entropy with give feature $A$ is $H(D|A)$. Then, the information divergence of feature $A$ for the data set can be expressed as: $g(D,A)=H(D)-H(D|A)$

When the information divergence is larger, the contribution of this feature to data set certainty is larger, indicating that this feature has strong classification-ability.

In [1]:
import numpy as np 
import pandas as pd 
from math import log

df = pd.read_csv('example_data.csv') 
df

Unnamed: 0,humility,outlook,play,temp,windy
0,high,sunny,no,hot,False
1,high,sunny,no,hot,True
2,high,overcast,yes,hot,False
3,high,rainy,yes,mild,False
4,normal,rainy,yes,cool,False
5,normal,rainy,no,cool,True
6,normal,overcast,yes,cool,True
7,high,sunny,no,mild,False
8,normal,sunny,yes,cool,False
9,normal,rainy,yes,mild,False


Calculate the information entropy of the target feature:
$$E(S)= \sum _{i=1}^{c} -p_{i} \log _{2} p_{i}$$

Example: Yes = 9 and No = 5
$$
\begin{aligned}
\text { Entropy(Yes \& No) }&=\text { Entropy }(9,5) \\
&=\text { Entropy }(0.64,0.36) \\
&=-\left(0.64 \log _{2} 0.64\right)-\left(0.36 \log _{2} 0.36\right) \\
&=0.94
\end{aligned}
$$

In [2]:
def entropy(ele): 
    # Calculating the probability distribution of list value 
    probs = [ele.count(i)/len(ele) for i in set(ele)]
    # Calculating entropy value
    entropy = -sum([prob*log(prob, 2) for prob in probs]) 
    return entropy

In [3]:
# split the data set based on feature and feature value
def split_dataframe(data, col): 
    '''
    input: dataframe, column name
    output: a dict of splited dataframe
    '''
    # unique value of column
    unique_values = data[col].unique()
    # empty dict of dataframe
    result_dict = {elem : pd.DataFrame for elem in unique_values} 
    # split dataframe based on column value
    for key in result_dict.keys():
        result_dict[key] = data[:][data[col] == key] 
    return result_dict

In [4]:
# choose the best column based on infomation gain
def choose_best_col(df, label): 
    ''' 
    input: datafram, label
    output: max infomation divergence, best column,
    splited dataframe dict based on best column.
    '''
    # Calculating label's entropy
    entropy_D = entropy(df[label].tolist())
    # columns list except label
    cols = [col for col in df.columns if col not in [label]]
    # initialize the max infomation gain, best column and best splited dict 
    max_value, best_col = -999, None
    max_splited = None
    # split data based on different column
    for col in cols:
        splited_set = split_dataframe(df, col) 
        entropy_DA = 0
        for subset_col, subset in splited_set.items():
            # calculating splited dataframe label's entropy
            entropy_Di = entropy(subset[label].tolist())
            # calculating entropy of current feature
            entropy_DA += len(subset)/len(df) * entropy_Di
        # calculating infomation gain of current feature
        info_gain = entropy_D - entropy_DA 
        if info_gain > max_value:
            max_value, best_col = info_gain, col
            max_splited = splited_set
    return max_value, best_col, max_splited


In [5]:
choose_best_col(df, 'play')

(0.2467498197744391, 'outlook', {'sunny':    humility outlook play  temp  windy
  0      high   sunny   no   hot  False
  1      high   sunny   no   hot   True
  7      high   sunny   no  mild  False
  8    normal   sunny  yes  cool  False
  10   normal   sunny  yes  mild   True,
  'overcast':    humility   outlook play  temp  windy
  2      high  overcast  yes   hot  False
  6    normal  overcast  yes  cool   True
  11     high  overcast  yes  mild   True
  12   normal  overcast  yes   hot  False,
  'rainy':    humility outlook play  temp  windy
  3      high   rainy  yes  mild  False
  4    normal   rainy  yes  cool  False
  5    normal   rainy   no  cool   True
  9    normal   rainy  yes  mild  False
  13     high   rainy   no  mild   True})

In [10]:
class ID3Tree:
    # define a Node class class Node:
    class Node:
        def __init__(self, name): 
            self.name = name
            self.connections = {}

        def connect(self, label, node): 
            self.connections[label] = node

    def __init__(self, data, label): 
        self.columns = data.columns 
        self.data = data
        self.label = label
        self.root = self.Node("Root")

    # print tree method
    def print_tree(self, node, tabs):
        print(node.connections)
        print(tabs + node.name)
        for connection, child_node in node.connections.items():
            print(tabs + "\t" + "(" + str(connection) + ")")
            self.print_tree(child_node, tabs + "\t\t") 
            
    def construct_tree(self):
        self.construct(self.root, "", self.data, self.columns)

    # construct tree
    def construct(self, parent_node, parent_connection_label, input_data, columns): 
        max_value, best_col, max_splited = choose_best_col(input_data[columns], self.label)
        if not best_col:
            node = self.Node(input_data[self.label].iloc[0])
            parent_node.connect(parent_connection_label, node)
            return

        node = self.Node(best_col) 
        parent_node.connect(parent_connection_label, node)
        new_columns = [col for col in columns if col != best_col] 
        # Recursively constructing decision trees
        for splited_value, splited_data in max_splited.items():
            self.construct(node, splited_value, splited_data, new_columns)

In [11]:
tree = ID3Tree(df, 'play')
tree.construct_tree()
tree.print_tree(tree.root, "")

{'': <__main__.ID3Tree.Node object at 0x7fecf8c80210>}
Root
	()
{'sunny': <__main__.ID3Tree.Node object at 0x7fecd8864d90>, 'overcast': <__main__.ID3Tree.Node object at 0x7fecf8cc9e90>, 'rainy': <__main__.ID3Tree.Node object at 0x7fecf8cc9b50>}
		outlook
			(sunny)
{'high': <__main__.ID3Tree.Node object at 0x7fecf8cc9a10>, 'normal': <__main__.ID3Tree.Node object at 0x7fecf8cc9910>}
				humility
					(high)
{'hot': <__main__.ID3Tree.Node object at 0x7fecd8864d10>, 'mild': <__main__.ID3Tree.Node object at 0x7fecd8864790>}
						temp
							(hot)
{False: <__main__.ID3Tree.Node object at 0x7fecd8864390>, True: <__main__.ID3Tree.Node object at 0x7fecd88643d0>}
								windy
									(False)
{}
										no
									(True)
{}
										no
							(mild)
{False: <__main__.ID3Tree.Node object at 0x7fecd8864ed0>}
								windy
									(False)
{}
										no
					(normal)
{'cool': <__main__.ID3Tree.Node object at 0x7fecd8864350>, 'mild': <__main__.ID3Tree.Node object at 0x7fecd88642d0>}
						t