## Decision Tree
Decision tree algorithm falls under the category of supervised learning. They can be used to solve both regression and classification problems.

Decision tree uses the tree representation to solve the problem in which each leaf node corresponds to a class label and attributes are represented on the internal node of the tree.

We can represent any boolean function on discrete attributes using the decision tree.
![](decision_trees.png)

In Decision Tree the major challenge is to identification of the attribute for the root node in each level. This process is known as attribute selection. We have two popular attribute selection measures:

1.Information Gain

2.Gini Index

In [1]:
import pandas as pd
import math

Load the data

download the dataset from https://gist.github.com/bigsnarfdude/515849391ad37fe593997fe0db98afaa

In [2]:
filename='weather.csv'
data=pd.read_csv(filename)
data

Unnamed: 0,outlook,temperature,humidity,windy,play
0,overcast,hot,high,False,yes
1,overcast,cool,normal,True,yes
2,overcast,mild,high,True,yes
3,overcast,hot,normal,False,yes
4,rainy,mild,high,False,yes
5,rainy,cool,normal,False,yes
6,rainy,cool,normal,True,no
7,rainy,mild,normal,False,yes
8,rainy,mild,high,True,no
9,sunny,hot,high,False,no


In [4]:
total_examples=data.shape[0]
total_examples

14

Drop the play column because that is a output/prediction 

In [5]:
features=data
features=features.drop('play',axis=1)
features

Unnamed: 0,outlook,temperature,humidity,windy
0,overcast,hot,high,False
1,overcast,cool,normal,True
2,overcast,mild,high,True
3,overcast,hot,normal,False
4,rainy,mild,high,False
5,rainy,cool,normal,False
6,rainy,cool,normal,True
7,rainy,mild,normal,False
8,rainy,mild,high,True
9,sunny,hot,high,False


In [6]:
data['play'].value_counts()

yes    9
no     5
Name: play, dtype: int64

### Entropy
Entropy is degree of randomness of elements or in other words it is measure of impurity. Mathematically, it can be calculated with the help of probability of the items as:
![](entropy.jpeg)

For example,  

if we have items as number of dice face occurrence in a throw event as 1123, the entropy is

   p(1) = 0.5  
   
   p(2) = 0.25
   
   p(3) = 0.25
   
entropy = - (0.5 * log(0.5)) - (0.25 * log(0.25)) -(0.25 * log(0.25)

        = 0.45

In [17]:
def get_entropy(data):
    shape=data.shape
    pyes=0
    pno=0
    counts=data['play'].value_counts()
    if 'yes' in counts:
        pyes=-counts['yes']/shape[0] * math.log((counts['yes']/shape[0]),2)
    if 'no' in counts:
        pno=-counts['no']/shape[0] * math.log((counts['no']/shape[0]),2)

    entropy = pyes+pno
    return entropy

In [8]:
attribute = data['outlook']
attribute

0     overcast
1     overcast
2     overcast
3     overcast
4        rainy
5        rainy
6        rainy
7        rainy
8        rainy
9        sunny
10       sunny
11       sunny
12       sunny
13       sunny
Name: outlook, dtype: object

In [9]:
attribute.unique()

array(['overcast', 'rainy', 'sunny'], dtype=object)

In [10]:
data.loc[attribute=='overcast']

Unnamed: 0,outlook,temperature,humidity,windy,play
0,overcast,hot,high,False,yes
1,overcast,cool,normal,True,yes
2,overcast,mild,high,True,yes
3,overcast,hot,normal,False,yes


In [11]:
total_examples=data.shape[0]
average_entropy_info=0
total_examples

14

In [12]:
for value in attribute.unique():
    value_df=data.loc[attribute==value]
    #display(value_df)
    average_entropy_info+=(value_df.shape[0]/total_examples)*get_entropy(value_df)

In [18]:
def get_average_entropy(attribute,data):
    total_examples=data.shape[0]
    average_entropy_info=0
    for value in attribute.unique():
        value_df=data.loc[attribute==value]
        #display(value_df)
        average_entropy_info+=(value_df.shape[0]/total_examples)*get_entropy(value_df)
    return average_entropy_info

### Information Gain
Suppose we have multiple features to divide the current working set, the information gain at any node is defined as

**Information Gain (n) = Entropy(x) — ([weighted average] * entropy(children for feature))**

In [13]:
def get_gain(data_entropy,attribute,data):
    info=get_average_entropy(attribute,data)
    gain=data_entropy-info
    return gain

In [14]:
def construct_tree(tree,curr_split,features,tree_order):
    global root
    for value in tree:

        max = 0
        if root != None:
            new_data = data.loc[data[curr_split] == value]
            data_entropy = get_entropy(new_data)

        else:
            data_entropy = get_entropy(data)
            print('data_entropy',data_entropy)
            new_data = data

        for attribute in features:
            gain = get_gain(data_entropy, new_data[attribute], new_data)
            if gain > max:
                max = gain
                split_attribute = attribute

        if max == 0:
            split_attribute = None
            print("Reached leaves at {} ".format(curr_split))
            continue

        else:
            root=split_attribute
            #print("Split attribute is ",split_attribute)
            tree_order.append(split_attribute)
            print("Tree order at {} = {}".format(curr_split, tree_order))
            construct_tree(data[split_attribute].unique(),split_attribute,features.drop(split_attribute, axis=1),tree_order)
            tree_order=[]
    print()
    return

In [15]:
root=None
construct_tree([1],None,features,[])

data_entropy 0.9402859586706309
Tree order at None = ['outlook']
Reached leaves at outlook 
Tree order at outlook = ['outlook', 'windy']
Tree order at windy = ['outlook', 'windy', 'humidity']
Tree order at humidity = ['outlook', 'windy', 'humidity', 'temperature']
Reached leaves at temperature 
Reached leaves at temperature 
Reached leaves at temperature 

Tree order at humidity = ['temperature']
Reached leaves at temperature 
Reached leaves at temperature 
Reached leaves at temperature 


Tree order at windy = ['temperature']
Tree order at temperature = ['temperature', 'humidity']
Reached leaves at humidity 
Reached leaves at humidity 

Reached leaves at temperature 
Tree order at temperature = ['humidity']
Reached leaves at humidity 
Reached leaves at humidity 



Tree order at outlook = ['humidity']
Tree order at humidity = ['humidity', 'temperature']
Tree order at temperature = ['humidity', 'temperature', 'windy']
Reached leaves at windy 
Reached leaves at windy 

Tree order at tem

In [16]:
for attribute in features:
    print(attribute)

outlook
temperature
humidity
windy
