# Decision Tree Learning

## Load the Weather dataset

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('weather.csv')
df

Unnamed: 0,ID,Outlook,Temperature,Humidity,Windy,Play
0,a,sunny,hot,high,no,no
1,b,sunny,hot,high,yes,no
2,c,overcast,hot,high,no,yes
3,d,rainy,mild,high,no,yes
4,e,rainy,cool,normal,no,yes
5,f,rainy,cool,normal,yes,no
6,g,overcast,cool,normal,yes,yes
7,h,sunny,mild,high,no,no
8,i,sunny,cool,normal,no,yes
9,j,rainy,mild,normal,no,yes


## Exercise 1
Write a Python function `freq` to find the relative frequency distribution of an attribute.
Hint: Try the `value_counts` method.

In [2]:
def freq(series):
    # TODO: Your code here!
    return series.value_counts() / len(series)

p_temp = freq(df['Temperature'])
print(p_temp)

mild    0.428571
hot     0.285714
cool    0.285714
Name: Temperature, dtype: float64


Expected output:

    mild    0.428571
    cool    0.285714
    hot     0.285714

## Entropy

The entropy $H(S)$ of an attribute $S$ is defined by
$$H(S) = -\sum_{i=1}^k p_i \log_2(p_i)$$
where $p_1, \ldots, p_k$ are the probabilities (relative frequencies) of the values of $S$.

## Exercise 2

Write a function `info` that calculates the entropy for a probability distribution.


In [3]:
def info(p):
    return -np.sum(p * np.log2(p))

print(info(p_temp)) # Expected answer: 1.5566567074628228


1.5566567074628228


It's more convenient to combine `info` and `freq` into a single function.

In [4]:
def entropy(series):
    return info(freq(series))

Use this function to calculate the entropy of the `Play` attribute. (Answer: 0.94 bits)

In [5]:
print(entropy(df['Play']))

0.9402859586706309


## Split Entropy

Let $T$ and $A$ be attributes. The *split entropy* $H(T, A)$ is the weighted average entropy of $T$ when we split on the values of $A$.

For example, let's calculate $H(\text{Play}, \text{Outlook})$. We split the dataset into groups based on the value of `Outlook`.

In [6]:
grouped = df.groupby('Outlook')
play = grouped['Play']

Next, we calculate the entropy of each group.

In [7]:
h = play.aggregate(entropy)
print(h)

Outlook
overcast   -0.000000
rainy       0.970951
sunny       0.970951
Name: Play, dtype: float64


Since there are 4 overcast days, 5 rainy days, and 5 sunny days, the split entropy is

$$\frac{4}{14} (0) + \frac{5}{14} (-0.97) + \frac{5}{14} (-0.97) = 0.69$$

## Exercise 3

Write a function to calculate split entropy. You may use the code from the Split Entropy section as a starting point.



In [8]:
def split_entropy(df, T, A):
    groups = df.groupby(T)[A]
    h = groups.aggregate(entropy)
    return np.average(h, weights = groups.size())

print (split_entropy(df, 'Outlook', 'Play')) # Expected answer: 0.6935361388961919

0.693536138896


## Information Gain

The **Information gain** $IG(T, A)$ is the change in the entropy of $T$ after splitting on $A$. It is defined by

$$IG(T, A) = H(T) - H(T, A).$$

For example, if we split on the `Outlook` attribute, then the entropy decreases from 0.94 to 0.69, so the information gain is $0.94 - 0.69 = 0.25$.

## Exercise 4

Write a function to calculate information gain. Use the `entropy` and `split_entropy` functions.

In [9]:
def information_gain(df, T, A):
    return entropy(df[A]) - split_entropy(df, T, A)

print(information_gain(df, 'Outlook', 'Play')) # Expected answer: 0.246749819774439

0.246749819774


Which attribute gives the greatest information gain?

## Gain Ratio

The gain ratio is the information gain from splitting on an attribute, divided by the information in the split.
The formula is
$$GR(T, A) = \frac{IG(T, A)}{H(T)}.$$

## Exercise 5
Write a function to calculate the gain ratio. Which attribute has the highest gain ratio?

In [10]:
def gain_ratio(df, T, A):
    return information_gain(df, T, A) / entropy(df[T])

for attr in ('Outlook', 'Temperature', 'Humidity', 'Windy'):
    print("%-15s %f" % (attr, gain_ratio(df, attr, 'Play')))

Outlook         0.156428
Temperature     0.018773
Humidity        0.151836
Windy           0.048849


Expected answers:

    Outlook         0.156428
    Temperature     0.018773
    Humidity        0.151836
    Windy           0.048849

## Decision trees in scikit-learn

The decision tree classifier expects the attributes to be numeric, unless I am missing something. The target attribute should assume integer values from 0 to $n-1$, where $n$ is the number of classes.

Let's drop the ID column, and recode the other columns as integers.

In [11]:
df = df.drop('ID', axis=1)

codes = {
    'rainy':  0, 'overcast': 1, 'sunny': 2,
    'cool':   0, 'mild':     1, 'hot': 2,
    'normal': 0, 'high':     1,
    'no':     0, 'yes': 1
}

df = df.replace(codes)
df

Unnamed: 0,Outlook,Temperature,Humidity,Windy,Play
0,2,2,1,0,0
1,2,2,1,1,0
2,1,2,1,0,1
3,0,1,1,0,1
4,0,0,0,0,1
5,0,0,0,1,0
6,1,0,0,1,1
7,2,1,1,0,0
8,2,0,0,0,1
9,0,1,0,0,1


Now we use the decision tree classifier in sklearn to build our model.

In [12]:
from sklearn import tree
X = df.drop('Play', axis=1)
y = df['Play']
clf = tree.DecisionTreeClassifier()
clf.fit(X, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

The decision tree can be exported in `dot` format, and rendered using `GraphViz`. Alternatively, the tree can be rendered in Python using the `pydotplus` package.

In [13]:
import pydotplus 
dot_data = tree.export_graphviz(clf, out_file=None, feature_names=X.columns) 
graph = pydotplus.graph_from_dot_data(dot_data) 
_ = graph.write_png("decision-tree.png") 

![](decision-tree.png)