# Decision Tree Learning

We will use an invented dataset to explore decision learning. The dataset contains weather observations for 14 days, and the task is to predict whether the day is good for playing tennis. This example is from [Induction of Decision Trees](http://hunch.net/~coms-4771/quinlan.pdf) by J.R. Quinlan, published in 1986.

## Load the Weather dataset

In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv('weather.csv')
df

## Exercise 1
Write a Python function `freq` to find the relative frequency distribution of an attribute.
Hint: Try the `value_counts` method.

In [None]:
def freq(series):
    # TODO: Your code here!
    return 0

p_temp = freq(df['Temperature'])
print(p_temp)

Expected output:

    mild    0.428571
    cool    0.285714
    hot     0.285714

## Entropy

The entropy $H(S)$ of an attribute $S$ is defined by
$$H(S) = -\sum_{i=1}^k p_i \log_2(p_i)$$
where $p_1, \ldots, p_k$ are the probabilities (relative frequencies) of the values of $S$.

## Exercise 2

Write a function `info` that calculates the entropy for a probability distribution.


In [None]:
def info(p):
    # TODO: Your code here!
    return 0

print(info(p_temp)) # Expected answer: 1.
# Note: p_temp was calculated in the previous cell.

It's more convenient to combine `info` and `freq` into a single function.

In [None]:
def entropy(series):
    return info(freq(series))

Use this function to calculate the entropy of the `Play` attribute. (Answer: 0.94 bits)

In [None]:
# TODO: Your code here!


## Split Entropy

Let $T$ and $A$ be attributes. The *split entropy* $H(T, A)$ is the weighted average entropy of $T$ when we split on the values of $A$.

For example, let's calculate $H(\text{Play}, \text{Outlook})$. We split the dataset into groups based on the value of `Outlook`.

In [None]:
grouped = df.groupby('Outlook')
play = grouped['Play']

Next, we calculate the entropy of each group.

In [None]:
h = play.aggregate(entropy)
print(h)

Since there are 4 overcast days, 5 rainy days, and 5 sunny days, the split entropy is

$$\frac{4}{14} (0) + \frac{5}{14} (0.97) + \frac{5}{14} (0.97) = 0.69$$

## Exercise 3

Write a function to calculate split entropy. You may use the code from the Split Entropy section as a starting point.



In [None]:
def split_entropy(df, T, A):
    # TODO: Your code here!
    return 0

print (split_entropy(df, 'Play', 'Outlook')) # Expected answer: 0.6935361388961919

## Information Gain

The **Information gain** $IG(T, A)$ is the change in the entropy of $A$ after splitting on $T$. It is defined by

$$IG(T, A) = H(T) - H(T, A).$$

For example, if we split on the `Outlook` attribute, then the entropy decreases from 0.94 to 0.69, so the information gain is $0.94 - 0.69 = 0.25$.

## Exercise 4

Write a function to calculate information gain. Use the `entropy` and `split_entropy` functions.

In [None]:
def information_gain(df, T, A):
    # TODO: Your code here!
    return 0

print(information_gain(df, 'Outlook', 'Play')) # Expected answer: 0.246749819774439

Which attribute gives the greatest information gain?

In [None]:
# TODO: Your code here!

## Gain Ratio

The gain ratio is the information gain from splitting on an attribute, divided by the information in the split.
The formula is
$$GR(T, A) = \frac{IG(T, A)}{H(T)}.$$

## Exercise 5
Write a function to calculate the gain ratio. Which attribute has the highest gain ratio?

In [None]:
def gain_ratio(df, T, A):
    # TODO: Your code here!
    return 0

for attr in ('Outlook', 'Temperature', 'Humidity', 'Windy'):
    print('%-15s%f' % (attr, gain_ratio(df, attr, 'Play')))

Expected answers:

    Outlook         0.156428
    Temperature     0.018773
    Humidity        0.151836
    Windy           0.048849

## Decision trees in scikit-learn

The decision tree classifier expects the attributes to be numeric, unless I am missing something. The target attribute should assume integer values from 0 to $n-1$, where $n$ is the number of classes.

Let's drop the ID column, and recode the other columns as integers.

In [None]:
df = df.drop('ID', axis=1)

codes = {
    'rainy':  0, 'overcast': 1, 'sunny': 2,
    'cool':   0, 'mild':     1, 'hot': 2,
    'normal': 0, 'high':     1,
    'no':     0, 'yes': 1
}

df = df.replace(codes)
df

Now we use the decision tree classifier in sklearn to build our model.

In [None]:
from sklearn import tree
X = df.drop('Play', axis=1)
y = df['Play']
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3)
clf.fit(X, y)

## Making predictions on new data

Let's create two new days and see what the model predicts. The first day is sunny and mild with normal humidity and calm wind. The second day is rainy and cool with high humidity and high wind. The model predicts that the first day is good for tennis, but the second day is not.

In [None]:
new_days = pd.DataFrame(
    [['sunny', 'mild', 'normal', 'no'],
     ['rainy', 'cool', 'high', 'yes']]
).replace(codes)

print(clf.predict(new_days))

## Rendering the decision tree

The decision tree can be exported in `dot` format, and rendered using `GraphViz`. Alternatively, the tree can be rendered in Python using the `pydotplus` package.

In [None]:
import pydotplus 
dot_data = tree.export_graphviz(clf, out_file=None, feature_names=X.columns) 
graph = pydotplus.graph_from_dot_data(dot_data) 
graph.write_png("decision-tree-entropy.png")

![](decision-tree-entropy.png)