# Using Decision Trees: detailed example

The _Play Tennis_ example data (from Mitchell (1997) _Machine Learning_) used in the notes for Week 8 is a nice, relatively small example.

The predictors are discrete-valued labels, making the decision splits easier to understand.

The first step is to read the data.

In [None]:
import math
import pandas as pd
import pprint
pp = pprint.PrettyPrinter(indent=2)
dataDir = "data"
df = pd.read_csv(dataDir+"/playTennis.csv",header=0,quotechar="'")
print(df)

For convenience, I have created a function to evaluate the _entropy_ function `H`.

It looks more complex than it is - mainly because I assemble the calculation in a string, for teaching purposes. This string is returned to the caller and `eval`uated there by the calling program.

Note that `H` has two forms, depending on whether it is Level 1 (applies to the overall set: arguments `predNameB` and `predLabelB` are each empty strings and `pd` evaluates to the number of rows in the overall set) or Level 2 (when it is conditional on a setting of one of the predictors, specified by `predNameB` and `predLabelB` and `pd` is also more specific).

If a particular predictor setting does not arise for a given class setting, the row count `pn` is zero and the log term needs special treatment. The log of zero is minus infinity, but the log term itself is multipled by zero, so the two multiplied together is taken as zero..

In [None]:
def calcH(predNameB,predLabelB,pd):
  acc = ""
  for classLabel in labels[className]:
    pns = "rowCount[className]"+predNameB+"[classLabel]"+predLabelB
    pn = eval(pns)
    p = "("+str(pn) +"/" + str(pd)+")"
    if (pn == 0):
      acc += " -" + p + " "
    else:
      acc += " -" + p + " * math.log(" + p + ", 2) "
  return acc

## Preliminary: Deriving the row counts

We now start to count rows depending on various (combinations of) settings. We use a rowCount set to store the counts. The first dictionary element gets the count of `'ALL'` rows.

We then get a list of the column names `colNames`and the names of the class to be predicted `className`.

Looping over the column names, we ask what the unique list of labels is for each column name.

We can then loop over each of these labels and count the number of rows for which this column name takes a particular label value.

In [None]:
rowCount = {}
rowCount["ALL"] = len(df.index)
colNames = list(df.columns.values)
className = "play"
labels = {}
for colName in colNames:
  labels[colName] = df[colName].unique().tolist()
  rowCount[colName] = {}
  for label in labels[colName]:
    rowCount[colName][label] = len(df.loc[df[colName] == label].index)
pp.pprint(rowCount)

For the Decision Tree classifier, we need to go to Level 2 also.

Therefore we also need to split the predictor counts above based on whether the decision was to play or not.

Our first step is to use a _list comprehension_ to return a list of the predictor column names only `predNames`.

We can then loop over just the predictors.

The resulting `rowCount` values depend on both the `className` = `classLabel` and the `predName` = `predLabel` conditions.

We then print the rowCount variable and see that `rowCount` has been extended with these counts.


In [None]:
predNames = [x for x in colNames if className not in x]
for predName in predNames:
  rowCount[className][predName] = {}
  for classLabel in labels[className]:
    rowCount[className][predName][classLabel] = {}
    for predLabel in labels[predName]:
      rowCount[className][predName][classLabel][predLabel] = len(df.loc[(df[className] == classLabel) & (df[predName] == predLabel)].index)
pp.pprint(rowCount)

## Entropy calculations

The top-level entropy calculation is calculated without any splits by predictor.

It is calculated in terms of the class column ("play" in this case) which takes two values: "yes" and "no". As expected it takes a value based on how predictable this variable is.

Note that the `calcH` function is called with empty strings for `predNameB`  and `predLabelB` and the `rowCount` for ALL settings.

Also, the string returned by `calcH` needs to be evaluated to a number.

We print both the `H` expression (as a string) and its value (as a number).

In [None]:
H = {}
H[className] = {}
acc = calcH(predNameB="",predLabelB="",pd=rowCount["ALL"])
H[className][""] = eval(acc)

print(acc)
print(H[className][""])

We now consider the effects of splitting by each of the predictors in turn.

In each case we scale by `p` which is the probability of a particular setting.

The other term is of course the `H` for that setting with the class variable. Note that here `calcH` has `predNameB` and `predLabelB` that are not just empty strings.

Using the `term` counter, we can tell whether we need to start a new accumulated string or just add to an existing accumulated string.

In [None]:
for predName in predNames:
  term = 0
  predNameB = '["'+predName+'"]'
  for predLabel in labels[predName]:
    predLabelB = '["'+predLabel+'"]'
    p = "(" + str(rowCount[predName][predLabel]) + "/" + str(rowCount["ALL"]) + ") * "
    a = p + "("+calcH(predNameB,predLabelB,pd=rowCount[predName][predLabel])+")"
    if (term == 0):
      acc = a
    else:
      acc = acc + " + " + a
    term += 1
  H[className][predName] = eval(acc)

  print(acc)
  print("Splitting on {} gives entropy {}".format(predName,H[className][predName]))

So which predictor should we use in our node, to split the data to get the smallest entropy over the available predictors?

It seems that the `outlook` variable is the one to choose!

In [None]:
splitVariable = min(H[className], key=H[className].get)
print("The first split should be on the _{}_ variable, reducing the entropy from {} to {}".format(splitVariable,H[className][""],H[className][splitVariable]))

Note that this process can be continued as necessary, with it stopping when all leafs are _pure_ (have entropy = 0). According to https://codefying.com/2015/03/09/decision-tree-classifier-part-1/, the resulting decision tree is shown below:

![View of Play Tennis decision tree](https://codefying.files.wordpress.com/2015/03/mltree.jpg)

Note that the `temp` predictor is not used in the decision tree - it is not needed.

## Using sklearn to derive the decision tree

Note that the treatment above is intended to help understanding. There is no need to program Decision Tree classifiers yourself!

`scikit-learn` offers a `DecisionTreeClassifier` with the same sort of API as other supervised learning algorithms, alongside settings that are specific to Decision Trees.

One awkward feature of the sklearn implementation is that it seems to be necessary to code the labels as numbers (otherwise you get a "could not convert string to float" error). While there are tools to do this, it does mean that the resulting decision tree is not as easy to "read".



In [None]:
from sklearn import tree
features = ['outlook', 'temp', 'humidity', 'windy']
X = df[features].copy()
y = df['play'].copy()

Working with the copies X and y, we need to encode them to integers. Be careful when looking at code others have shared on the web. Older versions of sklearn did not have an OrdinalEncoder, so people used LabelEncoders instead. This usually led to encodings that did not respect the natural ordering of ordinal features. This is a problem for Decision Tress that are based on ordinal splits.

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
le = LabelEncoder()
oe = OrdinalEncoder(categories=
                               [['rainy','overcast','sunny'],
                                ['cool','mild','hot'],
                                ['normal','high'],
                                ['low','high']]).fit(X)
oe

Now apply the encodings to the variables, generating XE and yE from X and y respectively.

In [None]:
yE = pd.Series(le.fit_transform(y))
XE = pd.DataFrame(data=oe.transform(X), columns=features)

Now look at the encoded features. Note that the encoded features respect the ordinal nature of the original features.

In [None]:
XE

Now fit the data using the sklearn DecisionTree classifier, using the encoded features and labels.

In [None]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(XE, yE)
clf

## Exercise

Now that we have classified the data, here are some tasks for you

1. Introduce an 80:20 train-test split and comment on the prediction error, size of data set, properties of decision trees, etc.
2. Use the inverse_transform() method to make the tree easier to "read".
3. Using [sklearn advice on using decision tree classifiers](https://scikit-learn.org/stable/modules/tree.html), investigate `export_graphviz()` and `export_text()` to export two ways to visualise the tree output.