# Decision Trees

In this hands-on we will look at how to create a decision tree using Scikit-Learn. We will implement the tennis decision problem that we saw in the lecture.

Our data is in the file tennis.csv, and consists of 14 rows (with one header row at the top of the file)

***

Outlook,Temp,Humidity,Wind,Decision

Sunny,Hot,High,Weak,No

Sunny,Hot,High,Strong,No

Overcast,Hot,High,Weak,Yes

Rain,Mild,High,Weak,Yes

Rain,Cool,Normal,Weak,Yes

Rain,Cool,Normal,Strong,No

Overcast,Cool,Normal,Strong,Yes

Sunny,Mild,High,Weak,No

Sunny,Cool,Normal,Weak,Yes

Rain,Mild,Normal,Weak,Yes

Sunny,Mild,Normal,Strong,Yes

Overcast,Mild,High,Strong,Yes

Overcast,Hot,Normal,Weak,Yes

Rain,Mild,High,Strong,No

---

For convenience we will load the data using a Pandas dataframe that automatically creates a dict data structure that lets us access each column using the column name.

The DecisionTreeClassifier we are using only takes in numeric values, so we have to turn all our text values (Overcast, Hot, etc) into numeric labels. We will use the LabelEncoder to do this.

Although we only have 14 training samples, we will set aside 5 for testing and use 9 for training.

As always we import all the modules that we need. Here we will use accuracy_score to measure accurate our decision making is, based on historical data.

In [None]:
import numpy as np                                                              
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

We begin by loading our data into Pandas dataframe, then creating one LabelEncoder for each column, and learning the labels for each column and changing them to numeric values using fit_transform:

In [None]:
tennis_data = pd.read_csv('tennis.csv', sep='\s*,\s*', engine = 'python')
o_l = LabelEncoder()
t_l = LabelEncoder()
h_l = LabelEncoder()
w_l = LabelEncoder()
d_l = LabelEncoder()

t_outlook = o_l.fit_transform(tennis_data['Outlook'])
t_temp = t_l.fit_transform(tennis_data['Temp'])
t_humid = h_l.fit_transform(tennis_data['Humidity'])
t_wind = w_l.fit_transform(tennis_data['Wind'])
t_decision = d_l.fit_transform(tennis_data['Decision'])


The t_outlook column now looks like this (the other columns will look similiar):

In [None]:
print(t_outlook)

Each column is now a row vector, and we need to change them it into a column vector, then concatenate the vectors for the remaining columns together to form the input to the Decision Tree. We also convert the target into a column vector, and use train_test_split as before to put aside 5 samples for testing.

(Note: If you want to specify actual number of testing samples, supply an integer to the test_size argument for train_test_split. If you want to specify a percentage, supply a float. So:

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 5) # Put aside 5 samples for testing.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 5.0) # Put aside 5 percent of samples for testing.)


In [None]:
labels = np.concatenate((t_outlook.reshape(-1, 1),
t_temp.reshape(-1, 1),
t_humid.reshape(-1, 1),
t_wind.reshape(-1, 1)), axis = 1)

targets = t_decision.reshape(-1, 1)

X_train, X_test, Y_train, Y_test = train_test_split(labels, targets,
test_size = 5)

Our training input and target values now look like this:

In [None]:
print("Training input:")
print(X_train)
print("Training targets:")
print(Y_train)

Now let's train our classifier:

In [None]:
clf = DecisionTreeClassifier()
clf.fit(X_train, Y_train)

train_predict = clf.predict(X_train).reshape(-1,1)
test_predict = clf.predict(X_test).reshape(-1,1)

We will now measure the accuracy of our classifier. Let's also look at the decisions taken by the classifier and compare it against the historical data ("Overcast" was shortened to "Ovrcst" for formatting reasons)

In [None]:
train_perf = accuracy_score(Y_train, train_predict)
test_perf = accuracy_score(Y_test, test_predict)

print("Train accuracy: %3.2f, Test accuracy: %3.2f" % (train_perf, test_perf))
# For convenience we transpose the X_train matrix so that each row
# is one complete set of samples rather than attributes
X_trans = X_test.transpose()

# Get the labels
X_labels = [o_l.inverse_transform(X_trans[0])]
X_labels.append(t_l.inverse_transform(X_trans[1]))
X_labels.append(h_l.inverse_transform(X_trans[2]))
X_labels.append(w_l.inverse_transform(X_trans[3]))

# Flatten the results vectors to suppress complaints from
# LabelEncoder
X_labels.append(d_l.inverse_transform(np.ravel(test_predict)))
X_labels.append(d_l.inverse_transform(np.ravel(Y_test)))

# Tranpose it back to num_samples x num_columns
results = np.array(X_labels).transpose()

for cname in tennis_data.columns:
    print(cname + '\t', end = '')

print("Predicted\tActual")
print("----------------------------------------------------------------------------------------------\n")

for row in results:
    for col in row:
        print("%s\t\t" % col, end = '')
    print()
print()                 

Since a different set of training and testing data is used each time, you will get a different result. This is especially so here because of the tiny training set of just 9 samples. Nonetheless this is a simple example of how you can use decision trees.