COMP 472 - Decision Trees COMP 6721 Applied Artificial Intelligence

Write a Python program that uses scikit-learn’s Decision Tree Classifier:

# Question 1 
Given the training instances below, use information theory to find whether
‘Outlook’ or ‘Windy’ is the best feature to decide when to play a game of golf.

In [None]:
import numpy as np
dataset = np.array([
['sunny', 85, 85, 0, 'Don\'t Play'],
['sunny', 80, 90, 1, 'Don\'t Play'],
['overcast', 83, 78, 0, 'Play'],
['rain', 70, 96, 0, 'Play'],
['rain', 68, 80, 0, 'Play'],
['rain', 65, 70, 1, 'Don\'t Play'],
['overcast', 64, 65, 1, 'Play'],
['sunny', 72, 95, 0, 'Don\'t Play'],
['sunny', 69, 70, 0, 'Play'],
['rain', 75, 80, 0, 'Play'],
['sunny', 75, 70, 1, 'Play'],
['overcast', 72, 90,1, 'Play'],
['overcast', 81, 75, 0, 'Play'],
['rain', 71, 80, 1, 'Don\'t Play'],
])
print(dataset)

In [None]:
import pandas as pd
df2 = pd.DataFrame(dataset,
                   columns=['Outlook', 'Temperature', 'Humidity','Windy','Play / Don’t Play'])
blankIndex=[''] * len(df2)
df2.index=blankIndex
df2

EXERCISE 1

In [None]:
from sklearn import tree
from sklearn import preprocessing

For our feature vectors, we need the first four columns and for the training labels, we use the last column from the dataset:


In [None]:
X = dataset[:, 0:4]
y = dataset[:, 4]

However, you will not be able to use the data as-is: All features must be
numerical for training the classifier, so you have to transform the strings into
numbers. Fortunately, scikit-learn has a preprocessing class for label encoding
that we can use:

In [None]:
le = preprocessing.LabelEncoder()
X[:, 0] = le.fit_transform(X[:, 0])
y = le.fit_transform(y)

Now you can create a classifier object:

In [None]:
dtc = tree.DecisionTreeClassifier(criterion="entropy")

Note that we are using the entropy option for building the tree, which is the
method we’ve studied in class and used on the exercise sheet. Train the classifier
to build the tree:


In [None]:
dtc.fit(X, y)

Now you can predict a new value using dtc.predict(test). Note: if you want the string output that
you transformed above, you can use the inverse_transform(predict) function
of a label encoder to change the predicted result back into a string.

In [None]:
y_pred = dtc.predict([[2, 81, 95, 1]])
print("Predicted output: ", le.inverse_transform(y_pred))

You can also print the tree:

In [None]:
tree.plot_tree(dtc)

but this can be a bit hard to read; to get a prettier version you can use the
Graphviz visualization software, which you can call from Python like this:

In [None]:
import graphviz
dot_data = tree.export_graphviz(dtc, out_file=None,
feature_names=['Outlook', 'Temperature', 'Humidity', 'Windy'],
class_names=le.classes_,
filled=True, rounded=True)
graph = graphviz.Source(dot_data)
graph.render("mytree1")

The result will be stored in a file mytree.pdf 

EXERCISE 2 

Let’s start working with some “real” data: scikit-learn comes with a number of
example datasets, including the Iris flower dataset. If you do not know this
dataset, start by reading https://en.wikipedia.org/wiki/Iris_flower_data_
set.


In [None]:
from sklearn import tree
from sklearn.datasets import load_iris
# load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.6, random_state=0)
# create and print the decision tree
dtc = tree.DecisionTreeClassifier(criterion="entropy")
dtc.fit(X_train, y_train)
tree.plot_tree(dtc)

In [None]:
import graphviz
dot_data = tree.export_graphviz(dtc, out_file=None,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph.render("iristree")


Now you have to evaluate the performance of your classifier. Use scikitlearn’s train_test_split helper function5
to split the Iris dataset into
a training and testing subset, as discussed in the lecture. Now run an
evaluation to compute the Precision, Recall, F1-measure, and Accuracy
of your classifier using the evaluation tools in scikit-learn.


To create an 80%/20% split of the training/testing data, use:

scikit-learn has a helper function to produce a report for all the metrics,
classification_report:

In [None]:
y_pred = dtc.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))


As you can see, the data was easy to learn and your classifier has a perfect
score. Try decreasing the training data to just 50% of the dataset to see
some errors.
Note: Since we have more than two classes, the above metrics are an
average of the values for each individual class.


Finally, compute and print out the confusion matrix.

In [None]:
from sklearn.metrics import confusion_matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))