DECISION TREE Experiment

This experiment uses the tennis.txt dataset, which contains 14 samples. Each sample contains weather-related features and whether it is suitable for tennis.

Step 1 Import dependencies

Input:

In [4]:
import pandas as pd
import numpy as np
from sklearn import tree
import pydotplus

Step 2: Define the function for generating a decision tree

Input:

In [5]:
# Generate a decision tree
def createTree(trainingData):
    data = trainingData.iloc[:, :-1]      # Feature matrix
    labels = trainingData.iloc[:, -1]     # Labels
    trainedTree = tree.DecisionTreeClassifier(criterion="entropy")      # Decision tree classifier
    trainedTree.fit(data, labels)      # Train the model
    return trainedTree

Step 3: Define the function for saving the generated tree diagram

Input:

In [6]:
def showtree2pdf(trainedTree,finename):
    dot_data = tree.export_graphviz(trainedTree, out_file=None)    # Export the tree in Graphviz format.
    graph = pydotplus.graph_from_dot_data(dot_data)
    graph.write_pdf(finename)     # Save the tree diagram to the local machine in PDF format.

Step 4: Define the function for generating vectorized data

In the function, pd.Categorical(list).codes obtains the sequence number list corresponding to the original data, so as to convert the categorical information into numeric information.

Input:

In [7]:
def data2vectoc(data):
    names = data.columns[:-1]
    for i in names:
        col = pd.Categorical(data[i])
        data[i] = col.codes
    return data

Step 5 Invoke the function for prediction

Input:

In [8]:
data = pd.read_table("tennis.txt",header=None,sep='\t')   # Read training data
trainingvec=data2vectoc(data)     # Vectorize data
decisionTree=createTree(trainingvec)   # Create a decision tree.
showtree2pdf(decisionTree,"tennis.pdf")  # Plot the decision tree

The file content is a visualized display of the decision tree. In the diagram, X[2] is the third feature variable (humidity); X[0] is the first feature variable (weather); X[3] is the fourth feature variable (wind); entropy is the entropy value of the node; and samples is the number of samples in the node, for example, 14 in the first node (root node) indicates the number of samples in the training set; and value indicates the number of samples of different types, for example, in the root node, 5 indicates the number of "no" samples, and 9 indicates the number of "yes" samples.

Predict a new sample. Input:

In [9]:
testVec = [0,0,1,1]  # Weather is sunny, temperature is low, humidity is high, and wind is strong.
print(decisionTree.predict(np.array(testVec).reshape(1,-1)))   # Predict

['Y']
