# Lesson Two

In lesson one, we used a decision tree as our classifier. In this lesson we'll add code to visualize it so we can see how it works under the hood. There are many types of classifiers:
1. Artificial Neural Network
2. Support Vector Machine
3. Decision Tree
and More…

So why did we use a decision tree to start? Well, they have a very unique property--they're easy to read and understand. In fact, they're one of the few models that are interpretable, where you can understand exactly why the classifier makes a decision. That's amazingly useful in practice. To get started, I'll introduce you to a real data set we'll work with today. It's called Iris. Iris is a classic machine learning problem. In it, you want to identify what type of flower you have based on different measurements, like the length and width of the petal. The data set includes three different types of flowers. They're all species of iris-- setosa, versicolor, and virginica.


Scrolling down, you can see we’re given 50 examples of each type, so 150 examples total. Notice there are four features that are used to describe each example. These are the length and width of the sepal and petal. And just like in our apples and oranges problem, the first four columns give the features and the last column gives the labels, which is the type of flower in each row. Our goal is to use this data set to train a classifier. Then we can use that classifier to predict what species of flower we have if we're given a new flower that we've never seen before. Knowing how to work with an existing data set is a good skill, so let's import Iris into Scikit-learn and see what it looks like in code. Conveniently, the friendly folks at Scikit-learn provided a bunch of sample data sets, including Iris, as well as utilities to make them easy to import.

In [4]:
# Import dataset
from sklearn.datasets import load_iris


We can import Iris into our code like this. The data set includes both the table from Wikipedia as well as some metadata.

In [5]:
iris = load_iris()
print(iris.feature_names)
print(iris.target_names)
print(iris.data[0])

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
['setosa' 'versicolor' 'virginica']
[ 5.1  3.5  1.4  0.2]


The metadata tells you the names of the features and the names of different types of flowers. The features and examples themselves are contained in the data variable. For example, if I print out the first entry, you can see the measurements for this flower. These index to the feature names, so the first value refers to the sepal length, and the second to sepal width, and so on.


In [6]:
print(iris.target[0])

0


The target variable contains the labels. Likewise, these index to the target names. Let's print out the first one. A label of 0 means it's a setosa. If you look at the table from Wikipedia, you'll notice that we just printed out the first row. Now both the data and target variables have 150 entries. If you want, you can iterate over them to print out the entire data set like this. Now that we know how to work with the data set, we're ready to train a classifier.

In [7]:
for i in range(len(iris.target)):
    print("Example: ", i, " label: ", iris.target[i], " features: ", iris.data[i])

Example:  0  label:  0  features:  [ 5.1  3.5  1.4  0.2]
Example:  1  label:  0  features:  [ 4.9  3.   1.4  0.2]
Example:  2  label:  0  features:  [ 4.7  3.2  1.3  0.2]
Example:  3  label:  0  features:  [ 4.6  3.1  1.5  0.2]
Example:  4  label:  0  features:  [ 5.   3.6  1.4  0.2]
Example:  5  label:  0  features:  [ 5.4  3.9  1.7  0.4]
Example:  6  label:  0  features:  [ 4.6  3.4  1.4  0.3]
Example:  7  label:  0  features:  [ 5.   3.4  1.5  0.2]
Example:  8  label:  0  features:  [ 4.4  2.9  1.4  0.2]
Example:  9  label:  0  features:  [ 4.9  3.1  1.5  0.1]
Example:  10  label:  0  features:  [ 5.4  3.7  1.5  0.2]
Example:  11  label:  0  features:  [ 4.8  3.4  1.6  0.2]
Example:  12  label:  0  features:  [ 4.8  3.   1.4  0.1]
Example:  13  label:  0  features:  [ 4.3  3.   1.1  0.1]
Example:  14  label:  0  features:  [ 5.8  4.   1.2  0.2]
Example:  15  label:  0  features:  [ 5.7  4.4  1.5  0.4]
Example:  16  label:  0  features:  [ 5.4  3.9  1.3  0.4]
Example:  17  label:  0 

Now that we know how to work with the data set, we're ready to train a classifier. But before we do that, first we need to split up the data. I'm going to remove several of the examples and put them aside for later. We'll call the examples I'm putting aside our testing data. We'll keep these separate from our training data, and later on we'll use our testing examples to test how accurate the classifier is on data it's never seen before. Testing is actually a really important part of doing machine learning well in practice, and we'll cover it in more detail in a future episode. Just for this exercise, I'll remove one example of each type of flower. And as it happens, the data set is ordered so the first setosa is at index 0, and the first versicolor is at 50, and so on. The syntax looks a little bit complicated, but all I'm doing is removing three entries from the data and target variables. Then I'll create two new sets of variables-- one for training and one for testing. Training will have the majority of our data, and testing will have just the examples I removed.

Now, just as before, we can create a decision tree classifier and train it on our training data. Before we visualize it, let's use the tree to classify our testing data. We know we have one flower of each type, and we can print out the labels we expect. Now let's see what the tree predicts. We'll give it the features for our testing data, and we'll get back labels. You can see the predicted labels match our testing data. That means it got them all right. Now, keep in mind, this was a very simple test, and we'll go into more detail down the road.

In [2]:
# Train a classifier
from sklearn.datasets import load_iris
from sklearn import tree
import numpy as np

iris = load_iris()
test_idx = [0, 50, 100]

# training data
train_target = np.delete(iris.target, test_idx)
train_data = np.delete(iris.data, test_idx, axis=0)

# testing data
test_target = iris.target[test_idx]
test_data = iris.data[test_idx]

clf_tree = tree.DecisionTreeClassifier()
clf_tree.fit(train_data, train_target)

# Predict label for new flower
print(test_target)
print(clf_tree.predict(test_data))

[0 1 2]
[0 1 2]


Now let's visualize the tree so we can see how the classifier works. To do that, I'm going to copy-paste some code in from scikit's tutorials, and because this code is for visualization and not machine-learning concepts, I won't cover the details here. Note that I'm combining the code from these two examples to create an easy-to-read PDF. I can run our script and open up the PDF, and we can see the tree.


In [3]:
# Visualize the tree
from sklearn.externals.six import StringIO
from IPython.display import Image

dot_data = StringIO()
tree.export_graphviz(clf_tree,
                    out_file="tree.dot",
                    feature_names=iris.feature_names,
                    class_names=iris.target_names,
                    filled=True, rounded=True,
                    impurity=False)


