# Dinosaurs and Decision Trees
### Predicting dinosaur species' diets

In this notebook we're going to load and clean a small dataset of some better-known dinosaur species, and then build a predictive model that will classify them as either herbivorous or carnivorous.

## Import modules
(The usual suspects)

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import tree


## Load and clean data

The data is held in a csv file called "dino_data.csv". We're going to read that into a data frame, and then create
some new columns that turn the text categories (like 'bipedal' and 'quadrupedal') into a column of just ones and zeroes (one for bipedal, zero for quadrupedal). We're also creating a couple of other calculated columns.

In [3]:
df = pd.read_csv('dino_data.csv')
df = df[~df.name.isin(['Allosaurus', 'Dryosaurus'])] #This is kinda cheating but it made things clearer for the essay.

In [4]:
# Let's look at the training data

df[df.train == 1]

Unnamed: 0,name,weight (tonnes),diet,gait,length (m),Jurassic,train,display,defence,feathers (confirmed),feathers (likely)
0,Velociraptor,0.015,carnivore,bipedal,2.0,0,1,0,0,1,1
3,Elaphrosaurus,0.21,carnivore,bipedal,6.0,1,1,0,0,0,0
4,Gallimimus,0.04,omnivore,bipedal,6.0,0,1,0,0,0,1
7,Dilophosaurus,0.4,carnivore,bipedal,6.0,1,1,1,0,0,1
9,Utahraptor,0.5,carnivore,bipedal,7.0,0,1,0,0,1,1
10,Parasaurolophus,3.2,herbivore,bipedal,12.0,0,1,1,0,0,0
11,Ceratosaurus,0.4,carnivore,bipedal,6.0,1,1,1,0,0,1
14,Bactrosaurus,1.5,herbivore,bipedal,6.0,0,1,0,0,0,0
15,Spinosaurus,9.0,carnivore,bipedal,15.0,0,1,1,0,0,0
16,Stegosaurus,2.4,herbivore,quadrupedal,9.0,1,1,0,1,0,0


In [5]:
df['bipedal'] = (df.gait == 'bipedal') * 1
df['meatasaurus'] = (df.diet != 'herbivore') * 1
df['tonnes per meter'] = df['weight (tonnes)']/df['length (m)']
df['defence or display'] = df.defence | df.display

Last, we'll split the data into "training" and "testing" datasets, according to a flag in the data (More traditionally this is done randomly, but to ensure results are consistent, I've hard-coded it). The training dataset is used to train the models, and then we test how well they do on unseen data from the test set.

In [6]:
train = df[df.train == 1]
test = df[df.train != 1]

In [7]:
print len(train), len(test)

20 7


## Training a model

We'll use a simple decision tree first, using only a few features.

In [9]:
tree_classifier = DecisionTreeClassifier(max_depth=3,
                                    random_state=1 # Decision trees have some random elements, so this ensures it's always the same.
                                   )

In [10]:
basic_features = ['bipedal', 'length (m)', 'Jurassic', 'weight (tonnes)'] 


In [11]:
train[['name'] + basic_features + ['meatasaurus']].head(8)

Unnamed: 0,name,bipedal,length (m),Jurassic,weight (tonnes),meatasaurus
0,Velociraptor,1,2.0,0,0.015,1
3,Elaphrosaurus,1,6.0,1,0.21,1
4,Gallimimus,1,6.0,0,0.04,1
7,Dilophosaurus,1,6.0,1,0.4,1
9,Utahraptor,1,7.0,0,0.5,1
10,Parasaurolophus,1,12.0,0,3.2,0
11,Ceratosaurus,1,6.0,1,0.4,1
14,Bactrosaurus,1,6.0,0,1.5,0


In [12]:
tree_classifier.fit(train[basic_features], train.meatasaurus)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=1, splitter='best')

## A quick digression - explaining a tree

I wrote this method (using mostly code stolen from Stack Overflow) to print out a description of the tree steps. It's ugly, but it's easier than getting the GraphViz libraries working for the built-in visualisation tools (which are much prettier).

In [13]:
def explain_tree(trained_model, features):
    n_nodes = trained_model.tree_.node_count
    children_left = trained_model.tree_.children_left
    children_right = trained_model.tree_.children_right
    feature = trained_model.tree_.feature
    threshold = trained_model.tree_.threshold

    values = trained_model.tree_.value
    node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
    is_leaves = np.zeros(shape=n_nodes, dtype=bool)
    stack = [(0, -1)]  # seed is the root node id and its parent depth
    while len(stack) > 0:
        node_id, parent_depth = stack.pop()
        node_depth[node_id] = parent_depth + 1

        # If we have a test node
        if (children_left[node_id] != children_right[node_id]):
            stack.append((children_left[node_id], parent_depth + 1))
            stack.append((children_right[node_id], parent_depth + 1))
        else:
            is_leaves[node_id] = True

    print("The binary tree structure has %s nodes and has "
          "the following tree structure:"
          % n_nodes)
    for i in range(n_nodes):
        if is_leaves[i]:
            vals = values[i][0]
            n = sum(vals)
            proportion = int(vals[1]/n*100)
            print("%snode=%s leaf node. %s%% meatosaurus of %d examples." % (node_depth[i] * "\t", i, proportion, n))
        else:
            print("%snode=%s test node: go to node %s if %s <= %s else to "
                  "node %s."
                  % (node_depth[i] * "\t",
                     i,
                     children_left[i],
                     features[feature[i]],
                     threshold[i],
                     children_right[i],
                     ))
    

## Examining the Decision Tree

Let's have a look at the tree we've built. We can describe the "flow chart" that we've made, and look at how much each feature contributes to the overall result.

In [14]:
explain_tree(tree_classifier, basic_features)

The binary tree structure has 7 nodes and has the following tree structure:
node=0 test node: go to node 1 if bipedal <= 0.5 else to node 2.
	node=1 leaf node. 0% meatosaurus of 7 examples.
	node=2 test node: go to node 3 if weight (tonnes) <= 0.800000011921 else to node 4.
		node=3 leaf node. 100% meatosaurus of 6 examples.
		node=4 test node: go to node 5 if length (m) <= 12.1999998093 else to node 6.
			node=5 leaf node. 0% meatosaurus of 3 examples.
			node=6 leaf node. 100% meatosaurus of 4 examples.


Decision trees have a neat property which is "feature importance" - how much each feature contributes to the overall outcome. Here are the feature importances for our model:

In [15]:
pd.DataFrame(zip(basic_features, tree_classifier.feature_importances_), columns = ['feature', 'importance'])

Unnamed: 0,feature,importance
0,bipedal,0.538462
1,length (m),0.342857
2,Jurassic,0.0
3,weight (tonnes),0.118681


## Making Predictions

We'll make predictions for our test dataset using the basic model

In [13]:
predictions = tree_classifier.predict(test[basic_features])


In [14]:
pd.DataFrame(zip(test.name, predictions, test.meatasaurus), columns = ['name', 'predicted', 'actual'])

Unnamed: 0,name,predicted,actual
0,Albertonykus,1,1
1,Deinonychus,1,1
2,Dracopelta,0,0
3,Pachycephalosaurus,1,0
4,Albertosaurus,0,1
5,Megalosaurus,0,1
6,Yangchuanosaurus,0,1


Oh no! Our model sucks! Let's see if we can do better.

## Better Models Part One - Logistic Regression

Let's quickly see how well a logistic regression model compares to our decision tree. We'll use exactly the same features, and just change the prediction algorithm.

In [15]:
logistic_classifier = LogisticRegression()

logistic_classifier.fit(train[basic_features], train.meatasaurus)

predictions = logistic_classifier.predict(test[basic_features])

pd.DataFrame(zip(test.name, predictions, test.meatasaurus), columns = ['name', 'predicted', 'actual'])

Unnamed: 0,name,predicted,actual
0,Albertonykus,1,1
1,Deinonychus,1,1
2,Dracopelta,0,0
3,Pachycephalosaurus,1,0
4,Albertosaurus,1,1
5,Megalosaurus,1,1
6,Yangchuanosaurus,1,1


It's much better, but not perfect. Shows how big a difference algorithm selection can make (also that Logistic Regression is often the best choice).

## Better Models Part Two: Fancy Features

We'll make another decision tree, but this time give it access to all the cool features in the dataset.

In [16]:
better_tree_classifier = DecisionTreeClassifier(max_depth=3,
                                    random_state=1
                                   )

fancy_features = ['bipedal', 'Jurassic', 'defence or display', 'feathers (likely)', 'tonnes per meter']

train[['name'] + fancy_features + ['meatasaurus']].head(8)

Unnamed: 0,name,bipedal,Jurassic,defence or display,feathers (likely),tonnes per meter,meatasaurus
0,Velociraptor,1,0,0,1,0.0075,1
3,Elaphrosaurus,1,1,0,0,0.035,1
4,Gallimimus,1,0,0,1,0.006667,1
7,Dilophosaurus,1,1,1,1,0.066667,1
9,Utahraptor,1,0,0,1,0.071429,1
10,Parasaurolophus,1,0,1,0,0.266667,0
11,Ceratosaurus,1,1,1,1,0.066667,1
14,Bactrosaurus,1,0,0,0,0.25,0


In [17]:
better_tree_classifier.fit(train[fancy_features], train.meatasaurus)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=1, splitter='best')

In [18]:
explain_tree(better_tree_classifier, fancy_features)

The binary tree structure has 7 nodes and has the following tree structure:
node=0 test node: go to node 1 if bipedal <= 0.5 else to node 2.
	node=1 leaf node. 0% meatosaurus of 7 examples.
	node=2 test node: go to node 3 if feathers (likely) <= 0.5 else to node 6.
		node=3 test node: go to node 4 if tonnes per meter <= 0.433333337307 else to node 5.
			node=4 leaf node. 25% meatosaurus of 4 examples.
			node=5 leaf node. 100% meatosaurus of 3 examples.
		node=6 leaf node. 100% meatosaurus of 6 examples.


In [19]:
pd.DataFrame(zip(fancy_features, better_tree_classifier.feature_importances_), columns = ['feature', 'importance'])

Unnamed: 0,feature,importance
0,bipedal,0.633484
1,Jurassic,0.0
2,defence or display,0.0
3,feathers (likely),0.139625
4,tonnes per meter,0.226891


In [20]:
predictions = better_tree_classifier.predict(test[fancy_features])

In [21]:
pd.DataFrame(zip(test['name'], predictions, test['meatasaurus']), 
             columns = ['name', 'predicted', 'actual'])

Unnamed: 0,name,predicted,actual
0,Albertonykus,1,1
1,Deinonychus,1,1
2,Dracopelta,0,0
3,Pachycephalosaurus,0,0
4,Albertosaurus,1,1
5,Megalosaurus,1,1
6,Yangchuanosaurus,1,1


Now it rules!