# Implementing a Decision Tree with scikit-learn

Authors: [j.rogel.datascience@gmail.com](mailto:j.rogel.datascience@gmail.com), [dima.galat@outlook.com](mailto:dima.galat@outlook.com)

Let us import some libraries that we will use during the practice

In [1]:
import numpy as np
import pandas as pd

We can read the data with the help of pandas using the `read_csv` method

In [2]:
iris_data = pd.read_csv('./data/iris.csv') 

Let us look at the first 6 records:

In [3]:
iris_data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


As we can see, the species is provided as a string, but the algorithms we are likely to use only take numerical values. 

Let us write a function that transforms the strings into numbers:

- Setosa: 0
- Versicolor: 1
- Virginica: 2

In [4]:
def get_num(x):
    if x == 'setosa':
        y = 0
    elif x == 'versicolor':
        y = 1
    elif x == 'virginica':
        y = 2
    return y

In [5]:
iris_data['species'].value_counts()

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

We can now apply the function to the `species` field in our data:

In [6]:
iris_data['target'] = iris_data['species'].apply(get_num)

In [7]:
iris_data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,target
0,5.1,3.5,1.4,0.2,setosa,0
1,4.9,3.0,1.4,0.2,setosa,0
2,4.7,3.2,1.3,0.2,setosa,0
3,4.6,3.1,1.5,0.2,setosa,0
4,5.0,3.6,1.4,0.2,setosa,0


# Modelling the data

As we mentioned above the algorithm we are going to use requires data to be numerical and structures in arrays. 

We can extract the values from the pandas dataframe:

- `X`: the iris attributes
- `y`: target species

In [8]:
feature_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
X = iris_data[feature_names].values
y = iris_data['target'].values

Let us import the `tree` method from the Scikit-Learn library

In [9]:
from sklearn import tree

Scikit-learn requires us to create an instance of the model, in this case we use the `DecisionTreeClassifier` method using `entropy` as the criterion used to partition our data.

Entropy in information theory tells us how much information there is in an event. In general, the more uncertain or random the event is, the more information it will contain. The concept of information entropy was created by mathematician Claude Shannon.


In [10]:
model = tree.DecisionTreeClassifier(criterion='entropy')

Once we have an instance of the model we can fit it with the `fit` model by providing the inputs and target:

In [11]:
iris_tree = model.fit(X, y)

Remember that we are interested in predicting the likely species of a flower based on its characteristics. 

We can obtain the predictions given by the model with the help of the `predict` method.

In [12]:
iris_pred = iris_tree.predict(X)

Finally, we can see how well se have done by comparing the predictions to the targets:

In [13]:
(iris_pred == y).all()

True

# Looking at the rules

Don't worry too much at this stage about the details of the function below.

We are using it to take a look at the rules that the decision tree we implementd has generated.

We need to install Graphviz package to be able to visualize the tree.  
Graphviz uses a notation that allows to express the tree in a way that can be used for a tree visualization. 

We also require a Python package that interfaces with a graph (a tree is a type of a graph) description language.  

Anaconda can get these packages.  
>```conda``` is a command for installing packages  

In [14]:
!conda install -y graphviz python-graphviz ipywidgets

Solving environment: done

# All requested packages already installed.



The resulting graphical representation of this tree:

In [15]:
from ipywidgets import interact
from graphviz import Source
from IPython.display import Image

@interact(depth=(1, 4), min_split=(0.05, 1), min_leaf=(0.05, 0.5))
def make_tree(depth, min_split, min_leaf):
    """Generate a tree and visualise it
    
    Returns an interactive decision tree visualisation for showing in jupyter
    """

    model = tree.DecisionTreeClassifier(random_state=42,
            criterion='entropy', max_depth=depth,
            min_samples_split=min_split,
            min_samples_leaf=min_leaf).fit(X, y)
    dt_full = Source(tree.export_graphviz(model, out_file=None,
                     class_names=iris_data['species'].unique(),
                     filled=True, feature_names=feature_names))
    display(Image(dt_full.pipe(format='png')))

interactive(children=(IntSlider(value=2, description='depth', max=4, min=1), FloatSlider(value=0.525, descript…