# Interactive Example of Decision Trees

This is a Jupyter Notebook which in an interactive fashion illustrates the inner workings of the machine learning (ML) algorithm called decision trees. 
This notebook is based on the same dataset as the `fisher_discrimant.ipynb` notebook: the __[Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set)__. The focus on this small example is neither the actual code nor getting any specific results, but - hopefully - getting a better understanding of the foundation of all tree-based ML algorithms; the decision tree. This is also why we don't describe the code in great detail - and simply load the dataset from Sklearn directly with the __[load_iris]( https://scikit-learn.org/stable/datasets/index.html#iris-plants-dataset)__ function - but the first part of the code should hopefully look familiar by now. 

***

### Authors: 
- Christian Michelsen (Niels Bohr Institute)
- Troels C. Petersen (Niels Bohr Institute)

### Date:    
- 03-01-2020 (latest update)

## Before you start: Installation instruction for graphviz

To make this notebook work, you need to install a visualization package called graphviz. The instructions to install it are written below:

---

### Installing Graphvix for the Decision Tree example (Linux)

__Make sure that you have loaded anaconda before proceeding__


* Go to [the page of Graphvix](https://graphviz.gitlab.io/_pages/Download/Download_source.html)

* Download the archive: __graphviz-2.40.1.tar.gz__

* in your terminal, unzip your archive somwhere you like: `tar -zxvf graphviz-2.40.1.tar.gz -C /absolute/path/of/yourchoice/`

* In the unzipped folder, run `./autogen.sh`.

* In the unzipped folder, create a build directory: `mkdir build_directory/`

* In the unzipped folder, run `./configure --prefix /absolute/path/of/yourchoice/build_directory/`

* In the unzipped folder, type `make` then `make install`

* There should now be a `bin` folder in `/absolute/path/of/yourchoice/build_directory/`. add the bin directory to your PATH variable:

`export PATH=$PATH:/absolute/path/of/yourchoice/build_directory/bin/`

* Install the python interface of graphvix using `pip install graphviz`

* the notebook should run now.


---

### MacOs instructions

__Make sure that you have loaded anaconda before proceeding__

* Install graphviz using homebrew: `brew install graphviz`, then `brew upgrade graphviz`

* Install the python module for graphviz using pip: `pip install graphviz`


---

### Windows instructions

If using Anaconda, then the following has been reported to work:

**conda install -c anaconda python-graphviz**


Otherwise, please follow these instructions:

1. Download the stable version from the website, install it and create a path for it. If the path is not working, then go to the advance system settings for your laptop and create a path manually there. Note that you should run as administrator, if you are using a laptop from KU.

2. Instead of using pip install, use conda install for installing graphviz in python.

3. Write the code to add the path to the python OS:
os.environ["PATH"] += os.pathsep + 'C:...\\Graphviz 2.44.1\\bin'


In [1]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import tree
from sklearn.datasets import load_iris, load_wine
from sklearn.metrics import accuracy_score
from IPython.display import SVG
from graphviz import Source
from IPython.display import display                               
from ipywidgets import interactive

verbose = True
N_verbose = 10

And load the dataset and extract the feature matrix (the independent variables with shape `(# samples, # features) = (150, 4)`:

In [2]:
# Load dataset - either Fisher's Iris data or the alternative Wine data
data = load_iris()
# data = load_wine()

# Feature matrix
X = data.data
# Target vector
y = data.target
# Feature names
feature_names = data.feature_names

if verbose:
    print(feature_names + ['class'])
    for i, (xi, yi) in enumerate(zip(X, y)):
        if i < N_verbose:
            print(xi, yi)


['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'class']
[5.1 3.5 1.4 0.2] 0
[4.9 3.  1.4 0.2] 0
[4.7 3.2 1.3 0.2] 0
[4.6 3.1 1.5 0.2] 0
[5.  3.6 1.4 0.2] 0
[5.4 3.9 1.7 0.4] 0
[4.6 3.4 1.4 0.3] 0
[5.  3.4 1.5 0.2] 0
[4.4 2.9 1.4 0.2] 0
[4.9 3.1 1.5 0.1] 0


## Part A: Max Depth

First we start with the simple part of the exercise; understanding the (hyper)parameter `depth` and what it changes. Try to play around with the slider below and see how the decision tree graph changes along with the accuracy of the prediction:

In [3]:
def fit_and_grapth_estimator(estimator):
    estimator.fit(X, y)
    accuracy = accuracy_score(y, estimator.predict(X))
    print(f'Accuracy: {accuracy:.4f}')
    graph = Source(tree.export_graphviz(estimator, out_file=None, feature_names=feature_names, 
                                        class_names=['0', '1', '2'], filled = True))
    display(SVG(graph.pipe(format='svg')))
    return estimator


def plot_tree_simple(depth=1):
    estimator = DecisionTreeClassifier(random_state = 0, max_depth = depth)
    estimator = fit_and_grapth_estimator(estimator)
    return estimator

inter_simple = interactive(plot_tree_simple, depth=(1, 5, 1))

display(inter_simple)

interactive(children=(IntSlider(value=1, description='depth', max=5, min=1), Output()), _dom_classes=('widget-…

## Part B: More hyper parameters

Now we add in some extra hyper parameters to change. We add the criterion `crit` which states how the loss should be measured (either the __[Gini impurity](https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity)__ or the __[Entropy](https://en.wikipedia.org/wiki/Decision_tree_learning#Information_gain)__), the strategy behind the splitting algorithm `split`, the minimum number of samples in each split/leaf `min_split`/`min_leaf`. See the following link for more information about the __[DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)__.

Try to play around with the different parameters and see how they affect the tree graph and accuracy:

In [4]:
def plot_tree_advanced(depth=1, crit='gini', split='best', min_split=2, min_leaf=1):
    estimator = DecisionTreeClassifier(random_state = 0, 
                                       criterion = crit, 
                                       splitter = split, 
                                       max_depth = depth, 
                                       min_samples_split=min_split, 
                                       min_samples_leaf=min_leaf,
                                      )
    return fit_and_grapth_estimator(estimator)

inter_advanced = interactive(plot_tree_advanced, 
                             depth=(1, 5, 1),
                             min_split=(2, 10), 
                             min_leaf=(1, 10),
                             crit = ["gini", "entropy"], 
                             split = ["best", "random"], 
                            )

display(inter_advanced)

interactive(children=(IntSlider(value=1, description='depth', max=5, min=1), Dropdown(description='crit', opti…