# Decision Trees

## Setup
### Imports

In [2]:
import pandas as pd                                     # for dataset manipulation (DataFrames)
import numpy as np                                      # allows some mathematical operations
import matplotlib.pyplot as plt                         # library used to display graphs
import seaborn as sns                                   # more convenient visualisation library for dataframes
import time                                             # for execution time measurement

### Import `graphviz`
In order to visualize decision trees, you can use the `graphviz` library. In order to install it (in Datalore), you need to run the following commands in the terminal (Tools > Terminal):
```
sudo apt update
sudo apt-get install graphviz
```
Then, you will need to import the library in Python. Remember to add it to your environment!

If you are working locally, you can refer to [the graphviz website](https://graphviz.org/download/).

In [3]:
import graphviz
import os
os.environ["PATH"] += os.pathsep + 'C:\\Program Files\\Graphviz\\bin\\'

## Decision trees

### Importing and observing the data
Using what you've learned in the previous practicals, answer the following questions:
1. Import the iris dataset directly from `scikit-learn`.
2. Convert it to a `pandas` Dataframe. Remember to add the `target` column. ***Note**: Keep the `target` column as digits, do not replace the digits with the names. This is because machine learning algorithms prefer working with numerical values.*
3. Using the relevant graphs and functions, realize an analysis of the dataset and its features.

*[Your comments here]*

In [4]:
# Your code here

### Preparing the data

1. Split the data into a training and a test set with an 80-20 proportion.
2. Display the number of examples in each set.

*[Your comments here]*

In [5]:
# Your code here

### Training a decision tree

We will now train a decision tree on this dataset. In the following cell, import, instantiate and train a decision tree classifier.

In [6]:
# Import the decision tree classifier from scikit-learn


# Instantiate the classifier


# Train the classifier

### Visualizing a decision tree

The `graphviz` library that we installed at the beginning of this practical can be used to visualize decision trees.
1. Complete the following code cell to display the decision tree you created above.
2. Explain the content of the blocks in the resulting graph.

In [10]:
from sklearn.tree import export_graphviz

graph_data = export_graphviz(
    decision_tree=..., # the decision tree you trained 
    feature_names=..., # the names of the features  
    class_names=...,   # the names of the classes
    
    # the following parameters format the graph
    filled=True, 
    rounded=True,  
    special_characters=True
)

graph = graphviz.Source(graph_data)  
graph 

### Measuring the performance of the classifier

In the cells underneath, answer the following questions:
1. Measure and analyze the performance of the decision tree you have built.
2. Create and train two new decision trees: one using only sepal features, and one using only petal features.
3. Measure the performance of these new trees. Compare and comment.
4. [**Bonus**] Using the `criterion` parameter, build new decision trees using a different splitting criterion. How does it impact the performance? 

*[Your comments here]*

In [11]:
# Your code here

## Bonus questions

1. Train a random forest classifier on this same task. Try to optimize its parameters. Compare and comment on the results, as well as the pros and cons of using random forests over decision trees.
2. Refer to the documentation to find out how to get the feature importance values for each feature. How are these values computed? How do you interpret them in this case?
3. For practice, you can try to follow the same procedure using another of [`scikit-learn`'s toy datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html).

*[Your comments here]*

In [None]:
# Your code here