# Assignment 3
## Decision Trees and Random Forests for Regression, Part 1

### About this notebook

The general description and instructions as well as questions for the walk through Part 1 of the task (this notebook) are found in the Assignment description in Canvas!


In [None]:
# YOU DON'T HAVE TO RUN THIS IF EVERYTHING IS ALREADY INSTALLED CORRECTLY
!pip3 install --upgrade pip
!pip3 install graphviz
!pip3 install dtreeviz
!pip3 install numpy scipy

## Steps 0-2: Dataset(s)

**Step 0:** First, load the dataset. Ultimately, you should be working with the California housing data, but for quicker test runs, it might help to first start out with the Diabetes data.

In [None]:
from sklearn.datasets import load_diabetes
from sklearn.datasets import fetch_california_housing
from ConceptDataRegr import ConceptDataRegr
from sklearn.model_selection import train_test_split 

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

test_case = 'diabetes'
#test_case = 'california'

if test_case == 'california':
    dataset = fetch_california_housing()
elif test_case == 'diabetes':
    dataset = load_diabetes()
else:
    raise ValueError('Unknown test case')

X = dataset.data
y = dataset.target


**Step 1:** Get some information about the dataset you're looking at

In [None]:
if test_case == 'california' :
    print("target:", list(dataset.target_names))
print("features:", list(dataset.feature_names))
print("description:", dataset.DESCR)


**Step 2:** Split the data into train, validation and test sets.

In [None]:
# splitting using the proper SKLearn tools
train_ratio = 0.70
validation_ratio = 0.15
test_ratio = 0.15
X = dataset.data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1 - train_ratio, random_state=0)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio), random_state=0)

## Step 3: The SKLearn Decision Tree Regressor

Set up and fit a DecisionTreeRegressor with *random_state=0* ([DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html?highlight=decision+tree)), use its *score* method to evaluate it in a simple step. Check also out [cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)
to learn about a different type of evaluation tool. Use the documentation where needed. Be prepared to answer "random" questions posed by the TA. 

In [None]:
from sklearn.tree import DecisionTreeRegressor

# set up your first regressor, e.g.: regressor1 = ...

In [None]:
from sklearn.model_selection import cross_val_score
# apply cross_val_score on your regressor1 (obs, this is done on training data) and see what happens
# go for cv=10 for a start, but you can also test different values


In [None]:
# fit your regressor1 to training data and evaluate with 'score' on the test data


## Step 4: Decision Tree Parameters
Now, work with two parameters, *max_depth* and *min_samples_leaf*. Analyse the documentation to understand what they do. Create two more regressors, where you experiment with different settings for those parameters, e.g. 1 for *max_depth* and 20 for *min_samples_leaf*. Evaluate with both *cross_val_score* and *score*.
Explain the outcomes.

In [None]:
# regressor2...



In [None]:
# regressor3 ...



## Steps 5-6: Decision Tree Visualization

The next cells give examples how to visualize regressor2 and regressor3.

**Step 5:** Visualisation with GraphViz, which was used for the lecture slides, but with rectangular nodes

In [None]:
from sklearn import tree
import graphviz
from IPython.display import Image

# The visualisation below assumes a regressor called 'regressor2'. 
# Change in the code below if your naming above is different

dot_data = tree.export_graphviz(regressor2, feature_names=dataset.feature_names, out_file=None, filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data, format="png") 
graph.render("decision_tree_regressor2")
Image("decision_tree_regressor2.png")

In [None]:
# The visualisation below assumes a regressor called 'regressor3'. 
# Change in the code below if your naming above is different

dot_data = tree.export_graphviz(regressor3, feature_names=dataset.feature_names, out_file=None, filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data, format="png") 
graph.render("decision_tree_regressor3")
Image("decision_tree_regressor3.png")

**Step 6:** Another way to visualize the decision trees is to use dtreeviz. To make these plots it takes quite some time, so we recommend to use this visualization tool for trees with few nodes only. 

In [None]:
import dtreeviz

# The visualisation below assumes a regressor called 'regressor2'. 
# Change in the code below if your naming above is different

viz = dtreeviz.model(regressor2, X, y,
                target_name="target",
                feature_names=dataset.feature_names)

viz.view(fontname="monospace", scale=3) # this displays the output inside the notebook.


# If you want to store the output in a file use:
#viz.save("dtreeviz.svg")


## Steps 7-9: Explainability

**Step 7:** If you want to visualize (explain) the decision path for one prediction, you can also use dtreeviz:

In [None]:
import numpy as np

# The visualisation below assumes a regressor called 'regressor2'. 
# Change in the code below if your naming above is different

sample = X_test[np.random.randint(0, len(X_test)),:] # random sample from training

viz = dtreeviz.model(regressor2, X, y,
                target_name="target",
                feature_names=dataset.feature_names)

viz.view(fontname="monospace", scale=3, x = sample)

**Step 8:** For bigger graphs you just show the decision path

In [None]:
# The visualisation below assumes a regressor called 'regressor3'. 
# Change in the code below if your naming above is different

viz = dtreeviz.model(regressor3, X, y,
                target_name="target",
                feature_names=dataset.feature_names)
#viz.view()
viz.view(fontname="monospace", scale=3, x=sample, show_just_path=True)

**Step 9:** Another option to explain the prediction for big trees is this

In [None]:


# The call below assumes a regressor called 'regressor3'. 
# Change in the code below if your naming above is different

print(viz.explain_prediction_path(sample))

## Step 10: Random Forests

Create a *RandomForestRegressor*, e.g. for 5 or 10 trees, and experiment with different parameters for it (explore the documentation!). Test at least two different parameter sets (evaluate with *score*) and discuss the outcomes.

In [None]:
from sklearn.ensemble import RandomForestRegressor

# number_of_trees = ... 
# forest = ...

In [None]:
# You can visualise the trees as above - but maybe not more than 5 and not the really big ones ;-)

# The visualisation below assumes a random forest called 'forest' and a parameter 'number_of_trees'. 
# Change in the code below if your naming above is different

for treeid in range(number_of_trees):
    dot_data = tree.export_graphviz(forest.estimators_[treeid], feature_names=dataset.feature_names, out_file=None, filled=True, rounded=True, special_characters=True)
    graph = graphviz.Source(dot_data, format="png") 
    graph.render("forest_treeid"+str(treeid))

fig, ax = plt.subplots(number_of_trees,1) # use plt.subplots(number_of_trees/2,2) if you want two columns
for i, axi in enumerate(ax.flat):
    axi.set_title("Tree {}".format(i))
    tree.plot_tree(forest.estimators_[i], ax=axi, feature_names=dataset.feature_names, filled=True, rounded=True)
fig.tight_layout()