# Decision Trees

For this notebook, your environment will require the following packages:

* pandas
* numpy
* skikit-learn
* ucimlrepo
* certifi

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from ucimlrepo import fetch_ucirepo 
np.random.seed(42)

In [2]:
# fetch dataset
glass_identification = fetch_ucirepo(id=42)

# data (as pandas dataframes)
X = glass_identification.data.features
y = glass_identification.data.targets

Let's evaluate this one using 5-fold Cross Validation.

Scikit-Learn has a function built in to perform very simple k-fold CV evaluations.  We can use it here for exploration.

We will look at three models:
* `tree_model_1` : This version will grow the trees fully, based on the impurity measure of all nodes (which is GINI index in this case).
* `tree_model_2` : This version will try to regularize the tree by choosing not to split if there are a relatively small number of samples reaching a node.  The tuning parameter is `min_samples_split`.
* `tree_model_3` : This version will try to regularize the tree by choosing not to split if the impurity decrease by doing a split is very small.  The tuning parameter is `min_impurity_decrease`.

To learn more about the Scikit-Learn DecisionTreeClassifier and the tuning parameters, see the documentation here:

<https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html>

In [3]:
strat_5_fold = StratifiedKFold(shuffle=True, random_state=42)
# Model 1 is a decision tree classifier that will build complete trees.
tree_model_1 = DecisionTreeClassifier()

# TODO: Find a value for X in the line below that will perform better than `tree_model_1`:
tree_model_2 = DecisionTreeClassifier(min_samples_split=4) # Don't split if there are X or fewer samples in a node.

# TODO: Find a value for X in the line below that will perform better than `tree_model_1`:
tree_model_3 = DecisionTreeClassifier(min_impurity_decrease= .002) # Don't split if the impurity decrease is less than X.

Evaluate the first model, which grows the trees fully (but might overfit).

In [4]:
scores = cross_val_score(tree_model_1, X, y, scoring='balanced_accuracy', cv=strat_5_fold)
print(f"Tree model #1 balanced accuracy mean: {scores.mean():0.3f} stdev: {scores.std():0.3f}")

Tree model #1 balanced accuracy mean: 0.642 stdev: 0.115


**TODO**: After you finish the TODO sections above to declare the model `tree_model_2` un-comment the lines in the code block below to see how that model performs.

In [5]:
scores = cross_val_score(tree_model_2, X, y, scoring='balanced_accuracy', cv=strat_5_fold)
print(f"Tree model #2 balanced accuracy mean: {scores.mean():0.3f} stdev: {scores.std():0.3f}")

Tree model #2 balanced accuracy mean: 0.677 stdev: 0.115


**TODO**: After you finish the TODO sections above to declare the model `tree_model_3` un-comment the lines in the code block below to see how that model performs.

In [6]:
scores = cross_val_score(tree_model_3, X, y, scoring='balanced_accuracy', cv=strat_5_fold)
print(f"Tree model #3 balanced accuracy mean: {scores.mean():0.3f} stdev: {scores.std():0.3f}")

Tree model #3 balanced accuracy mean: 0.651 stdev: 0.100


### Submit

Submit the finished notebook, with output saved.