# <font color='#31394d'> Decision Trees Practice Exercise </font>

It's time for you to build your own decision tree! For this exercise we'll be using the wine dataset from UCI Machine Learning Repository (and luckily included with sklearn). This dataset will let us try to predict the quality of wine (1 of 3 different levels) based on it's chemical composition. In our professional opinion, this homework pairs well with whatever wine you have at home :)

We'll start by loading in the wine data set from sklearn and break it into X, y.

In [18]:
import seaborn as sns
import sklearn.datasets as datasets
import pandas as pd
import numpy as np

In [20]:
wine = datasets.load_wine()
wine.keys()

X = pd.DataFrame(wine['data'], columns = wine['feature_names'])
y = pd.Series(wine['target'])

🚀 <font color='#D9C4B1'>Exercise: </font> Build a decision tree classifer for the wine data, and get a cross-validated score of it's accuracy. Hint use the `cross_val_score` function from sklearn.model_selection

In [21]:
# your code here
from sklearn import tree
from sklearn.model_selection import cross_val_score

clf = tree.DecisionTreeClassifier()
# ?cross_val_score
cv_scores = cross_val_score(clf, X, y, cv=5)

print("Mean & Standard Deviation")
print("Cross-validated accuracy score: %0.2f (+/- %0.2f)" % (cv_scores.mean(), cv_scores.std() * 2))

Mean & Standard Deviation
Cross-validated accuracy score: 0.89 (+/- 0.10)


🚀 <font color='#D9C4B1'>Exercise: </font> Fit your tree model on the full data and visualize your decision tree using graphviz. The `draw_tree` function from class is included - if you have extra time try playing around with the options and understand what the function is doing!

In [22]:
import graphviz
from sklearn.tree import export_graphviz

def draw_tree(tree):
    dot_data = export_graphviz(tree, out_file=None, 
                         feature_names=independent_variables, 
                         class_names=['0', '1', '2'],
                         filled=True, 
                         impurity=False,
                         rounded=True,  
                         special_characters=True)  
    
    graph = graphviz.Source(dot_data)
    graph.format = 'png'
    graph.render('tree',view=True)
    

ExecutableNotFound: failed to execute WindowsPath('dot'), make sure the Graphviz executables are on your systems' PATH

In [None]:
# Fit model
clf.fit(wine.data, wine.target)
# Define independent variables
independent_variables = wine.feature_names
# Draw tree
draw_tree(clf)

🚀 <font color='#D9C4B1'>Exercise: </font> Try varying the max_depth parameter - what's the best depth you can find (try options on the range of 2 to 10)? Hint don't test one by one, automate the hyper-parameter tuning!

In [24]:
from sklearn.model_selection import GridSearchCV

# Set up parameter grid
param_grid = {'max_depth': range(2, 11)}

# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(clf, param_grid=param_grid, cv=5)
grid_search.fit(wine.data, wine.target)

# Print the best hyperparameters and the corresponding score
print("Best max_depth hyperparameter:", grid_search.best_params_['max_depth'])
print("Best score:", grid_search.best_score_)

Best max_depth hyperparameter: 4
Best score: 0.9047619047619048


🚀 <font color='#D9C4B1'>Exercise: </font> What features are the most important in your model? Hint take a peak at the `feature_importances_` variable in decision tree objects and how we accessed it in the class notebook.

In [25]:
# Print feature importances
for feature, importance in zip(wine.feature_names, clf.feature_importances_):
    print(feature, importance)

alcohol 0.012570564071187309
malic_acid 0.014223159778821876
ash 0.0
alcalinity_of_ash 0.0
magnesium 0.0
total_phenols 0.0
flavanoids 0.1414466773122087
nonflavanoid_phenols 0.0
proanthocyanins 0.0
color_intensity 0.07906148272987158
hue 0.058185091460406506
od280/od315_of_diluted_wines 0.3120425747831769
proline 0.38247044986432716


In [None]:
# proline is the most important feature