# <font color='#31394d'> Decision Trees Practice Exercise </font>

It's time for you to build your own decision tree! For this exercise we'll be using the wine dataset from UCI Machine Learning Repository (and luckily included with sklearn). This dataset will let us try to predict the quality of wine (1 of 3 different levels) based on it's chemical composition. In our professional opinion, this homework pairs well with whatever wine you have at home :)

We'll start by loading in the wine data set from sklearn and break it into X, y.

In [2]:
import seaborn as sns
import sklearn.datasets as datasets
import pandas as pd
import numpy as np

In [3]:
wine = datasets.load_wine()
wine.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])

In [4]:
wine_pd = pd.DataFrame(wine['data'], columns=wine['feature_names'])
wine_pd.head(7)

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0
5,14.2,1.76,2.45,15.2,112.0,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450.0
6,14.39,1.87,2.45,14.6,96.0,2.5,2.52,0.3,1.98,5.25,1.02,3.58,1290.0


In [5]:
wine_pd.isna().sum()

alcohol                         0
malic_acid                      0
ash                             0
alcalinity_of_ash               0
magnesium                       0
total_phenols                   0
flavanoids                      0
nonflavanoid_phenols            0
proanthocyanins                 0
color_intensity                 0
hue                             0
od280/od315_of_diluted_wines    0
proline                         0
dtype: int64

In [6]:
wine_pd.shape

(178, 13)

In [7]:
X = pd.DataFrame(wine['data'], columns = wine['feature_names'])
Y = pd.Series(wine['target'])

🚀 <font color='#D9C4B1'>Exercise: </font> Build a decision tree classifer for the wine data, and get a cross-validated score of it's accuracy. Hint use the `cross_val_score` function from sklearn.model_selection

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import metrics

tree_md = DecisionTreeClassifier()

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)

🚀 <font color='#D9C4B1'>Exercise: </font> Fit your tree model on the full data and visualize your decision tree using graphviz. The `draw_tree` function from class is included - if you have extra time try playing around with the options and understand what the function is doing!

In [10]:
tree_md.fit(X_train, y_train)

DecisionTreeClassifier()

In [11]:
y_hat = tree_md.predict(X_test)
y_hat[:10]

array([0, 2, 0, 0, 0, 1, 2, 0, 0, 1])

In [12]:
print(metrics.confusion_matrix(y_test,y_hat))

[[13  2  0]
 [ 1 10  0]
 [ 0  0 10]]


In [13]:
print(metrics.classification_report(y_test,y_hat))

              precision    recall  f1-score   support

           0       0.93      0.87      0.90        15
           1       0.83      0.91      0.87        11
           2       1.00      1.00      1.00        10

    accuracy                           0.92        36
   macro avg       0.92      0.93      0.92        36
weighted avg       0.92      0.92      0.92        36



In [14]:
import graphviz
from sklearn.tree import export_graphviz

def draw_tree(tree):
    dot_data = export_graphviz(tree, out_file=None, 
                        # feature_names=X.columns, 
                         class_names=['0', '1', '2'],
                         filled=True, 
                         impurity=False,
                         rounded=True,  
                         special_characters=True)  
    
    graph = graphviz.Source(dot_data)
    graph.format = 'png'
    graph.render('tree',view=True)

In [15]:
draw_tree(tree_md)

In [16]:
dict(zip(X, tree_md.feature_importances_))

{'alcohol': 0.0,
 'malic_acid': 0.0,
 'ash': 0.0,
 'alcalinity_of_ash': 0.020988922081155072,
 'magnesium': 0.0,
 'total_phenols': 0.0,
 'flavanoids': 0.39481068520585255,
 'nonflavanoid_phenols': 0.0,
 'proanthocyanins': 0.0,
 'color_intensity': 0.39798634252064946,
 'hue': 0.0,
 'od280/od315_of_diluted_wines': 0.04145676863894484,
 'proline': 0.1447572815533981}

🚀 <font color='#D9C4B1'>Exercise: </font> Try varying the max_depth parameter - what's the best depth you can find (try options on the range of 2 to 10)? Hint don't test one by one, automate the hyper-parameter tuning!

In [17]:
depths = np.arange(2,10)
results = []

for depth in depths:
    best_depth_tree = DecisionTreeClassifier(max_depth=depth)
    results.append(cross_val_score(best_depth_tree, X, Y,scoring='accuracy',cv=3).mean())


In [18]:
test = pd.DataFrame({"depths": depths, "mean_acc": results})
test.sort_values("mean_acc", ascending=False)

Unnamed: 0,depths,mean_acc
1,3,0.89322
2,4,0.893126
3,5,0.887571
5,7,0.887571
7,9,0.887571
4,6,0.870904
6,8,0.848682
0,2,0.786441


🚀 <font color='#D9C4B1'>Exercise: </font> What features are the most important in your model? Hint take a peak at the `feature_importances_` variable in decision tree objects and how we accessed it in the class notebook.

### Color_intensity, proline & flavanoids are the most important features that determine wine quality

### Using only the important features to see whether they'll make a better model

In [19]:
important_features = ['color_intensity','flavanoids','proline']
X_simple = X[important_features]
X_train_simple,X_test_simple,y_train_simple,_y_test_simple = train_test_split(X_simple,Y, test_size=0.2,random_state=3)
tree_simp =DecisionTreeClassifier()
tree_simp.fit(X_train_simple,y_train_simple)
y_hat_simple = tree_simp.predict(X_test_simple)
print(metrics.confusion_matrix(_y_test_simple, y_hat_simple))
print(metrics.classification_report(_y_test_simple,y_hat_simple))


[[15  0  0]
 [ 0 13  1]
 [ 0  1  6]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        15
           1       0.93      0.93      0.93        14
           2       0.86      0.86      0.86         7

    accuracy                           0.94        36
   macro avg       0.93      0.93      0.93        36
weighted avg       0.94      0.94      0.94        36



### The important features seem to make a better model, as the accuracy of the model has increased from 0.92 to 0.94

In [20]:
draw_tree(tree_simp)