# <font color='#31394d'> Decision Trees Practice Exercise </font>

It's time for you to build your own decision tree! For this exercise we'll be using the wine dataset from UCI Machine Learning Repository (and luckily included with sklearn). This dataset will let us try to predict the quality of wine (1 of 3 different levels) based on it's chemical composition. In our professional opinion, this homework pairs well with whatever wine you have at home :)

We'll start by loading in the wine data set from sklearn and break it into X, y.

In [1]:
import seaborn as sns
import sklearn.datasets as datasets
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter('ignore')

In [2]:
wine = datasets.load_wine()
wine.keys()

X = pd.DataFrame(wine['data'], columns = wine['feature_names'])
Y = pd.Series(wine['target'])

In [3]:
wine['feature_names']

['alcohol',
 'malic_acid',
 'ash',
 'alcalinity_of_ash',
 'magnesium',
 'total_phenols',
 'flavanoids',
 'nonflavanoid_phenols',
 'proanthocyanins',
 'color_intensity',
 'hue',
 'od280/od315_of_diluted_wines',
 'proline']

In [4]:
X.isna().sum()

alcohol                         0
malic_acid                      0
ash                             0
alcalinity_of_ash               0
magnesium                       0
total_phenols                   0
flavanoids                      0
nonflavanoid_phenols            0
proanthocyanins                 0
color_intensity                 0
hue                             0
od280/od315_of_diluted_wines    0
proline                         0
dtype: int64

In [5]:
Y

0      0
1      0
2      0
3      0
4      0
      ..
173    2
174    2
175    2
176    2
177    2
Length: 178, dtype: int32

In [6]:
Y.isna().sum()

0

🚀 <font color='#D9C4B1'>Exercise: </font> Build a decision tree classifer for the wine data, and get a cross-validated score of it's accuracy. Hint use the `cross_val_score` function from sklearn.model_selection

### Split dataset into train and test set

In [7]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.25,random_state=123)

### Fit the model

In [8]:
from sklearn.tree import DecisionTreeClassifier
classifier=DecisionTreeClassifier()
classifier=classifier.fit(X_train,y_train)

### Make predictions

In [9]:
y_pred=classifier.predict(X_test)
y_pred

array([2, 1, 2, 1, 1, 2, 0, 2, 2, 1, 2, 2, 2, 0, 0, 2, 1, 1, 0, 1, 2, 2,
       2, 2, 1, 2, 2, 0, 0, 0, 0, 0, 2, 1, 2, 1, 2, 0, 0, 1, 2, 2, 0, 0,
       1])

In [10]:
from sklearn.model_selection import cross_val_score
cv_score=cross_val_score(classifier, X_train, y_train, cv=5, scoring='accuracy')
print('Accuracy: ',np.mean(cv_score))

Accuracy:  0.8501424501424502


🚀 <font color='#D9C4B1'>Exercise: </font> Fit your tree model on the full data and visualize your decision tree using graphviz. The `draw_tree` function from class is included - if you have extra time try playing around with the options and understand what the function is doing!

In [11]:
import graphviz
from sklearn.tree import export_graphviz
from IPython.display import display

def draw_tree(tree):
    dot_data = export_graphviz(tree, out_file=None, 
                         feature_names=['alcohol','malic_acid', 'ash','alcalinity_of_ash','magnesium',
                                        'total_phenols','flavanoids','nonflavanoid_phenols','proanthocyanins',
                                        'color_intensity','hue','od280/od315_of_diluted_wines','proline'], 
                         class_names=['0', '1', '2'],
                         filled=True, 
                         impurity=False,
                         rounded=True,  
                         special_characters=True)  
    
    graph = graphviz.Source(dot_data)
    graph.format = 'png'
    graph.render('tree',view=True)
    display(graph)
    

In [None]:
draw_tree(classifier)

🚀 <font color='#D9C4B1'>Exercise: </font> Try varying the max_depth parameter - what's the best depth you can find (try options on the range of 2 to 10)? Hint don't test one by one, automate the hyper-parameter tuning!

In [None]:
from sklearn.model_selection import RandomizedSearchCV
params = {"max_depth": [2, 10],
              "criterion": ["gini", "entropy"]}
classifier_cv=RandomizedSearchCV(classifier,params,cv=5)
classifier_cv.fit(X_train,y_train)

In [None]:
print('Tuned parameters: ', classifier_cv.best_params_)
print('Best score: ', classifier_cv.best_score_)

🚀 <font color='#D9C4B1'>Exercise: </font> What features are the most important in your model? Hint take a peak at the `feature_importances_` variable in decision tree objects and how we accessed it in the class notebook.

In [None]:
importance=classifier.feature_importances_
for x,y in enumerate(importance):
    print('Feature ',x,': ',X.columns[x],', Score:',y)

The most important features for the model are:
- `proline`
- `od280/od315_of_diluted_wines`
- `flavanoids`