# Decision Trees Exercises

## Introduction

We will be using the wine quality data set for these exercises. This data set contains various chemical properties of wine, such as acidity, sugar, pH, and alcohol. It also contains a quality metric (3-9, with highest being better) and a color (red or white). The name of the file is `Wine_Quality_Data.csv`.

In [1]:
from __future__ import print_function
import os
data_path = ['C:\\data']

## Question 1

* Import the data and examine the features.
* We will be using all of them to predict `color` (white or red), but the colors feature will need to be integer encoded.

In [2]:
import pandas as pd
import numpy as np

filepath = os.sep.join(data_path + ['data_1.csv'])
data = pd.read_csv(filepath, sep=',')

data['state'] = data.state.replace('sit',0).replace('walk',1).replace('sitandmove',2).astype(np.int)
data

Unnamed: 0,data,average,rms,distance,distance2,state
0,190,95,134,-95,-55,0
1,175,121,149,-53,-25,0
2,176,135,156,-40,-19,0
3,176,143,160,-32,-15,0
4,176,148,163,-27,-12,0
...,...,...,...,...,...,...
12563,332,270,311,-61,-20,2
12564,332,270,311,-61,-20,2
12565,332,270,311,-61,-20,2
12566,332,270,311,-61,-20,2


## Question 2

* Use `StratifiedShuffleSplit` to split data into train and test sets that are stratified by wine quality. If possible, preserve the indices of the split for question 5 below.
* Check the percent composition of each quality level for both the train and test data sets.

In [3]:
# All data columns except for color
feature_cols = [x for x in data.columns if x not in 'state']

from sklearn.model_selection import StratifiedShuffleSplit

# Split the data into two parts with 1000 points in the test data
# This creates a generator
strat_shuff_split = StratifiedShuffleSplit(n_splits=1, test_size=2000, random_state=42)

# Get the index values from the generator
train_idx, test_idx = next(strat_shuff_split.split(data[feature_cols], data['state']))

# Create the data sets
X_train = data.loc[train_idx, feature_cols]
y_train = data.loc[train_idx, 'state']

X_test = data.loc[test_idx, feature_cols]
y_test = data.loc[test_idx, 'state']

In [4]:
y_train.value_counts(normalize=True).sort_index()

0    0.224735
1    0.236942
2    0.538323
Name: state, dtype: float64

In [5]:
y_test.value_counts(normalize=True).sort_index()

0    0.2245
1    0.2370
2    0.5385
Name: state, dtype: float64

## Question 3

* Fit a decision tree classifier with no set limits on maximum depth, features, or leaves.
* Determine how many nodes are present and what the depth of this (very large) tree is.
* Using this tree, measure the prediction error in the train and test data sets. What do you think is going on here based on the differences in prediction error?

In [6]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42)
dt = dt.fit(X_train, y_train)

In [7]:
dt.tree_.node_count, dt.tree_.max_depth

(23, 8)

In [8]:
from sklearn import tree
text_representation = tree.export_text(dt)
print(text_representation)

|--- feature_2 <= 381.50
|   |--- feature_2 <= 173.00
|   |   |--- feature_0 <= 154.50
|   |   |   |--- class: 0
|   |   |--- feature_0 >  154.50
|   |   |   |--- feature_2 <= 149.50
|   |   |   |   |--- feature_1 <= 146.00
|   |   |   |   |   |--- feature_4 <= -50.50
|   |   |   |   |   |   |--- class: 0
|   |   |   |   |   |--- feature_4 >  -50.50
|   |   |   |   |   |   |--- class: 2
|   |   |   |   |--- feature_1 >  146.00
|   |   |   |   |   |--- class: 0
|   |   |   |--- feature_2 >  149.50
|   |   |   |   |--- feature_2 <= 166.50
|   |   |   |   |   |--- feature_1 <= 143.50
|   |   |   |   |   |   |--- class: 0
|   |   |   |   |   |--- feature_1 >  143.50
|   |   |   |   |   |   |--- feature_2 <= 163.50
|   |   |   |   |   |   |   |--- class: 2
|   |   |   |   |   |   |--- feature_2 >  163.50
|   |   |   |   |   |   |   |--- feature_0 <= 209.50
|   |   |   |   |   |   |   |   |--- class: 0
|   |   |   |   |   |   |   |--- feature_0 >  209.50
|   |   |   |   |   |   |   |   |--- 

In [9]:
with open("decistion_tree1.log", "w") as fout:
    fout.write(text_representation)

In [12]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(100,100))
tree.plot_tree(dt)

[Text(4197.916666666667, 7272.222222222223, 'X[2] <= 381.5\ngini = 0.604\nsamples = 10568\nvalue = [2375, 2504, 5689]'),
 Text(3552.0833333333335, 6416.666666666667, 'X[2] <= 173.0\ngini = 0.416\nsamples = 8064\nvalue = [2375, 0, 5689]'),
 Text(2906.25, 5561.111111111111, 'X[0] <= 154.5\ngini = 0.084\nsamples = 2484\nvalue = [2375, 0, 109]'),
 Text(2260.416666666667, 4705.555555555556, 'gini = 0.0\nsamples = 2081\nvalue = [2081, 0, 0]'),
 Text(3552.0833333333335, 4705.555555555556, 'X[2] <= 149.5\ngini = 0.395\nsamples = 403\nvalue = [294, 0, 109]'),
 Text(1937.5, 3850.0, 'X[1] <= 146.0\ngini = 0.124\nsamples = 181\nvalue = [169, 0, 12]'),
 Text(1291.6666666666667, 2994.4444444444443, 'X[4] <= -50.5\ngini = 0.142\nsamples = 13\nvalue = [1, 0, 12]'),
 Text(645.8333333333334, 2138.8888888888887, 'gini = 0.0\nsamples = 1\nvalue = [1, 0, 0]'),
 Text(1937.5, 2138.8888888888887, 'gini = 0.0\nsamples = 12\nvalue = [0, 0, 12]'),
 Text(2583.3333333333335, 2994.4444444444443, 'gini = 0.0\nsample

In [13]:
fig.savefig("decistion_tree.png")

In [10]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def measure_error(y_true, y_pred, label):
    return pd.Series({'accuracy':accuracy_score(y_true, y_pred),
                      'precision': precision_score(y_true, y_pred, average='weighted'),
                      'recall': recall_score(y_true, y_pred,average='weighted'),
                      'f1': f1_score(y_true, y_pred, average='weighted')},
                      name=label)

In [11]:
# The error on the training and test data sets
y_train_pred = dt.predict(X_train)
y_test_pred = dt.predict(X_test)

train_test_full_error = pd.concat([measure_error(y_train, y_train_pred, 'train'),
                              measure_error(y_test, y_test_pred, 'test')],
                              axis=1)

train_test_full_error

Unnamed: 0,train,test
accuracy,1.0,0.9985
precision,1.0,0.998501
recall,1.0,0.9985
f1,1.0,0.998499


## Question 4

* Using grid search with cross validation, find a decision tree that performs well on the test data set. Use a different variable name for this decision tree model than in question 3.
* Determine the number of nodes and the depth of this tree.
* Measure the errors on the training and test sets as before and compare them to those from the tree in question 3.

In [22]:
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth':range(1, dt.tree_.max_depth+1, 2),
              'max_features': range(1, len(dt.feature_importances_)+1)}

GR = GridSearchCV(DecisionTreeClassifier(random_state=42),
                  param_grid=param_grid,
                  scoring='accuracy',
                  n_jobs=-1)

GR = GR.fit(X_train, y_train)



In [23]:
GR.best_estimator_.max_depth

7

In [24]:
from sklearn.neighbors import KNeighborsClassifier

param_grid2 = {'n_neighbors':range(2, 5)}

GR_knn = GridSearchCV(KNeighborsClassifier(),
                     param_grid=param_grid2,
                     scoring='accuracy',
                     n_jobs=-1)
GR_knn = GR_knn.fit(X_train, y_train)



In [25]:
GR_knn.best_estimator_.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 2,
 'p': 2,
 'weights': 'uniform'}

In [26]:
y_train_pred_gr = GR.predict(X_train)
y_test_pred_gr = GR.predict(X_test)

train_test_gr_error = pd.concat([measure_error(y_train, y_train_pred_gr, 'train'),
                                 measure_error(y_test, y_test_pred_gr, 'test')],
                                axis=1)

In [27]:
train_test_gr_error

Unnamed: 0,train,test
accuracy,1.0,0.9985
precision,1.0,0.998501
recall,1.0,0.9985
f1,1.0,0.998499


In [None]:
from sklearn import datasets
import pickle
from sklearn.externals import joblib