# Do decision trees overfit?

## Wine Quality Dataset
This dataset contains instances for red and white wine samples.
The inputs include objective tests (e.g. PH values) and the output is based on sensory data
(median of at least 3 evaluations made by wine experts). Each expert graded the wine quality 
between 0 (very bad) and 10 (very excellent).

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine.
For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].
Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables 
are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

In [None]:
import pandas as pd
df = pd.read_csv('../Lecture_6-AppliedMachineLearning/data/winequality-white.csv', sep=';')
df.info()

We have loaded a dataset, with real-valued features. We will try to model this dataset with a *decision tree*.

In [None]:
# Prepare X and y...
features = list(df.columns[:-1])
X = df[features]
y = df["quality"]

from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)

In [None]:
import graphviz
dot_data = tree.export_graphviz(clf, out_file=None) 
graph = graphviz.Source(dot_data)  
graph

### What is the performance of the classifier on the training data?

In [None]:
from sklearn.metrics import confusion_matrix

class_names = range(1, 11)
prediction = clf.predict(X)
cm = confusion_matrix(y, prediction, class_names)
cm

In [None]:
import matplotlib.pyplot as plt
%matplotlib notebook
plt.matshow(cm)
plt.colorbar()

### Another plot of the confusion matrix

In [None]:
import itertools
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    #print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    

# Plot out confusion matrix
plt.figure()
plot_confusion_matrix(cm, classes=class_names)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y, prediction)