# Training and testing - Answers for practice

In this practice you will review and apply the **training and testing** workflow. We now switch over to **white wine quality** dataset. And we are still going to use **decision tree** to fit the dataset. Goal is to get more familiarized on the concepts and steps to carry it out as well as technical details for implementing it based on sci-kit learn API.

In [1]:
import os, sys
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
np.random.seed(18937) # please leave this line alone

## Load Dataset

**Load dataset** from files into multi-dimensional array and understand its structure.

Note that this .csv file is semicolon separated (;). Make sure you **shuffle the dataset using the sample() and  reset_index method. **

In [2]:
# Dataset location
DATASET = '/dsa/data/all_datasets/wine-quality/winequality-white.csv'
assert os.path.exists(DATASET)

# Add your code below this comment (Question #P1001)
# ----------------------------------
# Load and shuffle
dataset = pd.read_csv(DATASET, sep = ';').sample(frac=1).reset_index(drop=True)

**Print shape and describe** this dataset.

In [3]:
# Add your code below this comment (Question #P1002)
# ----------------------------------
print(dataset.shape)
dataset.describe()

(4898, 12)


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0
mean,6.854788,0.278241,0.334192,6.391415,0.045772,35.308085,138.360657,0.994027,3.188267,0.489847,10.514267,5.877909
std,0.843868,0.100795,0.12102,5.072058,0.021848,17.007137,42.498065,0.002991,0.151001,0.114126,1.230621,0.885639
min,3.8,0.08,0.0,0.6,0.009,2.0,9.0,0.98711,2.72,0.22,8.0,3.0
25%,6.3,0.21,0.27,1.7,0.036,23.0,108.0,0.991723,3.09,0.41,9.5,5.0
50%,6.8,0.26,0.32,5.2,0.043,34.0,134.0,0.99374,3.18,0.47,10.4,6.0
75%,7.3,0.32,0.39,9.9,0.05,46.0,167.0,0.9961,3.28,0.55,11.4,6.0
max,14.2,1.1,1.66,65.8,0.346,289.0,440.0,1.03898,3.82,1.08,14.2,9.0


**Initialize** multi-dimensional array **X** and **y**.

**Print number of labels for class 0 and class 1 respectively** after binarization.

In [4]:
# Complete code below this comment  (Question #P1003)
# ----------------------------------
X = np.array(dataset.iloc[:,:-1])
y = np.array(dataset.quality)
print('Label distribution (before):', {i: np.sum(y==i) for i in np.unique(dataset.quality)})

# Binarize wine quality just for simplification
y = (y>=7).astype(int)
# Caution: reversing the order of above 2 statements results in a non-obvious mistake.
#     Make it a habit to check label distribution to make sure it still makes sense and roughly balanced.

print('X', X.shape, 'Y', y.shape)
print('Label distribution (after):', {0: np.sum(y==0), 1: np.sum(y==1)})

Label distribution (before): {3: 20, 4: 163, 5: 1457, 6: 2198, 7: 880, 8: 175, 9: 5}
X (4898, 11) Y (4898,)
Label distribution (after): {0: 3838, 1: 1060}


## Train and validate

**Setup training/test split and use 25% of data for validation.**

In [5]:
# Complete code below this comment  (Question #P1004)
# ----------------------------------
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.8269387755102041

## Understand your classification model

In the cell below, add code to dump out the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) and compute the [F₁ score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html).

In [6]:
# Add your code below this comment (Question #P1005)
# ----------------------------------
confusion_matrix(y_test, model.predict(X_test))

array([[853, 113],
       [ 99, 160]])

In [7]:
# Add your code below this comment (Question #P1006)
# ----------------------------------
f1_score(y_test, np.round(model.predict(X_test)).astype('i4'), average='micro')

0.8269387755102042

# Save your notebook!