# Training and testing - Answers for practice

In this practice you will review and apply the **training and testing** workflow. We now switch over to **white wine quality** dataset. And we are still going to use **decision tree** to fit the dataset. Goal is to get more familiarized on the concepts and steps to carry it out as well as technical details for implementing it based on sci-kit learn API.

In [None]:
import os, sys
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
np.random.seed(18937) # please leave this line alone

## Load Dataset

**Load dataset** from files into multi-dimensional array and understand its structure.

Note that this .csv file is semicolon separated (;). Make sure you **shuffle the dataset using the sample() and  reset_index method. **

In [None]:
# Dataset location
DATASET = '/dsa/data/all_datasets/wine-quality/winequality-white.csv'
assert os.path.exists(DATASET)

# Add your code below this comment (Question #P1001)
# ----------------------------------
# Load and shuffle
dataset = pd.read_csv(DATASET, sep = ';').sample(frac=1).reset_index(drop=True)

**Print shape and describe** this dataset.

In [None]:
# Add your code below this comment (Question #P1002)
# ----------------------------------
print(dataset.shape)
dataset.describe()

**Initialize** multi-dimensional array **X** and **y**.

**Print number of labels for class 0 and class 1 respectively** after binarization.

In [None]:
# Complete code below this comment  (Question #P1003)
# ----------------------------------
X = np.array(dataset.iloc[:,:-1])
y = np.array(dataset.quality)
print('Label distribution (before):', {i: np.sum(y==i) for i in np.unique(dataset.quality)})

# Binarize wine quality just for simplification
y = (y>=7).astype(int)
# Caution: reversing the order of above 2 statements results in a non-obvious mistake.
#     Make it a habit to check label distribution to make sure it still makes sense and roughly balanced.

print('X', X.shape, 'Y', y.shape)
print('Label distribution (after):', {0: np.sum(y==0), 1: np.sum(y==1)})

## Train and validate

**Setup training/test split and use 25% of data for validation.**

In [None]:
# Complete code below this comment  (Question #P1004)
# ----------------------------------
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
model.score(X_test, y_test)

## Understand your classification model

In the cell below, add code to dump out the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) and compute the [F₁ score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html).

In [None]:
# Add your code below this comment (Question #P1005)
# ----------------------------------
confusion_matrix(y_test, model.predict(X_test))

In [None]:
# Add your code below this comment (Question #P1006)
# ----------------------------------
f1_score(y_test, np.round(model.predict(X_test)).astype('i4'), average='micro')

# Save your notebook!