# Module 1: Training and testing - Practice

In this practice you will review and apply the **training and testing** workflow. 
We now switch over to **white wine quality** dataset. 
We are still going to use **Decision Tree model** to fit the dataset. 
The goal is to get more familiarized on the concepts and steps to carry it out as well as technical details for implementing it based on the sci-kit learn API.

+ Look for <span style="background:yellow">**&lt;placeholder&gt;**</span>s  in the code cells and fill in the appropriate code.
+ Expect requirements in **bold** font when provided.
+ Presentation of print-outs are not strict as long as they are readable and equivalent. e.g. the following are equally acceptable:
    + `Index(['fixed acidity', 'volatile acidity'])`
    + `['fixed acidity', 'volatile acidity']`
    + `fixed acidity, volatile acidity`

For instance, **import DecisionTreeClassifier from sklearn.tree** module below.

In [None]:
import os, sys
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix

np.random.seed(18937) # please leave this line alone

## Load Dataset

**Load the dataset** from file into Panda dataframe (multi-dimensional array) and explore it.

Note that this .csv file is semicolon separated (;). Make sure you **shuffle the dataset using the sample() and  reset_index() method. **

In [None]:
# Dataset location
DATASET = '/dsa/data/all_datasets/wine-quality/winequality-white.csv'
assert os.path.exists(DATASET)


In [None]:
# Add your code below this comment (Question #P01)
# ----------------------------------
# Load and shuffle
dataset = pd.read_csv(<placeholder>).sample(<placeholder>).reset_index(<placeholder>)

**Print shape and describe** this dataset.

In [None]:
# Add your code below this comment (Question #P02)
# ----------------------------------




The last column is the quality of wine (0 to 10), other columns are features. 
We are going to build a classifier to estimate quality of wine based on its features. 
Then we will evaluate the performance of the classifier. 
Therefore, the classifier has the following input/output.

~~~
X = all except last column
y = last column
~~~

In addition, we are going to binarize Y into just 0 (ok wine) or 1 (good wine) just for simplification.

**Initialize** multi-dimensional array **X** and **y**.

**Print number of labels for class 0 and class 1 respectively** after binarization.

In [None]:
# Complete code below this comment  (Question #P03)
# ----------------------------------
X = np.array(dataset.<placeholder>)  # Slice out your training data
y = np.array(dataset.<placeholder>)  # Slice out your expected classification result

print('Label distribution (before):', {i: np.sum(y==i) for i in np.unique(dataset.quality)})

# Binarize wine quality just for simplification
y = (y>=7).astype(int)

print('X', X.shape, 'y', y.shape)
print('Label distribution:', {0: <placeholder>, 1: <placeholder>})

## Train and validate

**Setup training/test split and use 25% of data for validation.**

In [None]:
# Complete code below this comment  (Question #P04)
# ----------------------------------
from sklearn.model_selection import train_test_split
<placeholder> = train_test_split(<placeholder>)
model = <placeholder>()
model.fit(X_train, y_train)
model.score(X_test, y_test)

## Understand your classification model

In the cell below, add code to dump out the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html).

In [None]:
# Add your code below this comment (Question #P05)
# ----------------------------------




Compute the [F₁ score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html).

In [None]:
# Add your code below this comment (Question #P06)
# ----------------------------------




# Save your notebook!  Then `File > Close and Halt`