# Training and Testing

**SPAE-CS-DS A Data Science Short Course**

<small>Lecturer: Dr. CHAN, Chung<br>Department of Computer Science</small>
___

## Background

To give an unbiased performance estimate of a learning algorithm of interest, the fundamental principle is 

> to use separate datasets for training and testing. 

If there is only one dataset, we should split it into *training sets* and *test sets* by *random sampling* to avoid bias in the performance estimate. This notebook illustrates some methods of splitting the datasets for training and testing.

## Setup

We first import the iris dataset from `sklearn` and create a `pandas` dataframe to operate on the dataset. You may review the last notebook on [data preparation](./1.Data%20preparation.ipynb) for the details.

In [None]:
%reset -f
from sklearn import datasets
import pandas as pd
import numpy as np

# load the iris dataset from sklearn
iris = datasets.load_iris()

# Create pandas dataframe
iris_df = pd.DataFrame(data = iris.data, # # write the input features
                       columns = iris.feature_names)

iris_df.insert(len(iris_df.columns), # append the target values
               'target',
               pd.Categorical(iris.target))

iris_df.target.cat.categories = [iris.target_names[i] # give meaningful category names
                                 for i in iris_df.target.cat.categories] 

iris_df # to display the dataframe

## Stratified holdout method

This method randomly samples data for training or testing without replacement. It is implemented by the `train_test_split` function from the `sklearn.model_selection` package.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(iris_df[iris.feature_names], 
                                                    iris_df.target, 
                                                    test_size=0.2, # randomly holdout 20% of data
                                                    random_state=1) # random seed.

X_train.shape, X_test.shape, Y_train.shape, Y_test.shape # show the dimensions of the training and testing data

Note that we have separated the input features and target for the training and test sets. The size of the test set is $\frac{30}{150}=20\%$ as intended.

**Exercise** Using the following widget, check whether the class proportion is maintained for the training and test sets. Why is it useful to set the random seed?

In [None]:
from ipywidgets import interact
from IPython.display import display

@interact(data=['iris_df.target','Y_train','Y_test'],seed=(0,5))
def class_proportions(data,seed=0):
    Y_train, Y_test = train_test_split(iris_df.target,  # need only split the target series for class distribution
                                       test_size=0.2,
                                       random_state=seed) # set different random seeds.
    series = eval(data)
    dist = series.value_counts().sort_index()
    print('total counts: {:d}'.format(len(series.index)))
    display(pd.DataFrame(dist).rename(columns={'target':'count'}))
    dist.plot(kind='bar')

Next, we apply the learning algorithm of interest to train a classifier using only the training set. Let's say we want to evaluate the decision tree induction algorithm in `sklearn`.

In [None]:
from sklearn import tree

clf = tree.DecisionTreeClassifier(random_state=0) # the training is also randomized
clf.fit(X_train, Y_train) # fit the model to the training set 

We can use the `predict` method of the classifier to predict the iris species of an instance based on the lengths and widths of its sepals and petals. The following code adds the prediction as a separate column to `iris_df` dataframe.

In [None]:
iris_df['prediction'] = pd.Categorical(clf.predict(iris_df[iris.feature_names]))
iris_df

**Exercise** Write a function that returns a `DataFrame` containing only tuples with incorrect predictions.

*Hint: Use the `loc` method of `DataFrame`.*

In [None]:
def misclassified_instances(df):
    """
    Returns misclassified instances.
    
    Parameters:
    df (pandas.DataFrame): must contain columns 'target' and 'prediction'
    
    Returns:
    pandas.DataFrame: same as df but contains only instances with target not equal to prediction.
    
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    
misclassified_instances(iris_df)

To evaluate the performance of the classifier, we will consider only the predictions on the test set. The accuracy can be computed using the `score` method.

In [None]:
print('Accuracy: {:0.3f}'.format(clf.score(X_test, Y_test)))
iris_df.loc[X_test.index].loc[lambda df: df['prediction']!=df['target']] # display misclassified instances

**Exercise** Apply random subsampling to compute a better accuracy estimate. In particular, define a function that returns a numpy array of accuracies of $20\%$ stratified hold-out with random seed set to 0, 1, 2, 4, and 5.

In [None]:
import numpy as np

def random_subsampling_scores():
    scores = np.zeros(5)
    for seed in range(5):
        clf = tree.DecisionTreeClassifier(random_state=seed)
        # YOUR CODE HERE
        raise NotImplementedError()
    return scores

print('Accuracy: {:0.3f}'.format(random_subsampling_scores().mean()))

## Stratified cross validation

This method randomly split the data into $k$ *folds* (blocks with roughly equal sizes.). The score is the average of the accuracies obtained by using each fold to test a classifier trained using the remaining folds.

<center><img src="https://upload.wikimedia.org/wikipedia/commons/4/4b/KfoldCV.gif" style="width:600px" alt="Cross validation"></center>


In [None]:
from sklearn.model_selection import cross_val_predict

clf = tree.DecisionTreeClassifier(random_state=0)
iris_df['prediction'] = pd.Categorical(cross_val_predict(clf, iris_df[iris.feature_names], iris_df.target, cv=5))
iris_df.loc[lambda df: df['target'] != df['prediction']]

**Exercise** Compute the accuracy obtained by the cross validation result above.

In [None]:
def cv_score():
    # YOUR CODE HERE
    raise NotImplementedError()
    return score
 
print('Acurracy: {:0.3f}'.format(cv_score()))

**Exercise** Follow the documentation [here](https://ogrisel.github.io/scikit-learn.org/sklearn-tutorial/modules/generated/sklearn.cross_validation.Bootstrap.html) to explore the boostrap sampling method.

In [None]:
# You may add more cells

**Feedback**
___
Your feedback here.

___