# Cross validation and model selection

## Threefold split and parameter search
The simplest way to adjust parameters is to split the data into three parts: a training, a validation and a test set.
For each parameter setting, we fit a model on the training set, and evaluate it on the evaluation set.
We select the "best" parameter setting (or model) based on the validation set. We then rebuild a model using training and
validation data with this parameter setting, and evaluate it on the test set. The test set performance serves as an estimate of the generalization performance.

### Task 1
Load the boston housing data. Split the data into three parts, for example by calling ``train_test_split`` twice.
As yesterday, scale the data and create polynomial features.
Search the best setting for the regularization parameter alpha using the strategy described above.


In [None]:
from sklearn.datasets import load_boston
boston = load_boston()

In [None]:
alphas = np.logspace(-3, 3, 7)
np.set_printoptions(suppress=True)
print(alphas)

In [None]:
# solution here

## Cross validation
To get a better understanding of cross-validation, we'll implement it from scratch.
Our goal is to estimate the performance of a single model, let's say ``Ridge(alpha=1)`` on the original Boston housing dataset.

### Task 2
Complete the code below to fit a model for each of the folds of 5-fold cross-validation and compute the hold-out $R^2$ using the ``score method``.

In [None]:
import numpy as np
X = boston.data[:505]  # we make it divisible by n_folds to make the code simpler
y = boston.target


scores = []
n_folds = 5
n_samples = len(X)
fold_size = n_samples // 5
for fold in range(n_folds):
    hold_out_mask = np.zeros(n_samples, dtype=np.bool)
    # assign True to the samples that are supposed to be held out in this fold
    # ...
    training_mask = ~hold_out_mask  # training data is inverse of hold out data
    # assign training and hold-out portions
    # build model
    # compute scores
    # ...

print(scores)

### Task 3
Compare the result of your implementation with the result of the ``cross_val_score`` method in scikit-learn.

In [None]:
from sklearn.model_selection import cross_val_score
scores_sklearn = cross_val_score() # fill in missing arguments

# compare scores_sklearn with scores

## Parameter selection with cross-validation
### Task 4
Implement the same search over the parameter ``alpha`` in ``Ridge`` that you did in Task 1, but instead of splitting the data three times use cross-validation.
In more detail:
- Split the Boston housing data (with polynomial features) into two parts, training and testing
- Loop over different values of alpha
- for each value of alpha, call ``cross_val_score`` on the training set, and compute the mean cross-validated accuracy.
- Select the parameter with the best mean crossvalidation accuracy, and build a model on all of the training data
- evaluate the model on the test data.

# GridSearchCV
Because searching for the parameters of a model is such a common task, scikit-learn provides ``GridSearchCV`` which implements the procedure from Task 4 (with some bells an whistles).
To use ``GridSearchCV`` we simply have to define a parameter grid to search as a dictionary, with the key the name of the parameter, and the values the parameters we like to try. The ``GridSearchCV`` class has the same interface as the classification and regression models, and we can call ``fit`` to perform the grid-search with cross-validation. It even refits the model using the best parameters! We can then use ``predict`` or ``score`` to use the model with the best parameters, retrained on the whole training data.

### Task 5
Do the same search from Task 4 (and Task 1) again, this time using ``GridSearchCV`` (from the ``sklearn.model_selection`` module).

In [None]:
from sklearn.model_selection import GridSearchCV
alphas = np.logspace(-3, 3, 7)
param_grid = {'alpha':  alphas}

grid = GridSearchCV( ,return_train_score=True) # complete me!
grid.fit(X_train, y_train)
print("best mean cross-validation score: {:.3f}".format(grid.best_score_))
print("best parameters: {}".format(grid.best_params_))
print("test-set score: {:.3f}".format(grid.score(X_test, y_test)))

The ``GridSearchCV`` object stored a lot of useful information from the grid-search in the ``cv_results_`` attribute.
The easiest way to access it is to convert it to a pandas datafram:

In [None]:
import pandas as pd
results = pd.DataFrame(grid.cv_results_)

In [None]:
results.columns

In [None]:
results.params

We can even plot the cross-validation accuracies and their associated uncertainties:

In [None]:
results.plot('param_n_neighbors', 'mean_train_score')
results.plot('param_n_neighbors', 'mean_test_score', ax=plt.gca())
plt.fill_between(results.param_n_neighbors.astype(np.int),
                 results['mean_train_score'] + results['std_train_score'],
                 results['mean_train_score'] - results['std_train_score'], alpha=0.2)
plt.fill_between(results.param_n_neighbors.astype(np.int),
                 results['mean_test_score'] + results['std_test_score'],
                 results['mean_test_score'] - results['std_test_score'], alpha=0.2)
plt.legend()

### Task 6
Select the best value of ``n_neighbors`` for using ``KNeighborsClassifier`` on the ``digits`` dataset.

## Evaluation Metrics and scoring

In this section, we'll look at different evaluation metrics in scikit-learn and how to use them.
There's two main ways to use metrics:
- As functions in the ``sklearn.metrics`` module, such as ``accuracy_score`` and ``roc_auc``. These take the true labels and the predictions as arguments.
- By specifying a metrics in ``cross_val_score``, ``GridSearchCV`` or another evaluation method using the ``scoring`` keyword, i.e. ``cross_val_score(..., scoring='roc_auc')``.

### Metrics for binary classification
As we mentioned, accuracy is not a great metric in imbalanced classification problems.
We'll look at some alternatives.

### Task 7
Create an imbalanced classification problem from the digits dataset by classifying the digit 4 against all other digits.
Split the data into training and test set.

In [None]:
from sklearn.datasets import load_digits

# ....
# create X_train, X_test, y_train, y_test for "4 vs rest"

Now train a ``LogisticRegression`` model, a ``DummyClassifier(strategy='most_frequent')`` and a ``DecisionTreeClassifier(max_depth=2)``, and compare their test-set accuracy:

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier

lr = LogisticRegression()
tree = DecisionTreeClassifier(max_depth=2)
dummy = DummyClassifier(strategy='most_frequent')

# build models
# compare them using accuracy (for example using .score)

To get a better picture, now use the ``classification_report`` function from ``sklearn.metrics``:

The classification report provides precision and recall for the default threshold. To look at all possible thresholds, we can plot the precision-recall curve:

In [None]:
from sklearn.metrics import precision_recall_curve
positive_probs_lr = lr.predict_proba(X_test)[:, 1]
# complete:
# positive_probs_tree = tree.
# plot curves for tree and logistic regression

We can look at a summary by computing the average precision (``average_precision_score``):

In [None]:
from sklearn.metrics import average_precision_score
# ...


Finally, to use something like ``average_precision_score`` in cross-validation, we can simply specify the ``scoring`` argument of ``cross_val_score``. Use ``cross_val_score`` to compute the 5 fold cross-validated average precision of ``LogisticRegression`` and ``DecisionTreeClassifier(max_depth=2)``.

In [None]:
# ... solution here...