# Model selection and evaluation 1
___
## Cross-validation

To avoid overfitting, we usually use [cross-validation][1].  
Typically used cross-validation methods include:

- **Exhaustive cross-validation**  

    ```
Exhaustive cross-validation methods are cross-validation methods which learn and test on all possible ways to divide the original sample into a training and a validation set.
    ```

    - **Leave-p-out cross-validation**  
    From all the data points, select $p$ data points to test and use the other data to train the model. Repeat the process to include all the possible combination.
    
    - **Leave-one-out cross-validation**  
    The $p = 1$ scenario of *Leave-p-out cross-validation*.
- **Non-exhaustive cross-validation**  

    ```
Non-exhaustive cross validation methods do not compute all ways of splitting the original sample. Those methods are approximations of leave-p-out cross-validation.
    ```
    - **k-fold cross-validation**  
       ![](K-fold_cross_validation_EN.jpg)
    - **Holdout method**  
    Very simple method. Randomly select a part of data to train the model, holding the other part to test.
    - **Repeated random sub-sampling validation**  
    This method, also known as Monte Carlo cross-validation, randomly splits the dataset into training and validation data. For each such split, the model is fit to the training data, and predictive accuracy is assessed using the validation data. The results are then averaged over the splits.

[1]:https://en.wikipedia.org/wiki/Cross-validation_(statistics)

## Usage example
In the [scikit-learn][1] documentation, an example to of cross-validation is presented in a simple SVM classification algorithm.  
First, we provide a simple usage of svm class.
[1]:http://scikit-learn.org/stable/modules/cross_validation.html

In [29]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()

In [57]:
X_train, X_test, y_train, y_test = train_test_split(
...     iris.data, iris.target, test_size=1./3, random_state=1)

In [58]:
clf = svm.SVC(kernel='linear', C=1)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

1.0

### Simple usage
The simplest way to implement cross-validation is to call [cross_val_score][2] like this

```
cross_val_score(estimator, X, y)
```
[2]:http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score

In [52]:
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target)
scores                                              

array([ 1.        ,  0.96078431,  0.97916667])

### Use other cross validation methods
We can use other cross validation method. Simply import the cross validation method, and specify the parameter such as number of splits, test size. The object can be used as a parameter defining the cross validation method.

In [60]:
from sklearn.model_selection import ShuffleSplit
n_samples = iris.data.shape[0]
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
cross_val_score(clf, iris.data, iris.target, cv=cv)

array([ 0.97777778,  0.97777778,  1.        ,  0.95555556,  1.        ])

In [66]:
cv

ShuffleSplit(n_splits=5, random_state=0, test_size=0.3, train_size=None)

In the previous example, ```cv``` is call a cross-validator. Cross-validator involves two methods:



```get_n_splits([X, y, groups])```:	Returns the number of splitting iterations in the cross-validator   
```split(X[, y, groups])```:  Generate indices to split data into training and test set.

Now let's look at what cv contains:

In [73]:
i = 1
for train, test in cv.split(iris.data):
    print("cv "+str(i))
    print("train data: ")
    print(train)
    print("test data: ")
    print(test)
    i = i+1

cv 1
train data: 
[ 60 116 144 119 108  69 135  56  80 123 133 106 146  50 147  85  30 101
  94  64  89  91 125  48  13 111  95  20  15  52   3 149  98   6  68 109
  96  12 102 120 104 128  46  11 110 124  41 148   1 113 139  42   4 129
  17  38   5  53 143 105   0  34  28  55  75  35  23  74  31 118  57 131
  65  32 138  14 122  19  29 130  49 136  99  82  79 115 145  72  77  25
  81 140 142  39  58  88  70  87  36  21   9 103  67 117  47]
test data: 
[114  62  33 107   7 100  40  86  76  71 134  51  73  54  63  37  78  90
  45  16 121  66  24   8 126  22  44  97  93  26 137  84  27 127 132  59
  18  83  61  92 112   2 141  43  10]
cv 2
train data: 
[ 80 107  90   0  36 112   5  57 102  55  34 128  33  21  73   7  45 129
 103 146 120  94  50 134  99 126 114   9  39  97 101  29  81  20  46  51
  53  23  27   2  28  37 111  10  84 137 127  43  87  69 144 140  35  76
   3  82 145 116  88  44 147   1  93  38  11 115  54  40  18  41  79  24
  56  71  13  31  85  70 132 125 123 100  32 104 

The following table summarize the available cross validation methods:

|**Usage**| **Cross Validator** | **Explanation** |
| :---: | :---: | :---: | :--- |
|i.i.d. data| ```KFold``` | KFold divides all the samples in k groups of samples, called folds (if $k = n$, this is equivalent to the Leave One Out strategy), of equal sizes (if possible). The prediction function is learned using $k - 1$ folds, and the fold left out is used for test. | 
|i.i.d. data| ```RepeatedKFold``` | RepeatedKFold repeats K-Fold n times. It can be used when one requires to run KFold n times, producing different splits in each repetition. | 
|i.i.d. data| ```LeaveOneOut``` | Each learning set is created by taking all the samples except one, the test set being the sample left out. | 
|i.i.d. data| ```LeavePOut``` | LeavePOut creates all the possible training/test sets by removing p samples from the complete set. For n samples, this produces ${n \choose p}$ train-test pairs.| 
|i.i.d. data| ```ShuffleSplit``` | Samples are first shuffled and then split into a pair of train and test sets. | 
|class label based| ```StratifiedKFold``` | Each set contains approximately the same percentage of samples of each target class as the complete set. | 
|class label based| ```StratifiedShuffleSplit``` | creates splits by preserving the same percentage for each target class as in the complete set. | 
| grouped data |


