# Cross Validation

> Cross-validation, sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (called the validation dataset or testing set). The goal of cross-validation is to test the model's ability to predict new data that was not used in estimating it, in order to flag problems like overfitting or selection bias and to give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem).

- [Wiki](https://en.wikipedia.org/wiki/Cross-validation_(statistics))

<img width=600 src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png">

- [Help Link 2](https://scikit-learn.org/stable/modules/cross_validation.html)
- [Help Link](https://medium.com/@eijaz/holdout-vs-cross-validation-in-machine-learning-7637112d3f8f)

### Hold Out

- when rapid prototyping is required

- not much knowledge

- there is not abundant computing power

### K-Folds

- recommended in most cases

- sufficient equipment is available to develop ML

- integration with parametric optimization techniques is required

- you have more time for the tests


In [1]:
import pandas as pd 
import numpy as np 
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv('../Datasets/Week9/wine.csv')
df.head()

Unnamed: 0,Wine,Alcohol,Malic.acid,Ash,Acl,Mg,Phenols,Flavanoids,Nonflavanoid.phenols,Proanth,Color.int,Hue,OD,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [3]:
df.dtypes

Wine                      int64
Alcohol                 float64
Malic.acid              float64
Ash                     float64
Acl                     float64
Mg                        int64
Phenols                 float64
Flavanoids              float64
Nonflavanoid.phenols    float64
Proanth                 float64
Color.int               float64
Hue                     float64
OD                      float64
Proline                   int64
dtype: object

In [4]:
from sklearn import metrics
metrics.SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'max_error', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'accuracy', 'roc_auc', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'brier_score_loss', 'adjusted_rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples', 'jaccard_weighted'])

In [5]:
X = df.drop(['Wine'],1)
y = df['Wine']

tree = DecisionTreeClassifier()
score = cross_val_score(tree, X, y, scoring='accuracy', error_score=np.nan,cv=3)
score

array([0.95      , 0.83333333, 0.9137931 ])

In [6]:
tree = DecisionTreeClassifier()
score = cross_val_score(tree, X, y,cv=4, scoring='accuracy', error_score=np.nan)
score

array([0.82222222, 0.86666667, 0.91111111, 0.88372093])

In [7]:
np.abs(np.mean(score))

0.8709302325581396

In [8]:
# shuffle =  Whether to shuffle the data before splitting into batches.
# if shaffle is True we use random_state = int

kf = KFold(n_splits=3, shuffle=False)  
print('splits', kf.get_n_splits(X))

for train_index, test_index in kf.split(X):
    print("TRAIN:\t", train_index, "\nTEST:\t", test_index)
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    print(X_train, y_train)
    tree.fit(X_train, y_train)
    y_pred = tree.predict(X_test)
    print('Accuracy:\t', accuracy_score(y_pred, y_test))

splits 3
TRAIN:	 [ 60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77
  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95
  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131
 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149
 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167
 168 169 170 171 172 173 174 175 176 177] 
TEST:	 [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59]
     Alcohol  Malic.acid   Ash   Acl   Mg  Phenols  Flavanoids  \
60     12.33        1.10  2.28  16.0  101     2.05        1.09   
61     12.64        1.36  2.02  16.8  100     2.02        1.41   
62     13.67        1.25  1.92  18.0   94     2.10        1.79   
63     12.37        1.13  2.16  19.0   87     3

In [9]:

kf = KFold(n_splits=3, shuffle=True, random_state=42)  
print('splits', kf.get_n_splits(X))

for train_index, test_index in kf.split(X):
    print("TRAIN:\t", train_index, "\nTEST:\t", test_index)
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    print(X_train, y_train)
    tree.fit(X_train, y_train)
    y_pred = tree.predict(X_test)
    print('Accuracy:\t', accuracy_score(y_pred, y_test))

splits 3
TRAIN:	 [  0   1   3   4   5   6   7   8  10  11  13  14  17  20  21  23  25  27
  28  32  33  34  35  37  39  40  43  44  46  47  48  49  50  52  53  54
  57  58  59  61  62  63  64  69  70  71  72  73  74  75  77  79  80  81
  83  84  86  87  88  89  91  92  94  95  96  97  99 101 102 103 105 106
 107 110 112 115 116 120 121 123 124 125 126 127 129 130 131 132 133 134
 135 136 139 142 144 146 147 148 149 151 152 155 156 157 160 161 162 163
 165 166 167 168 170 172 173 175 176 177] 
TEST:	 [  2   9  12  15  16  18  19  22  24  26  29  30  31  36  38  41  42  45
  51  55  56  60  65  66  67  68  76  78  82  85  90  93  98 100 104 108
 109 111 113 114 117 118 119 122 128 137 138 140 141 143 145 150 153 154
 158 159 164 169 171 174]
     Alcohol  Malic.acid   Ash   Acl   Mg  Phenols  Flavanoids  \
0      14.23        1.71  2.43  15.6  127     2.80        3.06   
1      13.20        1.78  2.14  11.2  100     2.65        2.76   
3      14.37        1.95  2.50  16.8  113     3.85  

## Iris

In [10]:
iris = datasets.load_iris()
X = iris.data
y = iris.target
print('Tree:', cross_val_score(tree, X, y, cv=5, scoring='accuracy'))


Tree: [0.96666667 0.96666667 0.9        1.         1.        ]


In [11]:
svc = SVC(C=1, kernel='linear', gamma='scale')
print('SVC:', cross_val_score(svc, X, y, cv=3, scoring='f1_macro'))

SVC: [1.         0.96064815 0.9791463 ]


In [12]:
kf = KFold(n_splits=3, shuffle=True, random_state=42)  
print('splits', kf.get_n_splits(X))

for train_index, test_index in kf.split(X):
    print("TRAIN:\t", train_index, "\nTEST:\t", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print(X_train, y_train)
    tree.fit(X_train, y_train)
    y_pred = tree.predict(X_test)
    print('Accuracy:\t', accuracy_score(y_pred, y_test))

splits 3
TRAIN:	 [  0   1   2   3   5   6   7   8  13  14  17  20  21  23  24  25  28  33
  34  35  37  38  39  40  41  43  44  46  47  48  49  50  52  53  54  57
  58  59  60  61  62  63  66  67  70  71  72  74  77  79  80  83  84  87
  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 105 106
 107 111 112 113 114 115 116 117 119 120 121 122 123 124 125 126 129 130
 134 135 136 138 139 140 144 147 148 149] 
TEST:	 [  4   9  10  11  12  15  16  18  19  22  26  27  29  30  31  32  36  42
  45  51  55  56  64  65  68  69  73  75  76  78  81  82  85  86 104 108
 109 110 118 127 128 131 132 133 137 141 142 143 145 146]
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.1 3.5 1.4 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.2 3.4 1.4 0.2]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5