# KFold CV calculation on the Iris Dataset

There are different type of cross-validation iterators. In followind scikit-learn page, you can find some of them,

http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators

In this tutorial, we will work with **KFold** cross validation iterator. First, to learn how to work with KFold cross validation iterator, below, we presented a very simple example.

In [1]:
import numpy as np
from sklearn.cross_validation import KFold

X = np.array([[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [0, 2], [4, 6], [8, 1]])
y = np.array([11, 12, 13, 14, 15, 16, 17, 18])

kf= KFold(8, n_folds = 5, shuffle=True)

print(kf)  

for train_index, test_index in kf:
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

sklearn.cross_validation.KFold(n=8, n_folds=5, shuffle=True, random_state=None)
TRAIN: [1 2 3 5 6 7] TEST: [0 4]
TRAIN: [0 1 2 4 6 7] TEST: [3 5]
TRAIN: [0 1 3 4 5 6] TEST: [2 7]
TRAIN: [0 2 3 4 5 6 7] TEST: [1]
TRAIN: [0 1 2 3 4 5 7] TEST: [6]


Now, we start the calculation on the iris data sets with **KFold**. First of all, as usual on iris calculation, we read the iris data sets.

In [2]:
from sklearn.svm import SVC
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

**k** is defined in **n_folds** parameter:

In [3]:
kf_iris = KFold(150, n_folds=15, shuffle=True)
print(kf_iris)

sklearn.cross_validation.KFold(n=150, n_folds=15, shuffle=True, random_state=None)


In [4]:
for train_index, test_index in kf_iris:
    print("TEST:", test_index)

TEST: [ 23  36  44  61  89  90  92 100 108 120]
TEST: [  4  11  19  35  65  74 123 136 143 146]
TEST: [ 14  18  32  39  54  56  57  77 121 124]
TEST: [ 28  43  48  58  72  76 115 133 147 148]
TEST: [  5   9  49  88  91  99 126 129 138 144]
TEST: [  1   6  27  42  86  94  95 119 122 130]
TEST: [ 12  52  67  93  96 111 113 125 134 142]
TEST: [  3  13  15  37  62  70  82 107 127 137]
TEST: [  0  10  40  55  79 112 117 118 128 141]
TEST: [  2   7  21  24  30  59  66  73  97 105]
TEST: [ 22  26  33  38  51  68  71  75 101 140]
TEST: [ 60  80  84  87 102 106 109 131 135 139]
TEST: [ 17  25  41  45  46  78  83  85  98 114]
TEST: [ 16  20  29  47  50  64  69 104 145 149]
TEST: [  8  31  34  53  63  81 103 110 116 132]


As you see above, **KFold** generated 10 different test sets with different features. That means, **KFold** shuffled first the entire data set and then divided them to 10 test sets.

In [5]:
accuracy = []
for train_index, test_index in kf_iris:
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf = SVC().fit(X_train, y_train)

    i_correct = 0
    for i in range(len(X_test)):
        if (clf.predict(X_test)[i] == y_test[i]):
            i_correct += 1

    accuracy.append(i_correct*100/len(y_test))

print("Accuracy: %0.2f (+/- %0.2f)" %(np.mean(accuracy), np.sqrt(np.std(accuracy))))

Accuracy: 98.67 (+/- 1.84)
