# Cross-Validation on the Iris Dataset

Here is an example on you to split the data on the iris dataset.

In [1]:
import numpy as np
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

n_samples, n_features = iris.data.shape
print(n_samples, n_features)

150 4


First we need to shuffle the order of the samples and the
target to ensure that all classes are well represented on
both sides of the split:

In [2]:
indices = np.arange(n_samples)
indices[:]

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149])

In [3]:
np.random.RandomState(42).shuffle(indices)
indices[:]

array([ 73,  18, 118,  78,  76,  31,  64, 141,  68,  82, 110,  12,  36,
         9,  19,  56, 104,  69,  55, 132,  29, 127,  26, 128, 131, 145,
       108, 143,  45,  30,  22,  15,  65,  11,  42, 146,  51,  27,   4,
        32, 142,  85,  86,  16,  10,  81, 133, 137,  75, 109,  96, 105,
        66,   0, 122,  67,  28,  40,  44,  60, 123,  24,  25,  23,  94,
        39,  95, 117,  47,  97, 113,  33, 138, 101,  62,  84, 148,  53,
         5,  93, 111,  49,  35,  80,  77,  34, 114,   7,  43,  70,  98,
       120,  83, 134, 135,  89,   8,  13, 119, 125,   3,  17,  38,  72,
       136,   6, 112, 100,   2,  63,  54, 126,  50, 115,  46, 139,  61,
       147,  79,  59,  91,  41,  58,  90,  48,  88, 107, 124,  21,  57,
       144, 129,  37, 140,   1,  52, 130, 103,  99, 116,  87,  74, 121,
       149,  20,  71, 106,  14,  92, 102])

In [4]:
X = iris.data[indices]
y = iris.target[indices]

In [5]:
print(y)

[1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1
 0 0 0 2 1 1 0 0 1 2 2 1 2 1 2 1 0 2 1 0 0 0 1 2 0 0 0 1 0 1 2 0 1 2 0 2 2
 1 1 2 1 0 1 2 0 0 1 1 0 2 0 0 1 1 2 1 2 2 1 0 0 2 2 0 0 0 1 2 0 2 2 0 1 1
 2 1 2 0 2 1 2 1 1 1 0 1 1 0 1 2 2 0 1 2 2 0 2 0 1 2 2 1 2 1 1 2 2 0 1 2 0
 1 2]


We can now split the data using a 2/3 - 1/3 ratio:

In [6]:
split = int((n_samples*2)/3)

X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

print(X_train.shape)

(100, 4)


In [7]:
X_test.shape

(50, 4)

In [8]:
y_train.shape

(100,)

In [9]:
y_test.shape

(50,)

We can now re-train a new linear classifier on the training set only:

In [10]:
from sklearn.svm import LinearSVC
clf = LinearSVC().fit(X_train, y_train)

To evaluate its quality we can compute the average number
of correct classifications on the test set:

In [11]:
np.mean(clf.predict(X_test) == y_test)

1.0

This shows that the model has a predictive accurracy of 100%
which means that the classification model was perfectly capable
of generalizing what was learned from the training set to the
test set: this is rarely so easy on real life datasets as we
will see in the later sections.

In [12]:
r = 0
for i in range(len(X_test)):
    if (clf.predict(X_test)[i] == y_test[i]): r+=1

print(r*100/len(y_test))

100.0
