# K-Fold Cross Validation

Let's revisit the Iris data set:

In [2]:
import numpy as np  
from sklearn.model_selection import cross_val_score, train_test_split 
from sklearn import datasets  
from sklearn import svm  

iris = datasets.load_iris()  # carrega o dataset iris

A single train/test split is made easy with the train_test_split function in the cross_validation library:

In [10]:
# Split the iris data into train/test data sets with 40% reserved for resting
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)

# Build an SVC model for prediciting iris classification using training data
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)

# Nov measure its performance vith the test data
clf.score(X_test, y_test)

0.9666666666666667

K-gold cross validation is just as easy; let's use K of 5:

In [12]:
# We give cross_val_score a model, the entire data set and ints "real" values, and the number of folds:
scores = cross_val_score(clf, iris.data, iris.target, cv=5)

# Print the accuracy for each fold:
print(scores)

# And the mean accuracy of all 5 folds:
print(scores.mean())

[0.96666667 1.         0.96666667 0.96666667 1.        ]
0.9800000000000001


In [13]:
clf = svm.SVC(kernel='poly', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
print(scores)
print(scores.mean())

[0.96666667 1.         0.96666667 0.96666667 1.        ]
0.9800000000000001


NO! The more complex polynomial kernel produced lower accuracy than a simple linear kernel. The polynomial kernel is overfitting. But we couldn't have told that with a single train/test split:

In [15]:
# Buile an SVC model for predicting iris classifications using training data
clf = svm.SVC(kernel='poly', C=1).fit(X_train, y_train)

# Nof measure its performace vith the test data
clf.score(X_test, y_test)

0.9

That's the same score we got with a single train/test split on the linear kernel