# K-Fold Cross Validation

In [2]:
import numpy as np  
from sklearn.model_selection import train_test_split, cross_val_score
# precisei alterar a importação original
# aparentemente a função cross_validation foi removida
from sklearn import datasets  
from sklearn import svm

iris = datasets.load_iris() # carrega o dataset iris (flores)


A single train/test split is made easy with the train_test_split function in the cross_validation library:

In [15]:
# Split the iris data into train/test data sets with 40% reserved for testing
X_train, X_test, y_train, y_test = train_test_split(iris.data, # todas as features
                                                    iris.target, # target é o que queremos prever
                                                    test_size=0.4, # 40% dos dados para teste
                                                    random_state=0 # seed
                                                    )

# Build an SVC model for predicting iris classifications using training data
clf = svm.SVC(kernel='linear', C=1).fit(X_train, # utilizaremos os 60% restantes dos dados 
                                        y_train # dados target de espécies de flores
                                        )

# Now measure its performance with the test data
clf.score(X_test, y_test)   

0.9666666666666667

K-Fold cross validation is just as easy; let's use a K of 5:

In [16]:
# We give cross_val_score a model, the entire data set and its "real" values, and the number of folds:
scores = cross_val_score(clf, # modelo
                         iris.data, # todas as features
                         iris.target, # as espécies de flores	
                         cv=5 # número de datasets para cross-validation
                         )

# Print the accuracy for each fold:
print(scores) # imprime scores de cada fold

# And the mean accuracy of all 5 folds:
print(scores.mean()) # imprime a média dos scores

[0.96666667 1.         0.96666667 0.96666667 1.        ]
0.9800000000000001


Our model is even better than we thought! Can we do better? Let's try a different kernel (poly):

In [17]:
# Testando com kernel polinomial
clf = svm.SVC(kernel='poly', C=1).fit(X_train, y_train)

scores = cross_val_score(clf, iris.data, iris.target, cv=5)

print(scores)
print(scores.mean())


[0.96666667 1.         0.96666667 0.96666667 1.        ]
0.9800000000000001


No! The more complex polynomial kernel produced lower accuracy than a simple linear kernel. The polynomial kernel is overfitting. But we couldn't have told that with a single train/test split:

In [25]:
clf = svm.SVC(kernel='poly', C=1).fit(X_train, y_train)

clf.score(X_test, y_test)


0.9

That's the same score we got with a single train/test split on the linear kernel.

## Activity

The "poly" kernel for SVC actually has another attribute for the number of degrees of the polynomial used, which defaults to 3. For example, svm.SVC(kernel='poly', degree=3, C=1)

We think the default third-degree polynomial is overfitting, based on the results above. But how about 2? Give that a try and compare it to the linear kernel.

In [28]:
# Testando com kernel polinomial de grau 2
clf_2 = svm.SVC(kernel='poly', degree=2, C=1).fit(X_train, y_train)

clf_linear = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)

scores = clf_2.score(X_test, y_test)
scores_linear = clf_linear.score(X_test, y_test)

print(f"Polinomial: {scores}")
print(f"Linear: {scores_linear}")



Polinomial: 0.95
Linear: 0.9666666666666667
