# Introduction to Artificial Intelligence: HS 2023


---

## Lecture 7: Model Evaluation

In [None]:
import numpy as np
import matplotlib.pyplot as plt



## K-fold Cross-Validation in Scikit-Learn

- In the example below, we create a synthetic dataset to illustrate the usage of sklearn's cross-validation iterator. In this example, we have an equal distributed dataset, i.e. the same amount of samples for each of the two classes:

In [None]:
from sklearn.model_selection import KFold

# 10 rows 4 columns
rng = np.random.RandomState(123)

y = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
X = rng.random_sample((y.shape[0], 4))


cv = KFold(n_splits=5) # Create a Kfold with 5 folds

for k in cv.split(X,y): # split the dataset onto the different folds using the created fold
    print(k)


(array([2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1]))
(array([0, 1, 4, 5, 6, 7, 8, 9]), array([2, 3]))
(array([0, 1, 2, 3, 6, 7, 8, 9]), array([4, 5]))
(array([0, 1, 2, 3, 4, 5, 8, 9]), array([6, 7]))
(array([0, 1, 2, 3, 4, 5, 6, 7]), array([8, 9]))


- Recall that we should usually shuffle the dataset, because if the data records are sorted by their class labels, this would lead to inconsistencies, such as classes being underrepresented or overrepresented.

In [None]:
cv = KFold(n_splits=5,shuffle = True, random_state =1) # Create a Kfold with shuffled samples

for k in cv.split(X,y): # split the dataset onto the different folds using the created fold
    print(k)

(array([0, 1, 3, 4, 5, 6, 7, 8]), array([2, 9]))
(array([0, 1, 2, 3, 5, 7, 8, 9]), array([4, 6]))
(array([1, 2, 4, 5, 6, 7, 8, 9]), array([0, 3]))
(array([0, 2, 3, 4, 5, 6, 8, 9]), array([1, 7]))
(array([0, 1, 2, 3, 4, 6, 7, 9]), array([5, 8]))


- As we see, the `KFold` iterator provides an array of indices corresponding to the selected samples, however, we are interested in the feature values and their corresponding class labels.

In [None]:
cv = KFold(shuffle = True)  # Create a Kfold with shuffled samples

for train_idx, test_idx  in cv.split(X,y): # get the train and validation indexes from the dataset split
    print('train labels with shuffling', y[train_idx])
    print('train labels with shuffling', y[test_idx])
    print('\n')

train labels with shuffling [0 0 0 0 0 1 1 1]
train labels with shuffling [1 1]


train labels with shuffling [0 0 0 0 0 1 1 1]
train labels with shuffling [1 1]


train labels with shuffling [0 0 0 0 1 1 1 1]
train labels with shuffling [0 1]


train labels with shuffling [0 0 0 1 1 1 1 1]
train labels with shuffling [0 0]


train labels with shuffling [0 0 0 1 1 1 1 1]
train labels with shuffling [0 0]




- As the datasets usually have their own class distributions, it's usually important to stratify the slipts so as to keep the original class distributions. Especially in the case of skewed datasets, this can be relevant.

- Use the iris dataset and stratified folding to train a Random Forest Classifier. Next, evaluate the model results on the test dataset.

In [None]:
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5) # initialise a stratified object with 5 different folds

for train_idx, test_idx in cv.split(X,y): # get train and validation indexes from the dataset split
    print('train labels with shuffling', y[train_idx])
    print('train labels with shuffling', y[test_idx])
    print('\n')

train labels with shuffling [0 0 0 0 1 1 1 1]
train labels with shuffling [0 1]


train labels with shuffling [0 0 0 0 1 1 1 1]
train labels with shuffling [0 1]


train labels with shuffling [0 0 0 0 1 1 1 1]
train labels with shuffling [0 1]


train labels with shuffling [0 0 0 0 1 1 1 1]
train labels with shuffling [0 1]


train labels with shuffling [0 0 0 0 1 1 1 1]
train labels with shuffling [0 1]




In [None]:
from sklearn.tree import DecisionTreeClassifier
from mlxtend.data import iris_data
from sklearn.model_selection import train_test_split

nb_folds = 10
X, y = iris_data()
X_train, X_test, y_train, y_test = train_test_split(X,y,shuffle=True, test_size=0.15, stratify=y) # train split the dataset

cv = StratifiedKFold(n_splits=5, shuffle= True, random_state=1) # initialise stratified folding with 10 folds

kfold_acc = 0.
for train_idx, valid_idx in cv.split(X_train,y_train): # get train and validation indexes
    clf = DecisionTreeClassifier(max_depth=3, random_state=1) # initialise classifier
    clf.fit(X_train[train_idx],y_train[train_idx]) # fit your classifier on the corresponding training dataset
    y_pred = clf.predict(X_train[valid_idx]) # get the model predictions on the correspoding validation fold
    acc = np.mean(y_pred == y_train[valid_idx])*100
    kfold_acc += acc
kfold_acc /= nb_folds

print('Kfold Accuracy: %.2f%%' % kfold_acc)



Kfold Accuracy: 47.63%


Next, we use `sklearn` function `cross_val_score` to automatically perform cross-fold validation instead of performing all the previous manual steps.


In [None]:
from sklearn.model_selection import cross_val_score


cv_acc = cross_val_score(estimator = DecisionTreeClassifier(max_depth = 3, random_state=1),X=X_train,y=y_train, cv= StratifiedKFold(n_splits=5,random_state=1, shuffle=True))

print('Kfold Accuracy: %.2f%%' % (np.mean(cv_acc)*100))

Kfold Accuracy: 95.26%
