Lecture: AI I - Basics 

Previous:
[**Chapter 4.2: Machine Learning with scikit-learn**](../04_ml/02_machine_learning.ipynb)

---

# Chapter 4.3: Evaluation with scikit-learn

- [Cross-Validation](#cross-validation)
- [Hyperparameter Tuning](#hyperparameter-tuning)
- [Metrics](#metrics)
- [Visualization of Results](#visualization-of-results)

Model selection refers to choosing the ML model that best explains the given data. However, since there is a wide variety of different models, each with large parameter ranges, this task is very critical. For example, if a model is selected that performs very well on the training data but fails miserably on new data, it has not learned the underlying data distribution.  

In this notebook, we will cover the following topics:  
- Cross-validation  
- Hyperparameter tuning  
- Metrics and results  
- Visualization of results  

The topic of model selection and evaluation is also covered in great detail in the [sklearn user guide](https://scikit-learn.org/stable/model_selection.html).  


In [1]:
import numpy as np
from matplotlib import pyplot as plt
from sklearn import datasets

## Cross-validation
Learning the parameters of a prediction function and testing on the same data is a methodological mistake: a model that simply repeats the labels of the samples it has just seen would achieve a perfect result, but would not be able to predict anything useful for unseen data. This situation is called overfitting. To avoid this, it is common practice in (supervised) machine learning experiments to set aside part of the available data as a test set.  

### Simple data splitting
To simply split the data, the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method can be used.  


In [2]:
from sklearn.model_selection import train_test_split

# load data
X, y = datasets.load_iris(return_X_y=True)
print(X.shape, y.shape)

# split the data into 2 sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(150, 4) (150,)
(90, 4) (90,)
(60, 4) (60,)


Now the support vector machine can be trained with the training data and evaluated with the test data.  


In [3]:
from sklearn import svm
clf = svm.SVC(kernel='linear', C=0.001).fit(X_train, y_train)
clf.score(X_test, y_test)

0.26666666666666666

Trying out different values for C.  


In [4]:
clf = svm.SVC(kernel='linear', C=0.01).fit(X_train, y_train)
clf.score(X_test, y_test)

0.8833333333333333

In [5]:
clf = svm.SVC(kernel='linear', C=0.1).fit(X_train, y_train)
clf.score(X_test, y_test)

0.9333333333333333

In [6]:
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

0.9666666666666667

### Why Cross-validation?
When evaluating different settings ("hyperparameters") for prediction functions, such as the C value that must be manually set for an SVM, there is still the risk of overfitting to the test set, since the parameters can be changed until the prediction functions work optimally. In this way, knowledge about the test set can "leak" into the model, and the evaluation metrics no longer reflect generalization performance. To solve this problem, another part of the dataset can be set aside as a so-called "validation set": training is done on the training set, evaluation is then performed on the validation set, and if the experiment seems successful, the final evaluation is carried out on the test set.  

However, by splitting the available data into three sets, the number of data points that can be used to train the model is drastically reduced. In addition, the results may depend on a particular random choice of the (training, validation) set pair.  

To overcome these problems, the cross-validation procedure can be applied. With this procedure, a test set is still needed, but the validation set is no longer necessary. In the basic version, the training set is split into $k$ sets during training. The algorithm is then trained $k$ times on $k-1$ sets. The evaluation of the model is performed on the $k$th set.  

<img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png">  

### Using Cross-validation
The easiest way to use cross-validation is with the [cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score) helper function.  


In [7]:
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1, random_state=42)

k = 5
scores = cross_val_score(clf, X, y, cv=k)
scores

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

The function returns an array of length $k$. Each element is the score for the respective subset.  


In [8]:
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

0.98 accuracy with a standard deviation of 0.02


By default, the result of each cross-validation iteration is the value of the model’s `score` method. To change this, the `scoring` parameter can be set.  


In [9]:
scores = cross_val_score(clf, X, y, cv=5, scoring='f1_macro')
scores

array([0.96658312, 1.        , 0.96658312, 0.96658312, 1.        ])

Another way to perform cross-validation is offered by the [cross_validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) method. In contrast to the `cross_val_score` function, multiple metrics can be passed to it. In addition to the results, the method also returns information about runtimes.  


In [10]:
from sklearn.model_selection import cross_validate

scoring = ['precision_macro', 'recall_macro']
clf = svm.SVC(kernel='linear', C=1, random_state=0)
scores = cross_validate(clf, X, y, scoring=scoring)
scores

{'fit_time': array([0.00112128, 0.0012362 , 0.00085711, 0.00105786, 0.00059175]),
 'score_time': array([0.00267792, 0.00205421, 0.00188398, 0.00191569, 0.00161552]),
 'test_precision_macro': array([0.96969697, 1.        , 0.96969697, 0.96969697, 1.        ]),
 'test_recall_macro': array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])}

### Cross-validation Iterators
Both functions can take either an integer value $k$ or a cross-validation iterator via the `cv` parameter. The first option results in the use of the [KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) or the [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) iterator with $k$ folds. In the second case, the corresponding [cross-validation iterator](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators) is used to split the dataset during training.  


---

Lecture: AI I - Basics 

Exercise: [**Exercise 4.3: Evaluation with scikit-learn**](../04_ml/exercises/03_evaluation.ipynb)

Next: [**Chapter 5.1: Assesment 1**]()