# Questions on Methodology

## Question 1

Assume that we have a dataset that we would like to use for model generation with a specific learning algorithm as well as for measuring the error. If we decide to use ten-fold cross validation, what of the following are correct?

A: Each instance in the dataset will be used exactly once for testing.

B: The performance estimate obtained from averaging the results on the ten test sets is an unbiased estimate of the performance of the model trained from the whole dataset. 

C: The setup corresponds to drawing ten independent samples of training and test sets from some unknown underlying distribution, where the training sets are nine times larger than the test sets. 

D: Using the t-distribution, we can generate a confidence interval from the ten measurements, which contains the true error level at the specified confidence level, if we assume that the errors are distributed normally.

Correct answer: A

Explanation: Since the dataset is divided into ten (non-overlapping) partitions, each instance will appear in exactly one of them, and hence be used exactly once for testing. Since a model trained on 9/10 of the data will be evaluated on each test set, one can expect it to be (slightly) weaker than a model trained from the full set; the estimate can hence be expected to underestimate the performance of the latter. The ten sets of training and test sets are not independent; the same instance can not appear in multiple test sets and for each pair of training sets, there will be exactly an overlap of 8/9 of the instances. Since the observed measurements do not come from ten independent samples, the procedure to form confidence intervals come with no guarantees.


## Question 2

Assume that we are organizing a Kaggle competition and have received 100 contributions (by independent teams), in the form of predictive models generated
from 1000 training instances. We have evaluated the models on a test set of
the same size, which has been hidden from the teams, and found that the best
performing model received an accuracy of 90.1%, which is just above the 90%
threshold, which was a requirement for receiving an award of 100 000 USD. What should
we expect when evaluating the best performing (selected) model on a second test set of the same size, assuming it has been sampled from the same underlying distribution as the first test set?

A: If the competing models are clearly outperformed, then it is more likely that the selected model reaches the target level, compared to if most models performed on the same level.

B: The chance that the selected model performs above the target level decreases if the winning team has submitted several contributions which differ randomly in 2% of the predictions.

C: The chance that the selected model performs above the target level increases with the number of contributions.

Correct answer: A, B, possibly also C

Explanation: The estimated accuracy of a model will vary depending on what set of instances is used to evaluate its performance. If we consider such a set of instances to be a random sample drawn from some (unknown) target distribution, the estimated
accuracy of the model will sometimes be lower and sometimes higher than the
true accuracy (the accuracy of the model wrt. the target distribution). If the
compared models have similar true accuracies, then the observed differences in
the estimated accuracies will mainly be due to the sample we have drawn, and
hence the extreme values of the observed estimated accuracies will be biased; the
lowest value will be overly pessimistic and the highest value overly optimistic,
and the difference between the extreme values and the true accuracies will increase with the number of compared models. Hence, if we select the highest
of these values as an estimate for the true accuracy of the best model, we will
systematically be overestimating the performance. However, if the true accuracy of one of the models is much higher than for the others (and hence would
almost always be outperforming the others independently of the sample used),
then this bias will be smaller. Since we typically do not beforehand know what
the true accuracies are, there is hence a high risk that the estimated accuracy
is indeed too high. In principle, an increased number of contributions could lead to that a model on a new, high level is found, hence eliminating the bias due to the sampling error.

## Question 3

Assume that we want to compare a new algorithm to a baseline algorithm, for
some classification task. As we are not sure what hyper-parameter settings to
use for the new algorithm, we will investigate 100 different settings for that,
while we use a standard hyper-parameter setting for the baseline algorithm.
We first randomly split a given dataset into two equal-sized halves; one for
model building and one for testing. We then employ 10-fold cross-validation
using the first half of the data, measuring the accuracy of each model generated from an algorithm and hyper-parameter setting. Assume that the best performing hyper-parameter setting for the new algorithm results in a higher (cross-validation) accuracy than the baseline algorithm. Assume further that the two models (trained on the entire first half) are evaluated on the second half of the data. What of the following are correct?

A: We should expect to see the same relative performance, i.e., the new algorithm (with the best-performing hyper-parameter setting) outperforms the baseline (with the standard hyperparameter setting).

B: If a majority of the evaluated hyperparameter settings lead to that the baseline is outperformed on the first half, we may expect the best performing configuration to outperform the baseline on the second half.

C: The observed differences in performance between the different hyperparameter settings should not affect any conclusions on what relative performance to expect between the new and the baseline algorithm on the test set.


Correct answer: B

Explanation: Since we have evaluated 100 different configurations for the novel algorithm, the
observed (cross-validation) accuracy for the best performing of these is most
likely biased, i.e., the performance is over-estimated, due to sampling error.
Hence, although the best performing configuration outperforms the baseline
when performing cross-validation on the first half of the data, the corresponding
model trained with this configuration on this half and evaluated on the second
half may very well be outperformed by the baseline. However, if a majority of
the evaluated configurations outperform the baseline on the first half, we may
expect the best performing configuration to still outperform the baseline on the
second half. Moreover, if one (or a few) of the configurations clearly outperform the others, then we can expect the sampling error to be reduced, which may affect what should be expected from the comparison on the second half.

## Question 4

Assume that we would like to mitigate the curse-of-dimensionality for the decision tree
learning algorithm and a very large dataset. We decide to select
the ten features in the dataset with the highest information gain, out of the five hundred available categorical features. Assume that we now employ leave-one-out cross-validation on the reduced dataset, i.e., using only the ten selected features, together with the learning algorithm. What is correct?

A: The chances of generating a decision tree that perfectly fits the training data is reduced.

B: Feature selection reduces the risk that the estimated performance is higher than what would be expected on an independent test set.

C: If we would have kept all features, then the performance estimate would be less biased. 

Correct answer: A, C

Explanation: Since we no longer have access to all features, the chance that two instances with different class labels end up in the same leaf, and consequently the probability of not perfectly fitting the training instances, increases. 

In the proposed setup, we would have access to the test labels when performing
feature selection. This means that we could end up with a set of features which
perform very well together with the learning algorithm on the test set, but which
may not have been selected when basing the decision on what features to include
using the training instances only. Since we have selected a model guided by the
test instances, there is a risk that the estimated performance using these test
instances is biased compared to when evaluating the model on independent test
instances, which have neither been used for feature selection or model building.

## Question 5

Assume that we want to develop a model and estimate its performance on
independent data. We have therefore decided to randomly split an available
dataset, with numerical features only, into a training and test set, using the former to train the model and the latter to estimate its performance. However, since the learning algorithm
that we would like to use cannot directly deal with missing values, we have
decided to employ some imputation technique prior to applying the algorithm.

Which of the following statements are correct?

A: The performance estimate will not be biased as long as imputation is done in the same way for the training and test sets.

B: The performance estimate will not be biased as long as the test instances do not affect the way in which imputation is done.

C: The performance estimate will not be biased if we replace all missing values with zero.

Correct answer: B, C

Explanation: To obtain an estimate of the performance on independent data, we need to make sure that no information from the test set is carried over to the trained model. However, if we employ imputation before splitting the dataset into a training and a test set, some information from the test set may be used when imputing feature values in the training set. This could potentially lead to that the model is better fitted to the test set than to some other independent test set, from
which no information has been extracted. Hence, handling the training and test set in the same way may lead to a biased estimate. If the imputation is not affected by the test instances, it means that we should not expect the performance on the latter to be any different from independent instances. The same holds if we just replace missing instances with zero, as this is just one way of avoiding the test instances to affect the way in which imputation is done.  


## Question 6

Assume that you have a dataset with extreme (binary) class imbalance, making it hard to find a more accurate model than a dummy model that just predicts the majority class. Assume that we have decided to use a naive Bayes classifier. 

What of the following statements are correct?

A: Random undersampling of the majority class can be expected to improve accuracy.

B: Random oversampling of the minority class can be expected to improve accuracy.

C: The effect of oversampling by including multiple copies of the minority class instances can be achieved by modifying the class priors.

D: Random under- or oversampling can be expected to improve the area under the ROC curve.

Correct answers: C

Explanation: Random under- and oversampling will mainly affect the class priors and since these no longer will reflect what can be expected in independent test data, one would expect the accuracy to decrease. This will also be a consequence from that the class-conditional probabilities will be less precise for the undersampled class; the number of observations will be reduced. The same effect from adding copies of the minority class samples can achieved by modifying the class priors; the class-conditional probabilities will generally not be affected, as long as the probabilities for feature-value combinations with zero observations remain the same. Since random under- and oversampling (as well as changing the class priors) will not affect the rankings, the area under ROC will not be affected.