# Questions on Combining Models

### Question 1

Can the generation of a random forest with 1024 trees be at least as fast as the generation of a single decision tree (using the standard divide-and-conquer algorithm)?

A: Yes, but only if the trees in the forest are generated in parallel

B: Yes, if the number of training instances is large enough

C: Yes, if the number of features is large enough

Correct answer: C

When generating a random forest, only a random subset of the features are evaluated when searching
for the best way to split a node in a tree. If this subset constitutes less than one percent of the
original features, then the cost of growing hundred trees is expected to be less than growing a single
tree while evaluating all features (as the computational cost is directly proportional to the number
of evaluated features). Moreover, as the trees in the random forests are generated from bootstrap
replicates, the full-grown trees can be expected to be shallower, and hence faster to compute.

## Question 2

Assume that we want to apply both random forests (RF) and the gradient boosting
machine (GBM) to a regression task, where all regression values in the training set fall
into the range $[0,1]$. Which of the following are correct?

A: RF may produce predictions outside the range $[0,1]$

B: GBM may produce predictions outside the range $[0,1]$

C: Whether RF may produce predictions outside the range depends on the number of trees

D: Whether GBM may produce predictions outside the range depends on the number of base models


Correct answer: B, D

Explanation: Random forests form their predictions by averaging the predictions of the individual trees, and since these (normally) are restricted to making predictions by averaging observed regression values in the leaves, they will not produce predictions that are higher (lower) than the highest (lowest) regression value in the training set. Gradient boosting machines form their predictions by summing the predictions of the individual trees, which may very well result in predictions that fall outside the range of observed regression values. However, if the GBM contains one base model only, it will predict the mean of the regression values in the training set.

## Question 3

Assume that we would like to generate an ensemble of Naive Bayes classifiers in a similar way to how random forests are trained.

Which of the following statements are correct?

A Bagging can be employed

B Random feature selection can be employed

C The class priors may differ between the base models

D Numerical features have to be discretized in the same way across the base models

Correct answer: A, B, C

Explanation: If we would like to generate an ensemble of naı̈ve Bayes classifiers using the
strategy of random forests, it would mean that we introduce diversity in two
ways; by training each classifier from a bootstrap replicate (through bagging)
and by considering a random subset of the features for each classifier. Bagging may result in that the class frequencies may differ and hence the class priors for the base models. The preprocessing steps may be performed independently for each base model.

## Question 4

Assume that we would like to compare AdaBoost using decision stumps and fully grown decision trees as base models, respectively.

Which of the following statements are correct?

A: We can expect more base models to be generated when using stumps

B: We can expect the accuracy to be higher when using fully grown trees compared to using stumps

C: We can expect that using more base models leads to improved predictive performance

Correct answer: A, C

Explanation: The AdaBoost algorithm terminates before the maximum number of models
have been generated, if either the error on the training set exceeds 50% or the
error is 0%. The latter may very well happen if we allow the decision tree to be
fully grown as it can easily overfit the training set. In the worst case, this leads
to that AdaBoost will only iterate once, producing a model consisting of a single,
typically rather weak, decision tree. In contrast, it is often not possible to find
a single decision stump that perfectly classifies all training instances, allowing
AdaBoost to continue generating ensemble members until the maximum number
of iterations is reached, which we can expect to achieve a higher predictive performance, as this increases with the number of included base models.

## Question 5

Assume that we first train base models using a set of algorithms
from a given dataset and then generate a stacked model from the output of
the base models when given the same dataset as input.

Which of the following statements are correct?

A: We can expect the stacked model to outperform each of the individual models

B: If one base model is overfitting the training set, this will be compensated for by having included other base models

C: We can expect averaging of the base models to outperform the stacked model 

Correct answer: C

Explanation: If any of the algorithms has a tendency to overfit the training instances, such
as the decision tree learning algorithm, it means that the label predicted by
the corresponding model on a training instance with high likelihood will be the
same as the original label of the training instance. In the extreme case, the base
model will perfectly classify all training instances. In such a case, there is a risk
that the stacking model will rely only on the output of such an overfitted base
model, and hence not performning any better than this. Averaging can handle the problem unless a majority of the base models suffer from overfitting. The problem would be
avoided by using a separate dataset and generate the stacking model from the
output of the base models on this dataset.

## Question 6

In [1]:
import numpy as np

rng = np.random.default_rng()

# Now write:
# rng.choice([1,2,3],10)
# instead of (legacy):
# np.random.choice([1,2,3],10)

n = 10
d = 5
X = rng.random((n,d))
y = rng.choice([0,1], n)

# Which of the following results in a proper 
# bootstrap replicate of the objects and labels:

# A
X_bootstrap = X[rng.choice(n,n,replace=False)]
y_bootstrap = y[rng.choice(n,n,replace=False)]

# B
X_bootstrap = X[rng.choice(n,n,replace=True)]
y_bootstrap = y[rng.choice(n,n,replace=True)]

# C
bootstrap = rng.choice(n,n,replace=False)
X_bootstrap = X[bootstrap]
y_bootstrap = y[bootstrap]

# D
bootstrap = rng.choice(n,n,replace=True)
X_bootstrap = X[bootstrap]
y_bootstrap = y[bootstrap]

Correct answers: D

Explanation: The random selection should be with replacement and the same indexes should be used for both the objects and the labels

In [1]:
# One more illustration:

import numpy as np

rng = np.random.default_rng()

rng.integers(0,2,5)

array([0, 0, 1, 0, 1])