# Questions on Decision Trees and Rules

### Question 1

Which of the following statements are correct regarding decision trees generated from categorical features?

A: In a tree with multi-way splits, the same feature may appear multiple times on a path from the root to a leaf

B: In a tree with binary splits, the same feature may appear multiple times on a path from the root to a leaf

C: A tree with multi-way splits will always contain more nodes than a tree with binary splits

Correct answer: B

Explanation: All instances that follow a path that contain an equality condition for a feature will have the same value for that feature, and hence splitting on that feature, will lead to that all instances fall into the same partition. (This will be the case even if some of the instances miss the feature value.)

## Question 2

Which of the following statements are correct regarding decision trees with binary splits generated from numerical features?

A: The same feature may appear multiple times on a path from the root to a leaf

B: Missing values must be imputed prior to generating the tree

C: Min-max normalization may have an effect on how the instances are partitioned 

Correct answer: A

Explanation: Instances with missing values may be distributed over the two child nodes following each split. Min-max normalization does not change the ordering of the instances and hence the number of possible splits and the resulting partitions stay the same.



## Question 3

Assume that trees with multi-way splits are considered and that we by mistake have included a unique keyword identifier as a feature in the dataset.


Which of the following statements are correct?

A Most likely, more informative features will be chosen for the splits 

B The feature will be evaluated as uninformative by Information gain

C The feature will be evaluated as uninformative by Gini index

D Laplace correction would not have an impact on whether the feature will be selected or not

Correct answer: None

Explanation: The relative frequency will be one for one of the classes and zero for the others, since there will be only one instance in each resulting child node. This leads to that both Information gain and Gini index will consider the split to be this most informative possible; there cannot be a better split according to the metrics. If the class probabilities are corrected by Laplace, the resulting probabilities will be 2/3 and 1/3, rather than 1/1 and 0/2 for binary classification tasks, leading to that the split will be far from perfect according to the metrics.

## Question 4

Assume that missing values are handled during generation and application of trees with binary splits.

Which of the following statements are correct?

A: For categorical features, missing values must not be treated like non-missing values

B: For numerical features, missing values must not be treated like non-missing values

C: For test instances, we need to impute values before applying the tree

D: If not imputing values before application, a test instance may follow a path for which the conditions are mutually exclusive

Correct answer: B

Explanation: A missing value may in principle be treated as a unique categorical value, allowing the (in)equality test to be well-defined. However, the corresponding test for numerical values cannot be defined for missing values in a reasonable way. We may distribute test instances with missing values over multiple nodes and aggregate the (multiple) leaf node predictions. The only way in which we can obtain a path with mutually exclusive conditions would be if we would allow the tree to be grown from empty partitions.

## Question 5

Which of the following statements are correct?

A: A strategy is needed to handle non-covered instances for decision lists

B: A strategy is needed to handle conflicting rules for decision lists

C: Rule sets are more compact than decision lists

D: A decision tree can be directly converted to both a rule set and a decision list

Correct answer: D

Explanation: Only the first rule in a decision list will be applied and the last rule will cover any instances not covered by previous rules. Rule sets contain rules for all classes and are hence typically less compact than decision lists. Each path in a decision tree defines a rule and all the paths can be put in a set of (non-overlapping) rules. We could equally well form a decision list by replacing all rules predicting a specific class with a default rule, which is placed at the end of the decision list.

## Question 6

In [4]:
from sklearn import tree

# Which of the following parameter settings may lead to a larger number of nodes 
# in the resulting tree compared to the default?

# A
dt = tree.DecisionTreeClassifier(max_depth=3)
# default=None -> np.inf

# B
dt = tree.DecisionTreeClassifier(min_samples_split=3)
# default=2

# C
dt = tree.DecisionTreeClassifier(min_samples_leaf=3)
# default=1

# D
dt = tree.DecisionTreeClassifier(max_features="log2")
# default=None -> n_features

Correct answers: C, D

Explanation: The first two options limit the depth of the tree in two different ways and hence cannot lead to an increased number of nodes (decisions regarding splits above the nodes that meet the termination conditions are not affected). The two last alternatives restrict what splits may be chosen, which may lead to that splits that would result in a more compact tree are left out.