Skip to content

[Model] MinimumDecisionTree #932

@isPANN

Description

@isPANN

Motivation

MINIMUM DECISION TREE (MS15) from Garey & Johnson, A1.2. Given a set of objects distinguished by binary tests, find a decision tree that identifies each object with minimum total external path length (the weighted sum of test depths for identifying each object). NP-hard even when each test has at most 3 objects with value 1 (Hyafil and Rivest 1976). The problem captures optimal test sequencing for identification — applications in medical diagnosis, fault isolation, species classification, and information-theoretic search.

Definition

Name: MinimumDecisionTree
Reference: Garey & Johnson, Computers and Intractability, A1.2 MS15; Hyafil and Rivest 1976

Mathematical definition:

INSTANCE: Set S of objects, collection T = {T_1, T_2, ..., T_m} of binary tests, where each test T_j is a function from S into {0,1}, such that for every pair of objects s, s' ∈ S there exists a test T_j ∈ T with T_j(s) ≠ T_j(s').
OBJECTIVE: Find a decision tree using tests from T that identifies each member of S and minimizes the total external path length.

A decision tree is a binary tree where each internal node is labeled by a test T_j. The left child corresponds to outcome 0 and the right child to outcome 1. Each leaf is labeled by an object. The external path length of a leaf is the number of edges from the root to that leaf. The total external path length is the sum over all leaves.

Variables

  • Count: The tree structure is the decision variable. Equivalently, one can encode the tree as a sequence of test selections at each internal node.
  • Per-variable domain: Each internal node chooses from {T_1, ..., T_m}.
  • Meaning: The assignment of tests to internal nodes of a binary tree. The tree must correctly distinguish all objects, and the objective is to minimize the total external path length.

Schema (data type)

Type name: MinimumDecisionTree
Variants: none (unique input structure)

Field Type Description
test_matrix Vec<Vec<bool>> Binary matrix: test_matrix[j][i] = true iff object i passes test j
num_objects usize Number of objects
num_tests usize Number of tests

Complexity

  • Best known exact algorithm: O*(num_tests^num_objects) by enumerating all possible decision trees. NP-hard even when each test has at most 3 objects with value 1 (Hyafil and Rivest 1976; transformation from EXACT COVER BY 3-SETS).
  • Complexity string: "num_tests^num_objects"

Extra Remark

Full book text:

INSTANCE: Set S of "objects," collection T = {T_1, T_2, ..., T_m} of "binary tests," where each test T_j is a function from S into {0,1}, such that for every pair of objects s, s' ∈ S there exists a test T_j ∈ T with T_j(s) ≠ T_j(s'), positive integer K.
QUESTION: Is there a decision tree for T that "identifies" each member of S and has total external path length of at most K?

Reference: [Hyafil and Rivest, 1976]. Transformation from EXACT COVER BY 3-SETS.
Comment: NP-complete even if each test T_j has at most 3 objects s ∈ S for which T_j(s) = 1.

Note: The original G&J formulation is a decision problem with threshold K. We use the equivalent optimization formulation: minimize total external path length.

How to solve

  • It can be solved by (existing) bruteforce. (Enumerate all binary trees with test labels at internal nodes; prune those that don't identify all objects; return minimum total path length.)
  • It can be solved by reducing to integer programming.
  • Other:

Reduction Rule Crossref

(Planned: ExactCoverBy3Sets → MinimumDecisionTree, following Hyafil & Rivest 1976)

Example Instance

Input:
Objects: S = {o0, o1, o2, o3} (4 objects).
Tests:

       o0  o1  o2  o3
T0:     1   1   0   0
T1:     1   0   0   0
T2:     0   1   0   1

Each pair of objects is distinguished by at least one test:
(o0,o1): T1,T2; (o0,o2): T0,T1; (o0,o3): T0,T1,T2; (o1,o2): T0,T2; (o1,o3): T0; (o2,o3): T2.

Optimal decision tree (total path length = 8):

        T0
       /  \
    (0)    (1)
    T2      T1
   /  \    /  \
  o2   o3  o1  o0
  • Root: test T0. Left (T0=0): objects {o2, o3}. Right (T0=1): objects {o0, o1}.
  • Left child: test T2. T2(o2)=0 → leaf o2 (depth 2). T2(o3)=1 → leaf o3 (depth 2).
  • Right child: test T1. T1(o1)=0 → leaf o1 (depth 2). T1(o0)=1 → leaf o0 (depth 2).
  • Total path length = 2+2+2+2 = 8. ✓

Suboptimal decision tree (total path length = 9):

        T1
       /    \
    (0)      (1)
     T0      o0
    /  \
  T2    o1
 /  \
o2   o3
  • Root: test T1. T1(o0)=1 → leaf o0 (depth 1). T1=0: objects {o1, o2, o3}.
  • T0 on {o1,o2,o3}: T0(o1)=1 → leaf o1 (depth 2). T0=0: objects {o2, o3}.
  • T2 on {o2,o3}: T2(o2)=0 → leaf o2 (depth 3). T2(o3)=1 → leaf o3 (depth 3).
  • Total path length = 1+2+3+3 = 9.

Why the balanced tree wins: Starting with T1 (which isolates only o0) gives depth 1 for o0 but pushes three objects deeper. The balanced split by T0 (2-2) yields uniform depth 2 for all objects, achieving the minimum total path length.

Expected Outcome

Optimal value: Min(8) — the minimum total external path length is 8. There are 6 valid decision trees with 2 distinct path lengths: 4 trees with PL=8 and 2 trees with PL=9.

Metadata

Metadata

Assignees

No one assigned

    Labels

    GoodAn issue passed all checks.modelA model problem to be implemented.

    Type

    No type

    Projects

    Status

    Ready

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions