Skip to content

[Model] ShortestCommonSuperstring #413

@isPANN

Description

@isPANN

Important

Build as optimization, not decision. Value = Min<usize>.
Objective: Minimize the length |w| of the superstring containing all input strings as contiguous substrings.
Do not add a bound/threshold field — let the solver find the optimum directly. See #765.

Motivation

SHORTEST COMMON SUPERSTRING (P157) from Garey & Johnson, A4 SR9. A fundamental NP-complete problem in string algorithms and bioinformatics. Given a set of strings, find the shortest string containing each input string as a contiguous substring. This problem is central to genome assembly (reconstructing a genome from short sequencing reads), data compression, and database optimization. Different from Shortest Common Supersequence (P156), which requires subsequence containment (non-contiguous). Proved NP-complete by Maier and Storer (1977) via reduction from VERTEX COVER on cubic graphs, even for binary alphabet. The problem is APX-hard with the best known approximation factor of 2 11/23 ≈ 2.478 (Mucha, 2013).

Associated rules:

  • R103: MinimumVertexCover (cubic graphs) → ShortestCommonSuperstring (as target)

Definition

Name: ShortestCommonSuperstring
Canonical name: SHORTEST COMMON SUPERSTRING
Reference: Garey & Johnson, Computers and Intractability, A4 SR9

Mathematical definition:

INSTANCE: Finite alphabet Σ, finite set R of strings from Σ*.
OBJECTIVE: Find a string w ∈ Σ* of minimum length |w| such that each x ∈ R is a contiguous substring of w, i.e., w = w_0 x w_1 where w_0, w_1 ∈ Σ* (x appears contiguously in w).

Variables

  • Count: max_length variables (one per position of a fixed-length output buffer), where max_length = Σ_{x ∈ R} |x| is the input-derived worst-case bound (a superstring with zero overlap is at most this long).
  • Per-variable domain: {0, 1, ..., alphabet_size} (i.e. alphabet_size + 1 values). Symbols 0..alphabet_size - 1 index into the alphabet Σ; the extra symbol alphabet_size is a contiguous trailing padding/end marker, mirroring the representation used by ShortestCommonSupersequence (src/models/misc/shortest_common_supersequence.rs:42-47).
  • Meaning: Variable i encodes the symbol at position i of the candidate superstring. The effective superstring is the prefix before the first padding symbol; its length is the objective being minimized. The padding column must be contiguous at the end, and every x ∈ R must appear as a contiguous substring of the prefix. Equivalently, one can model this as choosing a permutation of the strings and their overlap amounts, analogous to the asymmetric TSP on the overlap graph.

Schema (data type)

Type name: ShortestCommonSuperstring
Variants: none

Field Type Description
alphabet_size usize Size of the alphabet |Σ|
strings Vec<Vec<usize>> Set R of input strings, each encoded as a vector of alphabet indices in 0..alphabet_size
max_length usize Maximum possible superstring length (sum of all input string lengths). Derived automatically inside new() as `Σ_{x ∈ R}

Notes:

  • This is an optimization problem: Value = Min<usize> — minimize the length of the superstring.
  • max_length is computed from strings inside new() (mirroring ShortestCommonSupersequence::new at src/models/misc/shortest_common_supersequence.rs:78-90), so users construct an instance with just alphabet_size and strings. Exposing it as a stored field keeps dims() input-derivable and aligns the schema with the sibling problem.
  • A string x is a substring of w if x appears contiguously in w (stricter than subsequence).
  • Without loss of generality, one may assume no string in R is a substring of another (otherwise remove it).
  • Can be modeled as an asymmetric TSP on the "overlap graph" of the strings.

Complexity

  • Best known exact algorithm: O(n^2 · 2^n) via Bellman-Held-Karp style DP on the overlap graph, where n = |R|. The problem is equivalent to finding a minimum-weight Hamiltonian path in the overlap graph (asymmetric TSP variant). For |R| = 2, solvable in O(|x_1| + |x_2|) time.
  • declare_variants! complexity string: "num_strings ^ 2 * 2 ^ num_strings"
  • Approximation:
    • Best known approximation ratio: 2 11/23 ≈ 2.478 (Mucha, 2013, "Lyndon Words and Short Superstrings", SODA). Subsequently improved to ≈2.466 by Englert, Matsakis, and Veselý (ISAAC 2023).
    • Greedy algorithm (repeatedly merge the pair with maximum overlap): conjectured 2-approximate, proved 4-approximate (Blum et al., 1994), improved to 3.5-approximate (Kaplan et al., 2005).
    • APX-hard: no PTAS unless P = NP.
  • NP-completeness: NP-complete (Maier and Storer, 1977; Gallant, Maier, and Storer, 1980). Remains NP-complete even if |Σ| = 2, or if all strings have |x| ≤ 8 with no repeated symbols.
  • Polynomial cases: Solvable in polynomial time if all strings have length ≤ 2.
  • References:
    • David Maier and James A. Storer (1977). "A note on the complexity of the superstring problem". Technical Report 233, Princeton University.
    • John Gallant, David Maier, and James A. Storer (1980). "On finding minimal length superstrings". J. Comput. Syst. Sci. 20(1):50-58.
    • Avrim Blum, Tao Jiang, Ming Li, John Tromp, and Mihalis Yannakakis (1994). "Linear approximation of shortest superstrings". JACM 41(4):630-647.
    • Marcin Mucha (2013). "Lyndon Words and Short Superstrings". SODA 2013.

Extra Remark

Full book text:

INSTANCE: Finite alphabet Σ, finite set R of strings from Σ*, and a positive integer K.
QUESTION: Is there a string w ∈ Σ* with |w| ≤ K such that each string x ∈ R is a substring of w, i.e., w = w0xw1 where each wi ∈ Σ*?
Reference: [Maier and Storer, 1977]. Transformation from VERTEX COVER for cubic graphs.
Comment: Remains NP-complete even if |Σ| = 2 or if all x ∈ R have |x| ≤ 8 and contain no repeated symbols. Solvable in polynomial time if all x ∈ R have |x| ≤ 2.

How to solve

  • It can be solved by (existing) bruteforce — enumerate candidate superstrings w ∈ Σ* in increasing length order and check if each x ∈ R appears as a contiguous substring.
  • It can be solved by reducing to integer programming — model as asymmetric TSP on the overlap graph with ordering constraints.
  • Other: Bellman-Held-Karp DP in O(n^2 · 2^n) via overlap graph / asymmetric TSP formulation. Greedy overlap algorithm (practical, 4-approximate). In bioinformatics, solved heuristically via de Bruijn graphs or overlap-layout-consensus pipelines.

Example Instance

Instance 1 (ternary alphabet, equal-length strings):

  • Alphabet: Σ = {a, b, c} (|Σ| = 3)
  • Strings: R = {"abc", "bca", "cab", "bcc", "cca", "aab"}
  • max_length = 3+3+3+3+3+3 = 18 (derived inside new())
  • No string is a substring of another ✓
  • Optimal overlap chain: aab → abc → bca → cab → bcc → cca
    • Overlaps: "aab"→"abc" (2: "ab"), "abc"→"bca" (2: "bc"), "bca"→"cab" (2: "ca"), "cab"→"bcc" (1: "b"), "bcc"→"cca" (2: "cc")
    • Total savings: 2+2+2+1+2 = 9. Superstring length: 6·3 − 9 = 9
    • w = "aabcabcca"
  • Verification: "aab" at pos 0 ✓, "abc" at pos 1 ✓, "bca" at pos 2 ✓, "cab" at pos 3 ✓, "bcc" at pos 5 ✓, "cca" at pos 6 ✓

Instance 2 (binary alphabet, equal-length strings):

  • Alphabet: Σ = {0, 1} (|Σ| = 2)
  • Strings: R = {"001", "011", "110", "100", "010", "101"}
  • max_length = 3+3+3+3+3+3 = 18 (derived inside new())
  • No string is a substring of another ✓ (all binary strings of length 3 except "000" and "111")
  • Optimal overlap chain: 001 → 011 → 110 → 101 → 010 → 100
    • Overlaps: "001"→"011" (2: "01"), "011"→"110" (2: "11"), "110"→"101" (2: "10"), "101"→"010" (2: "01"), "010"→"100" (2: "10")
    • Total savings: 2+2+2+2+2 = 10. Superstring length: 6·3 − 10 = 8
    • w = "00110100"
  • Verification: "001" at pos 0 ✓, "011" at pos 1 ✓, "110" at pos 2 ✓, "101" at pos 3 ✓, "010" at pos 4 ✓, "100" at pos 5 ✓

Instance 3 (ternary alphabet, mixed string lengths):

  • Alphabet: Σ = {a, b, c} (|Σ| = 3)
  • Strings: R = {"abc", "cab", "ba", "bb"} — lengths 3, 3, 2, 2 (deliberately mixed)
  • max_length = 3+3+2+2 = 10 (derived inside new())
  • No string is a substring of another ✓ ("ba" ⊄ "abc" / "cab"; "bb" ⊄ "abc" / "cab"; the two length-3 strings differ)
  • Optimal length: 7, witness w = "abcabba"
    • Decomposition: "abc"→"cab" overlap 2 ("ab" wait — actually cab starts at pos 2 of abcabba); the merge ordering is abc + cab with overlap 1 on the "c" (giving "abcab" of length 5), then append "ba" (already present at pos 5) and "bb" (already present at pos 4). Two of the four strings vanish into already-built positions, illustrating non-Hamiltonian-path behavior where shorter strings are absorbed by the longer chain.
  • Verification (brute-force confirmed by exhaustive search over Σ^L for L = 5, 6, 7):
    • "abc" at pos 0 ✓
    • "cab" at pos 2 ✓
    • "ba" at pos 5 ✓
    • "bb" at pos 4 ✓
  • Why this is non-isomorphic to Instance 1/2: different cardinality (4 strings vs 6), different string-length distribution (mixed 2/3 vs uniform 3), and the absorbed-substring phenomenon exercises a branch where the overlap-graph Hamiltonian-path framing has to not visit every string as a distinct node — useful for round-trip closed-loop testing.
  • Search space (with padding symbol): (alphabet_size + 1)^max_length = 4^10 ≈ 1.05M configurations — easily exhaustible by BruteForce.

Expected Outcome

  • Instance 1: optimal value = 9. Witness superstring w = "aabcabcca" (length 9) contains all 6 input strings as contiguous substrings.
  • Instance 2: optimal value = 8. Witness superstring w = "00110100" (length 8) contains all 6 input strings as contiguous substrings.
  • Instance 3: optimal value = 7. Witness superstring w = "abcabba" (length 7) contains all 4 input strings as contiguous substrings (with "ba" and "bb" absorbed inside the "abc"+"cab" overlap structure).

Metadata

Metadata

Assignees

No one assigned

    Labels

    GoodAn issue passed all checks.modelA model problem to be implemented.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Done

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions