You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Build as optimization, not decision.Value = Min<usize>. Objective: Minimize the length |w| of the superstring containing all input strings as contiguous substrings.
Do not add a bound/threshold field — let the solver find the optimum directly. See #765.
Motivation
SHORTEST COMMON SUPERSTRING (P157) from Garey & Johnson, A4 SR9. A fundamental NP-complete problem in string algorithms and bioinformatics. Given a set of strings, find the shortest string containing each input string as a contiguous substring. This problem is central to genome assembly (reconstructing a genome from short sequencing reads), data compression, and database optimization. Different from Shortest Common Supersequence (P156), which requires subsequence containment (non-contiguous). Proved NP-complete by Maier and Storer (1977) via reduction from VERTEX COVER on cubic graphs, even for binary alphabet. The problem is APX-hard with the best known approximation factor of 2 11/23 ≈ 2.478 (Mucha, 2013).
Name:ShortestCommonSuperstring Canonical name: SHORTEST COMMON SUPERSTRING Reference: Garey & Johnson, Computers and Intractability, A4 SR9
Mathematical definition:
INSTANCE: Finite alphabet Σ, finite set R of strings from Σ*.
OBJECTIVE: Find a string w ∈ Σ* of minimum length |w| such that each x ∈ R is a contiguous substring of w, i.e., w = w_0 x w_1 where w_0, w_1 ∈ Σ* (x appears contiguously in w).
Variables
Count:max_length variables (one per position of a fixed-length output buffer), where max_length = Σ_{x ∈ R} |x| is the input-derived worst-case bound (a superstring with zero overlap is at most this long).
Per-variable domain:{0, 1, ..., alphabet_size} (i.e. alphabet_size + 1 values). Symbols 0..alphabet_size - 1 index into the alphabet Σ; the extra symbol alphabet_size is a contiguous trailing padding/end marker, mirroring the representation used by ShortestCommonSupersequence (src/models/misc/shortest_common_supersequence.rs:42-47).
Meaning: Variable i encodes the symbol at position i of the candidate superstring. The effective superstring is the prefix before the first padding symbol; its length is the objective being minimized. The padding column must be contiguous at the end, and every x ∈ R must appear as a contiguous substring of the prefix. Equivalently, one can model this as choosing a permutation of the strings and their overlap amounts, analogous to the asymmetric TSP on the overlap graph.
Schema (data type)
Type name:ShortestCommonSuperstring Variants: none
Field
Type
Description
alphabet_size
usize
Size of the alphabet |Σ|
strings
Vec<Vec<usize>>
Set R of input strings, each encoded as a vector of alphabet indices in 0..alphabet_size
max_length
usize
Maximum possible superstring length (sum of all input string lengths). Derived automatically inside new() as `Σ_{x ∈ R}
Notes:
This is an optimization problem: Value = Min<usize> — minimize the length of the superstring.
max_length is computed from strings inside new() (mirroring ShortestCommonSupersequence::new at src/models/misc/shortest_common_supersequence.rs:78-90), so users construct an instance with just alphabet_size and strings. Exposing it as a stored field keeps dims() input-derivable and aligns the schema with the sibling problem.
A string x is a substring of w if x appears contiguously in w (stricter than subsequence).
Without loss of generality, one may assume no string in R is a substring of another (otherwise remove it).
Can be modeled as an asymmetric TSP on the "overlap graph" of the strings.
Complexity
Best known exact algorithm: O(n^2 · 2^n) via Bellman-Held-Karp style DP on the overlap graph, where n = |R|. The problem is equivalent to finding a minimum-weight Hamiltonian path in the overlap graph (asymmetric TSP variant). For |R| = 2, solvable in O(|x_1| + |x_2|) time.
Best known approximation ratio: 2 11/23 ≈ 2.478 (Mucha, 2013, "Lyndon Words and Short Superstrings", SODA). Subsequently improved to ≈2.466 by Englert, Matsakis, and Veselý (ISAAC 2023).
Greedy algorithm (repeatedly merge the pair with maximum overlap): conjectured 2-approximate, proved 4-approximate (Blum et al., 1994), improved to 3.5-approximate (Kaplan et al., 2005).
APX-hard: no PTAS unless P = NP.
NP-completeness: NP-complete (Maier and Storer, 1977; Gallant, Maier, and Storer, 1980). Remains NP-complete even if |Σ| = 2, or if all strings have |x| ≤ 8 with no repeated symbols.
Polynomial cases: Solvable in polynomial time if all strings have length ≤ 2.
References:
David Maier and James A. Storer (1977). "A note on the complexity of the superstring problem". Technical Report 233, Princeton University.
John Gallant, David Maier, and James A. Storer (1980). "On finding minimal length superstrings". J. Comput. Syst. Sci. 20(1):50-58.
Avrim Blum, Tao Jiang, Ming Li, John Tromp, and Mihalis Yannakakis (1994). "Linear approximation of shortest superstrings". JACM 41(4):630-647.
Marcin Mucha (2013). "Lyndon Words and Short Superstrings". SODA 2013.
Extra Remark
Full book text:
INSTANCE: Finite alphabet Σ, finite set R of strings from Σ*, and a positive integer K.
QUESTION: Is there a string w ∈ Σ* with |w| ≤ K such that each string x ∈ R is a substring of w, i.e., w = w0xw1 where each wi ∈ Σ*?
Reference: [Maier and Storer, 1977]. Transformation from VERTEX COVER for cubic graphs.
Comment: Remains NP-complete even if |Σ| = 2 or if all x ∈ R have |x| ≤ 8 and contain no repeated symbols. Solvable in polynomial time if all x ∈ R have |x| ≤ 2.
How to solve
It can be solved by (existing) bruteforce — enumerate candidate superstrings w ∈ Σ* in increasing length order and check if each x ∈ R appears as a contiguous substring.
It can be solved by reducing to integer programming — model as asymmetric TSP on the overlap graph with ordering constraints.
Other: Bellman-Held-Karp DP in O(n^2 · 2^n) via overlap graph / asymmetric TSP formulation. Greedy overlap algorithm (practical, 4-approximate). In bioinformatics, solved heuristically via de Bruijn graphs or overlap-layout-consensus pipelines.
No string is a substring of another ✓ ("ba" ⊄ "abc" / "cab"; "bb" ⊄ "abc" / "cab"; the two length-3 strings differ)
Optimal length: 7, witness w = "abcabba"
Decomposition: "abc"→"cab" overlap 2 ("ab" wait — actually cab starts at pos 2 of abcabba); the merge ordering is abc + cab with overlap 1 on the "c" (giving "abcab" of length 5), then append "ba" (already present at pos 5) and "bb" (already present at pos 4). Two of the four strings vanish into already-built positions, illustrating non-Hamiltonian-path behavior where shorter strings are absorbed by the longer chain.
Verification (brute-force confirmed by exhaustive search over Σ^L for L = 5, 6, 7):
"abc" at pos 0 ✓
"cab" at pos 2 ✓
"ba" at pos 5 ✓
"bb" at pos 4 ✓
Why this is non-isomorphic to Instance 1/2: different cardinality (4 strings vs 6), different string-length distribution (mixed 2/3 vs uniform 3), and the absorbed-substring phenomenon exercises a branch where the overlap-graph Hamiltonian-path framing has to not visit every string as a distinct node — useful for round-trip closed-loop testing.
Search space (with padding symbol): (alphabet_size + 1)^max_length = 4^10 ≈ 1.05M configurations — easily exhaustible by BruteForce.
Expected Outcome
Instance 1: optimal value = 9. Witness superstring w = "aabcabcca" (length 9) contains all 6 input strings as contiguous substrings.
Instance 2: optimal value = 8. Witness superstring w = "00110100" (length 8) contains all 6 input strings as contiguous substrings.
Instance 3: optimal value = 7. Witness superstring w = "abcabba" (length 7) contains all 4 input strings as contiguous substrings (with "ba" and "bb" absorbed inside the "abc"+"cab" overlap structure).
Important
Build as optimization, not decision.
Value = Min<usize>.Objective: Minimize the length |w| of the superstring containing all input strings as contiguous substrings.
Do not add a
bound/thresholdfield — let the solver find the optimum directly. See #765.Motivation
SHORTEST COMMON SUPERSTRING (P157) from Garey & Johnson, A4 SR9. A fundamental NP-complete problem in string algorithms and bioinformatics. Given a set of strings, find the shortest string containing each input string as a contiguous substring. This problem is central to genome assembly (reconstructing a genome from short sequencing reads), data compression, and database optimization. Different from Shortest Common Supersequence (P156), which requires subsequence containment (non-contiguous). Proved NP-complete by Maier and Storer (1977) via reduction from VERTEX COVER on cubic graphs, even for binary alphabet. The problem is APX-hard with the best known approximation factor of 2 11/23 ≈ 2.478 (Mucha, 2013).
Associated rules:
Definition
Name:
ShortestCommonSuperstringCanonical name: SHORTEST COMMON SUPERSTRING
Reference: Garey & Johnson, Computers and Intractability, A4 SR9
Mathematical definition:
INSTANCE: Finite alphabet Σ, finite set R of strings from Σ*.
OBJECTIVE: Find a string w ∈ Σ* of minimum length |w| such that each x ∈ R is a contiguous substring of w, i.e., w = w_0 x w_1 where w_0, w_1 ∈ Σ* (x appears contiguously in w).
Variables
max_lengthvariables (one per position of a fixed-length output buffer), wheremax_length = Σ_{x ∈ R} |x|is the input-derived worst-case bound (a superstring with zero overlap is at most this long).{0, 1, ..., alphabet_size}(i.e.alphabet_size + 1values). Symbols0..alphabet_size - 1index into the alphabet Σ; the extra symbolalphabet_sizeis a contiguous trailing padding/end marker, mirroring the representation used byShortestCommonSupersequence(src/models/misc/shortest_common_supersequence.rs:42-47).iencodes the symbol at positioniof the candidate superstring. The effective superstring is the prefix before the first padding symbol; its length is the objective being minimized. The padding column must be contiguous at the end, and everyx ∈ Rmust appear as a contiguous substring of the prefix. Equivalently, one can model this as choosing a permutation of the strings and their overlap amounts, analogous to the asymmetric TSP on the overlap graph.Schema (data type)
Type name:
ShortestCommonSuperstringVariants: none
alphabet_sizeusizestringsVec<Vec<usize>>0..alphabet_sizemax_lengthusizenew()as `Σ_{x ∈ R}Notes:
Value = Min<usize>— minimize the length of the superstring.max_lengthis computed fromstringsinsidenew()(mirroringShortestCommonSupersequence::newatsrc/models/misc/shortest_common_supersequence.rs:78-90), so users construct an instance with justalphabet_sizeandstrings. Exposing it as a stored field keepsdims()input-derivable and aligns the schema with the sibling problem.Complexity
declare_variants!complexity string:"num_strings ^ 2 * 2 ^ num_strings"Extra Remark
Full book text:
INSTANCE: Finite alphabet Σ, finite set R of strings from Σ*, and a positive integer K.
QUESTION: Is there a string w ∈ Σ* with |w| ≤ K such that each string x ∈ R is a substring of w, i.e., w = w0xw1 where each wi ∈ Σ*?
Reference: [Maier and Storer, 1977]. Transformation from VERTEX COVER for cubic graphs.
Comment: Remains NP-complete even if |Σ| = 2 or if all x ∈ R have |x| ≤ 8 and contain no repeated symbols. Solvable in polynomial time if all x ∈ R have |x| ≤ 2.
How to solve
Example Instance
Instance 1 (ternary alphabet, equal-length strings):
max_length= 3+3+3+3+3+3 = 18 (derived insidenew())Instance 2 (binary alphabet, equal-length strings):
max_length= 3+3+3+3+3+3 = 18 (derived insidenew())Instance 3 (ternary alphabet, mixed string lengths):
max_length= 3+3+2+2 = 10 (derived insidenew())w = "abcabba"cabstarts at pos 2 ofabcabba); the merge ordering isabc + cabwith overlap 1 on the "c" (giving "abcab" of length 5), then append "ba" (already present at pos 5) and "bb" (already present at pos 4). Two of the four strings vanish into already-built positions, illustrating non-Hamiltonian-path behavior where shorter strings are absorbed by the longer chain.(alphabet_size + 1)^max_length = 4^10 ≈ 1.05Mconfigurations — easily exhaustible byBruteForce.Expected Outcome