[Model] ShortestCommonSuperstring

> [!IMPORTANT]
> **Build as optimization, not decision.** `Value = Min<usize>`.
> **Objective:** Minimize the length |w| of the superstring containing all input strings as contiguous substrings.
> Do not add a `bound`/`threshold` field — let the solver find the optimum directly. See #765.

## Motivation

SHORTEST COMMON SUPERSTRING (P157) from Garey & Johnson, A4 SR9. A fundamental NP-complete problem in string algorithms and bioinformatics. Given a set of strings, find the shortest string containing each input string as a contiguous substring. This problem is central to genome assembly (reconstructing a genome from short sequencing reads), data compression, and database optimization. Different from Shortest Common Supersequence (P156), which requires subsequence containment (non-contiguous). Proved NP-complete by Maier and Storer (1977) via reduction from VERTEX COVER on cubic graphs, even for binary alphabet. The problem is APX-hard with the best known approximation factor of 2 11/23 ≈ 2.478 (Mucha, 2013).

**Associated rules:**
- R103: MinimumVertexCover (cubic graphs) → ShortestCommonSuperstring (as target)

## Definition

**Name:** `ShortestCommonSuperstring`
**Canonical name:** SHORTEST COMMON SUPERSTRING
**Reference:** Garey & Johnson, *Computers and Intractability*, A4 SR9

**Mathematical definition:**

INSTANCE: Finite alphabet Σ, finite set R of strings from Σ*.
OBJECTIVE: Find a string w ∈ Σ* of minimum length |w| such that each x ∈ R is a contiguous substring of w, i.e., w = w_0 x w_1 where w_0, w_1 ∈ Σ* (x appears contiguously in w).

## Variables


- **Count:** `max_length` variables (one per position of a fixed-length output buffer), where `max_length = Σ_{x ∈ R} |x|` is the input-derived worst-case bound (a superstring with zero overlap is at most this long).
- **Per-variable domain:** `{0, 1, ..., alphabet_size}` (i.e. `alphabet_size + 1` values). Symbols `0..alphabet_size - 1` index into the alphabet Σ; the extra symbol `alphabet_size` is a contiguous trailing padding/end marker, mirroring the representation used by `ShortestCommonSupersequence` (`src/models/misc/shortest_common_supersequence.rs:42-47`).
- **Meaning:** Variable `i` encodes the symbol at position `i` of the candidate superstring. The effective superstring is the prefix before the first padding symbol; its length is the objective being minimized. The padding column must be contiguous at the end, and every `x ∈ R` must appear as a contiguous substring of the prefix. Equivalently, one can model this as choosing a permutation of the strings and their overlap amounts, analogous to the asymmetric TSP on the overlap graph.

## Schema (data type)


**Type name:** `ShortestCommonSuperstring`
**Variants:** none

| Field | Type | Description |
|-------|------|-------------|
| `alphabet_size` | `usize` | Size of the alphabet \|Σ\| |
| `strings` | `Vec<Vec<usize>>` | Set R of input strings, each encoded as a vector of alphabet indices in `0..alphabet_size` |
| `max_length` | `usize` | Maximum possible superstring length (sum of all input string lengths). Derived automatically inside `new()` as `Σ_{x ∈ R} |x|`; stored on the struct so it is part of the schema and reachable by `dims()`. |

**Notes:**
- This is an **optimization problem**: `Value = Min<usize>` — minimize the length of the superstring.
- `max_length` is **computed from `strings` inside `new()`** (mirroring `ShortestCommonSupersequence::new` at `src/models/misc/shortest_common_supersequence.rs:78-90`), so users construct an instance with just `alphabet_size` and `strings`. Exposing it as a stored field keeps `dims()` input-derivable and aligns the schema with the sibling problem.
- A string x is a substring of w if x appears contiguously in w (stricter than subsequence).
- Without loss of generality, one may assume no string in R is a substring of another (otherwise remove it).
- Can be modeled as an asymmetric TSP on the "overlap graph" of the strings.

## Complexity


- **Best known exact algorithm:** O(n^2 · 2^n) via Bellman-Held-Karp style DP on the overlap graph, where n = |R|. The problem is equivalent to finding a minimum-weight Hamiltonian path in the overlap graph (asymmetric TSP variant). For |R| = 2, solvable in O(|x_1| + |x_2|) time.
- **`declare_variants!` complexity string:** `"num_strings ^ 2 * 2 ^ num_strings"`
- **Approximation:**
  - Best known approximation ratio: 2 11/23 ≈ 2.478 (Mucha, 2013, "Lyndon Words and Short Superstrings", SODA). Subsequently improved to ≈2.466 by Englert, Matsakis, and Veselý (ISAAC 2023).
  - Greedy algorithm (repeatedly merge the pair with maximum overlap): conjectured 2-approximate, proved 4-approximate (Blum et al., 1994), improved to 3.5-approximate (Kaplan et al., 2005).
  - APX-hard: no PTAS unless P = NP.
- **NP-completeness:** NP-complete (Maier and Storer, 1977; Gallant, Maier, and Storer, 1980). Remains NP-complete even if |Σ| = 2, or if all strings have |x| ≤ 8 with no repeated symbols.
- **Polynomial cases:** Solvable in polynomial time if all strings have length ≤ 2.
- **References:**
  - David Maier and James A. Storer (1977). "A note on the complexity of the superstring problem". Technical Report 233, Princeton University.
  - John Gallant, David Maier, and James A. Storer (1980). "On finding minimal length superstrings". *J. Comput. Syst. Sci.* 20(1):50-58.
  - Avrim Blum, Tao Jiang, Ming Li, John Tromp, and Mihalis Yannakakis (1994). "Linear approximation of shortest superstrings". *JACM* 41(4):630-647.
  - Marcin Mucha (2013). "Lyndon Words and Short Superstrings". *SODA 2013*.

## Extra Remark

**Full book text:**

INSTANCE: Finite alphabet Σ, finite set R of strings from Σ*, and a positive integer K.
QUESTION: Is there a string w ∈ Σ* with |w| ≤ K such that each string x ∈ R is a substring of w, i.e., w = w0xw1 where each wi ∈ Σ*?
Reference: [Maier and Storer, 1977]. Transformation from VERTEX COVER for cubic graphs.
Comment: Remains NP-complete even if |Σ| = 2 or if all x ∈ R have |x| ≤ 8 and contain no repeated symbols. Solvable in polynomial time if all x ∈ R have |x| ≤ 2.

## How to solve

- [x] It can be solved by (existing) bruteforce — enumerate candidate superstrings w ∈ Σ* in increasing length order and check if each x ∈ R appears as a contiguous substring.
- [x] It can be solved by reducing to integer programming — model as asymmetric TSP on the overlap graph with ordering constraints.
- [x] Other: Bellman-Held-Karp DP in O(n^2 · 2^n) via overlap graph / asymmetric TSP formulation. Greedy overlap algorithm (practical, 4-approximate). In bioinformatics, solved heuristically via de Bruijn graphs or overlap-layout-consensus pipelines.

## Example Instance



**Instance 1 (ternary alphabet, equal-length strings):**
- Alphabet: Σ = {a, b, c} (|Σ| = 3)
- Strings: R = {"abc", "bca", "cab", "bcc", "cca", "aab"}
- `max_length` = 3+3+3+3+3+3 = 18 (derived inside `new()`)
- No string is a substring of another ✓
- Optimal overlap chain: aab → abc → bca → cab → bcc → cca
  - Overlaps: "aab"→"abc" (2: "ab"), "abc"→"bca" (2: "bc"), "bca"→"cab" (2: "ca"), "cab"→"bcc" (1: "b"), "bcc"→"cca" (2: "cc")
  - Total savings: 2+2+2+1+2 = 9. Superstring length: 6·3 − 9 = 9
  - w = "aabcabcca"
- Verification: "aab" at pos 0 ✓, "abc" at pos 1 ✓, "bca" at pos 2 ✓, "cab" at pos 3 ✓, "bcc" at pos 5 ✓, "cca" at pos 6 ✓

**Instance 2 (binary alphabet, equal-length strings):**
- Alphabet: Σ = {0, 1} (|Σ| = 2)
- Strings: R = {"001", "011", "110", "100", "010", "101"}
- `max_length` = 3+3+3+3+3+3 = 18 (derived inside `new()`)
- No string is a substring of another ✓ (all binary strings of length 3 except "000" and "111")
- Optimal overlap chain: 001 → 011 → 110 → 101 → 010 → 100
  - Overlaps: "001"→"011" (2: "01"), "011"→"110" (2: "11"), "110"→"101" (2: "10"), "101"→"010" (2: "01"), "010"→"100" (2: "10")
  - Total savings: 2+2+2+2+2 = 10. Superstring length: 6·3 − 10 = 8
  - w = "00110100"
- Verification: "001" at pos 0 ✓, "011" at pos 1 ✓, "110" at pos 2 ✓, "101" at pos 3 ✓, "010" at pos 4 ✓, "100" at pos 5 ✓

**Instance 3 (ternary alphabet, mixed string lengths):**
- Alphabet: Σ = {a, b, c} (|Σ| = 3)
- Strings: R = {"abc", "cab", "ba", "bb"} — lengths 3, 3, 2, 2 (deliberately mixed)
- `max_length` = 3+3+2+2 = 10 (derived inside `new()`)
- No string is a substring of another ✓ ("ba" ⊄ "abc" / "cab"; "bb" ⊄ "abc" / "cab"; the two length-3 strings differ)
- Optimal length: **7**, witness `w = "abcabba"`
  - Decomposition: "abc"→"cab" overlap 2 ("ab" wait — actually `cab` starts at pos 2 of `abcabba`); the merge ordering is `abc + cab` with overlap 1 on the "c" (giving "abcab" of length 5), then append "ba" (already present at pos 5) and "bb" (already present at pos 4). Two of the four strings vanish into already-built positions, illustrating non-Hamiltonian-path behavior where shorter strings are absorbed by the longer chain.
- Verification (brute-force confirmed by exhaustive search over Σ^L for L = 5, 6, 7):
  - "abc" at pos 0 ✓
  - "cab" at pos 2 ✓
  - "ba"  at pos 5 ✓
  - "bb"  at pos 4 ✓
- Why this is non-isomorphic to Instance 1/2: different cardinality (4 strings vs 6), different string-length distribution (mixed 2/3 vs uniform 3), and the absorbed-substring phenomenon exercises a branch where the overlap-graph Hamiltonian-path framing has to *not* visit every string as a distinct node — useful for round-trip closed-loop testing.
- Search space (with padding symbol): `(alphabet_size + 1)^max_length = 4^10 ≈ 1.05M` configurations — easily exhaustible by `BruteForce`.

## Expected Outcome

- **Instance 1:** optimal value = 9. Witness superstring w = "aabcabcca" (length 9) contains all 6 input strings as contiguous substrings.
- **Instance 2:** optimal value = 8. Witness superstring w = "00110100" (length 8) contains all 6 input strings as contiguous substrings.
- **Instance 3:** optimal value = 7. Witness superstring w = "abcabba" (length 7) contains all 4 input strings as contiguous substrings (with "ba" and "bb" absorbed inside the "abc"+"cab" overlap structure).

Field	Type	Description
`alphabet_size`	`usize`	Size of the alphabet \|Σ\|
`strings`	`Vec<Vec<usize>>`	Set R of input strings, each encoded as a vector of alphabet indices in `0..alphabet_size`
`max_length`	`usize`	Maximum possible superstring length (sum of all input string lengths). Derived automatically inside `new()` as `Σ_{x ∈ R}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] ShortestCommonSuperstring #413

Motivation

Definition

Variables

Schema (data type)

Complexity

Extra Remark

How to solve

Example Instance

Expected Outcome

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Model] ShortestCommonSuperstring #413

Description

Motivation

Definition

Variables

Schema (data type)

Complexity

Extra Remark

How to solve

Example Instance

Expected Outcome

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions