CSE 462 Final Project - Group 7
Bangladesh University of Engineering and Technology
Department of Computer Science and Engineering
Given a finite set S = {s1, ..., sn} of strings, find the shortest string T such that every si is a substring of T. This problem is NP-complete (Gallant et al., 1980) and APX-hard (Blum et al., 1994).
- Original: Repeatedly merge the pair with maximum overlap. Guaranteed <= 3.396 approximation ratio.
- Mod 1 - Or-opt Local Search: Post-greedy improvement by removing 1-3 consecutive strings from the ordering and reinserting at the best position. Respects asymmetric overlaps (unlike 2-opt which reverses segments).
- Original: Permutation encoding, Order Crossover (OX), swap/inversion mutation, tournament selection.
- Mod 1 - Greedy-Seeded Init: Seed 10% of population with greedy solution and perturbations. Provides ~20% improvement on structured data.
- Mod 2 - Memetic (Or-opt + Seeded): Combines greedy-seeded initialization with Or-opt local search on offspring and mixed OX/PMX crossover.
Bitmask DP for ground truth on small instances (n <= 20). O(2^n * n^2) time.
# Setup
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Run all experiments (full mode, ~30-90 min)
python run_all.py
# Run quick test (~15 sec)
python run_all.py --quick
# Run specific experiment
python run_all.py --experiment exp4
# Skip GenBank download
python run_all.py --skip-download| # | Experiment | Description |
|---|---|---|
| 1 | Algorithm Comparison | Greedy vs GA vs DP across instance sizes |
| 2 | Parameter Sensitivity | GA parameter sweep (population, mutation rate) |
| 3 | Scalability | Runtime vs number of strings |
| 4 | Modifications | Original vs modified algorithms (key experiment) |
| 5 | Dataset Effects | Performance across random, structured, DNA, adversarial data |
- Random DNA: Uniform random over {A,C,G,T}
- Structured Overlap: Fragments from a random genome with controlled overlap (30%, 50%, 70%)
- Real DNA: Fragments from NCBI GenBank (Phi X 174 bacteriophage, 5386 bp), downloaded and cached locally
- Adversarial: Binary alphabet constructions exposing greedy's weaknesses
results/- CSV files with experimental datafigures/- PNG plots for the presentation
- Gallant, Maier, Storer (1980). NP-completeness of SCS.
- Blum, Jiang, Li, Tromp, Yannakakis (1994). Linear approximation of shortest superstrings. JACM.
- Tarhio, Ukkonen (1988). Greedy conjecture.
- Or (1976). Or-opt moves for asymmetric routing problems.
- Sweedyk (1999). 2.5-approximation. SIAM J. Comput.
- Mucha (2013). 2.478-approximation via Lyndon words. SODA.
- Englert, Matsakis, Vesely (2023). 2.466-approximation. ISAAC.
- Branke, Middendorf, Schneider (1998). GA for SCS. OR Spektrum.
- Goldberg, Lingle (1985). PMX Crossover.
- Moscato (1989). Memetic algorithms.