Skip to content

Rahat463/CSE_462_Algorithm_Engineering_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Shortest Common Superstring: Algorithms, Modifications, and Experiments

CSE 462 Final Project - Group 7
Bangladesh University of Engineering and Technology
Department of Computer Science and Engineering

Problem

Given a finite set S = {s1, ..., sn} of strings, find the shortest string T such that every si is a substring of T. This problem is NP-complete (Gallant et al., 1980) and APX-hard (Blum et al., 1994).

Algorithms Implemented

1. Greedy SCS (with 1 modification)

  • Original: Repeatedly merge the pair with maximum overlap. Guaranteed <= 3.396 approximation ratio.
  • Mod 1 - Or-opt Local Search: Post-greedy improvement by removing 1-3 consecutive strings from the ordering and reinserting at the best position. Respects asymmetric overlaps (unlike 2-opt which reverses segments).

2. Genetic Algorithm (with 2 modifications)

  • Original: Permutation encoding, Order Crossover (OX), swap/inversion mutation, tournament selection.
  • Mod 1 - Greedy-Seeded Init: Seed 10% of population with greedy solution and perturbations. Provides ~20% improvement on structured data.
  • Mod 2 - Memetic (Or-opt + Seeded): Combines greedy-seeded initialization with Or-opt local search on offspring and mixed OX/PMX crossover.

3. Exact DP (Held-Karp)

Bitmask DP for ground truth on small instances (n <= 20). O(2^n * n^2) time.

Quick Start

# Setup
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Run all experiments (full mode, ~30-90 min)
python run_all.py

# Run quick test (~15 sec)
python run_all.py --quick

# Run specific experiment
python run_all.py --experiment exp4

# Skip GenBank download
python run_all.py --skip-download

Experiments

# Experiment Description
1 Algorithm Comparison Greedy vs GA vs DP across instance sizes
2 Parameter Sensitivity GA parameter sweep (population, mutation rate)
3 Scalability Runtime vs number of strings
4 Modifications Original vs modified algorithms (key experiment)
5 Dataset Effects Performance across random, structured, DNA, adversarial data

Datasets

  • Random DNA: Uniform random over {A,C,G,T}
  • Structured Overlap: Fragments from a random genome with controlled overlap (30%, 50%, 70%)
  • Real DNA: Fragments from NCBI GenBank (Phi X 174 bacteriophage, 5386 bp), downloaded and cached locally
  • Adversarial: Binary alphabet constructions exposing greedy's weaknesses

Output

  • results/ - CSV files with experimental data
  • figures/ - PNG plots for the presentation

References

  • Gallant, Maier, Storer (1980). NP-completeness of SCS.
  • Blum, Jiang, Li, Tromp, Yannakakis (1994). Linear approximation of shortest superstrings. JACM.
  • Tarhio, Ukkonen (1988). Greedy conjecture.
  • Or (1976). Or-opt moves for asymmetric routing problems.
  • Sweedyk (1999). 2.5-approximation. SIAM J. Comput.
  • Mucha (2013). 2.478-approximation via Lyndon words. SODA.
  • Englert, Matsakis, Vesely (2023). 2.466-approximation. ISAAC.
  • Branke, Middendorf, Schneider (1998). GA for SCS. OR Spektrum.
  • Goldberg, Lingle (1985). PMX Crossover.
  • Moscato (1989). Memetic algorithms.

About

Advanced algorithm engineering implementations and analysis (BUET CSE 462)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages