A Snakemake workflow for quickly screening assembled genomes against SRA datasets.
For each genome and SRA run, the workflow:
- Downloads reads from SRA (
fasterq-dump) - Quality-trims reads (
fastp) - Maps reads to reference (
bwa mem+samtools) - Computes per-feature coverage (
bedtools)
Output: One Excel file per genome with mapping statistics and feature-level coverage.
# 1. Install dependencies
conda env create -f environment.yml
conda activate screen-sra
# 2. Edit config.yaml with your SRA IDs and genome files
# 3. Run (workflow parallelizes over SRA IDs/genomes with --cores)
snakemake --cores 16
# Note: `threads: 8` in config is used for `bwa mem` mapping.
# Download uses 4 threads, QC uses 4 threads.
# 4. Check results in results/excel/Edit config.yaml:
threads: 8 # Threads for bwa mem mapping
sra_ids_file: SRR_Acc_List.txt # File with SRA IDs (one per line)
keep_aux: true # Keep intermediate files (true/false)
keep_mapping: true # Keep CRAM/BAM files (true/false)
genomes:
- genome_id: your_genome
fasta: path/to/genome.fna
gff3: path/to/genome.gff3- SRA IDs: Text file with one SRA run ID per line (e.g.,
SRR123456) - Genomes: FASTA + GFF3 files for each reference genome
results/
├── excel/
│ └── {genome_id}.xlsx # Final reports (one per genome)
│ ├── General_mapping sheet
│ └── Per-SRA sample sheets
├── reads/ # Raw fastq.gz files
├── qc/ # fastp reports
├── mapping/ # BAM files
└── gene_tables/ # Per-sample TSV tables
- This pipeline is designed for bacterial genome screening with annotation files generated by Bakta .
- Other annotation files should also work, depending on their GFF/GFF3 structure.
- Code was reviewed and optimized with GPT-5.3-Codex.