Skip to content

Optional Shovill-based assembly workflow #203

@vascokarla

Description

@vascokarla

Thank you for keep developing PHoeNIx! we use it routinely and really appreciate how simple, comprehensive, and effective it is for surveillance workflows.

We would like to request support for an optional Shovill-based assembly workflow, while keeping SPAdes as the default assembler.

Background / Motivation

We have encountered several cases where SPAdes assembly quality appears suboptimal, particularly in high-coverage samples (~200x+) and certain edge cases. In these situations, assemblies tend to be: more fragmented and potentially misassembled.

This has downstream impact on analyses that depend on assembly quality, including:

  • cgMLST (allele calling and distance estimation)
  • AMRFinder+ (in some cases relies on SNPs to determine alleles or point mutations)
  • MLST
  • Annotation pipelines (e.g., Prokka)
  • Other assembly-dependent analyses

Example case (250x coverage)

For one sample (~250x coverage), we compared assemblies generated via PHoeNIx (SPAdes) and Shovill, and identified that cgMLST distances (assembly based) with another isolate from the same patient were higher than expected:

  • cgMLST allelic distance (AD):
    • SPAdes (PHoeNIx): 35
    • Shovill: 2
  • The Shovill result matches 2 SNPs identified using read-based variant calling (Snippy)

This discrepancy, along with a high scaffold count, raised concerns about assembly quality.

Assembly statistics

Metric shovill/spades shovill/skesa phoenix/spades_scaffolds_filtered phoenix/spades_scaffolds_unfiltered
# contigs (>= 0 bp) 155 40 235 5455
# contigs (>= 1000 bp) 28 33 152 152
# contigs (>= 5000 bp) 26 28 102 102
# contigs (>= 10000 bp) 21 23 86 86
# contigs (>= 25000 bp) 20 20 55 55
# contigs (>= 50000 bp) 18 17 26 26
Total length (>= 0 bp) 4012311 3981371 3996992 4621215
Total length (>= 1000 bp) 3983587 3977730 3938174 3938174
Total length (>= 5000 bp) 3981380 3967751 3827954 3827954
Total length (>= 10000 bp) 3945865 3932391 3701982 3701982
Total length (>= 25000 bp) 3932705 3886630 3213205 3213205
Total length (>= 50000 bp) 3866129 3780376 2125939 2125939
# contigs 40 37 235 235
Largest contig 885631 883558 164545 164545
Total length 3991278 3980011 3996992 3996992
GC (%) 38.84 38.83 38.82 38.82
N50 264832 264558 52168 52168
N75 176295 176071 31967 31967
L50 5 5 24 24
L75 9 9 48 48
# N's per 100 kbp 0 0 783.09 783.09

Proposed implementation

A minimal, backward-compatible approach could be:

  1. Add a parameter: --assembler {spades,shovill} with spades as default

  2. Implement a SHOVILL_WF (using spades assembler).

Example integration

if (params.assembler == 'spades') {
    SPADES_WF(reads_ch)
    raw_assembly_ch   = SPADES_WF.out.spades_ch
    assembly_fail_ch  = SPADES_WF.out.spades_outcome
    fairy_fail_ch     = SPADES_WF.out.fairy_outcome
    taxonomy_seed_ch  = SPADES_WF.out.taxonomy
}
else if (params.assembler == 'shovill') {
    SHOVILL_WF(reads_ch)
    raw_assembly_ch   = SHOVILL_WF.out.assembly_ch
    assembly_fail_ch  = SHOVILL_WF.out.assembly_outcome
    fairy_fail_ch     = SHOVILL_WF.out.fairy_outcome
    taxonomy_seed_ch  = SHOVILL_WF.out.taxonomy
}

RENAME_FASTA_HEADERS(raw_assembly_ch)
BBMAP_REFORMAT(RENAME_FASTA_HEADERS.out.renamed_scaffolds)
SCAFFOLD_COUNT_CHECK(...)

We would be happy to help develop, test or contribute to implementation

Thanks you so much for your help,
Karla

Metadata

Metadata

Assignees

Labels

EnhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions