Thank you for keep developing PHoeNIx! we use it routinely and really appreciate how simple, comprehensive, and effective it is for surveillance workflows.
We would like to request support for an optional Shovill-based assembly workflow, while keeping SPAdes as the default assembler.
Background / Motivation
We have encountered several cases where SPAdes assembly quality appears suboptimal, particularly in high-coverage samples (~200x+) and certain edge cases. In these situations, assemblies tend to be: more fragmented and potentially misassembled.
This has downstream impact on analyses that depend on assembly quality, including:
- cgMLST (allele calling and distance estimation)
- AMRFinder+ (in some cases relies on SNPs to determine alleles or point mutations)
- MLST
- Annotation pipelines (e.g., Prokka)
- Other assembly-dependent analyses
Example case (250x coverage)
For one sample (~250x coverage), we compared assemblies generated via PHoeNIx (SPAdes) and Shovill, and identified that cgMLST distances (assembly based) with another isolate from the same patient were higher than expected:
- cgMLST allelic distance (AD):
- SPAdes (PHoeNIx): 35
- Shovill: 2
- The Shovill result matches 2 SNPs identified using read-based variant calling (Snippy)
This discrepancy, along with a high scaffold count, raised concerns about assembly quality.
Assembly statistics
| Metric |
shovill/spades |
shovill/skesa |
phoenix/spades_scaffolds_filtered |
phoenix/spades_scaffolds_unfiltered |
| # contigs (>= 0 bp) |
155 |
40 |
235 |
5455 |
| # contigs (>= 1000 bp) |
28 |
33 |
152 |
152 |
| # contigs (>= 5000 bp) |
26 |
28 |
102 |
102 |
| # contigs (>= 10000 bp) |
21 |
23 |
86 |
86 |
| # contigs (>= 25000 bp) |
20 |
20 |
55 |
55 |
| # contigs (>= 50000 bp) |
18 |
17 |
26 |
26 |
| Total length (>= 0 bp) |
4012311 |
3981371 |
3996992 |
4621215 |
| Total length (>= 1000 bp) |
3983587 |
3977730 |
3938174 |
3938174 |
| Total length (>= 5000 bp) |
3981380 |
3967751 |
3827954 |
3827954 |
| Total length (>= 10000 bp) |
3945865 |
3932391 |
3701982 |
3701982 |
| Total length (>= 25000 bp) |
3932705 |
3886630 |
3213205 |
3213205 |
| Total length (>= 50000 bp) |
3866129 |
3780376 |
2125939 |
2125939 |
| # contigs |
40 |
37 |
235 |
235 |
| Largest contig |
885631 |
883558 |
164545 |
164545 |
| Total length |
3991278 |
3980011 |
3996992 |
3996992 |
| GC (%) |
38.84 |
38.83 |
38.82 |
38.82 |
| N50 |
264832 |
264558 |
52168 |
52168 |
| N75 |
176295 |
176071 |
31967 |
31967 |
| L50 |
5 |
5 |
24 |
24 |
| L75 |
9 |
9 |
48 |
48 |
| # N's per 100 kbp |
0 |
0 |
783.09 |
783.09 |
Proposed implementation
A minimal, backward-compatible approach could be:
-
Add a parameter: --assembler {spades,shovill} with spades as default
-
Implement a SHOVILL_WF (using spades assembler).
Example integration
if (params.assembler == 'spades') {
SPADES_WF(reads_ch)
raw_assembly_ch = SPADES_WF.out.spades_ch
assembly_fail_ch = SPADES_WF.out.spades_outcome
fairy_fail_ch = SPADES_WF.out.fairy_outcome
taxonomy_seed_ch = SPADES_WF.out.taxonomy
}
else if (params.assembler == 'shovill') {
SHOVILL_WF(reads_ch)
raw_assembly_ch = SHOVILL_WF.out.assembly_ch
assembly_fail_ch = SHOVILL_WF.out.assembly_outcome
fairy_fail_ch = SHOVILL_WF.out.fairy_outcome
taxonomy_seed_ch = SHOVILL_WF.out.taxonomy
}
RENAME_FASTA_HEADERS(raw_assembly_ch)
BBMAP_REFORMAT(RENAME_FASTA_HEADERS.out.renamed_scaffolds)
SCAFFOLD_COUNT_CHECK(...)
We would be happy to help develop, test or contribute to implementation
Thanks you so much for your help,
Karla
Thank you for keep developing PHoeNIx! we use it routinely and really appreciate how simple, comprehensive, and effective it is for surveillance workflows.
We would like to request support for an optional Shovill-based assembly workflow, while keeping SPAdes as the default assembler.
Background / Motivation
We have encountered several cases where SPAdes assembly quality appears suboptimal, particularly in high-coverage samples (~200x+) and certain edge cases. In these situations, assemblies tend to be: more fragmented and potentially misassembled.
This has downstream impact on analyses that depend on assembly quality, including:
Example case (250x coverage)
For one sample (~250x coverage), we compared assemblies generated via PHoeNIx (SPAdes) and Shovill, and identified that cgMLST distances (assembly based) with another isolate from the same patient were higher than expected:
This discrepancy, along with a high scaffold count, raised concerns about assembly quality.
Assembly statistics
Proposed implementation
A minimal, backward-compatible approach could be:
Add a parameter:
--assembler {spades,shovill}withspadesas defaultImplement a
SHOVILL_WF(using spades assembler).Example integration
We would be happy to help develop, test or contribute to implementation
Thanks you so much for your help,
Karla