Optional Shovill-based assembly workflow

Thank you for keep developing PHoeNIx! we use it routinely and really appreciate how simple, comprehensive, and effective it is for surveillance workflows.

We would like to request support for an **optional Shovill-based assembly workflow**, while keeping **SPAdes as the default assembler**.


## Background / Motivation

We have encountered several cases where **SPAdes assembly quality appears suboptimal**, particularly in high-coverage samples (~200x+) and certain edge cases. In these situations, assemblies tend to be: more  fragmented and potentially misassembled. 

This has downstream impact on analyses that depend on assembly quality, including:

- **cgMLST** (allele calling and distance estimation)
- **AMRFinder+** (in some cases relies on SNPs to determine alleles or point mutations)
- **MLST**
- **Annotation pipelines** (e.g., Prokka)
- Other assembly-dependent analyses


## Example case (250x coverage)

For one sample (~250x coverage), we compared assemblies generated via PHoeNIx (SPAdes) and Shovill, and identified that cgMLST distances (assembly based) with another isolate from the same patient were higher than expected:

- **cgMLST allelic distance (AD)**:
  - SPAdes (PHoeNIx): **35**
  - Shovill: **2**
- The Shovill result matches **2 SNPs** identified using read-based variant calling (Snippy)

This discrepancy, along with a high scaffold count, raised concerns about assembly quality.

### Assembly statistics

| Metric | shovill/spades | shovill/skesa | phoenix/spades_scaffolds_filtered | phoenix/spades_scaffolds_unfiltered |
|---|---:|---:|---:|---:|
| # contigs (>= 0 bp) | 155 | 40 | 235 | 5455 |
| # contigs (>= 1000 bp) | 28 | 33 | 152 | 152 |
| # contigs (>= 5000 bp) | 26 | 28 | 102 | 102 |
| # contigs (>= 10000 bp) | 21 | 23 | 86 | 86 |
| # contigs (>= 25000 bp) | 20 | 20 | 55 | 55 |
| # contigs (>= 50000 bp) | 18 | 17 | 26 | 26 |
| Total length (>= 0 bp) | 4012311 | 3981371 | 3996992 | 4621215 |
| Total length (>= 1000 bp) | 3983587 | 3977730 | 3938174 | 3938174 |
| Total length (>= 5000 bp) | 3981380 | 3967751 | 3827954 | 3827954 |
| Total length (>= 10000 bp) | 3945865 | 3932391 | 3701982 | 3701982 |
| Total length (>= 25000 bp) | 3932705 | 3886630 | 3213205 | 3213205 |
| Total length (>= 50000 bp) | 3866129 | 3780376 | 2125939 | 2125939 |
| # contigs | 40 | 37 | 235 | 235 |
| Largest contig | 885631 | 883558 | 164545 | 164545 |
| Total length | 3991278 | 3980011 | 3996992 | 3996992 |
| GC (%) | 38.84 | 38.83 | 38.82 | 38.82 |
| N50 | 264832 | 264558 | 52168 | 52168 |
| N75 | 176295 | 176071 | 31967 | 31967 |
| L50 | 5 | 5 | 24 | 24 |
| L75 | 9 | 9 | 48 | 48 |
| # N's per 100 kbp | 0 | 0 | 783.09 | 783.09 |

---

## Proposed implementation

A minimal, backward-compatible approach could be:

1. Add a parameter: ` --assembler {spades,shovill}` with `spades` as default

2. Implement a `SHOVILL_WF`  (using spades assembler).


## Example integration

```nextflow
if (params.assembler == 'spades') {
    SPADES_WF(reads_ch)
    raw_assembly_ch   = SPADES_WF.out.spades_ch
    assembly_fail_ch  = SPADES_WF.out.spades_outcome
    fairy_fail_ch     = SPADES_WF.out.fairy_outcome
    taxonomy_seed_ch  = SPADES_WF.out.taxonomy
}
else if (params.assembler == 'shovill') {
    SHOVILL_WF(reads_ch)
    raw_assembly_ch   = SHOVILL_WF.out.assembly_ch
    assembly_fail_ch  = SHOVILL_WF.out.assembly_outcome
    fairy_fail_ch     = SHOVILL_WF.out.fairy_outcome
    taxonomy_seed_ch  = SHOVILL_WF.out.taxonomy
}

RENAME_FASTA_HEADERS(raw_assembly_ch)
BBMAP_REFORMAT(RENAME_FASTA_HEADERS.out.renamed_scaffolds)
SCAFFOLD_COUNT_CHECK(...)
```

---

We would be happy to help develop, test or contribute to implementation


Thanks you so much for your help,
Karla


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional Shovill-based assembly workflow #203

Background / Motivation

Example case (250x coverage)

Assembly statistics

Proposed implementation

Example integration

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	shovill/spades	shovill/skesa	phoenix/spades_scaffolds_filtered	phoenix/spades_scaffolds_unfiltered
# contigs (>= 0 bp)	155	40	235	5455
# contigs (>= 1000 bp)	28	33	152	152
# contigs (>= 5000 bp)	26	28	102	102
# contigs (>= 10000 bp)	21	23	86	86
# contigs (>= 25000 bp)	20	20	55	55
# contigs (>= 50000 bp)	18	17	26	26
Total length (>= 0 bp)	4012311	3981371	3996992	4621215
Total length (>= 1000 bp)	3983587	3977730	3938174	3938174
Total length (>= 5000 bp)	3981380	3967751	3827954	3827954
Total length (>= 10000 bp)	3945865	3932391	3701982	3701982
Total length (>= 25000 bp)	3932705	3886630	3213205	3213205
Total length (>= 50000 bp)	3866129	3780376	2125939	2125939
# contigs	40	37	235	235
Largest contig	885631	883558	164545	164545
Total length	3991278	3980011	3996992	3996992
GC (%)	38.84	38.83	38.82	38.82
N50	264832	264558	52168	52168
N75	176295	176071	31967	31967
L50	5	5	24	24
L75	9	9	48	48
# N's per 100 kbp	0	0	783.09	783.09

Optional Shovill-based assembly workflow #203

Description

Background / Motivation

Example case (250x coverage)

Assembly statistics

Proposed implementation

Example integration

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions