Skip to content

Commit f5d1145

Browse files
committed
added nextclade support
1 parent 9ba21bb commit f5d1145

File tree

4 files changed

+85
-25
lines changed

4 files changed

+85
-25
lines changed

posts/NAAM-02-preparation.qmd

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,17 +14,17 @@ Notice the small *"Copy to Clipboard"* button on the right hand side of each cod
1414

1515
This workflow is distributed as a self-contained Singularity container image, which includes all necessary software dependencies and helper scripts. This simplifies setup considerably. It is required that [Singularity](https://docs.sylabs.io/guides/3.5/user-guide/introduction.html){target="_blank"} version 3.x or later is available on your system. If you are working with a high performance computing (HPC) system, then this will likely already be installed and available for use. Try writing `singularity --help` in your terminal (that's connected to the HPC system) and see if the command is recognized.
1616

17-
## 1.2 Download pre-built image
17+
### Download pre-built image
1818

1919
The singularity container needs an image file to activate the precompiled work environment. You can download the required workflow image file (naam_workflow.sif) directly through the terminal via:
2020

2121
``` bash
22-
wget https://github.com/LucvZon/nanopore-amplicon-analysis-manual/releases/download/v1.0.0/naam_workflow.sif
22+
wget https://github.com/LucvZon/nanopore-amplicon-analysis-manual/releases/download/v1.1.0/naam_workflow.sif
2323
```
2424

25-
Or go to the [github page](https://github.com/LucvZon/nanopore-amplicon-analysis-manual/releases/tag/v1.0.0){target="_blank"} and manually download it there, then transfer it to your HPC system.
25+
Or go to the [github page](https://github.com/LucvZon/nanopore-amplicon-analysis-manual/releases/tag/v1.1.0){target="_blank"} and manually download it there, then transfer it to your HPC system.
2626

27-
## 1.3 Verify container {.unnumbered}
27+
### 1.2 Verify container {.unnumbered}
2828

2929
You can test basic execution:
3030

posts/NAAM-04-generate_consensus.qmd

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,8 @@ samtools index {input}
3434

3535
virconsens \
3636
-b {input} \
37-
-o {output} \
37+
-o {output.fa} \
38+
-vf {output.vf} \
3839
-n {name} \
3940
-r {reference} \
4041
-d {coverage} \
@@ -43,10 +44,11 @@ virconsens \
4344
```
4445

4546
- `{input}` is the mapping bam file from step 3.1.
46-
- `{output}` is the fasta file containing the consensus sequence (e.g. `barcode01_consensus.fasta`)
47+
- `{output.fa}` is the fasta file containing the consensus sequence (e.g. `barcode01_consensus.fasta`)
48+
- `{output.vf}` is an optional tsv file which contains variant information.
4749
- `{name}` is the custom name of your sequence that will be used in the fasta file (e.g. `barcode01_consensus`)
4850
- `{reference}` is the fasta file containing the preferred reference, the same as in the previous step.
49-
- `{coverage}` is the minimal depth at which to not consider any alternative alleles
51+
- `{coverage}` is the minimal depth at which to not consider any alternative alleles.
5052

5153
::: {.callout-note}
5254
We now have a consensus sequence of our sequencing result. This is the "raw" result we can continue using for multiple sequence alignment and phylogeny in the next chapter.

posts/NAAM-06-sars_cov_2.qmd

Lines changed: 58 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,71 @@
1-
# 5. SARS-CoV-2 analysis {.unnumbered}
1+
# 5. Nextclade CLI {.unnumbered}
22

3-
If you are dealing with SARS-Cov-2 data, then you can run the [pangolin software](https://github.com/cov-lineages/pangolin) to submit your SARS-CoV-2 genome sequences which then are compared with other genome sequences and assigned the most likely lineage.
3+
You can use the command line version of Nextclade [Nextclade](https://docs.nextstrain.org/projects/nextclade/en/stable/) to identify differences between query sequences and a reference sequence and to assign query sequences to clades.
4+
5+
Nextclade can utilize official and community datasets which are maintained at [github.com/nextstrain/nextclade_data](github.com/nextstrain/nextclade_data). In addition, you could create your own dataset and use it with Nextclade. For more information on how to create your own dataset, visit [https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html](https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html)
6+
7+
## 5.1 How to use nextclade {.unnumbered}
8+
9+
Gather a list of all official and community datasets:
10+
```
11+
nextclade dataset list
12+
```
13+
14+
Download an official or community dataset:
15+
```
16+
nextclade dataset get --name '{dataset}' --output-dir '{output}'
17+
```
18+
19+
- `{dataset}` is the name of a dataset.
20+
- `{output}` is the location where the dataset will be downloaded.
21+
22+
Run nextclade:
23+
```
24+
nextclade run \
25+
--input-dataset {dataset} \
26+
--output-all={output}/ \
27+
{sequences}
28+
```
29+
30+
- `{dataset}` is either an official/community or custom made dataset.
31+
- `{output}` is the folder where all the output files will be stored.
32+
- `{sequences}` is your fasta file with all of your consensus sequences.
33+
34+
::: callout-important
35+
When using Nextclade, make sure that the reference sequence of the dataset is the exact same as your reference used to generate the consensus sequence from chapter 3.
36+
:::
37+
38+
## 5.2 Custom nextclade visualisation {.unnumbered}
39+
40+
If your Nextclade dataset contained a GFF3 annotation file for the reference sequence, then you can use the [viz_nextclade_cli.R](https://github.com/LucvZon/nanopore-amplicon-analysis-manual/tree/main/scripts) script to visualize the amino acid mutations per genetic feature.
441

542
Execute the following:
6-
```bash
7-
pangolin {input} --outfile {output}
43+
```
44+
Rscript viz_nextclade_cli.R \
45+
--nextclade-input-dir {input_dir} \
46+
--json-file {input.json} \
47+
--plotly-output-dir {plotly_output_dir} \
48+
--ggplotly-output-dir {ggplotly_output_dir}
849
```
950

10-
- `{input}` is your aggregated consensus fasta file from step X.X.
11-
- `{output}` is a .csv file that contains taxon name and lineage assigned per fasta sequence. Read more about the output format: [https://cov-lineages.org/resources/pangolin/output.html](https://cov-lineages.org/resources/pangolin/output.html)
51+
- `{nextclade-input-dir}` is the output folder from the nextclade run (step 5.1).
52+
- `{json-file}` is the nextclade.json file that should be present in the output folder from the nextclade run (step 5.1).
53+
- `{plotly-output-dir}` html plots made with plotly.
54+
- `{ggplotly-output-dir}` html plots made with ggplotly.
1255

56+
The plots will be generated for each genetic feature of the reference sequence. Currently, we output plotly and ggplotly versions, just use whichever looks best to you.
1357

14-
## To be added...
58+
## 5.3 Pangolin (redundant)
1559

16-
Here are some of the snakemake rules that are currently excluded:
60+
If you are dealing with SARS-Cov-2 data, then you can run the [pangolin software](https://github.com/cov-lineages/pangolin) to submit your SARS-CoV-2 genome sequences which then are compared with other genome sequences and assigned the most likely lineage.
1761

18-
- create_depth_file
19-
- create_vcf
20-
- annotate_vcf
21-
- filter_vcf
22-
- create_filtered_vcf_tables
62+
As Nextclade already performs pangolin classification step for you, it has become redundant to run this in addition to Nextclade. However, if for whatever reason you still want to run it manually, then execute the following:
63+
```bash
64+
pangolin {input} --outfile {output}
65+
```
2366

24-
These rules are exclusively for analysis of SARS-Cov-2 data and will be implemented into the container workflow in the near future.
67+
- `{input}` is your aggregated consensus fasta file from step X.X.
68+
- `{output}` is a .csv file that contains taxon name and lineage assigned per fasta sequence. Read more about the output format: [https://cov-lineages.org/resources/pangolin/output.html](https://cov-lineages.org/resources/pangolin/output.html)
2569

2670
::: {.callout-note}
2771
You can now move to the final chapter to automate all of the steps we’ve previously discussed.

posts/NAAM-07-nanopore_hpc.qmd

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ singularity exec naam_workflow.sif python /amplicon_project.py --help
2222

2323
```default
2424
usage: amplicon_project.py [-h] [-p PROJECT_DIR] -n STUDY_NAME -d RAW_FASTQ_DIR -P PRIMER -r PRIMER_REFERENCE -R REFERENCE_GENOME [-m MIN_LENGTH] [-c COVERAGE] [-t THREADS] [--use_sars_cov_2_workflow]
25+
[--nextclade_dataset NEXTCLADE_DATASET]
2526
2627
Interactive tool for setting up a Snakemake project.
2728
@@ -47,9 +48,12 @@ options:
4748
Maximum number of threads for the Snakefile (default: 8)
4849
--use_sars_cov_2_workflow
4950
Add this parameter if you want to analyze SARS-CoV-2 data
51+
--nextclade_dataset NEXTCLADE_DATASET
52+
Path to a custom Nextclade dataset directory, OR an official Nextclade dataset name (e.g., 'nextstrain/sars-cov-2/wuhan-hu-1/orfs'). Check official nextclade datasets with `nextclade
53+
dataset list`.
5054
```
5155

52-
Now prepare your project directory with prepare_project.py as follows:
56+
Now prepare your project directory with **amplicon_project.py** as follows:
5357

5458
``` bash
5559
singularity exec \
@@ -66,9 +70,12 @@ singularity exec \
6670
-P {primer} \
6771
-r {primer.reference} \
6872
-R {reference} \
69-
-t {threads}
73+
-t {threads} \
74+
--nextclade_dataset {nextclade.dataset}
7075
```
7176

77+
REQUIRED ARGUMENTS:
78+
7279
- `{project.folder}` is your project folder. This is where you run your workflow and store results.
7380
- `{name}` is the name of your study, no spaces allowed.
7481
- `{reads}` is the folder that contains your barcode directories (e.g. barcode01, barcode02).
@@ -78,7 +85,13 @@ singularity exec \
7885
- `{primer.reference}` is the reference sequence .fasta file used for primer trimming.
7986
- `{reference}` is the reference sequence .fasta file used for the consensus generation.
8087

81-
Please use absolute paths for the reads, primers and references so that they can always be located.
88+
OPTIONAL ARGUMENTS:
89+
90+
- `{nextclade.dataset}` the path to an official or custom nextclade dataset. A list official nextclade datasets can be checked with the following command: `singularity exec naam_workflow.sif nextclade dataset list`. If you are using a self made custom nextclade dataset, then please provide the absolute path to the dataset.
91+
92+
::: callout-important
93+
Please use absolute paths for the **reads**, **primers** and **references** so that they can always be located.
94+
:::
8295

8396
The `--bind` arguments are needed to explicitly tell Singularity to mount the necessary host directories into the container. The part before the colon is the path on the host machine that you want to make available. The path after the colon is the path inside the container where the host directory should be mounted.
8497

@@ -88,7 +101,7 @@ Once the setup is completed, move to your newly created project directory with `
88101

89102
Next, use the `ls` command to list the files in the project directory and check if the following files are present: `sample.tsv`, `config.yaml` and `Snakefile`.
90103

91-
- The **sample.tsv** should have 8 columns:
104+
- The **sample.tsv** should have 9 columns:
92105
- `unique_id`: the unique sample name that's generated based on the barcode directories.
93106
- `sequence_name`: the name given to the consensus sequence at the end of the pipeline. It's generated with the following template: {study_name}_{unique_id}.
94107
- `fastq_path`: the location of the raw .fastq.gz files per sample.
@@ -97,6 +110,7 @@ Next, use the `ls` command to list the files in the project directory and check
97110
- `primer_reference`: the location of the reference sequence for primer trimming.
98111
- `coverage`: minimum coverage required.
99112
- `min_length`: minimum length required.
113+
- `nextclade_db`: path to the nextclade dataset. This column can be empty if you're not including Nextclade.
100114

101115
- The **config.yaml** determines if the SARS_CoV_2 section of the workflow is enabled and the amount of default threads to use.
102116

0 commit comments

Comments
 (0)