added nextclade support

LucvZon · LucvZon · commit f5d1145782b9 · 2025-06-18T15:45:33.000+02:00
diff --git a/posts/NAAM-02-preparation.qmd b/posts/NAAM-02-preparation.qmd
@@ -14,17 +14,17 @@ Notice the small *"Copy to Clipboard"* button on the right hand side of each cod
 
 This workflow is distributed as a self-contained Singularity container image, which includes all necessary software dependencies and helper scripts. This simplifies setup considerably. It is required that [Singularity](https://docs.sylabs.io/guides/3.5/user-guide/introduction.html){target="_blank"} version 3.x or later is available on your system. If you are working with a high performance computing (HPC) system, then this will likely already be installed and available for use. Try writing `singularity --help` in your terminal (that's connected to the HPC system) and see if the command is recognized.
 
-## 1.2 Download pre-built image
+### Download pre-built image
 
 The singularity container needs an image file to activate the precompiled work environment. You can download the required workflow image file (naam_workflow.sif) directly through the terminal via:
 
 ``` bash
-wget https://github.com/LucvZon/nanopore-amplicon-analysis-manual/releases/download/v1.0.0/naam_workflow.sif
+wget https://github.com/LucvZon/nanopore-amplicon-analysis-manual/releases/download/v1.1.0/naam_workflow.sif
 ```
 
-Or go to the [github page](https://github.com/LucvZon/nanopore-amplicon-analysis-manual/releases/tag/v1.0.0){target="_blank"} and manually download it there, then transfer it to your HPC system.
+Or go to the [github page](https://github.com/LucvZon/nanopore-amplicon-analysis-manual/releases/tag/v1.1.0){target="_blank"} and manually download it there, then transfer it to your HPC system.
 
-## 1.3 Verify container {.unnumbered}
+### 1.2 Verify container {.unnumbered}
 
 You can test basic execution:
 
diff --git a/posts/NAAM-04-generate_consensus.qmd b/posts/NAAM-04-generate_consensus.qmd
@@ -34,7 +34,8 @@ samtools index {input}
 
 virconsens \
 -b {input} \
--o {output} \
+-o {output.fa} \
+-vf {output.vf} \
 -n {name} \
 -r {reference} \
 -d {coverage} \
@@ -43,10 +44,11 @@ virconsens \
 ```
 
 - `{input}` is the mapping bam file from step 3.1. 
-- `{output}` is the fasta file containing the consensus sequence (e.g. `barcode01_consensus.fasta`)
+- `{output.fa}` is the fasta file containing the consensus sequence (e.g. `barcode01_consensus.fasta`)
+- `{output.vf}` is an optional tsv file which contains variant information. 
 - `{name}` is the custom name of your sequence that will be used in the fasta file (e.g. `barcode01_consensus`)
 - `{reference}` is the fasta file containing the preferred reference, the same as in the previous step.
-- `{coverage}` is the minimal depth at which to not consider any alternative alleles
+- `{coverage}` is the minimal depth at which to not consider any alternative alleles.
 
 ::: {.callout-note}
 We now have a consensus sequence of our sequencing result. This is the "raw" result we can continue using for multiple sequence alignment and phylogeny in the next chapter.
diff --git a/posts/NAAM-06-sars_cov_2.qmd b/posts/NAAM-06-sars_cov_2.qmd
@@ -1,27 +1,71 @@
-# 5. SARS-CoV-2 analysis {.unnumbered}
+# 5. Nextclade CLI {.unnumbered}
 
-If you are dealing with SARS-Cov-2 data, then you can run the [pangolin software](https://github.com/cov-lineages/pangolin) to submit your SARS-CoV-2 genome sequences which then are compared with other genome sequences and assigned the most likely lineage.
+You can use the command line version of Nextclade [Nextclade](https://docs.nextstrain.org/projects/nextclade/en/stable/) to identify differences between query sequences and a reference sequence and to assign query sequences to clades. 
+
+Nextclade can utilize official and community datasets which are maintained at [github.com/nextstrain/nextclade_data](github.com/nextstrain/nextclade_data). In addition, you could create your own dataset and use it with Nextclade. For more information on how to create your own dataset, visit [https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html](https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html)
+
+## 5.1 How to use nextclade {.unnumbered}
+
+Gather a list of all official and community datasets:
+```
+nextclade dataset list
+```
+
+Download an official or community dataset:
+```
+nextclade dataset get --name '{dataset}' --output-dir '{output}'
+```
+
+- `{dataset}` is the name of a dataset.
+- `{output}` is the location where the dataset will be downloaded. 
+
+Run nextclade:
+```
+nextclade run \
+--input-dataset {dataset} \
+--output-all={output}/ \
+{sequences}
+```
+
+- `{dataset}` is either an official/community or custom made dataset.
+- `{output}` is the folder where all the output files will be stored.
+- `{sequences}` is your fasta file with all of your consensus sequences. 
+
+::: callout-important
+When using Nextclade, make sure that the reference sequence of the dataset is the exact same as your reference used to generate the consensus sequence from chapter 3.
+:::
+
+## 5.2 Custom nextclade visualisation {.unnumbered}
+
+If your Nextclade dataset contained a GFF3 annotation file for the reference sequence, then you can use the [viz_nextclade_cli.R](https://github.com/LucvZon/nanopore-amplicon-analysis-manual/tree/main/scripts) script to visualize the amino acid mutations per genetic feature.  
 
 Execute the following:
-```bash
-pangolin {input} --outfile {output}
+```
+Rscript viz_nextclade_cli.R \
+--nextclade-input-dir {input_dir} \
+--json-file {input.json} \
+--plotly-output-dir {plotly_output_dir} \
+--ggplotly-output-dir {ggplotly_output_dir}
 ```
 
-- `{input}` is your aggregated consensus fasta file from step X.X.
-- `{output}` is a .csv file that contains taxon name and lineage assigned per fasta sequence. Read more about the output format: [https://cov-lineages.org/resources/pangolin/output.html](https://cov-lineages.org/resources/pangolin/output.html)
+- `{nextclade-input-dir}` is the output folder from the nextclade run (step 5.1). 
+- `{json-file}` is the nextclade.json file that should be present in the output folder from the nextclade run (step 5.1). 
+- `{plotly-output-dir}` html plots made with plotly. 
+- `{ggplotly-output-dir}` html plots made with ggplotly. 
 
+The plots will be generated for each genetic feature of the reference sequence. Currently, we output plotly and ggplotly versions, just use whichever looks best to you. 
 
-## To be added...
+## 5.3 Pangolin (redundant)
 
-Here are some of the snakemake rules that are currently excluded:
+If you are dealing with SARS-Cov-2 data, then you can run the [pangolin software](https://github.com/cov-lineages/pangolin) to submit your SARS-CoV-2 genome sequences which then are compared with other genome sequences and assigned the most likely lineage.
 
-- create_depth_file
-- create_vcf
-- annotate_vcf
-- filter_vcf
-- create_filtered_vcf_tables
+As Nextclade already performs pangolin classification step for you, it has become redundant to run this in addition to Nextclade. However, if for whatever reason you still want to run it manually, then execute the following:
+```bash
+pangolin {input} --outfile {output}
+```
 
-These rules are exclusively for analysis of SARS-Cov-2 data and will be implemented into the container workflow in the near future.
+- `{input}` is your aggregated consensus fasta file from step X.X.
+- `{output}` is a .csv file that contains taxon name and lineage assigned per fasta sequence. Read more about the output format: [https://cov-lineages.org/resources/pangolin/output.html](https://cov-lineages.org/resources/pangolin/output.html)
 
 ::: {.callout-note}
 You can now move to the final chapter to automate all of the steps we’ve previously discussed.
diff --git a/posts/NAAM-07-nanopore_hpc.qmd b/posts/NAAM-07-nanopore_hpc.qmd
@@ -22,6 +22,7 @@ singularity exec naam_workflow.sif python /amplicon_project.py --help
 
 ```default
 usage: amplicon_project.py [-h] [-p PROJECT_DIR] -n STUDY_NAME -d RAW_FASTQ_DIR -P PRIMER -r PRIMER_REFERENCE -R REFERENCE_GENOME [-m MIN_LENGTH] [-c COVERAGE] [-t THREADS] [--use_sars_cov_2_workflow]
+                           [--nextclade_dataset NEXTCLADE_DATASET]
 
 Interactive tool for setting up a Snakemake project.
 
@@ -47,9 +48,12 @@ options:
                         Maximum number of threads for the Snakefile (default: 8)
   --use_sars_cov_2_workflow
                         Add this parameter if you want to analyze SARS-CoV-2 data
+  --nextclade_dataset NEXTCLADE_DATASET
+                        Path to a custom Nextclade dataset directory, OR an official Nextclade dataset name (e.g., 'nextstrain/sars-cov-2/wuhan-hu-1/orfs'). Check official nextclade datasets with `nextclade
+                        dataset list`.
 ```
 
-Now prepare your project directory with prepare_project.py as follows:
+Now prepare your project directory with **amplicon_project.py** as follows:
 
 ``` bash
 singularity exec \
@@ -66,9 +70,12 @@ singularity exec \
     -P {primer} \
     -r {primer.reference} \
     -R {reference} \
-    -t {threads}
+    -t {threads} \
+    --nextclade_dataset {nextclade.dataset}
 ```
 
+REQUIRED ARGUMENTS:
+
 - `{project.folder}` is your project folder. This is where you run your workflow and store results.
 - `{name}` is the name of your study, no spaces allowed.
 - `{reads}` is the folder that contains your barcode directories (e.g. barcode01, barcode02).
@@ -78,7 +85,13 @@ singularity exec \
 - `{primer.reference}` is the reference sequence .fasta file used for primer trimming.
 - `{reference}` is the reference sequence .fasta file used for the consensus generation.
 
-Please use absolute paths for the reads, primers and references so that they can always be located.
+OPTIONAL ARGUMENTS:
+
+- `{nextclade.dataset}` the path to an official or custom nextclade dataset. A list official nextclade datasets can be checked with the following command: `singularity exec naam_workflow.sif nextclade dataset list`. If you are using a self made custom nextclade dataset, then please provide the absolute path to the dataset.
+
+::: callout-important
+Please use absolute paths for the **reads**, **primers** and **references** so that they can always be located.
+:::
 
 The `--bind` arguments are needed to explicitly tell Singularity to mount the necessary host directories into the container. The part before the colon is the path on the host machine that you want to make available. The path after the colon is the path inside the container where the host directory should be mounted. 
 
@@ -88,7 +101,7 @@ Once the setup is completed, move to your newly created project directory with `
 
 Next, use the `ls` command to list the files in the project directory and check if the following files are present: `sample.tsv`, `config.yaml` and `Snakefile`.
 
--   The **sample.tsv** should have 8 columns:
+-   The **sample.tsv** should have 9 columns:
     - `unique_id`: the unique sample name that's generated based on the barcode directories.
     - `sequence_name`: the name given to the consensus sequence at the end of the pipeline. It's generated with the following template: {study_name}_{unique_id}.
     - `fastq_path`: the location of the raw .fastq.gz files per sample.
@@ -97,6 +110,7 @@ Next, use the `ls` command to list the files in the project directory and check
     - `primer_reference`: the location of the reference sequence for primer trimming.
     - `coverage`: minimum coverage required.
     - `min_length`: minimum length required.
+    - `nextclade_db`: path to the nextclade dataset. This column can be empty if you're not including Nextclade.
 
 -   The **config.yaml** determines if the SARS_CoV_2 section of the workflow is enabled and the amount of default threads to use.