π§ This pipeline is currently under development and is not currently functional π§
- Description
- Diagram
- User guide
- Benchmarking
- Workflow summaries
- Additional notes
- Help/FAQ/Troubleshooting
- Acknowledgements/citations/credits
Somatic-shortV-nf is a pipeline for identifying somatic short variant (SNPs and indels) events in human Illumina short read whole genome sequence data from tumour and matched normal BAM files. The pipeline follows the GATK's Best Practices workflow. The pipeline is written in Nextflow and uses Singularity to run containerised tools.
There are two main steps to this workflow:
- Generate a large set of candidate somatic variants using GATK's Mutect2.
- Filter the candidate variants to obtain a more confident set of somatic variant calls.
To run this pipeline, you will need to prepare your input files, reference data, and clone this repository. Before proceeding, ensure Nextflow is installed on the system you're working on. To install Nextflow, see these instructions.
To run this pipeline you will need the following inputs:
- Paired Tumor-Normal (T-N) BAM files
- Corresponding BAM index files
- Input sample sheet
- Panel of Normals (PoN) file
This pipeline processes paired BAM files and is capable of processing multiple samples in parallel. BAM files are expected to be coordinate sorted and indexed (see Fastq-to-BAM for an example of a best practice workflow that can generate these files).
You will need to create a sample sheet with information about the samples you are processing, before running the pipeline. This file must be comma-separated and contain a header and one row per sample. Columns should correspond to sampleID, BAM-N file-path, BAM-T file-path:
sampleID,bam-N,bam-T
SAMPLE1,/data/Bams/sample1-N.bam,/data/Bams/sample1-T.bam
SAMPLE2,/data/Bams/sample2-N.bam,/data/Bams/sample2-T.bam
When you run the pipeline, you will use the mandatory --input
parameter to specify the location and name of the input sample sheet file:
--input /path/to/samples.csv
The user can create a panel of normals (PoN) containing germline and artifactual sites for use with Mutect2 using the instructions provided here. You will use the mandatory --ponvcf
parameter to specify the location and name of the PoN file:
--ponvcf /path/to/PoN
We will include this functionality to create the PoN as an optional module in the next version of the pipeline.
To run this pipeline you will need the following reference files:
- Indexed reference genome
- Common biallelic variant resources
Reference genome indexes
You can download FASTA files from the Ensembl, UCSC, or NCBI ftp. sites. Reference FASTA files must be accompanied by specific index files. You can use our IndexReferenceFasta-nf pipeline to generate indexes.
This pipeline uses the following tools for generating specific index files.
When you run the pipeline, you will use the mandatory --ref
and --dict
parameters to specify the location and names of the reference files:
--ref /path/to/ref.fasta --dict /path/to/ref.dict
Common biallelic variant resources
Common biallelic variants databases (e.g. gnomAD) can be used to filter false positive germline variants from your somatic variant dataset. We have provided a script scripts/gatk4_selectvariants.pbs
for this purpose. It executes GATK's SelectVariants tool. This will be included as an optional functionality in this pipeline in an upcoming update.
When you run the pipeline, you will use the mandatory --common_biallelic_variants
parameter to specify the location and name of the vcf file:
--common_biallelic_variants /path/to/common_biallelic_variants.vcf
Download the code contained in this repository with:
git clone https://github.com/Sydney-Informatics-Hub/Somatic-shortV-nf
This will create a directory with the following structure:
Somatic-shortV-nf/
βββ LICENSE
βββ README.md
βββ config/
βββ images/
βββ scripts/
βββ main.nf
βββ modules/
βββ nextflow.config
The important features are:
- main.nf contains the main nextflow script that calls all the processes in the workflow.
- nextflow.config contains default parameters to use in the pipeline.
- modules contains individual process files for each step in the workflow.
- config contains infrastructure-specific config files (this is currently under development)
The minimal run command for executing this pipeline is:
nextflow run main.nf --input samples.csv \
--ref /path/to/ref.fasta --dict /path/to/ref.dict \
--ponvcf /path/to/pon \
--common_biallelic_variants /path/to/common_biallelic_variants
By default, this will generate work
directory, results
output directory and a runInfo
run metrics directory inside the results directory.
To specify additional optional tool-specific parameters, see what flags are supported by running:
nextflow run main.nf --help
The nextflow command with optional parameters, where you can define the name of the output folder and provide an integer value for the number of genomic-interval files is:
nextflow run main.nf --input samples.csv \
--ref /path/to/ref.fasta --dict /path/to/ref.dict \
--ponvcf /path/to/pon \
--common_biallelic_variants /path/to/common_biallelic_variants \
--outDir name_output_folder \
--number_of_intervals INT
Mandatory parameters
--input
Full path and name of sample input file (csv format)--dict
Full path and name of reference genome dictionary file (dict format) (Step: create genomic-intervals)--ref
Full path and name of reference genome (fasta format) (Step: Mutect2)--ponvcf
Full path and name of the Panel of Normals file (vcf format) (Step: Mutect2)--common_biallelic_variants
Full path and name of the common biallelic variant resources file (vcf format)(Step: GetPileupSummaries)
Optional parameters
--outDir
Name of the results directory (default:results
)--number_of_intervals
Define a specific number genomic-intervals for parallelisation (default:automatically calculated using genome size
)
If for any reason your workflow fails, you are able to resume the workflow from the last successful process with -resume
.
This pipeline has been optimised for NCI Gadi HPC, instructions for executing on Gadi are provided in Infrastructure usage and recommendations
Once the pipeline is complete, you will find all outputs in the results
directory with a sub-directory for each sampleID. Each sampleID directory contains a sub-directory for every step of the pipeline which inturn stores all intermediate files and results generated by that step.
A directory called intervals_folder
is created immediately inside the results directory and it contains all interval files created for the mutect2
scatter-gather step.
The following directories will be created inside every sample directory results/$sampleID/
:
- mutect2: All files generated by mutect2 to call somatic variants using the scatter approach
- GatherVcfs: A single VCF file generated by gathering multiple VCF files from the scatter operations
- MergeMutectStats: Combined stats files across the scattered intervals
- LearnReadOrientationModel: Orientation bias mixture model filter file
- GetPileupSummaries: Table containing pileup metrics for inferring contamination
- CalculateContamination: Table containing fraction of reads coming from cross-sample contamination
- FilterMutectCalls: A VCF file from GatherVCF step with variants marked for filteration based on the contamination table
- getFilteredVariants: A final VCF file containing only the filtered variants for downstream analysis
This pipeline has been successfully implemented on NCI Gadi HPC using a infrastructure-specific config.
As per the config/gadi.config
, the main script is excecuted using the queue copyq
so that the singularity container images required by the pipeline are downloaded. The NCI Gadi config currently runs all other tasks (except downloading the singularity images) on the normal queue. This config can be used to interact with the job scheduler and assign a project code to all task job submissions.
The following flags are required to be specified in the command:
--whoami
your NCI or Pawsey user name--gadi-account
the Gadi project account you would like to bill service units to
The config uses the --gadi-account
flag to assign a project code to all task job submissions for billing purposes. The version of Nextflow installed on Gadi has been modified to make it easier to specify resource options for jobs submitted to the cluster. See NCI's Gadi user guide for more details.
The minimal run command for executing this pipeline on NCI Gadi HPC is:
nextflow run main.nf --input samples.csv \
--ref /path/to/ref.fasta --dict /path/to/ref.dict \
--ponvcf /path/to/pon \
--common_biallelic_variants /path/to/common_biallelic_variants \
-profile gadi \
--whoami $(whoami) --gadi_account $PROJECT
Before running the pipeline you will need to load Nextflow and Singularity, both of which are globally installed modules on Gadi. You can do this by running the commands below:
module purge
module load nextflow singularity
To run this workflow on NCI Gadi HPC, you can excecute the script scripts/run_pipeline_on_gadi_script.sh
by first entering the following details in the PBS header of the script:
- project code
- Resource-related details
- walltime
- ncpus
- mem
You can then submit the script using the command:
qsub runPipeline_script.sh
Coming soon!
metadata field | Somatic-shortV-nf / v1.0 |
---|---|
Version | 1.0.0 |
Maturity | under development |
Creators | Tracy Chew, Cali Willet,Nandan Deshpande |
Source | NA |
License | GNU General Public License v3.0 |
Workflow manager | NextFlow |
Container | See component tools |
Install method | NA |
GitHub | https://github.com/Sydney-Informatics-Hub/Somatic-shortV-nf |
bio.tools | NA |
BioContainers | NA |
bioconda | NA |
To run this pipeline you must have Nextflow and Singularity installed on your machine. All other tools are run using containers.
Tool | Version |
---|---|
Nextflow | >=20.07.1 |
Singularity | 3.11.3 |
GATK | 4.4.0.0 |
Resources
- It is essential that the reference genome you're using contains the same chromosomes, contigs, and scaffolds as the BAM files. To confirm what contigs are included in your indexed BAM file, you can use Samtools idxstats:
samtools idxstats input.bam | cut -f 1
- Tracy Chew (Sydney Informatics Hub, University of Sydney)
- Cali Willet (Sydney Informatics Hub, University of Sydney)
- Nandan Deshpande (Sydney Informatics Hub, University of Sydney)
- This pipeline was built using the Nextflow DSL2 template.
- Documentation was created following the Australian BioCommons documentation guidelines.
Acknowledgements (and co-authorship, where appropriate) are an important way for us to demonstrate the value we bring to your research. Your research outcomes are vital for ongoing funding of the Sydney Informatics Hub and national compute facilities. We suggest including the following acknowledgement in any publications that follow from this work:
The authors acknowledge the technical assistance provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney and the Australian BioCommons which is enabled by NCRIS via Bioplatforms Australia.