nf-germline-snv

Nextflow pipeline for germline variant calling.

Introduction

Nf-germline-snv is a genomics pipeline designed with Nextflow and Docker to use common bioinformatic tools to perform germline variant calling.

Pipeline uses the following tools:

fastqc
flexbar
multiqc
bwa
samtools
bcftools
picard
gatk
manta
strelka

Quick Start

Install Nextflow (>=20.04.0)
Install Docker (Needs root permissions. Don't forget to activate docker service after installation)
Install Graphviz (optional)
Download the pipeline and run it with test data with a single command:

nextflow run . -profile test

Test data but with full reference genome:

nextflow run . -profile test_hg37
nextflow run . -profile test_hg38

With external yandex s3 bucket:

nextflow run . -profile test2,yandex --accessKey <accessKey> --secretKey <secretKey>
nextflow run . -profile test2_hg37,yandex --accessKey <accessKey> --secretKey <secretKey>
nextflow run . -profile test2_hg38,yandex --accessKey <accessKey> --secretKey <secretKey>

Using your own data:

nextflow run . --input test_data_paths.csv --metadata_from_file_name false

Pipeline Inputs

Pipeline takes as an input raw paired-end fastq[.gz] files, provided as paths via an input csv file:

input.csv:

sample,fastq_1,fastq_2
sample_S1_L001,path/to/data/sample_S1_L001_R1_001.fastq.gz,path/to/data/sample_S1_L001_R2_001.fastq.gz
sample_S1_L002,path/to/data/sample_S1_L001_R1_002.fastq.gz,path/to/data/sample_S1_L001_R2_002.fastq.gz

First line of file must be a header. First CSV column must contain sample names, other two columns must specify paths to forward and reverse fastq files. It is possible to specify s3 paths to AWS s3 buckets. If you wish to specify s3 paths to non-amazon s3 buckets, create and use a config similar to conf/yandex.config.

Naming convention

By default the pipeline relies on obtaining certain metadata information directly from the fastq file names:

sample_ID
bio-type
seq-type
seq-machine
flowcell-ID
lane
barcode

To provide that kind of information fastq file names must follow the naming convention:

{sampleID}-{bio-type}-{seq-type}-{seq-machine}-{flowcell-ID}-{lane}-{barcode}-{read-direction}.fq[.gz]

Where individual items are separated by single dashes - (can be changed with --sample_name_format_delimeter parameter)

Example:

ZD210122-GED-E080AS6-MG-300056277-3-66-F.fq.gz

If you do not want to provide metadata in such way and want to treat all your samples independently, you can use pipeline parameter --metadata_from_file_name false

Bio-type

Possible values: GED - germline DNA SOD - somatic DNA GER - germline RNA SOR - somatic RNA CFD - cell free DNA SCD - single cell DNA

Optional inputs

Genome reference data

By default pipeline uses reference data from igenomes s3 bucket for pre-defined genomes hg37 and hg38 (GRCh38). Genome hg38 is used by default, but that can be changed with --genome parameter.

Alternatively, user may provide his own reference data bundle with corresponding arguments. Note that full list of reference data will have to be provided in this case (http and s3 links are also accepted). Check the conf/igenomes.config file to see which files are already in use and which reference data files are needed for the pipeline in general.

Targeted sequencing regions files (exome sequening)

For targeted sequencing projects such as exome sequencing it is required to provide target and bait regions files, which should be in gatk interval_list format. The pipeline already comes with prepared SureSelect V6 S07604514 and V7 S31285117 files converted to iterval_list format for both hg37 and hg38. You can find these files and instructions how they were obtained in assets/regions_files folder.

Pipeline requires at least one targeted sequencing regions file:

--target_regions SS_V7_hg19_regions.bed.interval_list

If available, also the "probes" bait regions file should be provided:

--bait_regions SS_V7_hg19_probes.bed.interval_list

If only target regions file is provided, it will be used both as target and bait file in target seqencing specificity QC step. However this is less accurate then if providing both files.

Pipeline resource configuration

`--max_cpus`

Upper limit of CPUs per process, to limit the maximal number of cpus that can be allocated to each process. By default is set to total number of cpus on the machine where pipeline is started, therefor deafault limit of cpus per process is total number of cpus available on the machine.

Note: pipeline will still try to use all available computation resources, its just that each process will get an individual limit defined by --max_cpus. The processes that may get affected by this parameter are the mapping processes (by default use 4 cpus, defined by --cpus_mapping parameter, see below) and bam merging processes (are always equal to max_cpus).

Note 2: --max_cpus has priority over the --cpus_mapping parameter.

`--max_memory`

Same as max CPUs, but for RAM. Default: 16.GB. Should be defined in nextflow memory type format, e.g. 8.GB or 500.MB (see here)

`--max_time`

Maximum execution time allowed for single process. Once any process reaches this time limit, it will get aborted.

Default: 24.h

`--cpus_mapping`

Parameter to define a custom number of cores used by mapping processes.

Default: 4

`--memory_mapping`

Parameter to define a custom memory size allocated to mapping processes.

Default: 8.GB

Test profiles

Using public test data:

nextflow run . -profile test

Using private test data:

nextflow run . -profile test2,yandex --accessKey '<accessKey>' --secretKey '<secretKey>'

where <accessKey> and <secretKey> are credentials to private s3 bucket, for example s3://zenome-ngs-data .

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
assets/regions_files		assets/regions_files
conf		conf
containers/nf-germline-snv		containers/nf-germline-snv
docs		docs
testdata		testdata
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
main.nf		main.nf
nextflow.config		nextflow.config

License

zenomeplatform/nf-germline-mapping

Folders and files

Latest commit

History

Repository files navigation

nf-germline-snv

Introduction

Quick Start

Pipeline Inputs

Naming convention

Bio-type

Optional inputs

Genome reference data

Targeted sequencing regions files (exome sequening)

Pipeline resource configuration

--max_cpus

--max_memory

--max_time

--cpus_mapping

--memory_mapping

Test profiles

About

Resources

License

Stars

Watchers

Forks

Languages

`--max_cpus`

`--max_memory`

`--max_time`

`--cpus_mapping`

`--memory_mapping`