Whole Genome Sequencing Structural Variation Pipeline
# Install nextflow: curl -fsSL get.nextflow.io | bash mv ./nextflow ~/bin # Set work dir to no-backup, put this in your .bashrc export NXF_WORK=$SNIC_NOBACKUP/work # Pull worfklow from this repo, run manta, normalize, and variant effect predictor: nextflow run -profile milou NBISweden/wgs-structvar --project <uppmax_project_id> --bam <bamfile.bam> --steps manta,normalize,vep # Monitor log file tail -f .nextflow.log
Your summary files will be in the
This is a pipeline for running the two structural variation callers fermikit and manta on UPPMAX. You can choose to run either of the two structural variation callers or both (and generate summary files). The main focus on this pipeline is to enable better comparisions with the SweGen dataset, the default parameters for the tools are the same that were used for that dataset. If you have access to the structural variants in the swegen dataset you can add that file to the pipeline and thereby have the ability to filter population specific variants.
Profiles for running on Uppmax HPC clusters
It is possible to run the pipeline in a few different ways. Either as a single-node job or letting nextflow distribute the tasks using the SLURM queing engine. There is also some slight differences in module usage depending on which HPC system is used.
specify the profile to use with the
-profile option to NextFlow:
- -profile milou
- Run on the milou cluster using the queueing system (for example, directly from the login node).
- -profile miloulocal
- Run on milou but only on the local node. Use this in a batch job on one node, reserve it for 48 hours and everything should be ok.
- -profile bianca
- The same as `milou` but on the Bianca system
- -profile biancalocal
- The same as `miloulocal` but on the Bianca system
The pipeline will use the following mask files to remove known artifacts:
- From cc2qe/speedseq: https://github.com/cc2qe/speedseq/raw/master/annotations/ceph18.b37.lumpy.exclude.2014-01-15.bed
- From lh3/varcmp: https://github.com/lh3/varcmp/raw/master/scripts/LCR-hs37d5.bed.gz
You can configure the location of the artifact mask files with the
--mask_artifact_dir command line option.
The pipeline can take bed files to filter variants. To run the pipeline with
filters put the
bed files in the
mask_cohort/ subdirectory and add the
mask_cohort option to the
--steps comma separated command line argument, eg:
cp some_bed_file.bed <path-to-wgs-structvar>/mask_cohort/ nextflow run -profile biancalocal <path-to-wgs-structvar>/main.nf --project <uppmax_project_id> --bam <bamfile.bam> --steps manta,normalize,vep,mask_cohort
You can configure the location of the cohort mask files with the
--mask_cohort_dir command line option.
Command line options
Run a local copy of the wgs-structvar WF: nextflow main.nf --bam <bamfile> [more options] OR run from github: nextflow nbisweden/wgs-structvar --bam <bamfile> [more options] Options: Required --bam Input bamfile OR --runfile Input runfile for multiple bamfiles in the same run. Whitespace separated, first column is bam file, second column is output directory and an optional third column with a run id to more easily keep track of the run (otherwise it\'s autogenerated). --project Uppmax project to log cluster time to -profile <profile> Where profile can be any of milou, localmilou, bianca, localbianca and devel. The local<x> are for running the entire workflow on a single node on the cluster, without the local prefix the slurm queueing system is used. Optional --help Show this message and exit --fastq Input fastqfile (default is bam but with fq as fileending) Used by fermikit, will be created from the bam file if missing. --steps Specify what steps to run, comma separated: (default: manta, vep) Callers: manta, fermikit Annotation: vep, snpeff Extra: normalize (with vt), mask_cohort (with bed files in mask_cohort/) --sg_mask_ovlp Fractional overlap for use with the filter option --no_sg_reciprocal Don't use a reciprocal overlap for the filter option --outdir Directory where resultfiles are stored (default: results) --prefix Prefix for result filenames (default: no prefix) --mask_artifacts_dir Directory with bed files for artifact filtering (default: mask_artifacts) --mask_cohort_dir Directory with bed files for cohort filtering (default: mask_cohort)
The log file
.nextflow.log will be produced when running and can be monitored
tail -f .nextflow.log
Nextflow can pull from github (master branch) so if you specify this repo it will run
what is currently in it. However if you want to customize the parameters more you will
want to clone the repo and edit the
nextflow.config file in it.
It's probably only the
params scope of the config file that is of interest
The first part has the default values for the command line parameters, see the usage message for information on them.
The next section has the reference assembly to use, both as fasta and assembly name.
You may want to use different versions of the modules used by this workflow,
currently you will have to edit the profiles to do that. On uppmax we have the
milou profile which specifies all the modules and versions, see the
The runtimes of the different programs is set in the
file. That file also specifies how to deal with errors and the interaction
with the Slurm scheduler, you probably don't want to change those unless you
know what you are doing.
The two folders
mask_cohort contain bed files to
filter the vcf-files from the callers. The artifact directory contains files
that should remove problematic regions, it removes everything that has an
overlap of at least 25% with a region in the artifact mask. The cohort one is
for more stringent filtering of already known variants, and here the default
filter threshold is instead a reciprocal overlap of 95%. It can be customized
with the two options
sg_mask_ovlp (default 0.95) and
If you need help with this module, please create a support issue in github.