If you use or are inspired by code from this repo, please site related manuscripts and data including, but not limited to:
• Lind et al. (in press) Haploid, diploid, and pooled exome capture recapitulate features of biology and paralogy in two non-model tree species. Accepted to Molecular Ecology Resources. Available on bioRxiv https://doi.org/10.1101/2020.10.07.329961
- Access to an HPC with a scheduler (e.g., slurm, SGE, PBS, TORQUE) - this pipeline assumes slurm
- Ability to load the following modules via:
module load bwa/0.7.17
module load samtools/1.9
module load picard/2.18.9
module load gatk/4.1.0.0
module load bcftools/1.9
- In the
parentdir
folder that contains the fastq files, copy the following into a file calledbash_variables
. Thedef-someuser
reflects your compute canada account that you would like to use to submit jobs. If you have multiple accounts available, the pipeline will balance load among them (you choose these accounts during 00_start execution). The following is needed to submit jobs before the pipeline balances load. See example file in GitHub repository.export SLURM_ACCOUNT=def-someuser export SBATCH_ACCOUNT=$SLURM_ACCOUNT export SALLOC_ACCOUNT=$SLURM_ACCOUNT export PYTHONPATH="${PYTHONPATH}:$HOME/gatk_pipeline" export SQUEUE_FORMAT="%.8i %.8u %.15a %.68j %.3t %16S %.10L %.5D %.4C %.6b %.7m %N (%r)" # placeholder for python environment activation (see below)
- The following is assumed regarding the name of slurm accounts found by
sshare -U --user $USER --format=Account
:- The total character length of the account name is less than 15 - the full slurm account name will need to appear in the ACCOUNT column output from
squeue -u $USER
(using exportedSQUEUE_FORMAT
above); if not, increase the digits inSQUEUE_FORMAT
from%0.15a
to eg%0.20a
. - The characters in the account name that come before an underscore are sufficient to distinguish unique accounts - if the account name does not have an underscore then this is fine.
- The accounts that the user would like to use end in
_cpu
(as opposed to eg_gpu
). The pipeline will skip over non-_cpu
accounts.
- The total character length of the account name is less than 15 - the full slurm account name will need to appear in the ACCOUNT column output from
- The following is assumed regarding the name of slurm accounts found by
- Ability to install virtual environment with python 3.7 (e.g., virtualenv --no-download ~/py3`)
- Add the appropriate activation command to the
bash_variables
file (egsource ~/anaconda3/bin/activate py3
for conda, orsource ~/py3/bin/activate
for virutalenv)
- Add the appropriate activation command to the
- The reference fasta file should be uncompressed (eg. ref.fa not ref.fa.gz), and the following commands should be executed before starting pipeline:
bwa index ref.fa
samtools faidx ref.fa
- This pipeline assumes that the user will want to parallelize elements of the HaplotypeCaller and GenotypeGVCFs stages by using interval files (interval.list files). For example, chromosomes are often parallelized for this stage; ultimately the degree of parallelization (i.e., the number of interval.list files) is up to the user. The following is assumed about these files:
-
They are located within a directory called 'intervals' in the same folder as the ref.fa
-
The interval filenames are of the form 'batch_uniqueID.list'.
- uniqueID can be any string as long as there are no hyphens or underscores (eg batch_0013.list, batch_chrXIII.list)
-
The contents of the .list files should have at least one entry (eg the chromosome name found in the ref.fa file) or a list of chromosomes/scaffolds/intervals with or without start-stop positions. There should not be a blank line at the end of the .list file, this will cause an error.
batch_chrXI.list
chrXI
batch_0001.list
Scaffold_0001
batch_0013.list
Scaffold_1693:1-122845 Scaffold_1693:123446-409462 Scaffold_1693:410063-584146 ...
-
- Clone the pipeline repo's master branch to the server and create a symlink in
$HOME
so that it can be accessed via$HOME/gatk_pipeline
-
First create a python3 environment then activate it (see above).
- then use the
requirements.txt
file in the repo to install the appropriate python dependencies.pip install -r $HOME/gatk_pipeline/requirements.txt
- then use the
-
See example datatable.txt file needed for
00_start-pipeline.py
.- file names in
file_name_r1
andfile_name_r2
should be basenames not full paths - the entries in the
ref
column should be full paths - sample names in 'sample_name' column can be duplicates for samps sequenced multiple times
- RG info, file paths, etc should of course be different between sequenced files of single samps
- file names in
-
Once the environment is set up, put
datatable.txt
and the fastq files (symlinks work too) into a folder. This is the folder I callPARENTDIR
. -
To kick off the pipeline, source your
bash_variables
file inparentdir
(source bash_variables
) to activate the python env, export the pythonpath to the pipeline and other slurm variables. Then run00_start-gatk_pipeline.py
from the home node, and it will run the rest of the preprocessing pipeline automatically by serially sbatching jobs (through05_combine_and_genotype_supervised.py
).
(py3) [user@host ~]$ python $HOME/gatk_pipeline/00_start-gatk_pipeline.py -p PARENTDIR [-e EMAIL [-n EMAIL_OPTIONS]] [-maf INT] [-h]
required arguments:
-p PARENTDIR /path/to/directory/with/fastq.gz-files/
optional arguments:
-e EMAIL the email address you would like to have notifications
sent to (default: None)
-n EMAIL_OPTIONS [EMAIL_OPTIONS ...]
the type(s) of email notifications you would like to
receive from the pipeline. Requires --email-address.
These options are used to fill out the #SBATCH flags.
must be one (or multiple) of ['all', 'fail', 'begin',
'end', 'pipeline-finish'] (default: None)
-maf MAF At the end of the pipeline, VCF files will be filtered
for MAF. If the pipeline is run on a single
population/pool, the user can set MAF to 0.0 so as to
filter variants based on global allele frequency
across populations/pools at a later time. (default: 0.05)
-h, --help Show this help message and exit.