Copyright 2017-2018: Alexander Gosdschan, Katarzyna Wreczycka, Bren Osberg, Ricardo Wurmus. This work is distributed under the terms of the GNU General Public License, version 3 or later. It is free to use for all purposes.
PiGx BSseq is a data processing pipeline for raw fastq read data of bisulfite experiments; it produces reports on aggregate methylation and coverage and can be used to produce information on differential methylation and segmentation. It was first developed by the Akalin group at MDC in Berlin in 2017.
You can install this pipeline with all its dependencies using GNU Guix:
guix package -i pigx-bsseq
You can also install it from source manually. PiGx BSseq uses the GNU build system. If you want to install PiGx BSseq from source (you can find the latest release here), please make sure that all required dependencies are installed and then follow these steps after unpacking the latest release tarball:
./configure --prefix=/some/where make install
By default, the
configure script expects tools to be in a directory
listed in the
PATH environment variable. For reproducibility
We recommend installing the necessary tools with
GNU Guix. If the tools are installed in a
location that is not on the
PATH you can tell the
script about them with variables. Run
./configure --help for a list
of all variables and options.
The following tools must be available:
- samtools [>=1.3]
- Python [>=3.5]
- methylKit [>=1.3.1]
- rmarkdown [>=1.5]
All of these dependencies must be present in the environment at configuration time.
Installation of dependencies via Guix
Rather than installing all required packages manually, we generally recommend using GNU Guix. The following command spawns a sub-shell in which all dependencies are available:
guix environment -l guix.scm
To run PiGx BSseq on your experimental data, first enter the
necessary parameters in the spreadsheet file (see following section),
and configure the appropriate settings in
then, from the terminal type:
$ ./pigx-bsseq [options] sample_sheet.csv -s settings.yaml
To see all available options add the
--help option to the above command
$ ./pigx-bsseq --help usage: pigx-bsseq [-h] [-v] -s SETTINGS [-c CONFIGFILE] [--target TARGET] [-n] [--graph GRAPH] [--force] [--reason] [--unlock] sample_sheet PiGx BSseq Pipeline. PiGx BSseq is a data processing pipeline for raw fastq read data of bisulfite experiments. It produces methylation and coverage information and can be used to produce information on differential methylation and segmentation. positional arguments: sample_sheet The sample sheet containing sample data in CSV format. optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit -s SETTINGS, --settings SETTINGS A YAML file for settings that deviate from the defaults. -c CONFIGFILE, --configfile CONFIGFILE The config file used for calling the underlying snakemake process. By default the file 'config.json' is dynamically created from the sample sheet and the settings file. --target TARGET Stop when the named target is completed instead of running the whole pipeline. The default target is "final-report". Pass "--target=help" to describe all available targets. -n, --dry-run Only show what work would be performed. Do not actually run the pipeline. --graph GRAPH Output a graph in PDF format showing the relations between rules of this pipeline. You must specify a graph file name such as "graph.pdf". --force Force the execution of rules, even though the outputs are considered fresh. --reason Print the reason why a rule is executed. --unlock Recover after a snakemake crash. --verbose Print supplementary info on job execution. --printshellcmds Explicitly print commands being executed by snakemake to standard out. This pipeline was developed by the Akalin group at MDC in Berlin in 2017-2018.
The pipeline expects two kinds of input: a sample sheet in CSV format and a settings file specifying the desired behaviour of PiGx BSseq. PiGx BSseq will automatically generate a JSON configuration file from these inputs.
The sample sheet is a table with sample-specific information containing the names of fastq files, unique sample ids, the type of bisulfite sequencing experiment (such as RRBS or WGBS, currently, only WGBS is available) and treatment group for differential methylation detection.
An example sample sheet is provided in tests/sample_sheet.csv and contains the following:
Read1,Read2,SampleID,Protocol,Treatment PE_1.fq.gz,PE_2.fq.gz,PEsample,WGBS,0 SE_techrep1.fq.gz,,SEsample,WGBS,1 SE_techrep2.fq.gz,,SEsample_v2,WGBS,2
In the same folder,
sample\_sheet.ods is provided to allow for input
specification in the Libre Office Spreadsheet package, or similar
Excel-like programs. Once this spreadsheet is saved, select
Text CSV (.csv), and select the option to edit filters
to ensure that fields are separated by a comma (i.e.
Field delimiter = ',')
The default settings file can be found at
etc/settings.yaml; the values
here are used for any settings that are not specified by the user in the main
settings file (which override the defaults). An example settings file is
provided in tests/settings.yaml with the following content
locations: input-dir: in/ output-dir: out/ genome-dir: genome/ general: assembly: hg19 methylation-calling: minimum-coverage: 0 minimum-quality: 10 differential-methylation: cores: 20 treatment-groups: - ['0', '1'] - ['0', '2'] annotation: cpgIsland_bedfile: genome/cpgIslandExt.hg19.bed.gz refGenes_bedfile: genome/refGene.hg19.bed.gz webfetch: no execution: submit-to-cluster: no jobs: 6 nice: 19 cluster: memory: 8G stack: 128M queue: all contact-email: none
Note that indentation implies hierarchical structure; the significance of each section is discussed below.
PiGx BSseq recognizes four sections in the settings file:
locationsfor input, output, and genome directories
generalfor general settings
executionfor settings affecting the pipeline execution
|input-dir||string: location of the experimental input data files (currently requires .fastq.gz)|
|output-dir||string: ultimate location of the output data and report files|
|genome-dir||string: location of the reference genome data for alignment|
All input files (paired- or single-end) must be present in the folder
input-dir. All output produced by the pipeline will be written to the folder
output-dir, with subdirectories corresponding to the various stages of the
process. The directory
genome-dir must contain the reference genome being
mapped to. If
genome-dir contains a pre-existing sub-directory called
Bisulfite\_Genome, then the bisulfite-converted genome contained therein will
be used for mapping. If this sub-directory does not exist, then the pipeline
will perform this conversion itself (and will require permission to write to
disk at the location
genome-dir in order to create this sub-directory,
otherwise an error will be raised.)
general: assembly: hg19 methylation-calling: minimum-coverage: 0 minimum-quality: 10 differential-methylation: cores: 20 treatment-groups: - ['0', '1'] annotation: cpgIsland_bedfile: genome/cpgIslandExt.hg19.bed.gz refGenes_bedfile: genome/refGene.hg19.bed.gz webfetch: no
|assembly||string: UCSC assembly release name e.g. "hg19"|
|methylation-calling:minimum-coverage||integer: Minimum read coverage to be included in the methylKit objects. Defaults to 1. Any methylated base/region in the with fewer hits than this value will be ignored.|
|methylation-calling:minimum-quality||integer: Minimum phred quality score to call a methylation status for a base. Defaults to 10.|
|differential-methylation:cores||integer: Denotes how many cores should be employed in parallel differential methylation calculations|
|differential-methylation:treatment-groups||Array of strings indicating which groups (the "Treatment" column in the sample sheet) ought to be compared against one-another in differential methylation. The index corresponding to the control group must be entered first, followed by the 'treatment' under consideration. If differential methylation is to be omitted, remove this variable entirely from the settings file.|
|differential-methylation:annotation||Annotation files for differential methylation, based on CpG islands and reference genes respectively.|
|webfetch||Boolean: Should pigx download these annotation files from the internet if they are not found in the locations specified? (if
execution: submit-to-cluster: no jobs: 6 nice: 19 cluster: memory: 8G stack: 128M queue: all.q contact-email: none
|submit-to-cluster||string: Whether the pipeline should run locally ("no") or on a cluster with an SGE queueing system ("yes").|
|jobs||string: Number of jobs sent to cluster, e.g. "6"|
|nice||integer: From -20 to 19; higher values make the program execution less demanding on computational resources|
|cluster:memory||string: Amount of memory used for all jobs besides bismark, e.g. "8G"|
|cluster:stack||string: Stack size limit (used for cluster jobs), e.g. "128m"|
|cluster:contact-email||string: Email address where information about pipelines progress is sent (if it is running on a cluster).|
Further values can be supplied. For example, should the user wish to allocate job submission onto a specific queue (which we will call X) for all rules being executed, they may set:
execution: ... cluster: queue: X ... ...
Or certain rules can be allocated to specific queues as follows:
execution: ... rules: __default__: queue: X bismark_align_and_map_se: queue: Y ... ...
In the latter case, the
__default__ rule must be defined, and there may not
be any multiply-defined variables.
The values for the
executable field for each tool are determined at
configure time and usually won't have to be changed unless you want to
experiment with a custom variant of a particular tool.
args field for each tool accepts a string for additional
arguments to be passed to the specified tool.
bismark tool supports additional settings, such as
number of cores used by bismark) and
memory for the amount of RAM
that bismark may use.
Running the pipeline
Once the sample sheet and settings file have been prepared, it may be useful to
submit the pipeline with the
dry-run argument shown above, to confirm
execution will run. Once a run is submitted using the commands above, the
config.json file will be automatically generated, and the unicode
logo should appear. Sub-diretories will also be created in the
output folder with names that are (approximately) sequentially-ordered to
indicate their place in the sequence of operations.
If a run is interuppted, it may be necessary to run the
submission script with the option
--unlock before resuming from where the
previous run was terminated. PiGx generates its own documentation to screen
during execution, however, the underlying process based on snakemake will
produce further updates to screen showing the status of the run. Within each
subfolder, further programs will generate their own log files that can be
consulted in case debugging is necessary (e.g. if a run fails with a cryptic
error message, and the output directory contains folders named
03_*, etc., we recommend reading the last few lines of the
*.log files in the last directory to be created.)