Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
Rules
Tests
build-aux
etc
images
m4
scripts
.gitignore
CONTRIBUTING
Makefile.am
NEWS.txt
README.md
Snake_ChIPseq.py
VERSION
aclocal.m4
bootstrap.sh
configure.ac
guix.scm
pigx-chipseq.in
qsub-template.sh.in

README.md

Copyright 2017-2018: Vedran Franke, Alexander Gosdschan, Ricardo Wurmus. This work is distributed under the terms of the GNU General Public License, version 3 or later. It is free to use for all purposes.


Summary

PiGx ChIPseq (pipelines in genomics for Chromatin Immunoprecipitation Sequencing) is an analysis pipeline for preprocessing, peak calling and reporting for ChIP sequencing experiments. It is easy to use and produces high quality reports. The inputs are reads files from the sequencing experiment, and a configuration file that describes the experiment. In addition to quality control of the experiment, the pipeline enables multiple peak calling analysis and allows the generation of a UCSC track hub in an easily configurable manner.

What does it do

  • Trim reads using trim-galore
  • Quality control reads using fastQC and multiQC
  • Map reads to genome using Bowtie2
  • Call peaks for multiple combinations of samples using MACS2
  • Control reproducibility of experiments using IDR
  • Generate a UCSC track hub to view in Genome Browser

What does it output

  • QC reports
  • bam files
  • bigwig files
  • narrowPeak files
  • UCSC track hub folder

Install

You can install this pipeline with all its dependencies using GNU Guix:

guix package -i pigx-chipseq

You can also install it from source manually. You can find the latest release here. PiGx uses the GNU build system. Please make sure that all required dependencies are installed and then follow these steps after unpacking the latest release tarball:

./configure --prefix=/some/where
make install

Dependencies

By default the configure script expects tools to be in a directory listed in the PATH environment variable. If the tools are installed in a location that is not on the PATH you can tell the configure script about them with variables. Run ./configure --help for a list of all variables and options.

You can prepare a suitable environment with Conda or with GNU Guix. If you do not use one of these package managers, you will need to ensure that the following software is installed:

Software dependencies
  • R
    • argparser
    • biocparallel
    • biostrings
    • chipseq
    • data.table
    • dyplr
    • genomation
    • genomicalignments
    • genomicranges
    • rsamtools
    • rtracklayer
    • s4vectors
    • stringr
    • jsonlite
    • heatmaply
    • htmlwidgets
    • ggplot2
    • ggrepel
    • plotly
    • rmarkdown
  • python
    • snakemake
    • pyyaml
    • wrapper
    • pytest
    • xlrd
    • magic
  • pandoc
  • fastqc
  • multiqc
  • trim-galore
  • bowtie
  • macs2
  • idr
  • samtools
  • bedtools
  • bedToBigBed
  • bamToBed

Via Guix

Assuming you have Guix installed, the following command spawns a sub-shell in which all dependencies are available:

guix environment -l guix.scm

Getting started

To run PiGx on your experimental data, first enter the necessary parameters in the spreadsheet file (see following section), and then from the terminal type

$ pigx-chipseq [options] sample_sheet.csv

To see all available options type the --help option

$ pigx-chipseq --help

usage: pigx-chipseq [-h] [-v] -s SETTINGS [-c CONFIGFILE] [--target TARGET]
                   [-n] [--graph GRAPH] [--force] [--reason] [--unlock]
                   samplesheet

PiGx ChIPseq Pipeline.

PiGx ChIPseq is a data processing pipeline for ChIPseq read data.

positional arguments:
  samplesheet                             The sample sheet containing sample data in yaml format.

optional arguments:
  -h, --help                              show this help message and exit
  -v, --version                           show program version number and exit
  -s SETTINGS, --settings SETTINGS        A YAML file for settings that deviate from the defaults.
  -c CONFIGFILE, --configfile CONFIGFILE  The config file used for calling the underlying snakemake process.  By
                                          default the file 'config.json' is dynamically created from the sample
                                          sheet and the settings file.
  --target TARGET                         Stop when the named target is completed instead of running the whole
                                          pipeline.  The default target is "final-report".  Pass "--target=help"
                                          to describe all available targets.
  -n, --dry-run                           Only show what work would be performed.  Do not actually run the
                                          pipeline.
  --graph GRAPH                           Output a graph in Graphviz dot format showing the relations between
                                          rules of this pipeline.  You must specify a graph file name such as
                                          "graph.pdf".
  --force                                 Force the execution of rules, even though the outputs are considered
                                          fresh.
  --reason                                Print the reason why a rule is executed.
  --unlock                                Recover after a snakemake crash.

This pipeline was developed by the Akalin group at MDC in Berlin in 2017-2018.

Input Parameters

The pipeline requires two files as input to specify the samples and the design of the analysis.

Sample Sheet

The samples used for any subsequent analysis are defined in the sample sheet section.

SampleName Read Read2
  • SampleName is the name for the sample
  • Read/Read2 are the fastq file names of paired end reads
    • the location of these files is specified in settings.yaml
    • for single-end data, leave the Read2 column in place, but have it empty

Technical Replicates

The sample sheet offers support for technical replicates, by repeating the sample name (first column) for different input files (second,third column). The quality check will be performed for any input file and replicates will be merged during the mapping.

SampleName Read Read2
ChIPpe ChIPpe_R1.fq.gz ChIPpe_R2.fq.gz
ChIPpe ChIPpe_t2_R1.fq.gz ChIPpe_t2_R2.fq.gz

Settings File

The settings file is a file in yaml format specifying general settings and the details of the analysis. It has the following required sections:

Locations

Defines paths to be used in the pipeline, some of the items are required and some optional (can stay blank):

item required description
input-dir yes directory of the input files (fastq files)
output-dir yes output directory for the pipeline
genome-file yes path to the reference genome in fasta forma
index-dir no directory containing pre-built mapping indices for the given reference genome (created with bowtie2-build)
gff-file no location of a GTF file with genome annotations for the given reference genome

General

These are settings which apply to all analysis (unless adjusted in single analysis):

item required description
assembly yes version of reference genome (e.g. hg19,mm9, ...)
params no list of default parameters for tools and scripts (for tools check respective manual for available parameters)

Execution

The execution section in the settings file allows the user to specify whether the pipeline is to be submitted to a cluster, or run locally, and the degree of parallelism. For a full list of possible parameters, see etc/settings.yaml.

A minimal settings file could look like this, but please consider that no analysis will be performed without adding analysis information :

locations:
  input-dir: in/reads/
  output-dir: out/
  genome-file: genome/my_genome.fa
  index-dir:
  gff-file: genome/mm_chr19.gtf

general:
  assembly: mm9
  params:
    export_bigwig:
        extend: 200
        scale_bw: 'yes'
    bowtie2:
        # set k if you want to report at most k alignments per read
        k: 4
        N: 0
    bam_filter:
        mapq: 0
        deduplicate: no
    idr:
        idr-threshold: 0.1
    macs2:
        keep-dup: auto
        q: 0.05
    extract_signal:
        expand_peak: 200
        number_of_bins: 50
    peak_statistics:
        resize: 500

execution:
  submit-to-cluster: no
  rules:
    __default__:
      queue: all.q
      memory: 8G
    bowtie2:
      queue: all.q
      memory: 16G

Analysis Sections

The analysis part of the setting file describes the experiment. It has following sections:

section required description
peak_calling yes defines which samples will be used to detect regions of enriched binding ( multiple combinations and variations are possible, see here for details )
idr no specifies pairs of peak calling analysis that are compared to determine the reproducibilty of the general experiment (see here for details)
hub no describes the general layout of a UCSC hub that can be created from the processed data and allows the visual inspection of results at a UCSC genome browser (see here for details)
feature_combination no defines for a list of peak calling and/or idr analysis the combination of regions shared among this list (see here for details)

The creation of these sections is straight forward considering the following snippets as template. Comments and examples within the snippets provide guidance of what is possible and what to take care of.

Peak Calling

The previously defined samples are used for subsequent peak calling analysis to detect regions of enriched binding. In this section any number of comparisons can be defined, while multiple combinations and variations are possible. In terms of peak calling the ChIP (also called treatment) is the sample in which we want to detect enriched regions compared to the Cont(rol) (or background) sample. Each analysis can be run with a unique set of parameters and default parameters for all analysis can be defined in the settings file , check available parameters and description here. For more information have a look at the publication for the software we are using "Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol (2008) vol. 9 (9) pp. R137".

# define peak calling analysis
peak_calling:
    # analysis can have any name, but the names have to be unique 
    Peaks1: 
        # sample(s) to be used as treatment sample 
        ChIP: ChIP1
        # sample(s) to be used as control sample
        Cont: Cont1
        params:
            macs2:
                # each analysis can be adjusted independently
                # add/modify available parameters of the analysis
                nomodel: ''
                extsize: 300
    Peaks2:
        ChIP: ChIP2
        Cont: Cont2
        params:
            macs2:
                # each analysis can be adjusted independently
                nomodel: ''
                extsize: 147
    Peaks4:
        ChIP:
            # multiple samples can be used as treatment
            - ChIP1
            - ChIP2
        Cont:
            # multiple samples can be used as control
            - Cont1
            - Cont2
        params:
            macs2:
                nomodel: ''

    Peaks5:
        # the number of samples per group can differ 
        ChIP: ChIP2
        Cont:
            - Cont1
            - Cont2
        params:
            macs2:
                nomodel: ''

    Peaks6:
        # analysis can be performed without control
        ChIP: ChIP1
        Cont:
        params:
            macs2:
                nomodel: ''

(optional) IDR

Assuming that the some samples are (biological/technical) replicates, in order to measure the consistency between them use the irreproducible discovery rate (IDR) "Li, Q., Brown, J. B., Huang, H., & Bickel, P. J. (2011). Measuring reproducibility of high-throughput experiments. The annals of applied statistics, 5(3), 1752-1779.", which is in general a good (but very stringent) quality control.

idr:
    # idr analysis can have any name, but the names have to be unique 
    ChIP_IDR:
        # define the pair of samples, add more combinations for more replicates
        ChIP1: Peaks1
        ChIP2: Peaks2

(optional) Hub

In the hub section the general layout of a UCSC Track Hubs is described with some minimal items. The track hub is generated from the processed data and allows the visual inspection of results at a UCSC genome browser (for supported genomes).

The required items to define the hub are the following:

item example description
name PiGx_Hub name of the hub directory
shortLabel PiGx_Short short name of hub is displayed as name above track groups
longLabel PiGx_Hub_Long descriptive longer label for hub is displayed as hub description
email my.mail[at]domain.com whom to contact for questions about the hub or data
descriptionUrl pigx_hub.html URL to HTML page with a description of the hub's contents
super_tracks see below specification of hub layout (track groups, tracks)

This is a small example how this could look like:

hub:
    name: PiGx_Hub
    shortLabel: PiGx_Short
    longLabel: PiGx_Hub_Long
    email: my.mail@domain.com
    descriptionUrl: pigx_hub.html
    super_tracks:
        # track groups can have any name, but the names have to be unique 
        Tracks1:
            # tracks can have any name, but the names have to be unique 
            track11:
                # to add peaks as a track, define "type: macs" 
                name: Peaks1
                type: macs
            track12:
                # to add coverage signal as a track, define "type: bigwig"
                name: ChIP1
                type: bigWig
            # descriptive longer label for track group is
            # displayed as description in track settings
            long_label: Tracks1_long

(optional) Feature Combination

To find the combination of enriched binding regions, which is shared among a set of peak calling and/or idr analysis results, define a feature in the feature_combination section. Only items defined in the peak_calling and idr sections can be used here.

feature_combination:
    # features can have any name, but the names have to be unique
    Feature1:
        # define feature based on only one result
        - ChIP_IDR
    Feature2:
        # define feature based on more than one result
        - Peaks6
        - Peaks5
    Feature3:
        # define feature based on different analysis types
        - ChIP_IDR
        - Peaks5

Output Folder Structure

|-- Analysis
|-- Annotation
|-- BigWig
|-- Bowtie2_Index
|-- FastQC
|-- Log
|-- Mapped
|-- Peaks
|-- Reports
|-- Trimmed
|-- UCSC_HUB

Analysis

Contains RDS files with intermediary analysis steps. RDS are binary files which efficiently store R objects.

Annotation

Formatted GTF annotation.

BigWig

Symbolic links to the bigWig signal files.

Bowtie2_Index

Processed genme file along with the Bowtie2_Index

FastQC

FastQC sequencing quality report

Log

Detailed output from execution of each step of the pipeline.

Mapped

Mapped reads in .bam format, and corresponding bigWig files.

Peaks

Peaks called with MACS2. Depending on the parameters, contains either narrowPeak or broadPeak format. sample_qsort.bed contains uniformly processed peaks, sorted by their corresponding p value.

Reports

Contains MultiQC and ChIP quality reports in html format.

Trimmed

Trimgalore adaptor and quality trimmed files.

UCSC_Hub

Contains a completely formatted UCSC hub, with track descriptions, peaks and bigWig tracks.

Questions

If you have further questions please e-mail: pigx@googlegroups.com or use the web form to ask questions https://groups.google.com/forum/#!forum/pigx/