repliseq-nf

Introduction

repliseq-nf is a bioinformatics analysis pipeline used for Repli-seq data.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.

Pipeline summary

We loosely follow the steps of the Nature protocols paper for processing E/L Repli-seq next-generation-sequencing data.

Raw read QC (FastQC)
Adapter trimming (Trim Galore!)
Alignment (BWA)
Filter for:
- reads that are marked as duplicates (SAMtools)
- reads that arent marked as primary alignments (SAMtools)
- reads that are unmapped (SAMtools)
- reads that map to multiple locations (SAMtools)
- reads containing > 4 mismatches (BAMTools)
- reads that are soft-clipped (BAMTools)
- reads that have an insert size > 2kb (BAMTools; paired-end only)
- reads that map to different chromosomes (Pysam; paired-end only)
- reads that arent in FR orientation (Pysam; paired-end only)
- reads where only one read of the pair fails the above criteria (Pysam; paired-end only)
Merge filtered alignments across replicates (picard)
Re-mark duplicates (picard)
Calculate E/L ratio (replication timing) RT-tracks (deepTools)
Normalize RT-tracks:
- Loess-smoothened raw tracks
- Quantile-normalized raw tracks across all samples
- Loess-smoothened quantile-normalized tracks across all samples
Create bigWig files bedGraphToBigWig)
Present QC for raw reads, alignment and filtering MultiQC

Quick Start

i. Install nextflow

ii. Install one of docker, singularity or conda

iii. Clone repository

nextflow pull pavrilab/repliseq-nf

iv. Start running your own analysis!

nextflow run pavrilab/repliseq-nf --design design.txt --genome mm9 --singleEnd

Main arguments

`-profile`

Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Note that multiple profiles can be loaded, for example: -profile docker - the order of arguments is important!

If -profile is not specified at all the pipeline will be run locally and expects all software to be installed and available on the PATH.

docker
- A generic configuration profile to be used with Docker
- Pulls software from dockerhub: zuberlab/repliseq-nf
singularity
- A generic configuration profile to be used with Singularity
- Pulls software from DockerHub: zuberlab/repliseq-nf

`--design`

You will need to create a design file with information about the samples in your experiment before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 5 columns, and a header row as shown in the examples below.

--design '[path to design file]'

Multiple replicates

The condition identifier is the same for the individual early (E) and late (L) phase samples. For each condition, exactly one E and one L phase group have to be present in order to calculate proper replication timing (RT) tracks from the E/L ratio. When you have multiple replicates from the same phase, just increment the replicate identifier appropriately. The first replicate value for any given experimental group must be 1. A final design file may look something like the one below. This is for two experimental conditions each, with each phase in duplicates.

condition,phase,replicate,fastq_1,fastq_2
shLacZ,E,1,shLacZ_S_rep1.fastq.gz,
shLacZ,E,2,shLacZ_S_rep2.fastq.gz,
shLacZ,L,1,shLacZ_L_rep1.fastq.gz,
shLacZ,L,2,shLacZ_L_rep2.fastq.gz,
shKD,E,1,shKD_S_rep1.fastq.gz,
shKD,E,2,shKD_S_rep2.fastq.gz,
shKD,L,1,shKD_L_rep1.fastq.gz,
shKD,L,2,shKD_L_rep2.fastq.gz,

Column	Description
`condition`	Condition of this sample. This will be identical for all phases / replicate samples from the same experimental condition.
`phase`	Phase identifier for sample. Either "E" or "L" for early / late samples. This will be identical for replicate samples from the same phase group.
`replicate`	Integer representing replicate number. Must start from `1..<number of replicates>`.
`fastq_1`	Full path to FastQ file for read 1. File has to be zipped and have the extension ".fastq.gz" or ".fq.gz".
`fastq_2`	Full path to FastQ file for read 2. File has to be zipped and have the extension ".fastq.gz" or ".fq.gz".

Generic arguments

`--single_end`

By default, the pipeline expects paired-end data. If you have single-end data, specify --single_end on the command line when you launch the pipeline.

It is not possible to run a mixture of single-end and paired-end files in one run.

Reference genomes

The pipeline config files come bundled with paths to the illumina iGenomes reference index files. If running with docker or AWS, the configuration is set up to use the AWS-iGenomes resource.

`--genome` (using iGenomes)

There are 31 different species supported in the iGenomes references. To run the pipeline, you must specify which to use with the --genome flag.

You can find the keys to specify the genomes in the iGenomes config file. Common genomes that are supported are:

Human
- --genome GRCh37
Mouse
- --genome GRCm38
Drosophila
- --genome BDGP6
S. cerevisiae
- --genome 'R64-1-1'

There are numerous others - check the config file for more.

Note that you can use the same configuration setup to save sets of reference files for your own use, even if they are not part of the iGenomes resource. See the Nextflow documentation for instructions on where to save such a file.

The syntax for this reference configuration is as follows:

params {
  genomes {
    'GRCh37' {
      fasta   = '<path to the genome fasta file>' // Used if no bwa index given
      bwa     = '<path to the bwa index file>'
    }
    // Any number of additional genomes, key is used with --genome
  }
}

`--fasta`

Full path to fasta file containing reference genome (mandatory if --genome is not specified). If you don't have a BWA index available this will be generated for you automatically. Combine with --save_reference to save BWA index for future runs.

--fasta '[path to FASTA reference]'

`--bwa`

Full path to an existing BWA index for your reference genome including the base name for the index.

--bwa '[directory containing BWA index]/genome.fa'

`--windowSize`

Window size in which to calculate E/L ratios (default: 5000).

--windowSize '[size of windows in bp]'

`--loessSpan`

Span size of the loess smoothing (default: 300000).

--loessSpan '[span size in bp]'

`--outputDir`

Name of the folder to which the output will be saved (default: results)

--outputDir '[directory name]'

Credits

The pipeline was developed by Tobias Neumann for use at the IMP, Vienna.

The nf-core/rnaseq and nf-core/chipseq pipelines developed by Phil Ewels were initially used as a template for this pipeline. Many thanks to Phil for all of his help and advice, and the team at SciLifeLab.

Many thanks to others who have helped out along the way too, including (but not limited to): @apeltzer, @micans, @pditommaso.

Citations

Pipeline tools

Nextflow

Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.
BWA

Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009 Jul 15;25(14):1754-60. doi: 10.1093/bioinformatics/btp324. Epub 2009 May 18. PubMed PMID: 19451168; PubMed Central PMCID: PMC2705234.
BEDTools

Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/btq033. Epub 2010 Jan 28. PubMed PMID: 20110278; PubMed Central PMCID: PMC2832824.
SAMtools

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.
BamTools

Barnett DW, Garrison EK, Quinlan AR, Strömberg MP, Marth GT. BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics. 2011 Jun 15;27(12):1691-2. doi: 10.1093/bioinformatics/btr174. Epub 2011 Apr 14. PubMed PMID: 21493652; PubMed Central PMCID: PMC3106182.
UCSC tools

Kent WJ, Zweig AS, Barber G, Hinrichs AS, Karolchik D. BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics. 2010 Sep 1;26(17):2204-7. doi: 10.1093/bioinformatics/btq351. Epub 2010 Jul 17. PubMed PMID: 20639541; PubMed Central PMCID: PMC2922891.
deepTools

Ramírez F, Ryan DP, Grüning B, Bhardwaj V, Kilpert F, Richter AS, Heyne S, Dündar F, Manke T. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 2016 Jul 8;44(W1):W160-5. doi: 10.1093/nar/gkw257. Epub 2016 Apr 13. PubMed PMID: 27079975; PubMed Central PMCID: PMC4987876.
MultiQC

Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
FastQC
Trim Galore!
picard-tools

R packages

R

R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
preprocessCore

Bolstad B (2019). preprocessCore: A collection of pre-processing functions.
getopt

Trevor L Davis (2010). getopt: C-Like 'getopt' Behavior.

Software packaging/containerisation tools

Bioconda

Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.
Anaconda

Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.
Singularity

Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.
Docker

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
assets		assets
bin		bin
conf		conf
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.nf		main.nf
nextflow.config		nextflow.config

License

PavriLab/repliseq-nf

Folders and files

Latest commit

History

Repository files navigation

repliseq-nf

Introduction

Pipeline summary

Quick Start

Main arguments

-profile

--design

Multiple replicates

Generic arguments

--single_end

Reference genomes

--genome (using iGenomes)

--fasta

--bwa

--windowSize

--loessSpan

--outputDir

Credits

Citations

Pipeline tools

R packages

Software packaging/containerisation tools

About

Topics

Resources