repliseq-nf is a bioinformatics analysis pipeline used for Repli-seq data.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.
We loosely follow the steps of the Nature protocols paper for processing E/L Repli-seq next-generation-sequencing data.
- Raw read QC (
FastQC
) - Adapter trimming (
Trim Galore!
) - Alignment (
BWA
) - Filter for:
- reads that are marked as duplicates (
SAMtools
) - reads that arent marked as primary alignments (
SAMtools
) - reads that are unmapped (
SAMtools
) - reads that map to multiple locations (
SAMtools
) - reads containing > 4 mismatches (
BAMTools
) - reads that are soft-clipped (
BAMTools
) - reads that have an insert size > 2kb (
BAMTools
; paired-end only) - reads that map to different chromosomes (
Pysam
; paired-end only) - reads that arent in FR orientation (
Pysam
; paired-end only) - reads where only one read of the pair fails the above criteria (
Pysam
; paired-end only)
- reads that are marked as duplicates (
- Merge filtered alignments across replicates (
picard
) - Re-mark duplicates (
picard
) - Calculate E/L ratio (replication timing) RT-tracks (
deepTools
) - Normalize RT-tracks:
- Loess-smoothened raw tracks
- Quantile-normalized raw tracks across all samples
- Loess-smoothened quantile-normalized tracks across all samples
- Create bigWig files
bedGraphToBigWig
) - Present QC for raw reads, alignment and filtering
MultiQC
i. Install nextflow
ii. Install one of docker
, singularity
or conda
iii. Clone repository
nextflow pull pavrilab/repliseq-nf
iv. Start running your own analysis!
nextflow run pavrilab/repliseq-nf --design design.txt --genome mm9 --singleEnd
Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Note that multiple profiles can be loaded, for example: -profile docker
- the order of arguments is important!
If -profile
is not specified at all the pipeline will be run locally and expects all software to be installed and available on the PATH
.
docker
- A generic configuration profile to be used with Docker
- Pulls software from dockerhub:
zuberlab/repliseq-nf
singularity
- A generic configuration profile to be used with Singularity
- Pulls software from DockerHub:
zuberlab/repliseq-nf
You will need to create a design file with information about the samples in your experiment before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 5 columns, and a header row as shown in the examples below.
--design '[path to design file]'
The condition
identifier is the same for the individual early (E) and late (L) phase
samples. For each condition
, exactly one E
and one L
phase group have to be present in order to calculate proper replication timing (RT) tracks from the E/L ratio. When you have multiple replicates from the same phase
, just increment the replicate
identifier appropriately. The first replicate value for any given experimental group must be 1. A final design file may look something like the one below. This is for two experimental conditions each, with each phase in duplicates.
condition,phase,replicate,fastq_1,fastq_2
shLacZ,E,1,shLacZ_S_rep1.fastq.gz,
shLacZ,E,2,shLacZ_S_rep2.fastq.gz,
shLacZ,L,1,shLacZ_L_rep1.fastq.gz,
shLacZ,L,2,shLacZ_L_rep2.fastq.gz,
shKD,E,1,shKD_S_rep1.fastq.gz,
shKD,E,2,shKD_S_rep2.fastq.gz,
shKD,L,1,shKD_L_rep1.fastq.gz,
shKD,L,2,shKD_L_rep2.fastq.gz,
Column | Description |
---|---|
condition |
Condition of this sample. This will be identical for all phases / replicate samples from the same experimental condition. |
phase |
Phase identifier for sample. Either "E" or "L" for early / late samples. This will be identical for replicate samples from the same phase group. |
replicate |
Integer representing replicate number. Must start from 1..<number of replicates> . |
fastq_1 |
Full path to FastQ file for read 1. File has to be zipped and have the extension ".fastq.gz" or ".fq.gz". |
fastq_2 |
Full path to FastQ file for read 2. File has to be zipped and have the extension ".fastq.gz" or ".fq.gz". |
By default, the pipeline expects paired-end data. If you have single-end data, specify --single_end
on the command line when you launch the pipeline.
It is not possible to run a mixture of single-end and paired-end files in one run.
The pipeline config files come bundled with paths to the illumina iGenomes reference index files. If running with docker or AWS, the configuration is set up to use the AWS-iGenomes resource.
There are 31 different species supported in the iGenomes references. To run the pipeline, you must specify which to use with the --genome
flag.
You can find the keys to specify the genomes in the iGenomes config file. Common genomes that are supported are:
- Human
--genome GRCh37
- Mouse
--genome GRCm38
- Drosophila
--genome BDGP6
- S. cerevisiae
--genome 'R64-1-1'
There are numerous others - check the config file for more.
Note that you can use the same configuration setup to save sets of reference files for your own use, even if they are not part of the iGenomes resource. See the Nextflow documentation for instructions on where to save such a file.
The syntax for this reference configuration is as follows:
params {
genomes {
'GRCh37' {
fasta = '<path to the genome fasta file>' // Used if no bwa index given
bwa = '<path to the bwa index file>'
}
// Any number of additional genomes, key is used with --genome
}
}
Full path to fasta file containing reference genome (mandatory if --genome
is not specified). If you don't have a BWA index available this will be generated for you automatically. Combine with --save_reference
to save BWA index for future runs.
--fasta '[path to FASTA reference]'
Full path to an existing BWA index for your reference genome including the base name for the index.
--bwa '[directory containing BWA index]/genome.fa'
Window size in which to calculate E/L ratios (default: 5000).
--windowSize '[size of windows in bp]'
Span size of the loess smoothing (default: 300000).
--loessSpan '[span size in bp]'
Name of the folder to which the output will be saved (default: results)
--outputDir '[directory name]'
The pipeline was developed by Tobias Neumann for use at the IMP, Vienna.
The nf-core/rnaseq and nf-core/chipseq pipelines developed by Phil Ewels were initially used as a template for this pipeline. Many thanks to Phil for all of his help and advice, and the team at SciLifeLab.
Many thanks to others who have helped out along the way too, including (but not limited to): @apeltzer, @micans, @pditommaso.
-
Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.
-
Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009 Jul 15;25(14):1754-60. doi: 10.1093/bioinformatics/btp324. Epub 2009 May 18. PubMed PMID: 19451168; PubMed Central PMCID: PMC2705234.
-
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/btq033. Epub 2010 Jan 28. PubMed PMID: 20110278; PubMed Central PMCID: PMC2832824.
-
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.
-
Barnett DW, Garrison EK, Quinlan AR, Strömberg MP, Marth GT. BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics. 2011 Jun 15;27(12):1691-2. doi: 10.1093/bioinformatics/btr174. Epub 2011 Apr 14. PubMed PMID: 21493652; PubMed Central PMCID: PMC3106182.
-
Kent WJ, Zweig AS, Barber G, Hinrichs AS, Karolchik D. BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics. 2010 Sep 1;26(17):2204-7. doi: 10.1093/bioinformatics/btq351. Epub 2010 Jul 17. PubMed PMID: 20639541; PubMed Central PMCID: PMC2922891.
-
Ramírez F, Ryan DP, Grüning B, Bhardwaj V, Kilpert F, Richter AS, Heyne S, Dündar F, Manke T. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 2016 Jul 8;44(W1):W160-5. doi: 10.1093/nar/gkw257. Epub 2016 Apr 13. PubMed PMID: 27079975; PubMed Central PMCID: PMC4987876.
-
Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
-
R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
-
Bolstad B (2019). preprocessCore: A collection of pre-processing functions.
-
Trevor L Davis (2010). getopt: C-Like 'getopt' Behavior.
-
Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.
-
Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.
-
Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.