(If necessary, begin by installing Nextflow and Conda as described in
Quickstart
)Nextflow allows the pipeline to be downloaded simply by using the "nextflow run" command, and providing the user/repository path to the relevant github repository. CnR-flow has a one-step installation mode that can be performed simultaneously ( :cl_param:--mode initiate ).
Together, this gives the command:$ nextflow run rennelab/CnR-flow --mode initiateThis should download the pipeline, install the few required local files and dependencies, and prepare the pipeline for execution.Note
After the initial download, the pipeline can then be referredto by name, as with: | "nextflow run CnR-flow ..."
cnr-flow comes preconfigured to utilize the conda environment management together with bioconda packages to handle dependency utilization.
for more details, or for an alternative configuration, seedependency config
either using the default dependency configuration, or with a user's custom configuration, the dependency setup can then be tested using the :cl_param:--mode validate or :cl_param:--mode validate_all parameters.# validate all workflow dependencies (recommended) $ nextflow run CnR-flow --mode validate_all
# validate only steps enabled in nextflow.config $ nextflow run CnR-flow --mode validate
CnR-flow provides one-step preparation of alignment reference genome(s) for use by the pipeline. Either the local path or URL of a fasta file are provided to theref_fasta
paramater, and the execution is performed with:$ nextflow run CnR-flow --mode prep_fastaThis copies the reference fasta to the directory specified byref_dir
, creates a bowtie2 alignment reference, creates a fasta index using Samtools, and creates a ".chrom.sizes" file using UCSC faCount [faCount_Citation]. The effective genome size is also calculated with faCount, using the (Total - N's) method. Reference details are written to a ".refinfo.txt" in the same directory.Note
If normalization is enabled, the same process will be repeated for the fasta file supplied to
norm_ref_fasta
for alignments to the spike-in control genome.These referenes are then detected automatically, using the same parameter used for preparation setup. For more details, seeTask Setup
. The list of all detectable prepared databases can be provided using the :cl_param:mode list_refs run mode:$ nextflow run CnR-flow --mode list_refs
CUT&RUN-Flow allows automated handling of treatment (Ex: H3K4me3) and and control (Ex: IgG) input files, performing the analysis steps on each condition in parallel, and then associating the treatment with the control for the final peak calling step. This can be performed either with a single treatment/control group, or with multiple groups in parallel. For more information, seeTask Setup
.
This step is enabled with paramater :flag_param:do_retrim (default:
true
). This step takes one example input fastq[.gz] file and determines the sequence length, for use in later steps.
This step is enabled with paramater :flag_param:do_merge_lanes (default:
true
). If multiple sets of paired end files are provided that differ only by the "_L00#_" component of the name, these sequences are concatenated for further analysis.For example, these files will be merged into the common name: 'my_sample_CTRL_R(1/2)_001.fastq.gz':
./my_sample_CTRL_L001_R1_001.fastq.gz ./my_sample_CTRL_L001_R2_001.fastq.gz ./my_sample_CTRL_L002_R1_001.fastq.gz ./my_sample_CTRL_L002_R2_001.fastq.gz #... --> ./my_sample_CTRL_R1_001.fastq.gz ./my_sample_CTRL_R2_001.fastq.gz
This step is enabled with paramater :flag_param:do_fastqc (default:
true
). FastQC [FastQC_Citation] is utilized to perform quality control checks on the input (presumably untrimmed) fastq[.gz] files.
This step is enabled with paramater :flag_param:do_trim (default:true
). Trimming of input fastq[.gz] files for read quality and adapter content is performed by Trimmomatic [Trimmomatic_Citation].
Default trimming parameters:Default flags:
This step is enabled with paramater :flag_param:do_retrim (default:true
). Trimming of input fastq[.gz] files is performed by the kseq_test executable from the CUTRUNTools toolkit [CUTRUNTools_Citation]. It is designed to identify and remove very short adapter sequences from tags that were potentially missed by previous trimming steps.
This step is enabled with paramater :flag_param:do_fastqc (default:
true
). FastQC [FastQC_Citation] is utilized to perform quality control checks on sequences after any/all trimming steps are performed.
Sequence reads are aligned to the reference genome using Bowtie2 [Bowtie2_Citation].
Default alignment parameters were selected using concepts presented in work by the Henikoff Lab [Meers2019] and the Yuan Lab [CUTRUNTools_Citation].
Default flags:Warning
None of the output alignments (.sam/.bam/.cram) files produced in this step (or indeed, anywhere else in the pipeline) are normalized. The only normalized outputs are are genome genome coverage tracks produced if normalization is enabled.
Output alignments are then subjected to several cleaning, filtering, and preprocessing steps utilizing Samtools [Samtools_Citation].
These are:
- Removal of unmapped reads (samtools view)
- Sorting by name (samtools sort [required for fixmate])
- Adding/correcting mate pair information (samtools fixmate -m)
- Sorting by genome coordinate (samtools sort)
- Marking duplicates (samtools mkdup)
- ( Optional Processing Steps [ see below ] )
- Alignment compression BAM -> CRAM (samtools view)
- Alignment indexing (samtools index)
Optional processing steps include:
- Removal of Duplicates
- Filtering to reads <= 120 bp in length
The desired category (or categories) of output are selected withuse_aln_modes
. Multiple categores can be specifically selected using :config_param:use_aln_modes as a list, and the resulting selections are analyzed and output in parallel.
(Example: :config_param:use_aln_modes ['all', 'less_120_dedup'])
Option Deduplicated Length <= 120bp all false false all_dedup true false less_120 false true less_120_dedup true true Default mode:
Further cleaning steps are then performed on the outputs, to prepare the alignments for (optional) normalization and peak calling.
These modifications are performed as suggested by the Henikoff lab in the documentation for SEACR, https://github.com/FredHutch/SEACR/blob/master/README.md [SEACR_Citation] , and are performed utilizing Samtools [Samtools_Citation] and bedtools [bedtools_Citation].These are:
- Sorting by name and uncompress CRAM -> BAM (samtools sort)
- Covert BAM to bedgraph (bedtools bamtobed)
- Filter discordant tag pairs (awk)
- Change bedtools bed format to BEDPE format (cut | sort)
- Convert BEDPE to (non-normalized) bedgraph (bedtools genomecov)
Note
Genome coverage tracks output by this step are NOT normalized.
This step is enabled with paramater :flag_param:do_norm_spike (default:true
).
This step calculates a normalization factor for scaling output genome coverage tracks.
- Strategy:
A dual-alignment strategy is used to filter out any reads that cross-map to both the primary reference and the normalization references. Sequence pairs that align to the normalization reference are then re-aligned to the primary reference. The number of read pairs that align to both references is then subtracted from the normalization factor output by this step, depending on the value of
norm_mode
(default: obj:true).- Details:
Sequence reads are first aligned to the normalization reference genome using Bowtie2 [Bowtie2_Citation]. Default alignment parameters are the same as with alignment to the primary reference genome.Default flags:
All reads that aligned to the normalization reference are then again aligned to the primary reference using Bowtie2 [Bowtie2_Citation].
Counts are then performed of pairs of sequence reads that align (and re-align, respectively) to each reference using Samtools [Samtools_Citation] (viasamtools view
). The count of aligned pairs to the spike-in genome reference is then returned, with the number of cross-mapped pairs subtracted depending on the value ofnorm_mode
.
norm_mode Normalization Factor Used all norm_ref_aligned (pairs) adj norm_ref_aligned - cross_map_aligned (pairs)
This step is enabled with paramater :flag_param:do_norm_spike (default:true
).
This step uses a normalization factor to create scaled genome coverage tracks. The calculation as provided by the Henikoff Lab: https://github.com/Henikoff/Cut-and-Run/blob/master/spike_in_calibration.csh [Meers2019] is:countnorm = (countsite * norm_scale)/norm_factor
Thus, the scaling factor used is calucated as:scale_factor = norm_scale/norm_factor
Wherenorm_factor
is calculated in the previous step, and the arbitrarynorm_scale
is provided by the parameter:norm_scale
.
Defaultnorm_scale
value:The normalized genome coverage track is then created by bedtools [bedtools_Citation] using the-scale
option.
This step is enabled with paramater :flag_param:do_make_bigwig (default:true
).
This step converts the output genome coverage file from the previous steps as in the UCSC bigWig file format using UCSC bedGraphToBigWig, a genome coverage format with significantly decreased file size [bedGraphToBigWig_Citation].Warning
The bigWig file format is a "lossy" file format that cannot be reconverted to bedGraph with all information intact.
peak_callers
. This can either be provided a single argument, as with:
peak_callers seacr
:config_param:peak-callers ['macs', 'seacr']
peak_callers
value:This step is enabled ifmacs
is included in :config_param:peak-callers.
This step calls peaks using the non-normalized alignment data produced in previous steps, using the MACS2 peak_caller [MACS2_Citation]Default MACS2 Settings:
This step is enabled ifseacr
is included in :config_param:peak_callers.
This step calls peaks using the normalized alignment data produced in previous steps (if normalization is enabled, using the SEACR peak caller [SEACR_Citation].
Special thanks to the Henikoff group for their permission to distribute SEACR bundled with this pipeline.
- Parameters:
seacr_norm_mode
passes eithernorm
ornon
to SEACR. Options:
'auto'
:
- if
do_norm = true
- Passes'non'
to SEACR- if
do_norm = false
- passes'norm'
to SEACR'norm'
'non'
seacr_fdr
is passed directly to SEACR.seacr_call_stringent
- SEACR is called in "stringent" mode.
seacr_call_relaxed
- SEACR is called in "relaxed" mode.
(If both of these are true, both outputs will be produced)Default SEACR Settings: