Exon quantification pipeline (EQP-cluster)
EQP-cluster is a Unix-based RNA-seq quantification pipeline which takes a set of sample Fastq files as input, aligns them against reference files, and generates files with the gene, exon, or junction counts for each sample.
It provides scripts to facilitate the distributed execution of
alignment and quantification operations for each of the samples and
is designed to support the Univa* Grid Engine (UGE)
batch-queuing/scheduling system (job submission via
If only the quantification step is of interest and your Fastq files are
pre-aligned, then you can also use
which is a Unix based RNA-seq quantification module; it uses SAM/BAM genome alignment files as input and creates gene, exon, and junctions counts.
* Univa® is a registered trademark of Univa Corporation
This document corresponds to EQP-cluster version 2.1.0
Installation of EQP-cluster
After downloading and unpacking the files from GitHub, the download
directory should contain the following entries:
LICENSE.txt bin/ exon-pipeline-scripts/
README.md exon-pipeline-files.tgz samples.tgz
To complete the installation of EQP, go to the
bin directory and
call the script
create-bowtie2-indices.sh requires that the programs
bowtie2-build are installed. See below for the
create-bowtie2-indices.sh extracts the files from the
exon-pipeline-files.tgz into the directory
and launches three calls to
bowtie2-build in the background to construct
the Bowtie2 indices of the genome Fasta file, the transcript Fasta file,
and a custom junction Fasta file. The computation of the Bowtie2 indices
will take several hours.
Note that currently EQP-cluster makes use of human Ensembl76 reference files (on which the files contained in the directory
exon-pipeline-files are based)
and is, thus, configured to quantify only human data; furthermore, due to
the way the custom junction Fasta file is created EQP-cluster
works best with reads of length 101bp or less.
Other reference file package will be made available in the future.
Once the Bowtie2 indices are computed, EQP-cluster is ready to be used.
The directory containing the subdirectories
exon-pipeline-files will be called
<project directory> in the following.
- Python (>= version 2.6.5, imported libraries: copy, gettext, gzip, numpy, os, re, sets, sys, textwrap, warnings)
- Java (>= version 1.6)
- samtools (>= version 0.1.17)
- bedtools (>= version 2.24.0)
- bowtie2 (>= version 2.0.5)
- Univa® Grid Engine
For the processing of samples EQP relies one a fairly rigid directory
structure in which reference files are located at predefined locations
exon-pipeline-files subdirectory, scripts and programs are
located at predefined locations in the
subdirectory, and Fastq files are located at predefined locations in the
samples subdirectory. In following we describe how to setup a project
specific directory structure in the samples subdirectory.
Creating a samples directory
For each of your samples you need to create a subdirectory in
<project directory>/samples. For each sample directory EQP assumes that
it contains one or two Fastq files (or links to Fastq files) called
either .fastq.gz or _1.fastq.gz and _2.fastq.gz
depending on whether you want to process single-read or paired-end
reads. See the document
manual/EQP-manual.pdf for more information on
how create samples directories in a more automated way using the
After the creation of the sample directories, the content of the
LICENSE.txt exon-pipeline-files/ samples/
Generating the counts
After the creation of the sample directories EQP is ready for execution
using the script
process-samples.sh in the subdirectory
typical (and complete) processing of the Fastq files in the samples
subdirectory is accomplished by the following calls:
Split the input Fastq files into chunks of 25M reads and filter out reads that originate completely from the poly-A tails. Please note that this step needs to be executed even if the input Fastq files contain less than 25M reads as EQP changes the reads identifiers in this step to a format of which is required in the
All sub-processes must finish before running the next command.
Align all chunk Fastq files against the transcript, genome, and junction Fasta reference file.
Generate the gene, exon, and junction counts for each chunk. This step can be executed directly after the alignment step without the need to wait for the completion of the jobs.
Again all sub-processes must finish before the next call.
Combine the counts of all chunks of one sample into an aggregate count and combine the counts for the samples in a file with has a matrix structure with the gene, exon, or junction identifiers as the rows and the counts for each sample as the columns.
Please read the document
manual/EQP-manual.pdf for a much detailed
description of the different operations and various options, for
instance, to check the status of the submitted jobs.
Location of the results
The results can then be found in the files:
<process directory>/samples/count-files/<process directory base>-bowtie2-pe-<count type>.cnt
<process directory base>is the base name of , that is, just the name of the directory without the complete path
<count type>is one of gene, exon, or junction.
EQP-cluster and UGE
EQP is designed to work with the Univa® Grid Engine (UGE, formerly Sun
Grid Engine, SGE). Most of the processing steps are submitted to the UGE
scheduler via the
qsub command (with the notable exception of
process-samples.sh combine-counts which is executed on the machine on
which the command is issued). Once submitted, a job is then executed
asynchronously on a node of the cluster managed by UGE. Below we show which
resources are used by the
qsub call of EQP. It is also possible to run
EQP without UGE. In this case the option:
-noqsub <number of cores>
needs to be supplied to all commands. With this option EQP launches processes in the background until the number of cores given by the
option is exhausted. Usually, the different commands require more than one
splitrequires one core
Note that the number of cores used by EQP will be at
most the number given after
-noqsub. If this number is smaller than the
minimum number of required cores for one process, then, nevertheless, at
least one process will be launched using the number of cores given by the
The format of the qsub command used by EQP is as follows:
qsub -V -l m_mem_free=<mem> -l h_rt=<run time> -N <job name> -hold_jid <dependence jobs>
-o <log file> -pe smp <num cores> <command>
-V, -N, -hold_jid, -o, and
-pe smp are generally
accepted by UGE
qsub whereas the parameters
-l m_mem_free=<mem> -l h_rt=<run time> are dependent on the configuration of the UGE
<mem> is the memory requirement per core and
is either given as the number of seconds (14400 for four hours, default
value) or as hours (12:00:00 for compute-counts as it sometimes may
take more than four hours).
If your UGE instance does not accept the parameters
or requires other parameters (e.g. a queue name), the recommendation is
to edit the script
process-samples.sh and to adapt the calls to
in the function
submitJob which is the only place where
jobs are actually submitted.
In order to test EQP just unpack the file
tar xfz samples.tgz
Then go to the
bin directory, and execute run
This executes the sequence of
combine-counts for the two samples
sample2 (which are
contained in the
samples directory) using the option
If you want to test whether the submission of jobs via UGE works, just
execute the four commands
as described above and compare the newly computed counts in the directory
samples/count-files with the previously computed counts in the
samples/comparison-counts by calling (in the
Copyright 2015 Novartis Institutes for Biomedical Research
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.