This repository tests the performance of rnabridge-align and rnabridge-denovo. Here we provide scripts to download datasets, run these tools and reproduce the results and figures in the manuscript.
The pipeline involves in the followint five steps:
- Download necessary datasets (
data
directory). - Download and/or compile necessary programs (
programs
directory). - Run the methods and produce results regarding
rnabridge-align
(align
directory). - Run the methods and produce results regarding
rnabridge-denovo
(denovo
directory). - Summarize results and produce figures (
plots
directory).
We evaluate them on two datasets, namely simulation80 and encode10.
We also need the reference annotation files for evaluating reference-based transcript assembly.
In directory data
, we provide metadata for these datasets, and also provide scripts to download them.
The data was simulated with Flux-Simulator. We tried two parameters, the average length of fragments (300 and 500) and the length of reads (75 and 100). For each combination, we simulated 20 samples. The reads, ground-truth transcripts, alignments (using STAR) can be downloaded through Penn State Data Commons (https://doi.org/10.26208/b01x-aq20).
This dataset contains 10 human RNA-seq samples downloaded from ENCODE. This dataset has also been used in scalloptest. All these samples are sequenced with strand-specific and paired-end protocols. For each of these 10 samples, we align it with two RNA-seq aligners, STAR and HISAT2. You may download all these reads alignments via Penn State Data Commons (https://doi.org/10.26208/8c06-w247).
Use the following script in data
to download annotations:
./download.annotation.sh
The downloaded files will appear under data/ensembl
.
Our experiments (used in the manuscript) involve the following four programs:
Program | Version | Description |
---|---|---|
rnabridge-align | v1.0.1 | bridging RNA-seq alignments |
Scallop | v0.10.5 | transcript assembler |
StringTie | v2.1.4 | transcript assembler |
gffcompare | v0.11.2 | Evaluate assembled transcripts |
gtfcuff | a set of utilities for processing RNA-seq data |
You need to download and/or complile them, and then link them to programs
directory.
Make sure that the program names are in lower cases (i.e., stringtie
, scallop
, and gffcompare
) in programs
directory.
Once the datasets and programs are available, use the following scripts in align
to run:
./run.simulation80.sh
./run.encode10.sh
In each of these scripts, you can modify it to run different parameters.
For each run, you need to specify a run-id
, which will be used later on when
collecting the results.
After experiments finish running, the following script can collect accuracies:
./collect.sh
This will report results to a directory results.RUN-ID
, which can be directly
use by the scripts to generate figreus (below).
Once the results have been generated, one can use the following scripts in plots
to reproduce the figures:
./build.figures.sh
You may need to install R tikzDevice
. You may also need to modify these scripts to match the run-id(s)
you specified.
#The results used in the manuscript (run-id = D400) has been update in this repo (including GTEx dataset),
#so the directly running above script can generate all figures used in the manuscript.