This repository tests the performance of rnabridge-align and rnabridge-denovo. Here we provide scripts to download datasets, run these tools and reproduce the results and figures in the manuscript.
The pipeline involves in the followint five steps:
- Download necessary datasets (
datadirectory). - Download and/or compile necessary programs (
programsdirectory). - Run the methods and produce results regarding
rnabridge-align(aligndirectory). - Run the methods and produce results regarding
rnabridge-denovo(denovodirectory). - Summarize results and produce figures (
plotsdirectory).
We evaluate them on two datasets, namely simulation80 and encode10.
We also need the reference annotation files for evaluating reference-based transcript assembly.
In directory data, we provide metadata for these datasets, and also provide scripts to download them.
The data was simulated with Flux-Simulator. We tried two parameters, the average length of fragments (300 and 500) and the length of reads (75 and 100). For each combination, we simulated 20 samples. The reads, ground-truth transcripts, alignments (using STAR) can be downloaded through Penn State Data Commons (https://doi.org/10.26208/b01x-aq20).
This dataset contains 10 human RNA-seq samples downloaded from ENCODE. This dataset has also been used in scalloptest. All these samples are sequenced with strand-specific and paired-end protocols. For each of these 10 samples, we align it with two RNA-seq aligners, STAR and HISAT2. You may download all these reads alignments via Penn State Data Commons (https://doi.org/10.26208/8c06-w247).
Use the following script in data to download annotations:
./download.annotation.sh
The downloaded files will appear under data/ensembl.
Our experiments (used in the manuscript) involve the following four programs:
| Program | Version | Description |
|---|---|---|
| rnabridge-align | v1.0.1 | bridging RNA-seq alignments |
| Scallop | v0.10.5 | transcript assembler |
| StringTie | v2.1.4 | transcript assembler |
| gffcompare | v0.11.2 | Evaluate assembled transcripts |
| gtfcuff | a set of utilities for processing RNA-seq data |
You need to download and/or complile them, and then link them to programs directory.
Make sure that the program names are in lower cases (i.e., stringtie, scallop, and gffcompare) in programs directory.
Once the datasets and programs are available, use the following scripts in align to run:
./run.simulation80.sh
./run.encode10.sh
In each of these scripts, you can modify it to run different parameters.
For each run, you need to specify a run-id, which will be used later on when
collecting the results.
After experiments finish running, the following script can collect accuracies:
./collect.sh
This will report results to a directory results.RUN-ID, which can be directly
use by the scripts to generate figreus (below).
Once the results have been generated, one can use the following scripts in plots to reproduce the figures:
./build.figures.sh
You may need to install R tikzDevice. You may also need to modify these scripts to match the run-id(s) you specified.
#The results used in the manuscript (run-id = D400) has been update in this repo (including GTEx dataset),
#so the directly running above script can generate all figures used in the manuscript.