An alignment-free approach to estimating exon-inclusion ratios without a reference transcriptome
FreePSI is a new method for genome-wide percent spliced in (PSI) estimation that requires neither a reference transcriptome (hence, transcriptome-free) nor the mapping of RNA-seq reads (hence, alignment-free). The first freedom allows FreePSI to work effectively when a high quality reference transcriptome is unavailable and the second freedom not only helps make FreePSI more efficient, it also eliminates the necessity of dealing with multi-reads, which is a challenging problem by itself. Note that this is the first alignment-free method in RNA-seq data analysis that does not require a reference transcriptome. An outline of the method is sketched below.
FreePSI takes as the input a reference genome with exon boundary annotation and a set of RNA-seq reads. Since a reference transcriptome is not assumed, it uses a weighted directed bipartite graph (called an abundance flow graph) to represent all possible isoforms of a gene and their expression levels. In such a graph, each vertex represents an exon boundary and each edge represents either an exon or an exon junction. The weight of an edge represents the total relative abundance of all isoforms covering the corresponding exon or junction. Obviously, to estimate the PSI value of each exon, it suffices to infer the edge weights in every abundance flow graph. By regarding each edge as a sequence of k-mers, FreePSI constructs a novel probabilistic model for generating all observed k-mers in the input RNA-seq reads based on the abundance flow graphs for all genes. It then employs the expectation-maximization framework to solve a genome-wide maximum likelihood estimation of the model and a divide-and-conquer strategy to decompose the key optimization problem in the M-step into independent subproblems for each gene, which are then solved by an ultrafast algorithm, conjugate gradient projection descent. Finally, it uses a post-processing procedure based on straightforward correlation analysis to 'smooth out' the PSI values in each gene. FreePSI could have important applications in alternative splicing analysis when a high quality reference transcriptome is unavailable.
FreePSI requires Jellyfish (version 2.2.6 or higher) for counting k-mers in RNA-seq reads and Python3 (version 3.5.2 or higher) with Scipy (version 0.18.1 or higher) and Numpy (version 1.11.1 or higher) for post-processing and summarizing output.
Currently, we provide a binary release here (compiled with GCC version 5.4.0) for 64-bit Linux system.
Building FreePSI from source requires installing libraries Eigen (version 3.3 or higher) and Boost (version 1.61.0 or higher). We will provide an installable version soon.
We provide an example for testing FreePSI.
The example contains the reference genome with exon boundary annotation of human chromosome 21 and corresponding simulated RNA-seq reads.
Please run the shell script run_FreePSI.sh
in the folder example
for a try.
Description:
The build
command builds the k-mer hash table from a reference genome and loads the k-mer counts of RNA-seq reads.
Usage:
freePSI build -g <GENOME_DIR> -a <EXON_BND_ANNOT> [-r <KMER_COUNT> | -1 <KMER_COUNT> -2 <KMER_COUNT>] -o <HASHTABLE> [options..]
Options:
-g <GENOME_DIR>
The directory containing the reference genome of each chromosome (.fasta format)
-a <EXON_BND_ANNOT>
The annotation of exon boundary (.bed format)
-r <KMER_COUNT>
The k-mer count of single-end reads produced by Jellyfish (.fasta format)
-1 <KMER_COUNT> -2 <KMER_COUNT>
The k-mer count of paired-end reads produced by Jellyfish (.fasta format)
-o <HASHTABLE>
The k-mer hash table (.json format)
-k [Integer]
The length of k-mer (Default: 27)
-p [Thread number]
The thread numbers. (Default: 1)
-h
Help information
Description:
The quant
command quantifies the PSI values.
Usage:
freePSI quant -i <INDEX> -o <OUTPUT>
Options:
-i <HASHTABLE>
The k-mer hash table (.json format)
-o <OUTPUT>
The result of PSI values (.json format)
-k [Integer]
The length of k-mer (Default: 27)
-p [Thread number]
The thread numbers. (Default: 1)
-h
Help information
python3 postProc.py <RAW_OUTPUT> <REFINED_OUTPUT>
python3 summary.py <EXON_BND_ANNOT> <REFINED_OUTPUT> <SUMMARY>