GitHub - lifengtian/SplicePL: Yet another bioinformatics tool to detect de novo splice junctions from paired-end RNA-seq reads (human genome only)

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data		data
doc		doc
scripts		scripts
t		t
Changes		Changes
MANIFEST		MANIFEST
Makefile.PL		Makefile.PL
README		README
SplicePL.pm		SplicePL.pm

Repository files navigation

LIMITATION!!!!
So far, SplicePL does NOT use non-uniquely mapped reads in the splice junction analysis. Since we really don't know
how to resolve the ambiguous reads, we choose NOT to use them.

SplicePL only works with human genome right now. The chromosome numbers are hard-coded from chr1 to chrM.
I will encourage to change the source code to make it work for your genome of interest.

INTRODUCTION

SplicePL is a simple BLAT parser and splice site filter, originally hacked in one day to test the idea of
using BLAT as an aligner. After it showed us promising results, it was rewritten in Perl5.

We find it a useful tool to detect de novo splice junctions from paired-end human RNA-seq
data, especially for reads longer than 75 nt. With these long reads, novel exon-exon junctions can be directly
identified by aligning reads to the reference genome using algorithms such as BLAT (Kent, 2002) or SSAHA2 (Ning
et al., 2001), rather than splitting reads into segments as in TopHat (Trapnell et al., 2009) and SpliceMap (Au
et al., 2010).

SplicePL detects splice junctions by a three-stage process. Initially, each read is aligned to the reference
genome using BLAT, followed by identifying spliced reads across exon-exon junction. Finally, the putative
junctions are identified and filtered based on splice site donor/acceptor site, intron and flanking sequence
length.

In summary, SplicePL is a simple BLAT parser and splice site filter. It is written in Perl, packaged as
a perl module to facilitate testing and installation. The code can be improved a lot by using BioPerl modules,
however, for the sake of simplicity and ease of use, we choose to implement it the way it is.

At the moment of writing (Jun 2010), SplicePL has the highest sensitivity for long reads (at least 100 nt)
compared with TopHat and SpliceMap. However, SpliceMap has better specificity than SplicePL.

In our own RNA-seq analysis, we used TopHat, SpliceMap, SplicePL, and GSNAP. We find robust results
can be obtained from highly expressed transcripts. Lowly expressed junctions tend to disagree more for
different methods.

PREREQUISITE
Before running SplicePL, please make sure:
1. Either BLAT or gfServer/gfClient binaries is installed. The most recent version (Jun 25, 2010) is 34x7 .
You also need faToTwoBit, which is part of BLAT.
For BLAT installation, please go to http://genome.ucsc.edu/FAQ/FAQblat.html#blat3

2. fasta files for individual human chromosomes were downloaded, and in the same folder, a 2BIT format file is
prepared from chr1.fa, chr2.fa, ..., chr22.fa, chrX.fa, chrY.fa,and chrM.fa. You can download the hg19
assembly ( or The Feb. 2009 human reference sequence (GRCh37) ) from
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/

3. you know exact number of RAM and processors in your LINUX server.

HOW TO USE IT
SplicePL.pm is designed to run on a LINUX server with large memory and multiple CPUs. SplicePL is tested on
Ubuntu 10.04 and Red Hat 4.1.2-46.

It requires 3GB of RAM to load the hg19 assembly to the memory. To run multiple BLAT processes (n=N), you need
to have at least 4.0*N GB of RAM.

You run SplicePL.pm within a LINUX terminal.
1. (Recommended) Use the BLAT program which uses about 4.0GB of RAM for human genome hg19, including index
(1.3GB) plus sequences (3GB):
perl SplicePL/SplicePL.pm --forward=read1.fa --reverse=read2.fa --genome=/data/share/db/human/hg19.fa.masked
--genome_2bit_filename=hg19.fa.masked.2bit --flanksize=10 --processes=6

2. (optional, not recommended) Use the gfServer/gfClient programs which use 1.2GB of RAM for human genome hg19.
In the alignment stage, it will load needed sequences into RAM:
perl SplicePL/SplicePL.pm --forward=read1.fa --reverse=read2.fa --genome=/data/share/db/human/hg19.fa.masked
--genome_2bit_filename=hg19.fa.masked.2bit --flanksize=10 --processes=14 --gfServer

OUTPUT
The alignment output is stored in the 'temp' directory; the junction files are in the 'output' directory.

The first column in the junction file is the splice junction. It has three parts:
1. chromosome name
2. last nucleotide of the donor exon
3. first nucleotide of the acceptor exon
First base of the chromosome starts from 1. For example, it may look like chr10:102286311-102286732

ACKNOWLEDGEMENT
Using BLAT, instead of short reads aligner such as Bowtie, to align the long reads to the genome was initially
suggested by Greg Grant.

REFERENCE
Kent, W. J. (2002) Blat-the blast-like alignment tool, Genome Res 12, 656-664.
Ning, Z., Cox, a. J., and Mullikin, J. C. (2001) SSAHA: a fast search method for large DNA databases.,
Genome research 11, 1725-9.
CONTACT
If you have any comments and questions about SplicePL, please contact lifeng2@mail.med.upenn.edu

TO DO
1. convert junction format to bed format [Done. Jul 8, 2010, scripts/junction_to_bed.pl ]
2. filter junctions by read coverage
3. convert BLAT psl to BAM format such that RPKM can be calculated using Cufflinks