Skip to content

Vivianstats/MSIQ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MSIQ: Joint modeling of multiple RNA-seq samples for accurate isoform quantification

Wei Vivian Li, Anqi Zhao, Jingyi Jessica Li 2017-08-19

Introduction

MSIQ is developed for more accurate and robust isoform quantification, by integrating multiple RNA-seq samples under a Bayesian framework. Our method aims to

  • identify the consistent group of samples with homogeneous quality
  • improve isoform quantification accuracy by jointly modeling multiple RNA-seq samples with more weights on the consistent group.

Any suggestions on the package are welcome! For technical problems, please report to Issues. For suggestions and comments on the method, please contact Wei (liw@ucla.edu) or Dr. Jessica Li (jli@stat.ucla.edu).

Installation

The package is not on CRAN yet. For installation please use the following codes in R

install.packages("devtools")
library(devtools)

install_github("Vivianstats/MSIQ")

Usage

The package needs two types of input: GTF file and BAM files.

GTF file

MSIQ is an annotation-based method, so the users need to supply a GTF file from which MSIQ extracts the gene structures.

BAM files

MSIQ starts from alignment file in bam format. Each bam file represents one RNA-seq sample. The bam files need to be sorted and indexed, and the index files should be in the same folder as their corresponding bam files. Currently, the package only supports paired-end reads.

Example

Suppose the genes of interest are specified in the GTF file named as "example.gtf". We would like to estimate the transcript expression of these genes based on 6 RNA-seq samples. The bam files are named as "sample1.bam", ... , "sample4.bam". Suppose all the files are in the working directory, then we can use the following example codes to run the program

library(MSIQ)
D = 4 # number of samples
gtf_path = "example.gtf"
bam_path = c("sample1.bam", "sample2.bam", "sample3.bam", "sample4.bam")
ncores = 5
result = msiq(D, gtf_path, bam_path, ncores)

MSIQ summarizes the estimation results in MSIQ_result/transcript_summary.txt. The text file includes the following the columns:

  • transcriptID. The transcript ID extracted from the GTF file.
  • geneID. The gene ID extracted from the GTF file.
  • FPKM_1, ..., FPKM_D. The FPKM values of the corresponding transcript in samples 1, ..., D.
  • sample1, ..., sampleD. The estimated indicator variables of samples 1, ..., D. 1 means that the sample belongs to the consistent group; 0 means that the sample is not in the consistent group.

Below is an example output for one gene:

transcriptID geneID frac FPKM_1 FPKM_2 FPKM_3 FPKM_4 sample1 sample2 sample3 sample4
ENST00000464356 ENSG00000116604 0.0996 0.0365 0.0549 0.016 0.0124 1 1 1 0
ENST00000348159 ENSG00000116604 0.2601 0.0892 0.1342 0.0391 0.0304 1 1 1 0
ENST00000360595 ENSG00000116604 0.629 0.4154 0.6249 0.1819 0.1417 1 1 1 0
ENST00000368240 ENSG00000116604 0.0018 0.0021 0.0031 9e-04 7e-04 1 1 1 0
ENST00000475587 ENSG00000116604 0.0032 0.0044 0.0067 0.0019 0.0015 1 1 1 0
ENST00000493077 ENSG00000116604 4e-04 0.0024 0.0036 0.0011 8e-04 1 1 1 0
ENST00000489057 ENSG00000116604 0.0059 0.0297 0.0447 0.013 0.0101 1 1 1 0

About

Joint modeling of multiple RNA-seq samples for accurate isoform quantification

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published