Skip to content

splicebox/MntJULiP

Repository files navigation

MntJULiP

MntJULiP is a program for comprehensively and accurately quantifying splicing differences at intron level from collections of RNA-seq data. MntJULiP detects both differences in intron abundance levels, or differential splicing abundance (DSA), and differences in intron splicing ratios relative to the local gene output, or differential splicing ratio (DSR). MntJULiP uses a Bayesian mixture model that allows comparison of multiple conditions, and can model the effects of covariates to enable analyses of large and complex human data from disease cohorts and population studies.

Described in:

  • Yang G, Sabunciyan S, and Florea L (2022). Comprehensive and scalable quantification of splicing differences with MntJULiP. Genome Biol 23:195. [Suppl. data, Suppl. scripts]
  • Lui WW, Yang G, and Florea L (2023). MntJULiP and Jutils: Differential splicing analysis of RNA-seq data with covariates. Submitted.

For the original version of MntJULiP, please refer to the original branch.

Copyright (C) 2019-2024, and GNU GPL v3.0, by †Guangyu Yang, †Wui Wang Lui, Liliana Florea († equal contributors)

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Table of contents

What is MntJULiP?

MntJULiP is a high-performance Python package for comprehensive and accurate quantification of splicing differences from RNA-seq data. It uses Bayesian mixture models to detect changes in splicing ratios (differential splicing ratio, DSR) and in absolute splicing levels (differential splicing abundance, DSA). Its statistical underpinnings include a Dirichlet multinomial mixture model, to test for differences in the splicing ratio, and a zero-inflated negative binomial mixture model, to test for differential splicing abundance. MntJULiP works at the level of introns, or splice junctions, and therefore it is assembly-free, and can be used with or without a reference annotation. MntJULiP can perform multi-way comparisons, which may be desirable for complex and time-series experiments. Additionally, it can model confounders such as age, sex, BMI and others, and removes their biases from the data to allow for accurate comparisons. MntJULiP is fully scalable, and can work with data sets from a few to hundred or thousands of RNA-seq samples.

MntJULiP was designed to be compatible with alignments generated by STAR or Tophat2, but can work with output from other aligners.

Features

  • A novel assembly-free model for detecting differential intron abundance and differential intron splicing ratio across samples;
  • Allows multi-way comparisons;
  • Incorporates the treatment of covariates within the models;
  • Can be used with or without a reference annotation;
  • Multi-threaded and highly scalable, can process hundreds of samples in hours.

Installation

MntJULiP is written in Python, you can install the latest version from our GitHub repository. To download the code, you can clone this repository by

git clone https://github.com/splicebox/MntJULiP.git

System requirement

  • Linux
  • Python 3.7 or later
  • gcc 6.4.0 or later

Prerequisites:

MntJULiP has the following dependencies:

  • PyStan, a package for statistical modeling and high-performance statistical computation.
  • NumPy, a fundamental package for scientific computing with Python.
  • SciPy, a Python-based package for mathematics, science, and engineering.  
  • Statsmodels, a Python module for the estimation of different statistical models, conducting statistical tests and data exploration.  
  • Pandas, a fast, powerful, flexible and easy to use open source data analysis and manipulation tool.
  • Dask, a Python package that provides advanced parallelism for analytics.
  • scikit-learn, a Python simple and efficient tools for predictive data analysis.

The required packages may be installed using conda:

cd MnJULiP
conda env create -f mntjulip_cov.yml
#  run setup.py to install MntJULiP and all the required packages
module load cmake
python3 setup.py install  

Usage

Usage: python run.py [options] [--bam-list bam_file_list.txt | --splice-list splice_file_list.txt]

required arguments:
  --bam-list BAM_LIST   a text file that contains the list of the BAM file
                        paths and sample conditions
  OR
  --splice-list SPLICE_LIST
                        a text file that contains the list of the SPLICE file
                        paths and sample conditions
optional arguments:
  --anno-file ANNO_FILE
                        annotation file in GTF format
  --out-dir OUT_DIR     output folder to store the results and temporary files (default: ./out)
  --num-threads NUM_THREADS
                        number of CPU cores use to run the program (default: 4)
  --raw-counts-only     output sample-level raw values only (default: both raw and estimated values)
  -v, --version         show program's version number and exit
  -h, --help            show this help message and exit

Here is an example to run MntJULiP with a set of alignment files and the GENCODE annotation:

ANNO_FILE="gencode.v22.annotation.gtf"
BAM_LIST="bam_file_list.txt"
python3 run.py --bam-list ${BAM_LIST} \
               --anno-file ${ANNO_FILE} \
               --threads 8            

The 'bam_list' is a .txt file with columns separated by tabs ('\t'); covariate columns are optional. Here is an example:

sample	condition	covariate1	covariate2
sample1.bam	control	Male	18
sample2.bam	case	Female	49

Note that in the current version, MntJULiP automatically determines a reference condition, by sorting the conditions and choosing the lexicographically smallest one.

Extracting the splice junctions and their read counts needed for quantification from the .bam files is the first and most time consuming step. It may be beneficial to avoid recalculating the values on subsequent runs of the data or subgroups. Also, in some cases the .bam files may not be available, or the splice junctions can be obtained from other sources. For efficient processing and to accommodate such cases, MntJULiP can work directly with the .splice files, instead of the .bam files, as input.

Here is an example of how to run MntJULiP with the GENCODE annotation and the splice file list:

SPLICE_LIST="splice_file_list.txt"
ANNO_FILE="gencode.v22.annotation.gtf.gz"
python run.py --splice-list ${SPLICE_LIST} \
               --anno-file ${ANNO_FILE} \
               --num-threads 8 

The 'splice_file_list.txt' is a .txt file with columns separated by 'tab' or '\t'; covariate columns are optional. Here is an example:

sample  condition	covariate1	covariate2
sample1.splice.gz  control	Male	18
sample2.splice.gz  case	Female	49

The .splice file for a given sample is a space or ' ' separated file with at least 5 columns "chrom start end count strand" (the header is excluded):

chr1 1311924 1312018 100 -

A splice file may have additional columns; for instance, those generated by the junc tool included in this package will distinguish the numbers of uniquely and multimapped supporting reads:

chr1 1311924 1312018 100 - 67 33 ...

Input/Output

Input

The main input of MntJULiP is a list of BAM files containing RNA-seq read alignments. The BAM files can be generated by STAR with or without '--outSAMstrandField intronMotif' option.

STAR --genomeDir ${STARIDX_DIR} \
     --outSAMstrandField intronMotif \
     --readFilesIn ${DATA_DIR}/${name}_1.fastq ${DATA_DIR}/${name}_2.fastq \
     --outSAMtype BAM SortedByCoordinate \
     --outFileNamePrefix ${WORK_DIR}/${name}/

Alternatively, MntJULiP can take as input .splice files generated by an external program, such as the junc tool included in this package, or other versions from our PsiCLASS or CLASS2 packages.

junc my_alignment_fil.bam -a > my_alignment_file.splice

Output

Output generated by MntJULiP includes five types of files:

  • diff_spliced_introns.txt and diff_spliced_groups.txt: contains the results of the differential intron splicing ratio (DSR) analysis;
  • diff_introns.txt: contains the results of the differential intron abundance (DSA) analysis;
  • intron_data.txt and group_data.txt: contains information about the introns (splice junctions), including genomic coordinates, raw and estimated read counts, average abundance levels, etc.; and respectively groups, including genomic location, splicing ratios (PSI values) both calculated from read counts and estimated by the model, and others.

These files are described in detail below.

In addition, the directory will contain a temporary folder named 'data', containing the .splice files generated by MntJULiP.

1. DSR analysis: diff_spliced_introns.txt & diff_spliced_groups.txt

The two files contain information about differential spliced introns determined by the DSR criteria and their group ('bunch') context. A 'bunch' is a group of exons that share the same junction, either the lower coordinate ('i') or the higher coordinate ('o') of the intron(s) along the genomic axis. The diff_spliced_groups.txt file lists each 'bunch' along with information about the log-likelihood DSR test and its significance:

group_id        chrom   loc     strand  gene_name       structure       llr     p_value q_value
g000001 chr1    3421702 -       Xkr4    o       1.85645 0.156227        0.477186

The file diff_spliced_introns.txt lists all introns by group ('bunch') along with their location on the genome, Percent Splice In (PSI) value estimated for each of the conditions in the comparison, and the difference in PSI values dPSI=PSI2-PSI1. For multi-way comparisons, the dPSI value is excluded.

group_id        chrom   start   end     strand  gene_name       psi(control)    psi(epileptic)  delta_psi
g000001 chr1    3216968 3421702 -       Xkr4    0.843335        0.784751        -0.0585835
g000001 chr1    3323760 3421702 -       Xkr4    0.156665        0.215249        0.0585835

To determine introns that are differentially spliced by the DSR test, one may query the two files for introns in 'bunches' that satisfy the statistical significance condition (e.g., p-val<=0.05 or q-val<=0.05) and for which the dPSI value exceeds a pre-defined cutoff (e.g., |dPSI|>=0.05). For convenience, the script filter_DSR_introns.py was provided with this package:

filter_DSR_introns.py [-h] [--dir DIR] [--pvalue PVALUE] [--qvalue QVALUE] [--dpsi DPSI]

optional arguments:
  -h, --help       show this help message and exit
  --dir DIR        directory that contains diff_spliced_introns.txt and diff_spliced_groups.txt
  --pvalue PVALUE  filter by p-value (default 0.05)
  --qvalue QVALUE  filter by q-value (default 1.0)
  --dpsi DPSI      filter by absolute value of dPSI (default 0.05)

Note that, by virtue of how 'bunches' are constructed, an intron may be reported more than one time, for instance the exon-spanning intron in an exon skipping event will appear in two groups, one for each of its two endpoints. Additionally, more than one exon from the same 'bunch' may satisfy the statistical test and be reported. Our visualization tool Jutils selects an unbiased set of introns for display in heatmaps, following the criteria outlined here.

2. DSA analysis: diff_introns.txt

The file diff_introns.txt contains information about differential spliced introns determined by the DSA criteria. It lists all MntJULiP-selected introns, individually, along with their location on the genome, abundance (expression) estimates in terms of normalized read counts in each condition, and information about the log-likelihood ratio test and its significance:

chrom   start   end     strand  gene_name       status  llr     p_value q_value avg_read_counts(control)        avg_read_counts(epileptic)
chr1    3207317 3213439 -       Xkr4    TEST    0.718835        0.696729        0.89138 59.64   41.36

Introns that are differentially spliced by the DSA test can be determined directly from the file by querying the entries for statistical significance and/or other criteria.

3. Supporting intron & group data

The file intron_data.txt contains supporting data for all introns detected by MntJULiP that passed the initial internal quality filters.

chrom   start   end     strand  gene_name       status  read_counts(0)    read_counts(1)  est_counts(0)  est_counts(1)
chr1    966803  970277  +       PLEKHN1    OK      9,7,7,9,21   3,16,15,17,22   11.473772,9.75089,9.75089,11.473772,17.631809   11.957485,16.183491,15.404332,16.943721,20.49379
chr1    970621  1054485 +       PLEKHN1    LO_DATA 1,0,2,0,0    0,0,1,0,0       None,None,None,None,None        None,None,None,None,None

It lists all introns individually along with their genomic location, status of the test (OK or LO_DATA), the raw read counts for the intron in all samples, and the model fitted read counts, grouped by condition. If the '--raw-counts-only' option is used, only raw read counts are reported (n.b., this option is compatible with versions of MntJULiP before 1.5). This information can be used, for instance, to further filter introns with low support across all or subsets of the samples, or from genes with low expression levels, or can be used to generate visualizations such as heatmaps or PCA plots.

The file group_data.txt contains supporting data for all introns in groups tested by MntJULiP.

group_id        chrom   start   end     strand  gene_name       status  psi_values(0)   psi_values(1)   est_psi_values(0)       est_psi_values(1)
g000001 chr1    971006  971077  +       PLEKHN1 TEST    0.25,0.333333,0.375,0.666667,0.458333   0.454545,0.5,0.44,0.333333,0.478261     0.274425,0.361496,0.404335,0.692248,0.458361    0.484908,0.499894,0.440096,0.333756,0.478239
g000001 chr1    971006  971113  +       PLEKHN1 TEST    0.75,0.666667,0.625,0.333333,0.541667   0.545455,0.5,0.56,0.666667,0.521739     0.725575,0.638504,0.595665,0.307752,0.541639    0.515092,0.500106,0.559904,0.666244,0.521761

It lists all introns in a group, on separate lines, along with their genomic location, status of the test (TEST or LO_DATA), and per sample PSI values, both calculated from the raw read counts and estimated by the model, separated by condition. If the '--raw-counts-only' option is used, only PSI values calculated from the raw read counts are reported. This file is new in MntJULiP starting with release 1.5. As with intron data, this information can be used to filter low ratio isoforms or to generate visualizations.

Visualization

The output generated by MntJULiP DSR and DSA can be further visualized using the Jutils visualization suite. Jutils can generate heatmaps of differentially spliced events, sashimi plots, PCA plots and Venn diagrams of gene sets determined to be differentially spliced based on program or user-defined criteria, using information extracted directly from the MntJULiP output.

Support

Contact: Please submit an Issue through this Github page.

License information

See the file LICENSE for information on the history of this software, terms & conditions for usage, and a DISCLAIMER OF ALL WARRANTIES.