# Quantifying alternative splicing from RNA-seq data

This pipeline implements our pipeline to call alternative splicing events from RNA-seq data, using [`leafcutter`](https://www.nature.com/articles/s41588-017-0004-9) and [`psichomics`](https://academic.oup.com/nar/article/47/2/e7/5114259) to call the RNA-seq data from original `fastq.gz` data. It implements the GTEx pipeline for GTEx/TOPMed project. Please refer to [this page](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md) for detail. The choice of pipeline modules in this project is supported by internal (unpublished) benchmarks from GTEx group.

**Various reference data needs to be prepared before using this workflow**. [Here we provide a module](https://cumc.github.io/xqtl-pipeline/code/data_preprocessing/reference_data.html) to download and prepare the reference data. 

The product of this workflow can be used in generating phenotype tables using /molecular_phenotyles/QC/splicing_normalization.ipynb.

## Methods overview

There are many types of alternative splicing events. See [Wang et al (2008)](https://pubmed.ncbi.nlm.nih.gov/18978772/) and [Park et al (2018)](https://pubmed.ncbi.nlm.nih.gov/29304370/) for an illustration on different events and how splicings are controlled. We will apply two methods to quantify alternative splicing:

1. [`psichomics`](https://academic.oup.com/nar/article/47/2/e7/5114259) that quantifies each specific event. In particular the exon skipping event which is used also in GTEx sQTL analysis.
2. [`leafcutter`](https://www.nature.com/articles/s41588-017-0004-9) to quantify the usage of alternatively excised introns. This collectively captures skipped exons, 5’ and 3’ alternative splice site usage and other complex events. The method was previously applied to ROSMAP data as part of the Brain xQTL version 2.0. 

## Input

### `leafcutter`

The bam file can be generated by `the STAR_align` workflow from our RNA_calling.ipynb module. 

A meta-data file, white space delimited without header, containing 3 columns: sample ID, RNA strandness and path to the BAM file:

```
sample_1 rf samp1.bam
sample_2 fr samp2.bam
sample_3 unstranded samp3.bam
```

All the BAM files should be available under specified folder (default assumes the same folder as where the meta-data file is).

If intend to blacklist some chromosomes and not analyze it, add one text file named black_list.txt with one chromosome name per line in the same directory of the meta-data file.


### `psichomics`



## Output

### `leafcutter`

`{sample_list}` below refers to the name of the meta-data file input.

Main output include: 

- `{sample_list}_intron_usage_perind.counts.gz` file with row id in format: "chromosome:intron_start:intron_end:cluster_id", column labeled as input sample names and each type of intron usage ratio under each sample (i.e. #particular intron in a sample / #total introns classified in the same cluster in a sample) in each cells. 
- `{sample_list}_intron_usage_perind_numers.counts.gz` file with the same row and column label but the count of each intron in each cells.

### `psichomics`

## Minimal working example


### For `leafcutter`
A minimal working example is uploaded in the [google drive](https://drive.google.com/drive/folders/1lpcx3eKG2UpauntLUuJ6bMBjHyIhWW_R?usp=sharing)

In [None]:
sos run pipeline/splicing_calling.ipynb leafcutter \
    --cwd output/ \
    --samples sample_bam.list \
    --container containers/leafcutter.sif 

### For `psichomics`

In [None]:
sos run splicing_calling.ipynb psichomics \
    --cwd output/rnaseq/splicing \
    --samples data/sample_bam.list \
    --data-dir data \
    --container container/splicing.sif

## Command interface

In [1]:
sos run splicing_calling.ipynb -h

usage: sos run splicing_calling.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  leafcutter

Global Workflow Options:
  --cwd output (as path)
                        The output directory for generated files.
  --samples VAL (as path, required)
                        Sample meta data list
  --data-dir  path(f"{samples:d}")

                        Raw data directory, default to the same directory as
                        sample list
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 8 (a

## Setup and global parameters

In [17]:
[global]
# The output directory for generated files. 
parameter: cwd = path("output")
# Sample meta data list
parameter: samples = path
# Raw data directory, default to the same directory as sample list
parameter: data_dir = path(f"{samples:d}")
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
from sos.utils import expand_size
cwd = path(f'{cwd:a}')

def get_samples(fn, dr):
    import os
    samples = [x.strip().split() for x in open(fn).readlines()]
    names = []
    strandness = []
    files = []
    bams = []
    
    for i, x in enumerate(samples):
        if len(x)<3:
            raise ValueError(f"Line {i+1} of file {fn} must have 3 columns")
        names.append(x[0])
        strandness.append(x[1])
        files.append(x[2])
        
    for j in range(len(strandness)):
        # for regtools command usage, replace 0 = unstranded/XS, 1 = first-strand/RF, 2 = second-strand/FR
        if strandness[i] == 'rf':
            strandness[i] = 1
        if strandness[i] == 'fr':
            strandness[i] = 2
        if strandness[i] == 'unstranded':
            strandness[i] = 0
            
    for y in files:
        y = os.path.join(dr, y)
        if not os.path.isfile(y):
            raise ValueError(f"File {y} does not exist")
        bams.append(y)
        
    if len(files) != len(set(files)):
        raise ValueError("Duplicated files are found (but should not be allowed) in BAM file list")
        
    return names, bams, strandness

sample_id, bam, strandness = get_samples(samples, data_dir)

## `leafcutter`

Documentation: [`leafcutter`](https://davidaknowles.github.io/leafcutter/index.html). The choices of regtool parameters are [discussed here](https://github.com/davidaknowles/leafcutter/issues/127).

### Parameter Annotations

* anchor_len: anchor length
* min_intron_len: minimum intron length to be analyzed
* max_intron_len: maximum intron length to be analyzed
* min_split_reads: minimal split reads allows to form a cluster

### Things to keep in mind

* If .bam.bai index files of the .bam input are ready before using leafCutter, it can be placed in the same directory with input .bam files and the "samtools index ${_input}" line can be skipped.


In [18]:
[leafcutter_1]
parameter: anchor_len = 8
parameter: min_intron_len = 50
parameter: max_intron_len = 500000
input: bam, group_by = 1, group_with = "strandness"
output: f'{cwd}/{_input:bn}.junc' 
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container
    samtools index ${_input}
    regtools junctions extract -a ${anchor_len} -m ${min_intron_len} -M ${max_intron_len} -s ${_strandness} ${_input} -o ${_output}

In [19]:
[leafcutter_2]
parameter: min_split_reads = 50
parameter: max_intron_len = 500000
input: group_by = 'all'
output: f'{cwd}/{samples:bn}_intron_usage_perind.counts.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container
    rm -f ${_output:nn}.junc
    for i in ${_input:r}; do
    echo $i >> ${_output:nn}.junc ; done
    python /opt/leafcutter/clustering/leafcutter_cluster_regtools.py -j ${_output:nn}.junc -o ${f'{_output:bnn}'.replace("_perind","")} -m ${min_split_reads} -l ${max_intron_len} -r ${cwd}

## `psichomics`

Documentation: [`psichomics`](http://bioconductor.org/packages/release/bioc/html/psichomics.html)