# Quantifying alternative splicing from RNA-seq data

This pipeline implements our pipeline to call alternative splicing events from RNA-seq data, using [`leafcutter`](https://www.nature.com/articles/s41588-017-0004-9) and [`psichomics`](https://academic.oup.com/nar/article/47/2/e7/5114259) to call the RNA-seq data from original `fastq.gz` data. It implements the GTEx pipeline for GTEx/TOPMed project. Please refer to [this page](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md) for detail. The choice of pipeline modules in this project is supported by internal (unpublished) benchmarks from GTEx group.

**Various reference data needs to be prepared before using this workflow**. [Here we provide a module](https://cumc.github.io/xqtl-pipeline/code/data_preprocessing/reference_data.html) to download and prepare the reference data. 

## Methods overview

There are many types of alternative splicing events. See [Wang et al (2008)](https://pubmed.ncbi.nlm.nih.gov/18978772/) For an illustration on different events. We will apply two methods to quantify alternative splicing:

1. [`psichomics`](https://academic.oup.com/nar/article/47/2/e7/5114259) that quantifies each specific event. In particular the exon skipping event which is used also in GTEx sQTL analysis.
2. [`leafcutter`](https://www.nature.com/articles/s41588-017-0004-9) to quantify the usage of alternatively excised introns. This collectively captures skipped exons, 5’ and 3’ alternative splice site usage and other complex events. The method was previously applied to ROSMAP data as part of the Brain xQTL version 2.0. 

## Input

A meta-data file, white space delimited without header, containing 2 columns: sample ID, BAM file:

```
sample_1 samp1.bam
sample_2 samp2.bam
sample_3 samp3.bam
```

All the BAM files should be available under specified folder (default assumes the same folder as where the meta-data file is).

## Output

- `leafcutter`: counts for each intron cluster within a gene, as well as intron usage ratio
- `psichomics`: ...

## Minimal working example

A toy `BAM` data can be found on ...

### For `leafcutter`

In [None]:
sos run splicing_calling.ipynb leafcutter \
    --cwd output/rnaseq/splicing \
    --samples data/sample_bam.list \
    --data-dir data \
    --container container/splicing.sif

### For `psichomics`

In [None]:
sos run splicing_calling.ipynb psichomics \
    --cwd output/rnaseq/splicing \
    --samples data/sample_bam.list \
    --data-dir data \
    --container container/splicing.sif

## Command interface

In [3]:
sos run splicing_calling.ipynb -h

usage: sos run splicing_calling.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  leafcutter

Global Workflow Options:
  --cwd output (as path)
                        The output directory for generated files.
  --samples VAL (as path, required)
                        Sample meta data list
  --data-dir  path(f"{samples:d}")

                        Raw data directory, default to the same directory as
                        sample list
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 8 (a

## Setup and global parameters

In [6]:
[global]
# The output directory for generated files. 
parameter: cwd = path("output")
# Sample meta data list
parameter: samples = path
# Raw data directory, default to the same directory as sample list
parameter: data_dir = path(f"{samples:d}")
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
from sos.utils import expand_size
cwd = path(f'{cwd:a}')


def get_samples(fn, dr):
    import os
    samples = [x.strip().split() for x in open(fn).readlines()]
    names = []
    files = []
    for i, x in enumerate(samples):
        if len(x)<3:
            raise ValueError(f"Line {i+1} of file {fn} must have 2 columns")
        names.append(x[0])
        for y in x[1:]:
            y = os.path.join(dr, y)
            if not os.path.isfile(y):
                raise ValueError(f"File {y} does not exist")
            files.append(y)
    if len(files) != len(set(files)):
        raise ValueError("Duplicated files are found (but should not be allowed) in BAM file list")
    return names, files

sample_id, bam = get_samples(samples, data_dir)

## `leafcutter`

Documentation: [`leafcutter`](https://davidaknowles.github.io/leafcutter/index.html). The choices of regtool parameters are [discussed here](https://github.com/davidaknowles/leafcutter/issues/127).


In [7]:
[leafcutter_1]
parameter: min_anchor_len = 8
parameter: min_split_reads = 50
parameter: max_intron_len = 500000
input: bam, group_by = 1
output: f'{cwd}/{_input:bn}.junc' 
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container
    regtools junctions extract -a ${min_anchor_len} -m ${min_split_reads} -M ${max_intron_len} ${_input} -o ${_input}.junc

In [7]:
[leafcutter_2]
parameter: min_split_reads = 50
parameter: max_intron_len = 500000
input: group_by = 'all'
output: f'{cwd}/{samples:bn}_intron_usage.counts.gz' 
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container
    cat ${_input} > ${_output:nn}.junc
    leafcutter_cluster_regtools.py -j ${_output:nn}.junc -m ${min_split_reads} -o ${_output:nn} -l ${max_intron_len}

## `psichomics`

Documentation: [`psichomics`](http://bioconductor.org/packages/release/bioc/html/psichomics.html)