# Reference data standardization

This module provides reference data download, indexing and preprocessing (if necessary), in preparation for use throughout the pipeline.

We have included the PDF document compiled by data standardization subgroup in the [minimal working example folder on Google Drive](https://drive.google.com/file/d/1R5sw5o8vqk_mbQQb4CGmtH3ldu1T3Vu0/view?usp=sharing). It contains the reference data to use for the project.

## Overview

This module is based on the [TOPMed workflow from Broad](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md).

Workflows implemented include:

### Convert transcript feature file gff3 to gtf

- Input: an uncompressed gff3 file.(i.e. can be view via cat)
- Output: a gtf file.

### Collapse transcript features into genes

- Input: a gtf file.
- Output: a gtf file with collapesed gene model.

### Generate STAR index based on gtf and reference fasta

- Input: a gtf file and an acompanying fasta file.
- Output: A folder of STAR index.


### Generate RSEM index based on gtf and reference fasta

- Input: a gtf file and an acompanying fasta file.
- Output: A folder of RSEM index.

## Example commands

To download reference data:

In [None]:
sos run reference_data.ipynb download_hg_reference --cwd reference_data
sos run reference_data.ipynb download_gene_annotation --cwd reference_data
sos run reference_data.ipynb download_ercc_reference --cwd reference_data

To format reference data:

In [None]:
sos run reference_data.ipynb hg_reference \
    --cwd reference_data \
    --ercc-reference reference_data/ERCC92.fa \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.fa \
    --container container/rna_quantification.sif

To format gene feature data:

In [None]:
sos run reference_data.ipynb gene_annotation \
    --cwd reference_data \
    --ercc-gtf reference_data/ERCC92.gtf \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --container container/rna_quantification.sif

**Notice that for stranded RNA-seq protocol please add `--is-stranded` to the command above. More details can be found later in the document.**

To generate STAR index using the GTF annotation file before gene model collapse:

In [None]:
sos run reference_data.ipynb STAR_index \
    --cwd reference_data \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container container/rna_quantification.sif \
    --mem 40G

**Notice that command above requires at least 40G of memory, and takes quite a while to complete**.

To generate RSEM index:

In [None]:
sos run reference_data.ipynb RSEM_index \
    --cwd reference_data \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --container container/rna_quantification.sif \
    --mem 40G

## Command interface

In [1]:
sos run reference_data.ipynb -h

usage: sos run reference_data.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  download_hg_reference
  download_gene_annotation
  download_ercc_reference
  gff3_to_gtf
  hg_reference
  hg_gtf
  ercc_gtf
  gene_annotation
  STAR_index
  RSEM_indexing

Global Workflow Options:
  --cwd VAL (as path, required)
                        The output directory for generated files.
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 8 (as int)
                        Number of threads
  --container ''
               

In [None]:
[global]
# The output directory for generated files.
parameter: cwd = path
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
cwd = path(f'{cwd:a}')
from sos.utils import expand_size

## Data download

In [None]:
[download_hg_reference]
output: f"{cwd:a}/GRCh38_full_analysis_set_plus_decoy_hla.fa"
download: dest_dir = cwd
    ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

In [None]:
[download_gene_annotation]
output: f"{cwd:a}/Homo_sapiens.GRCh38.103.chr.gtf"
download: dest_dir = cwd, decompress=True
    http://ftp.ensembl.org/pub/release-103/gtf/homo_sapiens/Homo_sapiens.GRCh38.103.chr.gtf.gz

In [None]:
[download_ercc_reference]
output: f"{cwd:a}/ERCC92.gtf", f"{cwd:a}/ERCC92.fa"
download: dest_dir = cwd, decompress=True
    https://tools.thermofisher.com/content/sfs/manuals/ERCC92.zip

## GFF3 to GTF formatting

In [None]:
[gff3_to_gtf]
parameter: gff3_file = path
input: gff3_file
output: f'{cwd}/{_input:bn}.gtf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
bash: container=container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
        gffread ${_input} -T -o ${_output}

## HG reference file preprocessing
1. Remove the HLA/ALT/Decoy record from the fasta
2. Adding in ERCC information to the fasta file
3. Generating index for the fasta file

In [None]:
[hg_reference_1 (HLA ALT Decoy removal)]
# Path to HG reference file
parameter: hg_reference = path
input: hg_reference
output:  f'{cwd}/{_input:bn}.noALT_noHLA_noDecoy.fasta'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
python: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
    with open('${_input}', 'r') as fasta:
        contigs = fasta.read()
        contigs = contigs.split('>')
        contig_ids = [i.split(' ', 1)[0] for i in contigs]

        # exclude ALT, HLA and decoy contigs
        filtered_fasta = '>'.join([c for i,c in zip(contig_ids, contigs)
        if not (i[-4:]=='_alt' or i[:3]=='HLA' or i[-6:]=='_decoy')])
    
    with open('${_output}', 'w') as fasta:
        fasta.write(filtered_fasta)

In [None]:
[hg_reference_2 (merge with ERCC reference)]
parameter: ercc_reference = path
output: f'{cwd}/{_input:bn}_ERCC.fasta'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output}.stdout', container = container
    sed 's/ERCC-/ERCC_/g' ${ercc_reference} >  ${ercc_reference:n}.patched.fa
    cat ${_input} ${ercc_reference:n}.patched.fa > ${_output}

In [None]:
[hg_reference_3 (index the fasta file)]
output: f'{cwd}/{_input:bn}.dict'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
    samtools faidx ${_input}
    java -jar /opt/picard-tools/picard.jar \
    CreateSequenceDictionary \
    R=${_input} \
    O=${_output}

## Transcript and gene model reference processing

This step modify the `gtf` file for following reasons:

1. RSEM require GTF input to have the same chromosome name format (with `chr` prefix) as the fasta file. **although for STAR, this problem can be solved by the now commented --sjdbGTFchrPrefix "chr" option, we have to add `chr` to it for use with RSEM**. 
2. Gene model collapsing script `collapse_annotation.py` from GTEx require the gtf have `transcript_type` instead `transcript_biotype` in its annotation. We rename it here, although **this problem can also be solved by modifying the collapse_annotation.py while building the docker, since we are doing 1 above we think it is better to add in another customization here.**
3. Adding in ERCC information to the `gtf` reference.

We may reimplement 1 and 2 if the problem with RSEM is solved, or when RSEM is no longer needed.

In [None]:
[hg_gtf_1 (add chr prefix to gtf file)]
parameter: hg_reference = path
parameter: hg_gtf = path
input: hg_reference, hg_gtf
output: f'{cwd}/{_input[1]:bn}.reformatted.gtf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
R: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
    library("readr")
    library("stringr")
    library("dplyr")
    options(scipen = 999)
    con <- file("${_input[0]}","r")
    fasta <- readLines(con,n=1)
    close(con)
    gtf = read_delim("${_input[1]}", "\t",  col_names  = F, comment = "#", col_types="ccccccccc")
    if(!str_detect(fasta,">chr")) {
        gtf_mod = gtf%>%mutate(X1 = str_remove_all(X1,"chr"))
    } else if (!any(str_detect(gtf$X1[1],"chr"))) {
        gtf_mod = gtf%>%mutate(X1 = paste0("chr",X1))
    }
    if(any(str_detect(gtf_mod$X9, "transcript_biotype"))) {
      gtf_mod = gtf_mod%>%mutate(X9 = str_replace_all(X9,"transcript_biotype","transcript_type"))
    }
    gtf_mod%>%write.table("${_output}",sep = "\t",quote = FALSE,col.names = F,row.names = F)

**Text below is taken from https://github.com/broadinstitute/gtex-pipeline/tree/master/gene_model**


Gene-level expression and eQTLs from the GTEx project are calculated based on a collapsed gene model (i.e., combining all isoforms of a gene into a single transcript), according to the following rules:

1. Transcripts annotated as “retained_intron” or “read_through” are excluded. Additionally, transcripts that overlap with annotated read-through transcripts may be blacklisted (blacklists for GENCODE v19, 24 & 25 are provided in this repository; no transcripts were blacklisted for v26).
2. The union of all exon intervals of each gene is calculated.
3. Overlapping intervals between genes are excluded from all genes.


The purpose of step 3 is primarily to exclude overlapping regions from genes annotated on both strands, which can't be unambiguously quantified from unstranded RNA-seq (GTEx samples were sequenced using an unstranded protocol). For stranded protocols, this step can be skipped by adding the `--collapse_only` flag.

Further documentation is available on the [GTEx Portal](https://gtexportal.org/home/documentationPage#staticTextAnalysisMethods).

In [None]:
[hg_gtf_2 (collapsed gene model)]
parameter: is_stranded = False
output: f'{_input:n}{".collapse_only" if is_stranded else ""}.gene.gtf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
    collapse_annotation.py ${"--collapse_only" if is_stranded else ""} ${_input} ${_output}

In [None]:
[ercc_gtf (Preprocess ERCC gtf file)]
parameter: ercc_gtf = path
input: ercc_gtf
output: f'{cwd}/{_input:bn}.genes.patched.gtf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
python: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
    with open('${_input}') as exon_gtf, open('${_output}', 'w') as gene_gtf:
        for line in exon_gtf:
            f = line.strip().split('\t')
            f[0] = f[0].replace('-','_')  # required for RNA-SeQC/GATK (no '-' in contig name)
        
            attr = f[8]
            if attr[-1]==';':
                attr = attr[:-1]
            attr = dict([i.split(' ') for i in attr.replace('"','').split('; ')])
            # add gene_name, gene_type
            attr['gene_name'] = attr['gene_id']
            attr['gene_type'] = 'ercc_control'
            attr['gene_status'] = 'KNOWN'
            attr['level'] = 2
            for k in ['id', 'type', 'name', 'status']:
                attr['transcript_'+k] = attr['gene_'+k]
        
            attr_str = []
            for k in ['gene_id', 'transcript_id', 'gene_type', 'gene_status', 'gene_name',
                'transcript_type', 'transcript_status', 'transcript_name']:
                attr_str.append('{0:s} "{1:s}";'.format(k, attr[k]))
            attr_str.append('{0:s} {1:d};'.format('level', attr['level']))
            f[8] = ' '.join(attr_str)
        
            # write gene, transcript, exon
            gene_gtf.write('\t'.join(f[:2]+['gene']+f[3:])+'\n')
            gene_gtf.write('\t'.join(f[:2]+['transcript']+f[3:])+'\n')
            f[8] = ' '.join(attr_str[:2])
            gene_gtf.write('\t'.join(f[:2]+['exon']+f[3:])+'\n')

In [None]:
[gene_annotation]
input: output_from("hg_gtf_1"), output_from("hg_gtf_2"), output_from("ercc_gtf")
output: f'{cwd}/{_input[0]:bn}.ERCC.gtf', f'{cwd}/{_input[1]:bn}.ERCC.gtf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container
    cat ${_input[0]} ${_input[2]} > ${_output[0]}
    cat ${_input[1]} ${_input[2]} > ${_output[1]}

## Generating index file for `STAR` 

This step generate the index file for STAR alignment. This file just need to generate once and can be re-used. 

**At least 40GB of memory is needed**.

### Step Inputs

* `gtf` and `fasta`: path to reference sequence. Both of them needs to be unzipped. `gtf` should be the one prior to collapse by gene.
* `sjdbOverhang`: specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads. We use 100 here as recommended by the TOPMed pipeline. See here for [some additional discussions](https://groups.google.com/g/rna-star/c/h9oh10UlvhI/m/BfSPGivUHmsJ). 

### Step Output

* Indexing file stored in `{cwd}/STAR_index`, which will be used by `STAR`

In [None]:
[STAR_index]
parameter: hg_gtf = path
parameter: hg_reference = path
# Specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads.
# Default choice follows from TOPMed pipeline recommendation.
parameter: sjdbOverhang = 100
fail_if(expand_size(mem) < expand_size('40G'), msg = "At least 40GB of memory is required for this step")
input: hg_reference, hg_gtf
output: f"{cwd}/STAR_Index/chrName.txt", f"{cwd}/STAR_Index/Log.out", 
        f"{cwd}/STAR_Index/transcriptInfo.tab", f"{cwd}/STAR_Index/exonGeTrInfo.tab", 
        f"{cwd}/STAR_Index/SAindex", f"{cwd}/STAR_Index/SA", f"{cwd}/STAR_Index/genomeParameters.txt", 
        f"{cwd}/STAR_Index/chrStart.txt", f"{cwd}/STAR_Index/sjdbList.out.tab", 
        f"{cwd}/STAR_Index/exonInfo.tab", f"{cwd}/STAR_Index/sjdbList.fromGTF.out.tab", 
        f"{cwd}/STAR_Index/chrLength.txt", f"{cwd}/STAR_Index/sjdbInfo.txt", 
        f"{cwd}/STAR_Index/Genome", f"{cwd}/STAR_Index/chrNameLength.txt", 
        f"{cwd}/STAR_Index/geneInfo.tab"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bd}'
bash: container=container, expand= "${ }", stderr = f'{_output[0]:d}.stderr', stdout = f'{_output[0]:d}.stdout'
    STAR --runMode genomeGenerate \
         --genomeDir ${_output:d} \
         --genomeFastaFiles ${_input[0]} \
         --sjdbGTFfile ${_input[1]} \
         --sjdbOverhang ${sjdbOverhang} \
         --runThreadN ${numThreads} #--sjdbGTFchrPrefix "chr" 

## Generating index file for `RSEM`

This step generate the indexing file for `RSEM`. This file just need to generate once.

### Step Inputs

* `gtf` and `fasta`: path to reference sequence. `gtf` should be the one prior to collapse by gene.
* `sjdbOverhang`: specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads.

### Step Outputs
* Indexing file stored in `RSEM_index_dir`, which will be used by `RSEM`

In [None]:
[RSEM_index]
parameter: hg_gtf = path
parameter: hg_reference = path
input: hg_reference, hg_gtf
output: f"{cwd}/RSEM_Index/rsem_reference.n2g.idx.fa", f"{cwd}/RSEM_Index/rsem_reference.grp", 
        f"{cwd}/RSEM_Index/rsem_reference.idx.fa", f"{cwd}/RSEM_Index/rsem_reference.ti", 
        f"{cwd}/RSEM_Index/rsem_reference.chrlist", f"{cwd}/RSEM_Index/rsem_reference.seq", 
        f"{cwd}/RSEM_Index/rsem_reference.transcripts.fa"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bd}'
bash: container=container, expand= "${ }", stderr = f'{_output[0]:d}.stderr', stdout = f'{_output[0]:d}.stdout'
    rsem-prepare-reference \
            ${_input[0]} \
            ${_output:nn} \
            --gtf ${_input[1]} \
            --num-threads ${numThreads}