# Reference Processing
This module will perform variouse proprocessing for reference data
In Particular:
1. Convert gff3 to gtf

Input: an uncompressed gff3 file.(i.e. can be view via cat)

Output: a gtf file.

2. Produce gene collapesed version of gtf

Input: a gtf file.

Output: a gtf file with collapesed gene model.


3. Generate STAR index based on gtf and reference fasta

Input: a gtf file and an acompanying fasta file.

Output: A folder of STAR index.


4. Generate RSEM index based on gtf and reference fasta

Input: a gtf file and an acompanying fasta file.

Output: A folder of RSEM index.

### Working examples
To reproduce the example codes, please download the reference data as described in ADSP_GCAD_genome_resources v2.pdf, with the following commands
1. gtf
wget  http://ftp.ensembl.org/pub/release-103/gff3/homo_sapiens/Homo_sapiens.GRCh38.103.chr.gtf
2. fasta
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
3. ERCC.fa + ERCC.gtf
wget https://tools.thermofisher.com/content/sfs/manuals/ERCC92.zip

In [None]:
[global]
# The output directory for generated files. MUST BE FULL PATH
parameter: wd = path("./")
cwd = wd
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""


In [None]:
[gff3_to_gtf]
parameter: gff3_file = path
input: gff3_file
output: f'{wd}/{_input:n}.gtf'
bash: container=container, expand= "${ }", stderr = f'{_input[0]}.stderr', stdout = f'{_input[0]}.stdout'
        gffread ${_input} -T -o ${_output}

### Fasta Processing
1. Remove the HLA/ALT/Decoy record from the fasta
2. Adding in ERCC information to the fasta file
3. Generating index for the fasta file

### Example code

Running following codes can run all the step for fasta processing

In [None]:
nohup sos run /home/hs3163/GIT/xqtl-pipeline/pipeline/molecular_phenotypes/Reference_processing.ipynb FASTA_index \
    --ERCC_fa /mnt/mfs/statgen/xqtl_workflow_testing/rna_quant_topmed/data/ERCC92.fa \
    --fasta  /mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/data/GRCh38_full_analysis_set_plus_decoy_hla.fa \
    --container "/mnt/mfs/statgen/containers/xqtl_pipeline_sif/rna_quantification.sif" -s force &

In [None]:
[HLA_removal]
parameter: fasta = path
input: fasta
output:  f'{wd}/{_input:bn}.noALT_noHLA_noDecoy.fasta'
task: trunk_workers = 1, trunk_size = 1, walltime = '24h',  mem = '30G', tags = f'{step_name}_{_output[0]:bn}'
python: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout',container = container
    with open('${_input}', 'r') as fasta:
        contigs = fasta.read()
        contigs = contigs.split('>')
        contig_ids = [i.split(' ', 1)[0] for i in contigs]

        # exclude ALT, HLA and decoy contigs
        filtered_fasta = '>'.join([c for i,c in zip(contig_ids, contigs)
        if not (i[-4:]=='_alt' or i[:3]=='HLA' or i[-6:]=='_decoy')])
    
    with open('${_output}', 'w') as fasta:
        fasta.write(filtered_fasta)

In [None]:
[FASTA_merge]
parameter: ERCC_fa = path
input: output_from("HLA_removal")
output: f'{wd}/{_input:bn}_ERCC.fasta'
bash: expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    sed 's/ERCC-/ERCC_/g' ${ERCC_fa} >  ${ERCC_fa:n}.patched.fa
    cat ${_input} ${ERCC_fa:n}.patched.fa > ${_output}

In [None]:
[FASTA_index]
input: output_from("FASTA_merge")
output: f'{wd}/{_input:bn}.dict'
bash: expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    samtools faidx ${_input}
    java -jar /opt/picard-tools/picard.jar \
    CreateSequenceDictionary \
    R=${_input} \
    O=${_output}

### GTF Processing

This step modify the gtf file for following reason:
1. RSEM require GTF input to have the same chromosome name format as the fasta file.

**For STAR, this problem can be solved by the now commented --sjdbGTFchrPrefix "chr"  option**
   
2. collapse_annotation.py from GTEX require the gtf have transcript_type insteadd transcript_biotype in its annotation.
**This problem can be solved by modifying the collapse_annotation.py while building the docker**

Once the problem with RSEM is solved, or when RSEM is no longer needed, the aforementioned remedy can be implemented and this step can be remvoed

3. Adding in ERCC information to the gtf

### Example commands
Running the following commands will generate the reformat version of annotation gtf and collapsed.gtf with ERCC addition.

In [None]:
nohup sos run /home/hs3163/GIT/xqtl-pipeline/pipeline/molecular_phenotypes/Reference_processing.ipynb gtf_merge \
    --ERCC_gtf /mnt/mfs/statgen/xqtl_workflow_testing/rna_quant_topmed/data/ERCC92.gtf \
    --gtf  /mnt/mfs/statgen/xqtl_workflow_testing/rna_quant/data/Homo_sapiens.GRCh38.103.chr.gtf \
    --fasta  /mnt/mfs/statgen/xqtl_workflow_testing/rna_quant_topmed/data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta  \
    -s force &

In [None]:
[chrom_reformating]
# Reference genome
parameter: gtf = path
parameter: fasta = path
parameter: empty_rows = 5
input: fasta, gtf
output:  f'{wd}/{_input[1]:bn}.reformated.gtf'
task: trunk_workers = 1, trunk_size = 1, walltime = '24h',  mem = '30G', tags = f'{step_name}_{_output[0]:bn}'
R: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    library("readr")
    library("stringr")
    library("dplyr")
    options(scipen = 999)
    fasta = system("head -1 ${_input[0]}",intern = TRUE)
    gtf = read_delim("${_input[1]}", col_names  = F,"\t", skip = ${empty_rows})
    if(!str_detect(fasta,">chr")){
    gtf_mod = gtf%>%mutate(X1 = str_remove_all(X1,"chr"))
    } else if(!any(str_detect(gtf$X1[1],"chr"))) {
        gtf_mod = gtf%>%mutate(X1 = paste0("chr",X1))    
    }
    if(str_detect(gtf_mod$X9,"transcript_biotype")){gtf_mod = gtf_mod%>%mutate(X9 = str_replace_all(X9,"transcript_biotype","transcript_type"))}
    gtf_mod%>%write.table("${_output}",sep = "\t",quote = FALSE,col.names = F,row.names = F)

In [None]:
[collapse]
parameter: gtf = path
parameter: collapse_only_switch = False
input: gtf
output: f'{wd}/{_input:bn}{".collapse_only" if collapse_only_switch else ""}.gene.gtf'
bash: expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    collapse_annotation.py ${"--collapse-only" if collapse_only_switch else ""} ${_input} ${_output}

In [None]:
[ERCC_gtf]
parameter: ERCC_gtf = path
input: ERCC_gtf
output: f'{wd}/{_input:bn}.genes.patched.gtf'
python: expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    with open('${_input}') as exon_gtf, open('${_output}', 'w') as gene_gtf:
        for line in exon_gtf:
            f = line.strip().split('\t')
            f[0] = f[0].replace('-','_')  # required for RNA-SeQC/GATK (no '-' in contig name)
        
            attr = f[8]
            if attr[-1]==';':
                attr = attr[:-1]
            attr = dict([i.split(' ') for i in attr.replace('"','').split('; ')])
            # add gene_name, gene_type
            attr['gene_name'] = attr['gene_id']
            attr['gene_type'] = 'ercc_control'
            attr['gene_status'] = 'KNOWN'
            attr['level'] = 2
            for k in ['id', 'type', 'name', 'status']:
                attr['transcript_'+k] = attr['gene_'+k]
        
            attr_str = []
            for k in ['gene_id', 'transcript_id', 'gene_type', 'gene_status', 'gene_name',
                'transcript_type', 'transcript_status', 'transcript_name']:
                attr_str.append('{0:s} "{1:s}";'.format(k, attr[k]))
            attr_str.append('{0:s} {1:d};'.format('level', attr['level']))
            f[8] = ' '.join(attr_str)
        
            # write gene, transcript, exon
            gene_gtf.write('\t'.join(f[:2]+['gene']+f[3:])+'\n')
            gene_gtf.write('\t'.join(f[:2]+['transcript']+f[3:])+'\n')
            f[8] = ' '.join(attr_str[:2])
            gene_gtf.write('\t'.join(f[:2]+['exon']+f[3:])+'\n')

In [None]:
[gtf_merge]
parameter: gtf
input:  output_from("chrom_reformating") ,output_from("collapse"),output_from("ERCC_gtf")
output: f'{wd}/{_input[0]:bn}.ERCC.gtf', f'{wd}/{_input[1]:bn}.ERCC.gtf'
bash: expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    cat ${_input[0]} ${_input[2]} > ${_output[0]}
    cat ${_input[1]} ${_input[2]} > ${_output[1]}

## Generating indexing file for `STAR` 
This step generate the indexing file for STAR alignment. This file just need to generate once and can be re-used. 

At least 40GB of memory is needed
### Step Inputs:
* `STAR_index_dir`: a path to the output.
* `gtf` and `fasta`: path to reference sequence. Both of them needs to be unzipped
* `sjdbOverhang`: specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads.

### Step Output:
* Indexing file stored in `STAR_index_dir`, which will be used by `STAR`

In [None]:
[STAR_indexing]

# The directory for STAR index
# Reference genome
parameter: gtf = path
parameter: fasta = path

# Length:
parameter: sjdbOverhang = 150
input: fasta, gtf
output: f'{wd}/STAR_Index/genomeParameters.txt'
task: trunk_workers = 1, trunk_size = 1, walltime = '24h',  mem = '40G', tags = f'{step_name}_{_output[0]:bn}'
bash: container=container, expand= "${ }", stderr = f'{_input[0]}.stderr', stdout = f'{_input[0]}.stdout'
    STAR --runMode genomeGenerate \
         --genomeDir ${_output:d} \
         --genomeFastaFiles ${_input[0]} \
         --sjdbGTFfile ${_input[1]} \
         --sjdbOverhang ${sjdbOverhang} \
         --runThreadN ${numThreads} #--sjdbGTFchrPrefix "chr" 

## Generating indexing file for `RSEM`
This step generate the indexing file for `RSEM`. This file just need to generate once.

### Step Inputs:

* `RSEM_index_dir`: a path to the output.
* `gtf` and `fasta`: path to reference sequence.
* `sjdbOverhang`: specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads.

### Step Outputs:
* Indexing file stored in `RSEM_index_dir`, which will be used by `RSEM`

### Example Command

In [None]:
[RSEM_indexing]
# Output directory:

# Reference genome
parameter: gtf = path
parameter: fasta = path
parameter: name = str
input: fasta, gtf
output: f'{wd}/RSEM_Index/rsem_reference.idx.fa'
task: trunk_workers = 1, trunk_size = 1, walltime = '24h',  mem = '40G', tags = f'{step_name}_{_output[0]:bn}'
bash: container=container, expand= "${ }", stderr = f'{_input[0]}.stderr', stdout = f'{_input[0]}.stdout'
    rsem-prepare-reference \
            ${_input[0]} \
            ${_output:nn} \
            --gtf ${_input[1]} \
            --num-threads ${numThreads}