# Gene coordinate annotation


This workflow adds genomic coordinate annotation to gene-level molecular phenotype files generated, eg in GCT format, converting them to `bed` format. 

## Overview

This pipeline is based on [`pyqtl`, as demonstrated here](https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/src/eqtl_prepare_expression.py).

### Alternative implementation

Previously we use `biomaRt` package in R instead of code from `pyqtl`. The core function calls are:

```r
    ensembl = useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = "$[ensembl_version]")
    ensembl_df <- getBM(attributes=c("ensembl_gene_id","chromosome_name", "start_position", "end_position"),mart=ensembl)
```

We require ENSEMBL version to be specified explicitly in this pipeline. As of 2021 for the Brain xQTL project, we use ENSEMBL version 103.

## Input

1. Molecular phenotype data with the first column being ENSEMBL ID and other columns being sample names. 
2. GTF for collapsed gene model
    - the gene names must be consistent with the GCT matrices (eg ENSG00000000003 vs. ENSG00000000003.1 will not work) 
3. (Optional) Meta-data to match between sample names in expression data and genotype files
    - Required input
    - Tab delimited with header
    - Only 2 columns: first column is sample name in expression data, 2nd column is sample name in genotype data
    - **must contains all the sample name in expression matrices even if they don't existing in genotype data**
    

## Output

Molecular phenotype data in `bed` format.

## Minimal working example

The MWE is uploaded to the [Google Drive](https://drive.google.com/drive/u/0/folders/1Rv2bWHBbX_tastTh49ToYVDMV6rFP5Wk)

In [None]:
sos run gene_annotation.ipynb annotate_coord \
    --cwd output \
    --phenoFile data/MWE.pheno_log2cpm.tsv \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.gene.ERCC.gtf \
    --sample-participant-lookup data/sampleSheetAfterQC.txt \
    --container container/rna_quantification.sif --phenotype-id-type gene_name

## Command interface

In [5]:
sos run gene_annotation.ipynb -h

usage: sos run gene_annotation.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  annotate_coord
  annotate_coord_biomart

Global Workflow Options:
  --cwd VAL (as path, required)
                        Work directory & output directory
  --phenoFile VAL (as path, required)
                        Molecular phenotype matrix
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 1 (as int)
                        Number of threads
  --container ''

Sections
  annotate_coord:
    Workflow Options:

In [2]:
[global]
# Work directory & output directory
parameter: cwd = path
# Molecular phenotype matrix
parameter: phenoFile = path
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 1
parameter: container = ""

## Implementation using `pyqtl`

Implementation based on [GTEx pipeline](https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/src/eqtl_prepare_expression.py).

In [None]:
[annotate_coord]
#  gene gtf annotation table
parameter: annotation_gtf = path
# A file to map sample ID from expression to genotype, must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the genotype (these can be the same).
parameter: sample_participant_lookup = path()
# Whether the input data is named by gene_id or gene_name. By default it is gene_id, if not, please change it to gene_name
parameter: phenotype_id_type = 'gene_id'
input: phenoFile, annotation_gtf
output: f'{cwd:a}/{_input[0]:bn}.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
python: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container
    import pandas as pd
    import qtl.io
    from pathlib import Path
    def prepare_bed(df, bed_template_df, chr_subset=None):
        bed_df = pd.merge(bed_template_df, df, left_index=True, right_index=True)
        # sort by start position
        bed_df = bed_df.groupby('chr', sort=False, group_keys=False).apply(lambda x: x.sort_values('start'))
        if chr_subset is not None:
            # subset chrs from VCF
            bed_df = bed_df[bed_df.chr.isin(chr_subset)]
        return bed_df
    # Load data
    df = pd.read_csv(${_input[0]:ar}, sep='\t', skiprows=0, index_col=0)
    sample_participant_lookup = Path("${sample_participant_lookup:a}")
    
    # change sample IDs to participant IDs
    if sample_participant_lookup.is_file():
        sample_participant_lookup_s = pd.read_csv(sample_participant_lookup, sep="\t", index_col=0, dtype={0:str,1:str}, squeeze=True)
        df.rename(columns=sample_participant_lookup_s.to_dict(), inplace=True)
    bed_template_df = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = "${phenotype_id_type}" )
    bed_df = prepare_bed(df, bed_template_df)
    qtl.io.write_bed(bed_df, ${_output:r})

## Implementation using biomaRt
This workflow adds the annotations of chr pos(TSS where start = end -1) and gene_ID to the `bed` file. **This workflow is obsolete**.

In [None]:
[annotate_coord_biomart]
parameter: ensembl_version=int
input: phenoFile
output: f'{cwd:a}/{_input:bn}.bed.gz',
        f'{cwd:a}/{_input:bn}.region_list'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
R:  expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout' ,container = container
    library("biomaRt")
    library(dplyr)
    library(readr)
    biomartCacheClear()
    gene_exp = readr::read_delim("$[_input[0]]",delim = "\t")
    if("#chr" %in% colnames(gene_exp) ){
      # need to re-annotate
      gene_exp = gene_exp[,4:ncol(gene_exp)]
    }
    ensembl = useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = "$[ensembl_version]")
    ensembl_df <- getBM(attributes=c("ensembl_gene_id","chromosome_name", "start_position", "end_position"),mart=ensembl)
    my_genes = gene_exp$gene_ID
    keep_genes =  my_genes
    my_genes_ann = ensembl_df[match(my_genes, ensembl_df$ensembl_gene_id),]%>%filter(chromosome_name%in%1:23)%>%dplyr::rename( "#chr" = chromosome_name, "start" = start_position, "end" = end_position,"gene_ID" = ensembl_gene_id)%>%filter(gene_ID!="NA", gene_ID%in%keep_genes)
    my_genes_ann%>%select(`#chr`,start,end,gene_ID)%>%write_delim(path = "$[_output[1]]","\t")
    my_gene_bed = inner_join(my_genes_ann %>%mutate(end = start + 1) %>%select(`#chr`,start,end,gene_ID),gene_exp,by = "gene_ID" )%>%arrange(`#chr`,start) 
    my_gene_bed%>%readr::write_tsv( path = "$[_output[0]:n]", na = "NA", append = FALSE, col_names = TRUE, quote_escape = "double")

bash: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
        bgzip -f $[_output[0]:n]
        tabix -p bed $[_output[0]] -f