# Molecular phenotype normalization
This is the normalization step for data processing pipeline for xqtl workflow, containing the generation of:
1. whole Molecular_phenotype in bed format, normalized.


### Input
The input for this workflow is:
1. 1 complete gene data table
2. 1 complete gene tmp table
3. 1 gtf table downloaded from genecode
4. 1 index file cross referencing the sample name in the expression and genotype
5. 1 vcf_chrom_list, provided by default, should be 1 column of chr1:chr22 chromosomes, without header 


Requirement for input 1 and 2:
1. Sep by "\t"
2. have 2 unneeded rows above colname
3. file name end with gct
4. can only have samples in input 4

Requirement for input 4
1. sample_id must not be all numeric,
2. Sep by "\t"

the gene_ID in 1,2 and 4 must be the same: i.e. ENSG00000000003 ENSG00000000003.1 can not coexist


### Output
For each collection, the output is 1:
1. normalized Molecular_phenotype bed file

It take care of the purple part of the following diagram


### Test run commands:

In [None]:
sos run /home/hs3163/GIT/xqtl-pipeline/pipeline/data_preprocessing/phenotype/normalization.ipynb Normalization \
        --counts_gct "./geneCounts.gct" \
        --tpm_gct "./geneTpm.gct" \
        --vcf_chr_list "/mnt/mfs/statgen/xqtl_workflow_testing/expression_normalization/vcf_chrom_list" \
        --sample_participant_lookup "/mnt/mfs/statgen/xqtl_workflow_testing/expression_normalization/sampleSheetAfterQc.txt" \
        --name "test" \
        --script_dir "/mnt/mfs/statgen/xqtl_workflow_testing/expression_normalization/" --wd ./ \
        --annotation_gtf /mnt/mfs/statgen/xqtl_workflow_testing/expression_normalization/gencode.v26.annotation.gtf.gz  -s build &

In [2]:
[global]
import os
# Work directory & output directory
parameter: wd = path
# The filename namefor output data
parameter: container = 'gaow/twas'
# namefor the analysis output
parameter: name= 'ROSMAP'
# An gene count table
parameter: counts_gct = path
# An gene TPM table
parameter: tpm_gct = path
# An gene gtf annotation table
parameter: annotation_gtf = path
# a file containing the number of chromosome in follow up analysis
parameter: vcf_chr_list = path("./")
# A file to map sample ID from expression to genotype,must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the genotype (these can be the same).

parameter: sample_participant_lookup = path

parameter: name = str

# The directory containing the .py script needed
parameter: script_dir = path("./")


# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
parameter: container = "gaow/twas"

In [None]:
[Normalization]
# Path to the input molecular phenotype data, should be a processd and indexed bed.gz file, with tabix index.
input: counts_gct,tpm_gct, annotation_gtf
output: f'{wd}/{name}.mol_phe.bed.gz',  # For factor
task: trunk_workers = 1, trunk_size = 1, walltime = '4h',  mem = '20G', tags = f'{step_name}_{_output[0]:bn}'  
bash: expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    eqtl_prepare_expression.py ${tpm_gct} ${counts_gct} ${_input[2]} \
    ${sample_participant_lookup} ${ vcf_chr_list if vcf_chr_list is not path("./") else ""} ${name} \
    --tpm_threshold 0.1 \
    --count_threshold 1 \
    --sample_frac_threshold 0.2 \
    --normalization_method tmm

### Input formatting
Following step are optional and to ensure the input met the requirement of Normalization
1. Filter out the geneCount table based on TPM table
2. Adds two empty line above the header of TPM and geneCount table, in case it dont have it.
3. If sample names are all numeric, changes them to be not numeric.

In [None]:
[input_preprocessing]
input: tpm_gct, counts_gct,sample_participant_lookup
output: f'{wd}/{name}.processed.tpm.gct',f'{wd}/{name}.processed.geneCount.gct',f'{wd}/{name}.processed.sample_lookup.txt'
task: trunk_workers = 1, trunk_size = 1, walltime = '24h',  mem = '30G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    echo "# 
          #  " > $[_output[0]:r]
    cp $[_output[0]:r] $[_output[1]:r]
R: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    library("tidyverse")
    # Reason to use read.table: 1.accomodate both " " and "\t"
    tpm = read.table($[_input[0]:r],header = T)
    geneCount = read.table($[_input[1]]:r],header = T)
    sample_name =  read.table($[_input[3]]:r],header = T)
    ## Make geneCount consistant with tpm
    geneCount = geneCount%>%filter(gene_ID %in% tpm$gene_ID)%>%select(colnames(tpm))
    ## Check if the sample names is all numeric(it will be problemetic for the normalization step)
    if(is.numeric(sample_name[,1])){
      sample_name[,1] = paste0("X",sample_name[,1])
      }
    ## Save each file with 3 header line
    tpm%>%write_delim($[_output[0]:r],delim = "\t", append = T)
    geneCount%>%write_delim($[_output[1]:r],delim = "\t", append = T)
    sample_name%>%write_delim($[_output[2]:r],delim = "\t", append = T)