# Covariate data preprocessing
This is the data processing pipeline for xqtl workflow, containing the generation of:
1. Factor from expression
2. PCA from genotype
3. GRM from genotype
4. LD from genotype, filtered by grm [TBD]
5. Molecular_phenotype per chrom within selected regions in the format APEX and tensorQTL takes


**FIXME: Hao, I am thinking this kind of notebook (that sits outside these folders) should be of a tutorial nature. It should only contain `sos run` commands interactively with enough text explanations. For those who want to run the default analysis they should work with `master_control.ipynb` and generate the commands to run as is. For those who want to customize the analysis, they should refer to each of these "recipe" and change the parameters here. That should cover 95% user cases. People will read the module notebooks only for learning purpose. For those who want to edit the module notebooks we will consider them developers or at least power users and I expect few of them.**


### Input
The input for this workflow is 1 row of the input recipe file, documenting the path to
1. 1 complete molecular_phenotype data
2. 1 collection of genotype data in plink format, partitioned by chrm
3. 1 file documenting the list of region to be analyzed
4. 


### Output
For each collection, the output is 23 sets of :
1. EXP file for selected region
2. genotype from vcf file

1 sets of
1. PCA + Factor + Covariate file

### Excutable:
This notebook depends on the scripts of multiple other notebook, the directory those are specify by exe_dir

In [None]:
nohup sos run /home/hs3163/GIT/ADSPFG-xQTL/workflow/Data_Processing/Data_Processing.ipynb region_extraction \
            --wd $[wd] \
            --container $[container] \
            --name $[name] \
            --numThreads $[numThreads] \
            --yml $[yml] \
            --queue $[queue] \
            --J $[J] \
            --exe_dir $[exe_dir] -s build &

In [2]:
[global]
import os
# Work directory & output directory
parameter: wd = path
# The filename name for output data
parameter: container = 'gaow/twas'
# name for the analysis output
parameter: name = str
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "24h"
# Memory expected
parameter: mem = "60G"
# Number of threads
parameter: numThreads = 20
# Diretory to the executable
parameter: exe_dir = path("~/GIT/ADSPFG-xQTL/workflow")
# yml template
parameter: yml = '/home/hs3163/GIT/ADSPFG-xQTL/code/csg.yml'
# queue for analysis
parameter: queue = "csg"
# Number of submission
parameter: J = 200
# Factor Options
parameter: factor_option: "APEX"

## Temp   
parameter: container_lmm = str
parameter: container_apex = str

parameter: region_list = path
regions = [x.strip().split() for x in open(region_list).readlines() if x.strip() and not x.strip().startswith('#')]
# Get the unique chormosome that have regions to be analyzed.
def extract(lst):
    return [item[0] for item in lst]
chrom = list(set(extract(regions)))
chrom.sort()

In [None]:
[pca_factor]
## PCA models
input: output_from("project_sample")["project_sample"], output_from("Factor_analysis")["Factor_analysis"]
output: pca_factor = f'{_input[1]:n}.pca.cov'
R: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
        library("dplyr")
        library("tibble")
        library("readr")
        library("modelr")
        library("purrr")
        pca_output = readRDS("$[_input[0]]")$pc_scores
        mtx = pca_output%>%select(contains("PC"))%>%t()
        colnames(mtx) <- pca_output$IID
        mtx = mtx%>%as_tibble()%>%mutate("#id" = rownames(mtx))%>%select("#id",everything())
        factor_cov = read_delim("$[_input[1]]","\t")
        output = bind_rows(factor_cov,mtx)
        output%>%write_delim("$[_output]","\t")

## Process of Factor analysis
Based on the selection of method, PEER vs APEX, for factor analysis, the input file will be different. For apex, the input file is a bed.gz file, with tbi index. For peer, the molecular pheno file itself shall suffice

In [None]:
[Factor_analysis]
parameter: covariate = ""
# N PEER factors, If do not specify or specified as 0, default values suggested by 
# UCSC (based on different sample size) Will be used
parameter: N = 4
# Default values from PEER:
## The number of iteration
parameter: max_iter = 30
## Prior parameters
parameter: Alpha_a = 0.001
parameter: Alpha_b = 0.1
parameter: Eps_a = 0.1
parameter: Eps_b = 10.
## Tolarance parameters
parameter: tol = 0.001
parameter: var_tol = 1e-08
input: output_from("Region_extraction_1")["molecular_pheno_whole_bed"], output_from("plink2vcf")["qced_vcf_genotype_list"]
output: Factor_analysis = f'{wd:a}/Factor_and_Covariate/{name}.{factor_option}.cov'
bash: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
        sos run $[exe_dir]/Data_Processing/Factor_and_Covariate/factor.ipynb $[factor_option]  \
            --wd $[wd]/Factor_and_Covariate/ \
            --container_apex $[container_apex] \
            --name $[name] \
            --numThreads $[numThreads] \
            --molecular_pheno $[_input[0]] \
            --genotype_list $[_input[1]] \
            --N $[N] \
            --Alpha_a $[Alpha_a]  \
            --Alpha_b $[Alpha_b] \
            --Eps_a  $[Eps_a] \
            --Eps_b  $[Eps_b] \
            --tol  $[tol] \
            --var_tol $[var_tol] \
            -J $[J] -q $[queue] -c $[yml] $[f'--covariate {covariate}' if os.path.exists(covariate) else f'']