# Phenotype Data Imputation

This workflow contains a collection of methods on imputation of missing omics data values.

## Description

## Input
* A molecular phenotype data with missing where first four columns are chr, start, end, and ID. The rest columns are samples.

## Output
* A complete molecular phenotype data where first four columns are chr, start, end, and ID. The rest columns are samples.

## Minimal Working Example

### a. Phenotype Imputation

Timing: X min

In [None]:
sos run xqtl-pipeline/pipeline/phenotype_imputation.ipynb flash \
    --phenoFile /phenotype/protocol_example.protein.bed.gz \
    --cwd output/phenotype \
    --prior ebnm_point_normal --varType 1 \
    --container oras://ghcr.io/cumc/omics_imputation_apptainer:latest

## Troubleshooting

| Step | Substep | Problem | Possible Reason | Solution |
|------|---------|---------|------------------|---------|
|  |  |  |  |  |




## Command Interface

In [1]:
!sos run phenotype_imputation.ipynb -h

usage: sos run phenotype_imputation.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  flash
  missforest
  missxgboost
  knn
  soft
  mean
  lod
  bed_filter_na

Global Workflow Options:
  --cwd output (as path)
                        Work directory & output directory
  --phenoFile VAL (as path, required)
                        Molecular phenotype matrix
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 72h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 20 (as int)
                        Number of threads
  --container ''
  --entrypoint

## Setup and global parameters

In [2]:
[global]
# Work directory & output directory
parameter: cwd = path("output")
# Molecular phenotype matrix
parameter: phenoFile = path
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "72h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""

In [None]:
[flash]
# prior distribution of loadings and factors
parameter: prior = "ebnm_point_normal"
# type of estimated variance: 1 is a estimated variance for each row.
parameter: varType = '1'
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
   library(flashier)
   library("tibble")
   library("readr")
   library("dplyr")
    
    pheno <- read_delim(${_input:ar}, delim = "\t")
    pheno_NAs <- pheno[, 5:ncol(pheno)]
    f <- flashier::flash(as.matrix(pheno_NAs), ebnm_fn = ${prior}, var_type = ${varType})
    Yfill <- ifelse(is.na(as.matrix(pheno_NAs)), fitted(f), as.matrix(pheno_NAs))
    pheno_imp <- as.data.frame(cbind(pheno[, 1:4], Yfill))
    write_delim(pheno_imp, ${_output:r}, delim = "\t" )

bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
    bgzip -f ${_output:n}
    tabix ${_output}

bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `cat $i | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

In [1]:
[missforest]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
   library(missForest)
   library("tibble")
   library("readr")
   library("dplyr")
    
    pheno <- read_delim(${_input:ar}, delim = "\t")
    pheno_NAs <- pheno[, 5:ncol(pheno)]
    Yfill <- missForest(as.matrix(pheno_NAs), parallelize = 'variables')$ximp
    pheno_imp <- as.data.frame(cbind(pheno[, 1:4], Yfill))
    write_delim(pheno_imp, ${_output:r}, delim = "\t" )

bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
    bgzip -f ${_output:n}
    tabix ${_output}

bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `cat $i | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

In [None]:
[missxgboost]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
   source('/mnt/vast/hpc/csg/zq2209/xgb_imp.R')
   library("tibble")
   library("readr")
   library("dplyr")
    
    pheno <- read_delim(${_input:ar}, delim = "\t")
    pheno_NAs <- pheno[, 5:ncol(pheno)]
    Yfill <- xgboost_imputation(as.matrix(pheno_NAs))
    pheno_imp <- as.data.frame(cbind(pheno[, 1:4], Yfill))
    write_delim(pheno_imp, ${_output:r}, delim = "\t" )

bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
    bgzip -f ${_output:n}
    tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `cat $i | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

In [None]:
[knn]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
   library(impute)
   library("tibble")
   library("readr")
   library("dplyr")
    
    pheno <- read_delim(${_input:ar}, delim = "\t")
    pheno_NAs <- pheno[, 5:ncol(pheno)]
    Yfill <- impute.knn(as.matrix(pheno_NAs), rowmax = 1)$data
    pheno_imp <- as.data.frame(cbind(pheno[, 1:4], Yfill))
    write_delim(pheno_imp, ${_output:r}, delim = "\t" )

bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
    bgzip -f ${_output:n}
    tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `cat $i | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

In [None]:
[soft]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
   library(softImpute)
   library("tibble")
   library("readr")
   library("dplyr")
    
    pheno <- read_delim(${_input:ar}, delim = "\t")
    pheno_NAs <- pheno[, 5:ncol(pheno)]
    X_mis_C <- as(as.matrix(pheno_NAs), "Incomplete")
    ###uses "svd" algorithm
    fit <- softImpute(X_mis_C,rank = 50,lambda = 30,type = "svd")
    Yfill <- complete(as.matrix(pheno_NAs),fit)
    pheno_imp <- as.data.frame(cbind(pheno[, 1:4], Yfill))
    write_delim(pheno_imp, ${_output:r}, delim = "\t" )

bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
    bgzip -f ${_output:n}
    tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `cat $i | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

In [None]:
[mean]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
   library("tibble")
   library("readr")
   library("dplyr")
    
    pheno <- read_delim(${_input:ar}, delim = "\t")
    pheno_NAs <- pheno[, 5:ncol(pheno)]
    Yfill <- as.matrix(pheno_NAs)
    for (t.row in 1:nrow(pheno_NAs)) {
        Yfill[t.row, is.na(Yfill[t.row,])] <- rowMeans(Yfill, na.rm = TRUE)[t.row] 
    }    
    pheno_imp <- as.data.frame(cbind(pheno[, 1:4], Yfill))
    write_delim(pheno_imp, ${_output:r}, delim = "\t" )

bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
    bgzip -f ${_output:n}
    tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `cat $i | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

In [None]:
[lod]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
   library("tibble")
   library("readr")
   library("dplyr")
    
    pheno <- read_delim(${_input:ar}, delim = "\t")
    pheno_NAs <- pheno[, 5:ncol(pheno)]
    Yfill <- as.matrix(pheno_NAs)
    for (t.row in 1:nrow(pheno_NAs)) {
        Yfill[t.row, is.na(Yfill[t.row,])] <- min(Yfill[t.row, ], na.rm = TRUE)
    }
    pheno_imp <- as.data.frame(cbind(pheno[, 1:4], Yfill))
    write_delim(pheno_imp, ${_output:r}, delim = "\t" )

bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
    bgzip -f ${_output:n}
    tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `cat $i | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

In [None]:
[bed_filter_na]
parameter: rank_max = 50 # max rank estimated in the per-chr methyl matrix
parameter: lambda_hyp = 30 # hyper par, indicating the importance of the nuclear norm
parameter: impute_method = "soft"
# Tolerance of missingness rows with missing rate larger than tol_missing will be removed,
# with missing rate smaller than tol_missing will be mean_imputed. Say if we want to keep rows with less than 5% missing, then we use 0.05 as tol_missing.
parameter: tol_missing = 0.05
input: phenoFile
output: f'{_input:nn}.filter_na.{impute_method}.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
   library("dplyr")
   library("tibble")
   library("readr")
   library(softImpute)
   compute_missing <- function(mtx){
          miss <- sum(is.na(mtx))/length(mtx)
          return(miss)
        }

        mean_impute <- function(mtx){
          f <- apply(mtx, 2, function(x) mean(x,na.rm = TRUE))
          for (i in 1:length(f)) mtx[,i][which(is.na(mtx[,i]))] <- f[i]
          return(mtx)
        }
    
        soft_impute <- function(){
          f <- apply(mtx, 2, function(x) mean(x,na.rm = TRUE))
          for (i in 1:length(f)) mtx[,i][which(is.na(mtx[,i]))] <- f[i]
          return(mtx)
        }
  
  
        filter_mtx <- function(X, missing_rate_thresh) {
            rm_col <- which(apply(X, 2, compute_missing) > missing_rate_thresh)
            if (length(rm_col)) X <- X[, -rm_col]
            return((X))
        }  
  
    bed = read_delim("${_input}")
    mtx = bed[,5:ncol(bed)]%>%as.matrix
    rownames(mtx) = bed[,4]%>%unlist()
    tbl_filtered = filter_mtx(mtx%>%t(),${tol_missing})
    if ( "${impute_method}" == "mean" ){
    tbl_filtered = tbl_filtered%>%mean_impute()%>%t()
     } else if ("${impute_method}" == "soft"){ 
      tbl_filtered_C= as(t(tbl_filtered),"Incomplete")
      fit=softImpute(tbl_filtered_C,rank=${rank_max},lambda=${lambda_hyp},type="svd")
      tbl_filtered = complete(t(tbl_filtered),fit)
    }
    tbl_filtered = tbl_filtered%>%as_tibble(rownames = colnames(bed)[4])  
    bed_filtered = inner_join(bed[,1:4],tbl_filtered)
    bed_filtered%>%write_delim("${_output:n}", "\t" )
  
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
    bgzip -f ${_output:n}
    tabix ${_output}
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `zcat $i | grep -v "##"   | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `zcat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

In [None]:
The resource usage for softimputing 450K methylation data are as followed:

``` 
time elapsed: 880.90s
peak first occurred: 152.11s
peak last occurred: 175.41s
max vms_memory: 38.95GB
max rss_memory: 34.35GB
memory check interval: 1s
return code: 0
```