# Quantifying methylation data

## Methods overview

To be completed.

## Input

1. `phenoFile`: The input of this module is a folder containing all the IDAT file, 1 pair for each sample. Directly under the folder, there should be one companion csv file that documenting all the meta-information of the bisulfite sequencing. Please specify the `phenoFile` as the path to this companion SampleSheet csv file

2. Optional: `cross_reactive`: A list of cpg probe that are reported to [map to multiple regions in the genome.](https://academic.oup.com/nargab/article/2/4/lqaa105/6040968) 

## Output

A pair bed.gz file for both beta and m value.


## Minimal working example




In [None]:
sos run pipeline/methylation_calling.ipynb methylation \
    --phenoFile data/MWE/MWE_Sample_sheet.csv \
    --container containers/methylation.sif 

## Command interface

In [1]:
sos run methylation_calling.ipynb -h

usage: sos run methylation_calling.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  methylation

Global Workflow Options:
  --cwd output (as path)
                        The output directory for generated files.
  --phenoFile VAL (as path, required)
                        The companion sample sheet csv file as outlined in the
                        input section.
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 8 (as int)
                        Number of threads
  --container ''
     

## Setup and global parameters

In [6]:
[global]
# The output directory for generated files.
parameter: cwd = path("output")
# The companion sample sheet csv file as outlined in the input section.
parameter: phenoFile = path
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"

# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
from sos.utils import expand_size
cwd = path(f'{cwd:a}')

## Generate methylation data matrix

The first step of methylation data processing is to acquire the beta and M value after preliminary QC and filtering of the IDAT data.

By default, for epic data, the data will be annotated based on hg38 using [this annotation](https://github.com/achilleasNP/IlluminaHumanMethylationEPICanno.ilm10b5.hg38), alternatively user can set the `--hg` parameter back to 19 to use the [hg19 annotation](https://bioconductor.org/packages/release/data/annotation/html/IlluminaHumanMethylationEPICanno.ilm10b4.hg19.html).

For 450K data however, only hg19 annotation is availble. And that is what we would use.

1. Based on 1 input csv file, all the IDAT file in the folder and sub-folder will be loaded
2. The methylation data samples will first be filtered based on [bisulphite conversation rate](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4527772/). This operation is done using the [bscon function from watermelon package](http://www.bioconductor.org/packages/release/bioc/vignettes/wateRmelon/inst/doc/wateRmelon.html#introduction) 
3. samples will then be filtered based on a [detection pvalue](https://www.rdocumentation.org/packages/minfi/versions/1.18.4/topics/detectionP), which indicates the quality of the signal at each genomics position
4. [Stratified Quantile Normalization](https://rdrr.io/bioc/minfi/man/preprocessQuantile.html) will then be applied.
5. features will be filtered if they are on sex chr, known to be [cross-reactive,maping to multiple regions in the genome.](https://academic.oup.com/nargab/article/2/4/lqaa105/6040968), overlapping with snps, or having too low a detection P. The list of cross-reactive probe can be found as `/opt/cross_reactive_probe_Hop2020.txt` in our docker and [here](https://raw.githubusercontent.com/hsun3163/xqtl-pipeline/main/data/cross_reactive_probe_Hop2020.txt).
6. Beta and M value will for all the probes/samples will then each be saved to a indexed bed.gz file.

[As documented here](https://github.com/cumc/xqtl-pipeline/issues/312) when the batch of IDAT data are different, there will be a problem reading the IDAT file without specifing the force = TRUE option in the `read.metharray.exp(targets = targets,force = TRUE)`

In [1]:
[methylation_1]
# treshold to filter out samples based on detection P value
parameter: samples_pval_tre = 0.05
# treshold to filter out probes based on detection P value
parameter: probe_det_pval_tre = 0.01
ersion_rate.
parameter: bisulfite_conversion_tre = 85 # 85 in# treshold to remove samples base on bisulfite_convdicates a conversion rate of 85% 
## Use the default list in our docker, if want to skip methylation, specify it as "."
parameter: cross_reactive_list = path("/opt/cross_reactive_probe_Hop2020.txt")
parameter: hg = 38 #hg = 38 or 19 for epic data, by default 38. Noted for 450K data only hg19 is availble
input: phenoFile
output: f'{cwd}/{_input:bn}.minfi.rds',f'{cwd}/{_input:bn}.methyl.beta.bed',f'{cwd}/{_input:bn}.methyl.M.bed',f'{cwd}/{_input:bn}.methyl.M.region_list'  
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container
    ## load libraries
    library(dplyr)
    library(readr)
    library(tibble)
    library(minfi)
    sessionInfo()
    cross_reactive = readr::read_delim("${cross_reactive_list}","\t")$probe
    
    ## Define functions
    bscon_minfi <- function(RGsetEx){
        
        #getting the values from the green channel
        (csp.green <- function (RGsetEx, controls = c("BISULFITE CONVERSION I", "BISULFITE CONVERSION II"))
        {
          minfi:::.isRGOrStop(RGsetEx)
          r <- getRed(RGsetEx)
          g <- getGreen(RGsetEx)
          sapply (controls, function( controlType ) {
        
           ctrlAddress <- try (getControlAddress(RGsetEx, controlType = controlType), silent = T)
           if (!inherits (ctrlAddress, 'try-error')){ctrlAddress <- getControlAddress(RGsetEx, controlType = controlType)}
           else
             stop ("450k QC data could not be found")
        
        
           g[ctrlAddress, ]
          })})
        
        green <- csp.green(RGsetEx)
        
        #Getting values from the red channel
        (csp.red <- function (RGsetEx, controls = c("BISULFITE CONVERSION I", "BISULFITE CONVERSION II"))
        {
          minfi:::.isRGOrStop(RGsetEx)
          r <- getRed(RGsetEx)
          g <- getGreen(RGsetEx)
          sapply (controls, function( controlType ) {
        
            ctrlAddress <- getControlAddress(RGsetEx, controlType = controlType)
        
            r[ctrlAddress, ]
          })})
        
        red <- csp.red (RGsetEx)
        
        #selecting only the Bisulfite conversion I values from both green and red
        bsI.green <- green$`BISULFITE CONVERSION I`
        bsI.red <- red$`BISULFITE CONVERSION I`
        #selecting only the Bisulfite conversion II values from both green and red
        bsII.green <- green$`BISULFITE CONVERSION II`
        bsII.red <- red$`BISULFITE CONVERSION II`
        
        # calculate BS conv type I betas as an example of using an index vector
        if(nrow(bsI.green) > 11){ # 450K
          BSI.betas <- rbind(bsI.green[1:3,], bsI.red[7:9,])/((rbind(bsI.green[1:3,], bsI.red[7:9,])) + rbind(bsI.green[4:6,], bsI.red[10:12,]))
        } else { # EPIC
          BSI.betas <- rbind(bsI.green[1:2,], bsI.red[6:7,])/((rbind(bsI.green[1:2,], bsI.red[6:7,])) + rbind(bsI.green[3:4,], bsI.red[ 8:9 ,]))
        }
        
        #calculation of BS con in Type II data
        BSII.betas <- bsII.red/(bsII.red + bsII.green)
        
        apply(rbind(BSI.betas, BSII.betas), 2, median)*100 ## this is the value you are interested in
        }

    ## 1. read idat files
    targets <- read.metharray.sheet(${_input:adr})
    colnames(targets)[1] = "Sample_Name"
    ### Only read samples with data
    Missing_sample = targets%>%filter(!stringr::str_detect(targets$Basename ,"/") )%>%pull(Sample_Name) 
    targets = targets%>%filter(stringr::str_detect(targets$Basename ,"/") ) 
    message(paste0("Following samples: ",paste0(Missing_sample,collapse = ", "), " don't have IDAT data" ))
    
    rgSet <- read.metharray.exp(targets = targets)
    if(${hg} == 38 && rgSet@annotation["array"] == 'IlluminaHumanMethylationEPIC' ){rgSet@annotation['annotation'] = "ilm10b5.hg38")}
    message("rgset created")
    
    ###### Quality Control and Normalization ###############
    ## 2. bisulphite conversation rate filtering
    rgSet_bcr = bscon_minfi(rgSet)
    rgSet = rgSet[,names(which(rgSet_bcr > ${bisulfite_conversion_tre} ))]
    
    ## 3. QC based on p-value, remove samples with average p value less than 0.05
    
    detP <- detectionP(rgSet)
    keep <- colMeans(detP) < ${samples_pval_tre}
    rgSet <- rgSet[,keep]
    targets <- targets[keep,]
    message("samples with avg det p-val < ${samples_pval_tre} removed")

    ## 4. Normalize the data - Quantile
    mSetSq <- preprocessQuantile(rgSet)
    message("data quantile-normalized")
    ## 5. Remove XY chr - probes
    
    mSetSq <- mSetSq[!as.vector(mSetSq@rowRanges@seqnames)%in% c("chrX","chrY"),]
    message("sex probes removed")
    
    ## 6. Remove cross-reactive probes
    no_cross_reactive <- !(featureNames(mSetSq) %in% cross_reactive)
    mSetSq <- mSetSq[no_cross_reactive, ]
    message("cross-reactive probes removed")
    
    ## 7. Drop probes that are also SNPs
    
    mSetSq <- dropLociWithSnps(mSetSq)
    message("probes overlapping with snps removed")
    
    ## 8. Remove probes with < ${probe_det_pval_tre} detectin p-values
    detP <- detP[match(featureNames(mSetSq),rownames(detP)),]
    keep <- rowSums(detP < ${probe_det_pval_tre}) == ncol(mSetSq)
    mSetSq <- mSetSq[keep,]
    
    ## 9. get Beta and Mvalues
    mSetSqbval <- getBeta(mSetSq)%>%as_tibble(rownames = "ID")
    mSetSqMval <- getM(mSetSq)%>%as_tibble(rownames = "ID")
    cpg_regions = mSetSq@rowRanges%>%as.data.frame()%>%as_tibble(rownames = "ID")%>%select("#chr" = seqnames, start, end, ID, -width,-strand)
    mSetSqbval = cpg_regions%>%right_join(mSetSqbval, by = "ID")%>%mutate(end = start + 1)
    mSetSqMval = cpg_regions%>%right_join(mSetSqMval, by = "ID")%>%mutate(end = start + 1)
    message("Beta-values and M-values obtained")
    ## 10. output data
    mSetSqMval = mSetSqMval%>%rename_at(vars(rgSet@colData%>%rownames()), function(x) rgSet@colData[x,]%>%as_tibble%>%pull(Sample_Name) )
    mSetSqbval = mSetSqbval%>%rename_at(vars(rgSet@colData%>%rownames()), function(x) rgSet@colData[x,]%>%as_tibble%>%pull(Sample_Name) )
    mSetSqMval[,1:4]%>%write_delim(${_output[3]:r},"\t")
    mSetSqbval%>%readr::write_delim("${_output[1]}","\t")
    mSetSqMval%>%readr::write_delim("${_output[2]}","\t")
    output = list("rgSet" = rgSet,mSetSq = "mSetSq", mSetSqbval = "mSetSqbval", mSetSqMval = "mSetSqMval")
    output%>%saveRDS(${_output[0]:r})

## Format and annotation methylation data

After the beta and m value are generated. The output will be saved as a bed file.

In [None]:
[methylation_2]
output: f'{_input[0]:nn}.methyl.beta.bed.gz',f'{_input[0]:nn}.methyl.M.bed.gz' 
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container
    bgzip -f ${_input[1]} 
    tabix ${_output[0]}
    bgzip -f ${_input[2]} 
    tabix ${_output[1]}

In [None]:
#[combat_batch]
parameter: batchFile = path(".")
input: output_from("methylation")
output:f'{_input}.MDS.pdf',f'{_input}.beta.batch_corrected.txt'
R: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container
    library(ggplot2)
    library(metaMA)
    library(sva)
    library(lme4)
    ## Define functions
    posibatches <- function(dat, Sentrix, batch=TRUE, par.prior=TRUE, prior.plots=FALSE, mean.only.posi=FALSE, mean.only.batch=FALSE) {
      require(sva)
      require(lme4)
      #get the position and batch information----------------------------------------
      ### Extraction of the position numbers and chip numbers--------------------
      #if (is.null(SentrixVector)){
      if (is.null(Sentrix)){
        stop('Sentrix informations must be provided.')
      }
      #chips<-as.numeric(factor(substr(Sentrix$sampleNames, 1, 10))) 
      #positions<-as.numeric(factor(substr(Sentrix$sampleNames, 12, 17))) 
      
      chips <- as.numeric(factor(sapply(strsplit(as.character(Sentrix$sampleNames), "_"), function(x) x[1])))
      positions <- as.numeric(factor(sapply(strsplit(as.character(Sentrix$sampleNames), "_"), function(x) x[2])))
      
      if (length(positions)!=length(chips)){
        stop('positions and chips must have the same length')
      }
      if (sum(positions>12)>0){
        stop('Position number cannot be greater than 12')
      }
      if (sum(is.na(chips))>1 || sum(is.na(positions))>1){
        stop('One or more position or chip numbers missing')
      }
      
      if (sum(is.na(Sentrix$batches))==length(chips)){
        batches<-chips
      } else {
        batches<-as.factor(Sentrix$batches)
      }
      
      if(batch==TRUE){
        
        ################################################################################
        ################################################################################
        pct_threshold = .8 # Amount of variability desired to be explained by the principal components.  Set to match the results in book chapter and SAS code.  User can adjust this to a higher (>= 0.8) number but < 1.0
        dataRowN <- nrow(dat)
        dataColN <- ncol(dat)
        
        ########## Center the data (center rows) ##########
        datCentered <- matrix(data = 0, nrow = dataRowN, ncol = dataColN)
        datCentered_transposed = apply(dat, 1, scale, center = TRUE, scale = FALSE)
        datCentered = t(datCentered_transposed)
        
        exp_design<-data.frame(cbind(positions,batches))
        expDesignRowN <- nrow(exp_design)
        expDesignColN <- ncol(exp_design)
        myColNames <- names(exp_design)
        
        
        ########## Compute correlation matrix ##########
        
        theDataCor <- cor(datCentered)
        
        ########## Obtain eigenvalues ##########
        
        eigenData <- eigen(theDataCor)
        eigenValues = eigenData$values
        ev_n <- length(eigenValues)
        eigenVectorsMatrix = eigenData$vectors
        eigenValuesSum = sum(eigenValues)
        percents_PCs = eigenValues /eigenValuesSum
        
        ########## Merge experimental file and eigenvectors for n components ##########
        
        my_counter_2 = 0
        my_sum_2 = 1
        for (i in ev_n:1){
          my_sum_2  = my_sum_2 - percents_PCs[i]
          if ((my_sum_2) <= pct_threshold ){
            my_counter_2 = my_counter_2 + 1
          }
          
        }
        if (my_counter_2 < 3){
          pc_n  = 3
          
        }else {
          pc_n = my_counter_2
        }
        
        # pc_n is the number of principal components to model
        
        pc_data_matrix <- matrix(data = 0, nrow = (expDesignRowN*pc_n), ncol = 1)
        mycounter = 0
        for (i in 1:pc_n){
          for (j in 1:expDesignRowN){
            mycounter <- mycounter + 1
            pc_data_matrix[mycounter,1] = eigenVectorsMatrix[j,i]
            
          }
        }
        
        AAA <- exp_design[rep(1:expDesignRowN,pc_n),]
        
        Data <- cbind(AAA,pc_data_matrix)
        
        ####### Edit these variables according to your factors #######
        
        variables <- c(colnames(exp_design))
        for (i in 1:length(variables)) {
          Data$variables[i] <- as.factor(Data$variables[i])
        }
        
        ########## Mixed linear model ##########
        
        p <- options(warn = (-1))
        #effects_n = expDesignColN + choose(expDesignColN, 2) + 1
        effects_n = expDesignColN  + 1
        randomEffectsMatrix <- matrix(data = 0, nrow = pc_n, ncol = effects_n)
        
        model.func <- c()
        index <- 1
        for (i in 1:length(variables)) {
          mod = paste("(1|", variables[i], ")", sep = "")
          model.func[index] = mod
          index = index + 1
        }
        
        function.mods <- paste(model.func, collapse = " + ")
        
        for (i in 1:pc_n) {
          y = (((i - 1) * expDesignRowN) + 1)
          funct <- paste("pc_data_matrix", function.mods, sep = " ~ ")
          Rm1ML <- lmer(funct, Data[y:(((i - 1) * expDesignRowN) +
                                         expDesignRowN), ], REML = TRUE, control=lmerControl(check.nobs.vs.nlev = "ignore",check.nobs.vs.rankZ = "ignore",check.nobs.vs.nRE="ignore"),verbose = FALSE,
                        na.action = na.omit)
          randomEffects <- Rm1ML
          randomEffectsMatrix[i, ] <- c(unlist(VarCorr(Rm1ML)),
                                        resid = sigma(Rm1ML)^2)
        }
        effectsNames <- c(names(getME(Rm1ML, "cnms")), "resid")
        ########## Standardize Variance ##########
        
        randomEffectsMatrixStdze <- matrix(data = 0, nrow = pc_n, ncol = effects_n)
        for (i in 1:pc_n){
          mySum = sum(randomEffectsMatrix[i,])
          for (j in 1:effects_n){
            randomEffectsMatrixStdze[i,j] = randomEffectsMatrix[i,j]/mySum
          }
        }
        
        ########## Compute Weighted Proportions ##########
        
        randomEffectsMatrixWtProp <- matrix(data = 0, nrow = pc_n, ncol = effects_n)
        for (i in 1:pc_n){
          weight = eigenValues[i]/eigenValuesSum
          for (j in 1:effects_n){
            randomEffectsMatrixWtProp[i,j] = randomEffectsMatrixStdze[i,j]*weight
          }
        }
        ######### Compute Weighted Ave Proportions ##########
        
        randomEffectsSums <- matrix(data = 0, nrow = 1, ncol = effects_n)
        randomEffectsSums <-colSums(randomEffectsMatrixWtProp)
        totalSum = sum(randomEffectsSums)
        randomEffectsMatrixWtAveProp <- matrix(data = 0, nrow = 1, ncol = effects_n)
        
        for (j in 1:effects_n){
          randomEffectsMatrixWtAveProp[j] = randomEffectsSums[j]/totalSum
          
        }
        
        if(randomEffectsMatrixWtAveProp[,1]<randomEffectsMatrixWtAveProp[,2]){
          afterbatchExp<-ComBat(dat = dat, batch = batches, par.prior = par.prior, prior.plots = prior.plots, mean.only = mean.only.batch)
          afterposiExp<-ComBat(dat = afterbatchExp, batch = positions, par.prior = par.prior, prior.plots = prior.plots, mean.only = mean.only.posi)
          dat<-afterposiExp
        } else{
          afterposiExp<-ComBat(dat = dat, batch = positions, par.prior = par.prior, prior.plots = prior.plots, mean.only = mean.only.posi)
          afterbatchExp<-ComBat(dat = afterposiExp, batch = batches, par.prior = par.prior, prior.plots = prior.plots, mean.only = mean.only.batch)
          dat<-afterbatchExp
        }
      }else{
        afterposiExp<-ComBat(dat = dat, batch = positions, par.prior = par.prior, prior.plots = prior.plots, mean.only = mean.only.posi)
        dat<-afterposiExp
      }
      return(dat)
    }
    ## Combat batch effect
    mSetSq = readRDS("${_input[0]}")$mSetSq
    Sentrix = read.csv("${batchFile}")
    results <- posibatches(mSetSq, Sentrix, batch=TRUE, par.prior=TRUE, prior.plots=FALSE, mean.only.posi=FALSE, mean.only.batch=FALSE)
    results%>%readr::write_delim(${_output[1]:r})
    ## Plot MDS
    data4mds <- function(data, topvals){
      b <- data
      o <- order(rowVars(b), decreasing = TRUE)[seq_len(topvals)]
      d <- dist(t(b[o, ]))
      fit <- cmdscale(d)
      return(fit)
    }
    
    myfit <- as.data.frame(data4mds(mSetSq, 1000))
    colnames(myfit) <- c("fit1", "fit2")

    MDSplot1 = ggplot(myfit, aes(x = fit1, y = fit2)) + geom_point(size = 1.5) + 
      labs(title = "MDS plot",  x = "Coordinate 1",  y = "Coordinate 2") + 
      theme(axis.title.x = element_text(face = "bold", size = 26), axis.text.x = element_text(face="bold", size = 20)) + theme(axis.text.y = element_text(face="bold", size = 20), axis.title.y = element_text(face = "bold", size = 26)) +  
      theme(panel.background = element_rect(fill = "white", colour = "white")) + 
      theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_line(size = 0.5, linetype = "solid", colour = "black")) +  
      theme(plot.title = element_text(face = "bold", size = 30, hjust = 0.5)) + 
      theme(legend.position="bottom", legend.text=element_text(size=16))
      MDSplot1%>%ggsave(${_output[0]:r})