Skip to content

JessicaQRen/DreamAI2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DreamAI2

DreamAI2::DreamAI2

Description

The function DreamAI2 imputes a dataset with missing values or NA's using individual or ensemble output from 7 different methods.

Individual methods:

  • "KNN": k nearest neighbor
  • "MissForest": nonparametric Missing Value Imputation using Random Forest
  • "ADMIN": abundance dependent missing imputation
  • "Birnn": imputation using IRNN-SCAD algorithm
  • "SpectroFM": imputation using matrix factorization
  • "RegImpute": imputation using Glmnet ridge regression
  • "MICE": Multiple Imputation by Chained Equations

Ensemble methods

  • "Ensemble": average of the 7 individual methods or the user specified methods among the 7.
  • "Ensemble.Fast": average of the 7 individual methods or the user specified methods among the 7 excluding "MissForest".

Usage

DreamAI2(data, k = 10, maxiter_MF = 10, ntree = 100,
  maxnodes = NULL, maxiter_ADMIN = 30, tol = 10^(-2),
  gamma_ADMIN = NA, gamma = 50, CV = FALSE,
  fillmethod = "row_mean", maxiter_RegImpute = 10,
  conv_nrmse = 1e-06, iter_SpectroFM = 40,
  m_mice = 1, method_mice = 'pmm', maxit_mice = 20,
  method = c("KNN", "MissForest", "ADMIN", "Birnn", "SpectroFM", "RegImpute", "MICE"),
  out = c("Ensemble.Fast"))

Arguments

Parameter Default Description
data dataset in the form of a matrix or dataframe with missing values or NA's. The function throws an error message and stops if any row or column in the dataset is missing all values
k 10 number of neighbors to be used in the imputation by KNN and ADMIN
maxiter_MF 10 maximum number of iteration to be performed in the imputation by "MissForest" if the stopping criteria is not met beforehand
ntree 100 number of trees to grow in each forest in "MissForest"
maxnodes NULL maximum number of terminal nodes for trees in the forest in "MissForest", has to equal at least the number of columns in the given data
maxiter_ADMIN 30 maximum number of iteration to be performed in the imputation by "ADMIN" if the stopping criteria is not met beforehand
tol 10^(-2) convergence threshold for "ADMIN"
gamma_ADMIN NA parameter for ADMIN to control abundance dependent missing. Set gamma_ADMIN=0 for log ratio intensity data. For abundance data put gamma_ADMIN=NA, and it will be estimated accordingly
gamma 50 parameter of the supergradients of popular nonconvex surrogate functions, e.g. SCAD and MCP of L0-norm for Birnn
CV FALSE a logical value indicating whether to fit the best gamma with cross validation for "Birnn". If CV=FALSE, default gamma=50 is used, while if CV=TRUE gamma is calculated using cross-validation.
fillmethod "row_mean" a string identifying the method to be used to initially filling the missing values using simple imputation for "RegImpute". That could be "row_mean" or "zeros", with "row_mean" being the default. It throws an warning if "row_median" is used.
maxiter_RegImpute 10 maximum number of iterations to reach convergence in the imputation by "RegImpute"
conv_nrmse 1e-06 convergence threshold for "RegImpute"
iter_SpectroFM 40 number of iterations for "SpectroFM"
m_mice 1 Number of multiple imputations in "MICE"
method_mice "pmm" imputation method to be used for each column in "MICE"
maxit_mice 20 A scalar giving the number of iterations in "MICE"
method c("KNN","MissForest", "ADMIN", "Birnn", "SpectroFM", "RegImpute", "MICE") a vector of imputation methods selected from "KNN", "MissForest", "ADMIN", "Birnn", "SpectroFM, "RegImpute" and "MICE".
out c("Ensemble.Fast") a vector of imputation methods for which the function will output the imputed matrices. Default is "Ensemble.Fast"

Value

a list of imputed datasets by different methods as specified by the user.

Notes

If all methods are specified for obtaining "Ensemble" imputed matrix, the approximate time required to output the imputed matrix for a dataset of dimension 26000 x 200 is ~50 hours.

Example

data(datapnnl)
data<-datapnnl.rm.ref[1:100,1:21]
impute<- DreamAI2(data,k=10,maxiter_MF = 10, ntree = 100,maxnodes = NULL,maxiter_ADMIN=30,tol=10^(-2),gamma_ADMIN=NA,gamma=50,CV=FALSE,fillmethod="row_mean",maxiter_RegImpute=10,conv_nrmse = 1e-6,iter_SpectroFM=40, m_mice = 1, method_mice = 'pmm', maxit_mice = 20,
method = c("KNN", "MissForest", "ADMIN", "Birnn", "SpectroFM", "RegImpute","MICE"),out="Ensemble.Fast")
impute$Ensemble

DreamAI2::DreamAI2_Bagging

Description

The function DreamAI2_bagging imputes a dataset with missing values or NA's by bag imputaion with help of parallel processing. Pseudo datasets are generated having true missing (as in the original dataset) and pseudo missing and every such pseudo dataset is imputed by individual or ensemble output of the 7 different methods: KNN, MissForest, ADMIN, Birnn, SpectroFM, RegImpute and MICE (descriptions are included in the documentation of the function DreamAI2).

Usage

DreamAI2_Bagging(data, k = 10, maxiter_MF = 10, ntree = 100,
  maxnodes = NULL, maxiter_ADMIN = 30, tol = 10^(-2),
  gamma_ADMIN = NA, gamma = 50, CV = FALSE,
  fillmethod = "row_mean", maxiter_RegImpute = 10,
  conv_nrmse = 1e-06, iter_SpectroFM = 40,
  m_mice = 1, method_mice = 'pmm', maxit_mice = 20,
  method = c("KNN", "MissForest", "ADMIN", "Birnn", "SpectroFM", "RegImpute", "MICE"),out=c("Enemble.Fast"),
  SamplesPerBatch, n.bag, save.out = TRUE, path = NULL, ProcessNum)

Arguments

Parameter Default Description
data dataset in the form of a matrix or dataframe with missing values or NA's. The function throws an error message and stops if any row or column in the dataset is missing all values
k 10 number of neighbors to be used in the imputation by KNN and ADMIN
maxiter_MF 10 maximum number of iteration to be performed in the imputation by "MissForest" if the stopping criteria is not met beforehand
ntree 100 number of trees to grow in each forest in "MissForest"
maxnodes NULL maximum number of terminal nodes for trees in the forest in "MissForest", has to equal at least the number of columns in the given data
maxiter_ADMIN 30 maximum number of iteration to be performed in the imputation by "ADMIN" if the stopping criteria is not met beforehand
tol 10^(-2) convergence threshold for "ADMIN"
gamma_ADMIN NA parameter for ADMIN to control abundance dependent missing. Set gamma_ADMIN=0 for log ratio intensity data. For abundance data put gamma_ADMIN=NA, and it will be estimated accordingly
gamma 50 parameter of the supergradients of popular nonconvex surrogate functions, e.g. SCAD and MCP of L0-norm for Birnn
CV FALSE a logical value indicating whether to fit the best gamma with cross validation for "Birnn". If CV=FALSE, default gamma=50 is used, while if CV=TRUE gamma is calculated using cross-validation.
fillmethod "row_mean" a string identifying the method to be used to initially filling the missing values using simple imputation for "RegImpute". That could be "row_mean" or "zeros", with "row_mean" being the default. It throws an warning if "row_median" is used.
maxiter_RegImpute 10 maximum number of iterations to reach convergence in the imputation by "RegImpute"
conv_nrmse 1e-06 convergence threshold for "RegImpute"
iter_SpectroFM 40 number of iterations for "SpectroFM"
m_mice 1 Number of multiple imputations in "MICE"
method_mice "pmm" imputation method to be used for each column in "MICE"
maxit_mice 20 A scalar giving the number of iterations in "MICE"
method must specify a vector of imputation methods selected from "KNN", "MissForest", "ADMIN", "Birnn", "SpectroFM, "RegImpute", "MICE"
SamplesPerBatch number of samples per batch (batch size in the original data)
n.bag number of pseudo datasets to generate and impute in the current process
save.out logical indicator whether or not to save the output. When TRUE output is saved, when FALSE output is returned
path NULL location to save the output file from the curent process. Path only needs to be specified when save.out=TRUE
ProcessNum process number starting from 1 when run in cluster, e.g. 1 - 10, 1 - 100 etc. Needs to be specified only if the output is saved
out "Ensemble.Fast" a vector of imputation methods for which the function will output the imputed matrices.

Value

list of imputed dataset (averaged over all pseudo imputed data matrices) by different methods as specified by the user, n.bag and a summary matrix containing gene name, sample name, true and imputed values of every pseudo missing combined from n.bag datasets.

Notes

This function can be run as parallel job in cluster. It generates and saves a .RData file containing the output from the current process in the location provided by the user, with the process number in the file name. If the user runs it in local computer multiple times, then changing the ProcessNumber everytime will generate and save .RData file with the given ProcessNumber.

Example

data(datapnnl)
data<-datapnnl.rm.ref[1:100,1:21]
impute<- DreamAI2_Bagging(data=data,k=10,maxiter_MF = 10, ntree = 100,maxnodes = NULL,maxiter_ADMIN=30,tol=10^(-2),gamma_ADMIN=NA,gamma=50,CV=FALSE,fillmethod="row_mean",maxiter_RegImpute=10,conv_nrmse = 1e-6,iter_SpectroFM=40,m_mice = 1, method_mice = 'pmm', maxit_mice = 20,
method=c("KNN","MissForest","ADMIN","Birnn","SpectroFM","RegImpute","MICE"),SamplesPerBatch=3,n.bag=2,save.out=TRUE,path="C:\\Users\\chowds14\\Desktop\\test_package\\",ProcessNum=1)
impute$Ensemble.Fast

DreamAI2::bag.summary

Description

Wrapper function for summarizing the outputs from DreamAI2_bagging

Usage

bag.summary(method = c("KNN", "MissForest", "ADMIN", "Birnn",
  "SpectroFM", "RegImpute", "MICE"), nNodes = 3, path = NULL)

Arguments

Parameter Default Description
method Ensemble a vector of imputation methods. This vector should be same or subset of the vector out in DreamAI2_bagging. Default is "Ensemble"
nNodes number of parallel processes
path NULL location where the bagging output is saved

Value

list of final imputed data and confidence score for every gene using pseudo missing

Example

data(datapnnl)
data<-datapnnl.rm.ref[1:100,1:21]
impute<- DreamAI2_Bagging(data=data,k=10,maxiter_MF = 10, ntree = 100,maxnodes = NULL,maxiter_ADMIN=30,tol=10^(-2),gamma_ADMIN=NA,gamma=50,CV=FALSE,fillmethod="row_mean",maxiter_RegImpute=10,conv_nrmse = 1e-6,iter_SpectroFM=40,m_mice = 1, method_mice = 'pmm', maxit_mice = 20,
method=c("KNN","MissForest","ADMIN","Birnn","SpectroFM","RegImpute","MICE"),SamplesPerBatch=3,n.bag=2,save.out=TRUE,path="C:\\Users\\chowds14\\Desktop\\test_package\\",ProcessNum=1)
final.out<-bag.summary(method=c("KNN"),nNodes=2,path="C:\\Users\\chowds14\\Desktop\\test_package\\")
final.out$score
final.out$imputed_data

About

DreamAI2 for Bioinformatics Application Note

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors