- DreamAI2::DreamAI2
- Imputation of Missing Protein Abundances with Iterative Prediction Model
- DreamAI2::DreamAI2_Bagging
- Bag Imputation of Missing Protein Abundances with Iterative Prediction Model
- DreamAI2::bag.summary
- Wrapper function for summarizing the outputs from DreamAI2_bagging
The function DreamAI2 imputes a dataset with missing values or NA's using individual or ensemble output from 7 different methods.
Individual methods:
- "KNN": k nearest neighbor
- "MissForest": nonparametric Missing Value Imputation using Random Forest
- "ADMIN": abundance dependent missing imputation
- "Birnn": imputation using IRNN-SCAD algorithm
- "SpectroFM": imputation using matrix factorization
- "RegImpute": imputation using Glmnet ridge regression
- "MICE": Multiple Imputation by Chained Equations
Ensemble methods
- "Ensemble": average of the 7 individual methods or the user specified methods among the 7.
- "Ensemble.Fast": average of the 7 individual methods or the user specified methods among the 7 excluding "MissForest".
DreamAI2(data, k = 10, maxiter_MF = 10, ntree = 100,
maxnodes = NULL, maxiter_ADMIN = 30, tol = 10^(-2),
gamma_ADMIN = NA, gamma = 50, CV = FALSE,
fillmethod = "row_mean", maxiter_RegImpute = 10,
conv_nrmse = 1e-06, iter_SpectroFM = 40,
m_mice = 1, method_mice = 'pmm', maxit_mice = 20,
method = c("KNN", "MissForest", "ADMIN", "Birnn", "SpectroFM", "RegImpute", "MICE"),
out = c("Ensemble.Fast"))
| Parameter | Default | Description |
|---|---|---|
| data | dataset in the form of a matrix or dataframe with missing values or NA's. The function throws an error message and stops if any row or column in the dataset is missing all values | |
| k | 10 | number of neighbors to be used in the imputation by KNN and ADMIN |
| maxiter_MF | 10 | maximum number of iteration to be performed in the imputation by "MissForest" if the stopping criteria is not met beforehand |
| ntree | 100 | number of trees to grow in each forest in "MissForest" |
| maxnodes | NULL | maximum number of terminal nodes for trees in the forest in "MissForest", has to equal at least the number of columns in the given data |
| maxiter_ADMIN | 30 | maximum number of iteration to be performed in the imputation by "ADMIN" if the stopping criteria is not met beforehand |
| tol | 10^(-2) | convergence threshold for "ADMIN" |
| gamma_ADMIN | NA | parameter for ADMIN to control abundance dependent missing. Set gamma_ADMIN=0 for log ratio intensity data. For abundance data put gamma_ADMIN=NA, and it will be estimated accordingly |
| gamma | 50 | parameter of the supergradients of popular nonconvex surrogate functions, e.g. SCAD and MCP of L0-norm for Birnn |
| CV | FALSE | a logical value indicating whether to fit the best gamma with cross validation for "Birnn". If CV=FALSE, default gamma=50 is used, while if CV=TRUE gamma is calculated using cross-validation. |
| fillmethod | "row_mean" | a string identifying the method to be used to initially filling the missing values using simple imputation for "RegImpute". That could be "row_mean" or "zeros", with "row_mean" being the default. It throws an warning if "row_median" is used. |
| maxiter_RegImpute | 10 | maximum number of iterations to reach convergence in the imputation by "RegImpute" |
| conv_nrmse | 1e-06 | convergence threshold for "RegImpute" |
| iter_SpectroFM | 40 | number of iterations for "SpectroFM" |
| m_mice | 1 | Number of multiple imputations in "MICE" |
| method_mice | "pmm" | imputation method to be used for each column in "MICE" |
| maxit_mice | 20 | A scalar giving the number of iterations in "MICE" |
| method | c("KNN","MissForest", "ADMIN", "Birnn", "SpectroFM", "RegImpute", "MICE") | a vector of imputation methods selected from "KNN", "MissForest", "ADMIN", "Birnn", "SpectroFM, "RegImpute" and "MICE". |
| out | c("Ensemble.Fast") | a vector of imputation methods for which the function will output the imputed matrices. Default is "Ensemble.Fast" |
a list of imputed datasets by different methods as specified by the user.
If all methods are specified for obtaining "Ensemble" imputed matrix, the approximate time required to output the imputed matrix for a dataset of dimension 26000 x 200 is ~50 hours.
data(datapnnl)
data<-datapnnl.rm.ref[1:100,1:21]
impute<- DreamAI2(data,k=10,maxiter_MF = 10, ntree = 100,maxnodes = NULL,maxiter_ADMIN=30,tol=10^(-2),gamma_ADMIN=NA,gamma=50,CV=FALSE,fillmethod="row_mean",maxiter_RegImpute=10,conv_nrmse = 1e-6,iter_SpectroFM=40, m_mice = 1, method_mice = 'pmm', maxit_mice = 20,
method = c("KNN", "MissForest", "ADMIN", "Birnn", "SpectroFM", "RegImpute","MICE"),out="Ensemble.Fast")
impute$Ensemble
The function DreamAI2_bagging imputes a dataset with missing values or NA's by bag imputaion with help of parallel processing. Pseudo datasets are generated having true missing (as in the original dataset) and pseudo missing and every such pseudo dataset is imputed by individual or ensemble output of the 7 different methods: KNN, MissForest, ADMIN, Birnn, SpectroFM, RegImpute and MICE (descriptions are included in the documentation of the function DreamAI2).
DreamAI2_Bagging(data, k = 10, maxiter_MF = 10, ntree = 100,
maxnodes = NULL, maxiter_ADMIN = 30, tol = 10^(-2),
gamma_ADMIN = NA, gamma = 50, CV = FALSE,
fillmethod = "row_mean", maxiter_RegImpute = 10,
conv_nrmse = 1e-06, iter_SpectroFM = 40,
m_mice = 1, method_mice = 'pmm', maxit_mice = 20,
method = c("KNN", "MissForest", "ADMIN", "Birnn", "SpectroFM", "RegImpute", "MICE"),out=c("Enemble.Fast"),
SamplesPerBatch, n.bag, save.out = TRUE, path = NULL, ProcessNum)
| Parameter | Default | Description |
|---|---|---|
| data | dataset in the form of a matrix or dataframe with missing values or NA's. The function throws an error message and stops if any row or column in the dataset is missing all values | |
| k | 10 | number of neighbors to be used in the imputation by KNN and ADMIN |
| maxiter_MF | 10 | maximum number of iteration to be performed in the imputation by "MissForest" if the stopping criteria is not met beforehand |
| ntree | 100 | number of trees to grow in each forest in "MissForest" |
| maxnodes | NULL | maximum number of terminal nodes for trees in the forest in "MissForest", has to equal at least the number of columns in the given data |
| maxiter_ADMIN | 30 | maximum number of iteration to be performed in the imputation by "ADMIN" if the stopping criteria is not met beforehand |
| tol | 10^(-2) | convergence threshold for "ADMIN" |
| gamma_ADMIN | NA | parameter for ADMIN to control abundance dependent missing. Set gamma_ADMIN=0 for log ratio intensity data. For abundance data put gamma_ADMIN=NA, and it will be estimated accordingly |
| gamma | 50 | parameter of the supergradients of popular nonconvex surrogate functions, e.g. SCAD and MCP of L0-norm for Birnn |
| CV | FALSE | a logical value indicating whether to fit the best gamma with cross validation for "Birnn". If CV=FALSE, default gamma=50 is used, while if CV=TRUE gamma is calculated using cross-validation. |
| fillmethod | "row_mean" | a string identifying the method to be used to initially filling the missing values using simple imputation for "RegImpute". That could be "row_mean" or "zeros", with "row_mean" being the default. It throws an warning if "row_median" is used. |
| maxiter_RegImpute | 10 | maximum number of iterations to reach convergence in the imputation by "RegImpute" |
| conv_nrmse | 1e-06 | convergence threshold for "RegImpute" |
| iter_SpectroFM | 40 | number of iterations for "SpectroFM" |
| m_mice | 1 | Number of multiple imputations in "MICE" |
| method_mice | "pmm" | imputation method to be used for each column in "MICE" |
| maxit_mice | 20 | A scalar giving the number of iterations in "MICE" |
| method | must specify | a vector of imputation methods selected from "KNN", "MissForest", "ADMIN", "Birnn", "SpectroFM, "RegImpute", "MICE" |
| SamplesPerBatch | number of samples per batch (batch size in the original data) | |
| n.bag | number of pseudo datasets to generate and impute in the current process | |
| save.out | logical indicator whether or not to save the output. When TRUE output is saved, when FALSE output is returned | |
| path | NULL | location to save the output file from the curent process. Path only needs to be specified when save.out=TRUE |
| ProcessNum | process number starting from 1 when run in cluster, e.g. 1 - 10, 1 - 100 etc. Needs to be specified only if the output is saved | |
| out | "Ensemble.Fast" | a vector of imputation methods for which the function will output the imputed matrices. |
list of imputed dataset (averaged over all pseudo imputed data matrices) by different methods as specified by the user, n.bag and a summary matrix containing gene name, sample name, true and imputed values of every pseudo missing combined from n.bag datasets.
This function can be run as parallel job in cluster. It generates and saves a .RData file containing the output from the current process in the location provided by the user, with the process number in the file name. If the user runs it in local computer multiple times, then changing the ProcessNumber everytime will generate and save .RData file with the given ProcessNumber.
data(datapnnl)
data<-datapnnl.rm.ref[1:100,1:21]
impute<- DreamAI2_Bagging(data=data,k=10,maxiter_MF = 10, ntree = 100,maxnodes = NULL,maxiter_ADMIN=30,tol=10^(-2),gamma_ADMIN=NA,gamma=50,CV=FALSE,fillmethod="row_mean",maxiter_RegImpute=10,conv_nrmse = 1e-6,iter_SpectroFM=40,m_mice = 1, method_mice = 'pmm', maxit_mice = 20,
method=c("KNN","MissForest","ADMIN","Birnn","SpectroFM","RegImpute","MICE"),SamplesPerBatch=3,n.bag=2,save.out=TRUE,path="C:\\Users\\chowds14\\Desktop\\test_package\\",ProcessNum=1)
impute$Ensemble.Fast
Wrapper function for summarizing the outputs from DreamAI2_bagging
bag.summary(method = c("KNN", "MissForest", "ADMIN", "Birnn",
"SpectroFM", "RegImpute", "MICE"), nNodes = 3, path = NULL)
| Parameter | Default | Description |
|---|---|---|
| method | Ensemble | a vector of imputation methods. This vector should be same or subset of the vector out in DreamAI2_bagging. Default is "Ensemble" |
| nNodes | number of parallel processes | |
| path | NULL | location where the bagging output is saved |
list of final imputed data and confidence score for every gene using pseudo missing
data(datapnnl)
data<-datapnnl.rm.ref[1:100,1:21]
impute<- DreamAI2_Bagging(data=data,k=10,maxiter_MF = 10, ntree = 100,maxnodes = NULL,maxiter_ADMIN=30,tol=10^(-2),gamma_ADMIN=NA,gamma=50,CV=FALSE,fillmethod="row_mean",maxiter_RegImpute=10,conv_nrmse = 1e-6,iter_SpectroFM=40,m_mice = 1, method_mice = 'pmm', maxit_mice = 20,
method=c("KNN","MissForest","ADMIN","Birnn","SpectroFM","RegImpute","MICE"),SamplesPerBatch=3,n.bag=2,save.out=TRUE,path="C:\\Users\\chowds14\\Desktop\\test_package\\",ProcessNum=1)
final.out<-bag.summary(method=c("KNN"),nNodes=2,path="C:\\Users\\chowds14\\Desktop\\test_package\\")
final.out$score
final.out$imputed_data