Skip to content

SOCR/CBDA

Repository files navigation

CBDA

Compressive Big Data Analytics (CBDA)

Table of contents

Overview

The theoretical foundations of Big Data Science are not fully developed, yet. The CBDA project investigates a new Big Data theory for high-throughput analytics and model-free Inference. Specifically, we explore the core principles of distribution-free and model-agnostic methods for scientific inference based on Big Data sets. Compressive Big Data analytics (CBDA) represents an idea that iteratively generates random (sub)samples from the Big Data collection, uses established techniques to develop model-based or non-parametric inference, repeats the (re)sampling and inference steps many times, and finally uses bootstrapping techniques to quantify probabilities, estimate likelihoods, or assess accuracy of findings. The CBDA approach may provide a scalable solution avoiding some of the Big Data management and analytics challenges. CBDA sampling is conducted on the data-element level, not on the case level, and the sampled values are not necessarily consistent across all data elements (e.g., high-throughput random sampling from cases and variables within cases). An alternative approach is to use Bayesian methods to investigate the theoretical properties (e.g., asymptotics, as sample sizes increase to infinity, but the data has sparse conditions) of model-free inference entirely based on the complete dataset without any parametric or model-limited restrictions.

This project investigates the parallels between the (established) compressive sensing (CS) for signal representation, reconstruction, recovery and data denoising, and the (new) field of big data analytics and inference. Ultimately, the project will develop the foundational principles of scientific inference based on compressive big data analytics. We investigate various methods for efficient data-aggregation, compressive-analytics, scientific inference, interactive services for data interrogation, including high-dimensional data visualization, and integration of research and education. Specific applications include neuroimaging-genetics studies of Alzheimer’s disease, predictive modeling of cancer treatment outcomes, and high-throughput data analytics using graphical pipeline workflows.

A manuscript entitled "Controlled Feature Selection and Compressive Big Data Analytics: Applications to Big Biomedical and Health Studies" has been published on PLOS ONE Bioinformatics.

R package

The CBDA protocol has been developed in the R environment, see the CBDA R package download site on C-RAN for the latest R version (currently R-3.5.1). Since a large number of smaller training sets are needed for the convergence of the protocol, we created a workflow that runs on the LONI pipeline environment, a free platform for high performance computing that allows the simultaneous submission of hundreds of independent instances/jobs of the CBDA protocol. The methods, software, workflows, datssets and protocols developed here are publicly accessible and openly shared in our first CBDA release.

The source code to run the CBDA protocol is at source1.zip or at source2.zip.

The CBDA protocol steps are illustrated in Figure.

Installation

The version 1.0.0 of the CBDA package can be downloaded and installed with the following command:

install.packages("CBDA",repos = 'https://cran.r-project.org/')

The historical CBDA stats (since publication in April 16 2018 on CRAN) are shown in the Figure below . figure0

A comparison with some other similar packages for the month of November 2018 is shown below. figure0

Vignettes

The documentation and vignettes, as well as the source and binary files can be found on CRAN. The binary and the source files for the CBDA R package can also be downloaded from our Github repository and install it via the following commands.

# Installation from the Windows binary (recommended for Windows systems)
install.packages("/filepath/CBDA_1.0.0.zip", repos = NULL, type = "win.binary") 
# Installation from the source (recommended for Macs and Linux systems)
install.packages("/filepath/CBDA_1.0.0.tar.gz", repos = NULL, type = "source")

The necessary packages to run the CBDA algortihm are installed automatically at installation. However, they can also be installed/attached by launching the CBDA_initialization() function (see example in the R chunk below). If the parameter install is set to TRUE (by default it's set to FALSE), then the CBDA_initialization() function installs (if needed) and attaches all the necessary packages to run the CBDA package v1.0.0. This function can be run before any production run or test. The list of packages can pe personalized to comprise extra packages needed for an expanded SL.library or for other needs by the user. The output shows a table (see Figure below) where for each package a TRUE or FALSE is displayed. Thus the necessary steps can be pursued in case some package has a FALSE.

N.B.: to eliminate a warning in Windows due to the "doMC" package not available (it was intended for Mac), install the "doMC" with the following command "install.packages("doMC", repos="http://R-Forge.R-project.org")"

ipaktable figure1.

Acknowledgments

This work is supported in part by NIH grants U54 EB020406, P20 NR015331, P50 NS091856, P30 DK089503, P30AG053760, UL1TR002240, and NSF grants 1734853, 1636840, 1416953, 0716055 and 1023115. Students, trainees, scholars, and researchers from SOCR, BDDS, MNORC, MIDAS, MADC, MICHR, and the broad R-statistical computing community have contributed ideas, code, and support.

References