This repo contains supporting documents for the manuscript "Defining the AIM: An Abstraction for Improving Machine Learning Prediction"
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
LiteratureSearch
ReproducingMLpipelines
.RData
.Rhistory
.gitignore
LICENSE
README.md

README.md

Project Description:

We present code, data, and supplementary figures and documents used in the preparation of the manuscript "Defining the AIM: An Abstraction for Improving Machine Learning Prediction". We illustrate the need for abstraction describing Machine Learning pipelines to facilitate the comparison, improvement, and study of ML results by focusing on the famous ALL/AML dataset [1]. We define an abstraction layer for leaderboard style competitions to improve ML results.

Repository Contents:

  • LiteratureSearch folder:
    This folder contains two notebooks, one giving the results of our literature analysis (LiteratureSearchResults.ipynb) and the other presenting ML pipelines for the articles (SummaryofMLpipelines.ipynb).
  • ReproducingMLpipelines folder:
    This folder contains 12 notebooks, 5 for each article we studied in depth, 5 for the comparison of the articles' methods (Table 1 in the manuscript), and 2 for comparison summaries. We also included the intermediate .Rdata file we created in the folder.
  • See the ReproducingMLpipeline example folder for reproducible containers (Singularity and Docker) to run the pipeline.

Data and Associated Repos:

  • Data in the Golub et al. paper[1]: The datasets used in [1] with training dataset(38 by 7129) and testing dataset(34 by 7129).
  • Data Version 2: leukemia data in R package spikeslab(72 by 3571). We have shown that this data is a transformed version of the original data.
  • Data Version 3: 'golub' data in R package multtest. In which, 'golub' is the training dataset (38 by 3051) and 'golub.cl' is the test dataset (34 by 3051). We also have shown that this data is another transformed dataset based on the original data.
  • We use the data in [1] (also here and in the LiteratureSearch folder) to reproduce results in the papers.
  • Associated Repos
    Previous work

If you have any questions, please contact us vcs@stodden.net and xwu64@illinois.edu.