Skip to content

SPARC-FAIR-Codeathon/KnowMore

Repository files navigation

Contributors Stargazers Issues MIT License DOI


Logo

Say "no more" to tedious manual discovery across SPARC datasets


Read manuscript draft . Report Issue

Table of Contents

About

This is the repository of Team KnowMore (Team #6) at the 2021 SPARC Codeathon. Click here to find out more about the NIH SPARC (Stimulating Peripheral Activity to Relieve Condition) Program. Click here to find out more about the SPARC Codeathon 2021. Check out the Team section of this page to find out more about our team members.

No work was done on this project prior to the Codeathon. Only the high-level idea of a knowledge discovery tool for SPARC datasets was submitted as a project suggestion for the Codeathon. The detailed outline of the project (what knowledge? what level of automation? what use case?) was defined jointly by the team after the kick-off of the Codeathon. The development of all the code and documentation was initiated subsequently.

The Problem

The NIH SPARC program seeks to accelerate the development of therapeutic devices that modulate electrical activity in nerves to improve organ function. All the datasets generated by SPARC-funded projects are curated and shared according to strict guidelines that ensure these datasets are Findable, Accessible, Interoperable, and Reusable (FAIR). Specifically, the datasets are organized and annotated following the SPARC Data Structure, shared publicly on the Pennsieve discover plaform, and easily accessible via the SPARC Data Portal (sparc.science). The SPARC program thus provides a wealth of openly available and well-curated datasets that span over multiple organs, species, and datatypes and have the potential for leading to new discoveries. This is great! But...

While it is very easy to learn about the findings of a single dataset on the portal by looking at the dataset summary page (e.g., see here), there is currently no easy way to rapidly analyze multiple SPARC datasets together. Typically, a user of the portal will search for datasets through the search feature and identify datasets of interest from the list of results. To find relations across datasets that are deemed of interest, the user would then have to do so manually, i.e., read the description of each of the datasets from their summary page, go through their protocols, browse files that are accessible from the browser, etc. For a deeper investigation and cross-analysis of the data from these datasets, they will then have to download each dataset before analyzing them further on their computer (payment may be required for downloading large datasets according to AWS pricing). This is a tedious process that needs to be urgently improved to enable rapid discoveries from SPARC datasets, which would 1) Enhance the speed of innovations in the neuromodulation field and 2) Elevate the impact of the SPARC program.

Our Solution - KnowMore

To address this problem, we have developed a tool called KnowMore. KnowMore is an automated knowledge discovery tool integrated within the SPARC Portal that allows users of the portal to visualize, in just a few clicks, potential similarities, differences, and connections between multiple SPARC datasets of their choice. This simple process for using KnowMore is illustrated in the figure below.


knowmore-usage

Illustration of the simple user side workflow of KnowMore. Note that the tool is not currently integrated in the offical SPARC Portal, but accessible through our own deployed prototype. We refer to the "Using KnowMore", section for details.


The output of KnowMore consists of multiple interactive visualization items displayed to the user such that they can progressively gain knowledge on the potential similarities, differences, and relations across the datasets. This output is intended to provide foundational information to the user such that they can rapidly make novel discoveries from SPARC datasets, generate new hypotheses, or simply decide on their next step (assess each dataset individually on the portal, download and analyze the datasets further, remove/add datasets to their analysis pool, etc.). A list of the visualization items is provided in the table below, along with the potential knowledge that could be gained from each of them.

Visualization item Knowledge gained across the datasets Raw data used for generating the visualization and how it was obtained Status
Knowledge Graph High-level connections (authors, institutions, funding organisms, etc.) Dataset metadata from Pennsieve API and SciCrunch Elasticsearch API
Summary Table Similarities/differences in the study design Dataset metadata.json file from Pennsieve API
Common Keywords Common themes Dataset metadata.json file and all dataset text files from Pennsieve API , protocol text from protocols.io API
Abstract Common study design and findings Dataset metadata.json file and all dataset text files from Pennsieve API, protocol text from protocols.io API
Data Plots Comparison between measured numerical data (if any) MAT files in the derivative folder of the datasets on Pennsieve API
Image Clustering Comparison between image data (if any) Image files associated with the datasets from Biolucida API

Table listing the visualization items automatically generated by KnowMore along with their status (✅ = available, ❌ = not fully ready)


A sample output is presented in the figure below.


coming-soon...
Sample output from KnowMore. It consists of interactive text and plots display to the user.


Under the hood, KnowMore uses several Machine Learning and Data Science workflows to output the above-mentioned elements, including Natural Language Processing (NLP), Image clustering, and Data Correlation. Details about these are discussed in the draft of our manuscript associated with this project.

Workflow

The overall workflow of KnowMore is shown in the figure below. Our architecture consists of three main blocks that can all run independently:

  1. The front end of our app is based on a fork of the sparc-app (i.e. the front-end of sparc.science) where we have integrated additional UI elements and back-end logic for KnowMore. Learn more about the sparc app.
  2. The back-end consists of a Flask application that listens to front-end requests and launches the data processing jobs.
  3. The data processing and result generation is done through a Matlab code (for 'MAT' data files) and Python code (all other data types) that run on osparc, the SPARC supported cloud computing platform. Learn more about osparc.

knowmore-workflow
Illustration of the overall technical workflow of KnowMore. The red rectangles highlight the major code blocks of KnowMore that have been developed during this Codeathon.


Such a design was motivated by our aim of making KnowMore ready to on-board the SPARC Data Portal:

  • Integrating the front-end of KnowMore would only require to merge our fork of the sparc-app with the main branch sparc-app branch.
  • The back-end of the sparc-app, the sparc-api, is build with Flask so the KnowMore back-end would be readily compatible.
  • The data processing jobs are designed to run on osparc, the SPARC supported cloud computing platform, and would not require any type of integration as our back-end ensures communication with osparc.

Moreover, each of the three main elements of KnowMore is fully independent. While the front-end will not be of much use on its own, having the back-end fully interoperable is very valuable as our flask application can be connected to any front-end if needed (another analysis tool, website, software, etc.). The data processing and results generation jobs are also independent such that they can be used directly to get the visualization items, for instance in a Jupyter Notebook. Note that the data for the Knowledge graph is obtained from Pennsieve/Scicrunch on the front-end for efficiency but the same results can be generated in the back-end as well.

Usecase

Our development and testing revolved around these three datasets:

They were selected due to their common theme with the aim of making some interesting and meaningful discoveries during the Codeathon. Our discoveries from these datasets are discussed in the draft of our manuscript initiated during the Codeathon. Our results can be reproduced by selecting these datasets when using KnowMore (see next section). We would like to emphasize that our tool is not specifically designed around these datasets and is intended to work with any user-selected datasets. Only the Data Plots are limited to work with these datasets since we found that there are many discrepancies in tabular data structuring from dataset to dataset that limited auto generation of meaningful plots. Our suggestion to SPARC is to focus on standardizing tabular data for the next update to the SPARC Data Structure for icreasing the the interoperability of SPARC datasets. Recommendations to achieve that are discussed in the draft of our manuscript and well as the "Recommendation from the KnowMore Team to increase FAIRness of SPARC datasets" document we have submitted to SPARC.

Using KnowMore

You can test the current prototype of KnowMore directly on our fork of the SPARC Data Portal: https://sparc-know-more.herokuapp.com/sparc-app/

Follow the steps described below:

  1. Find datasets of interest and click on the "Add to KnowMore" button, visible in the search list and the dataset page, to add each of them in the KnowMore analysis
  2. Go to the KnowMore tab, and check that all your selected datasets are listed
  3. Click on "Discover" to initiate the automated discovery process
  4. Wait until the results are displayed

Using the Source Code

Prerequisites

We recommend using Anaconda to create and manage your development environments for KnowMore. All the subsequent instructions are provided assuming you are using Anaconda (Python 3 version).

Running the app

To use the source code, clone or download the repository first.

git clone https://github.com/SPARC-FAIR-Codeathon/KnowMore.git --recurse

Then run our back-end flask-server following the documentation available here.

Finally, run our fork the sparc-app by following the documentation available here.

Our back-end data processing program consists of a Matlab code for handling 'MAT' files (when found in a dataset) and a Python code for processing all other data types. They run on osparc and, as such, do not require any local setup. The documentation for using/editing them is, however, available here if needed.

Reporting issues and contributing

To report an issue or suggest a new feature, use the issue page. Check existing issues before submitting a new one.

Fork this repositery and submit a pull request to contribute. Before doing so, please read our Code of Conduct and Contributing Guidelines.

Cite us

If you use KnowMore to make new discoveries or use the source code, please cite us as follows

Ryan Quey, Matthew Schiefer, Anmol Kiran, Bhavesh Patel (2021). KnowMore: v1.0.0 - Automated Knowledge Discovery Tool for SPARC Datasets. 
Zenodo. https://doi.org/10.5281/zenodo.5137255.

FAIR practices

Given that this codeathon revolved around FAIR data, we deemed suitable to ensure that the development of KnowMore is also FAIR. We have assessed accordingly the FAIRness of KnowMore against the FAIR Principles established for research software. The details are available in this document.

License

KnowMore is fully Open Source and distributed under the very permissive MIT License. See LICENSE for more information.

Team

Acknowledgements

We would like to thank the organizers of the 2021 SPARC Codeathon and thank the SPARC Data Resource Center (DRC) teams for their guidance and help during this Codeathon.