CoronaWhy Common Research and Data Infrastructure

What is CoronaWhy?

CoronaWhy.org is a global volunteer organization dedicated to driving actionable insights into significant world issues using industry-leading data science, artificial intelligence and knowledge sharing. CoronaWhy was founded during the 2020 COVID-19 crisis, following a White House call to help extract valuable data from more than 50,000 coronavirus-related scholarly articles, dating back decades. Currently at over 1000 volunteers, CoronaWhy is composed of data scientists, doctors, epidemiologists, students, and various subject matter experts on everything from technology and engineering to communications and program management.

What has CoronaWhy produced so far?

Read about our creations before you start.

CoronaWhy infrastructure setup

The infrastructure can be setup locally and exposed as a number of CoronaWhy services using traefik tool.

You need to specify the value of "traefikhost" before you'll start to deploy the infrastructure:

export traefikhost=lab.coronawhy.org or export traefikhost=localhost

download all CoronaWhy notebooks

./build-coronawhy-infra.sh

then you simply run

docker-compose up

after that there would be exposed next CoronaWhy services:

Airflow http://airflow.apps.coronawhy.org (takes some time to launch)
Whoami http://whoami.apps.coronawhy.org (simple webserver returning host stats)
CoronaWhy API http://api.apps.coronawhy.org (FastAPI with Swagger)
Elasticsearch http://es.apps.coronawhy.org
SPARQL http://sparql.apps.coronawhy.org (Virtuoso as a service)
INDRA http://indra.apps.coronawhy.org (INDRA REST API https://indra.readthedocs.io/en/latest/rest_api.html)
Grlc http://grlc.apps.coronawhy.org (SPARQL queries into RESTful APIs convertor)
Doccano http://doccano.apps.coronawhy.org
Jupyter http://jupyter.apps.coronawhy.org (look for token in the logs)
Portainer http://portainer.apps.coronawhy.org
Traefik dashboard is available at http://apps.coronawhy.org:8080 (not secure setup)
Kibana http://kibana.apps.coronawhy.org

Warning: in the example all infrastructure components deployed on *.apps.coronawhy.org, you should be able to get a local deployment on *.localhost (doccano.localhost, etc) or *.lab.coronawhy.org

CoronaWhy datasets

CoronaWhy community is building an Infrastructure for Open Science that can be distributed and scaled up in the future and reused for other important tasks like cancer research. The vision of the community is to build it completely from Open Source components, all data should be published data in FAIR way and keep all available provenance information.

We're using Harvard Data Commons as a foundation that allows all CoronaWhy members to work together. We’re building a different services and running an experimental Labs and our data infrastructure is something common and reusable, a place where all research groups are sharing the same resources. It’s build on top of Dataverse data repository developed by Harvard University and available on datasets.coronawhy.org.

CoronaWhy also maintaining various APIs to produce an aggregated COVID-19 datasets. You can access the data by querying CoronaWhy Data API with using country codes, for example, FRA for France http://api.apps.coronawhy.org/country/FRA

CoronaWhy dashboards

Task-Risk helps to identify risk factors that can increase the chance of being infected, or affects the severity or the survival outcome of the infection
Task-Ties to explore transmission, incubation and environment stability
Named Entity Recognition across the entire corpus of CORD-19 papers with full text
Match Clinical Trials allows exploration of the results from the COVID-19 International Clinical Trials dataset
COVID-19 Literature Visualization helps to explore the data behind the AI-powered literature review

More detailed information about every dashboard published on Kaggle.

CORD-19 preprocessing pipeline

Download COVID-19 Open Research Dataset Challenge (CORD-19) from Kaggle

bash ./download_dataset.sh

Start NLP pipeline manually by executing

docker run -v /data/distrib/covid-19-infrastructure/data/original:/data -it coronawhy/pipeline /bin/bash

or automatically with

docker-compose -f ./docker-compose-pipeline.yml up

Follow all updates from our YouTube and CoronaWhy Github

Getting Started with CoronaWhy Common infrastructure

How to access Elasticsearch and Dataverse, notebook

CoronaWhy Elasticsearch Tutorial notebook

How to Create Knowledge Graph, notebook

Dataverse Colab Connect, notebook

GitHub dataset sync with Dataverse, notebook

CoronaWhy Services

You can connect your notebooks to the number of services listed below, all services coming from CoronaWhy Labs have an experimental status. Join the fight against COVID-19 if you want to help us!

Data repository

Dataverse deployed as a data service on https://datasets.coronawhy.org Dataverse is an open source web application to share, preserve, cite, explore, and analyze research data. It facilitates making data available to others.

Elasticsearch

CoronaWhy Elasticsearch has CORD-19 indexes on sentences level and available at CoronaWhy Search

Available indexes:

MongoDB

MongoDB service deployed on mongodb.coronawhy.org and available from CoronaWhy Labs Virtual Machines. Please contact our administrators if you want to use it.

Hypothesis

Our Hypothesis annotation service is running on hypothesis.labs.coronawhy.org and allows to manually annotate CORD-19 papers. Please try our Hypothesis Demo if you're interested.

OpenLink Virtuoso triplestore

We are providing Virtuoso as a service with public SPARQL Endpoint that offers an HTTP-based Query Service that operates on Entity Relationship Types (Relations) represented as RDF sentence collections using the SPARQL Query Language. https://virtuoso.openlinksw.com

You can run a simple SPARQL query to get some overview of triples from CoronaWhy Knowledge Graph.

Kibana

Kibana deployed as a community service connected to CoronaWhy Elasticsearch on https://kibana.labs.coronawhy.org Allows to visualize Elasticsearch data and navigate the Elastic Stack so you can do anything from tracking query load to understanding the way requests flow through your apps. https://www.elastic.co/kibana

BEL

BEL Commons 3.0 available as a service https://bel.labs.coronawhy.org

An environment for curating, validating, and exploring knowledge assemblies encoded in Biological Expression Language (BEL) to support elucidating disease-specific, mechanistic insight.

You can watch the introduction video and read Corona BEL Tutorial if you want to know more.

INDRA

INDRA deployed as a service on https://indra.labs.coronawhy.org/indra.

INDRA (Integrated Network and Dynamical Reasoning Assembler) generates executable models of pathway dynamics from natural language (using the TRIPS and REACH parsers), and BioPAX and BEL sources (including the Pathway Commons database and NDEx.

You can quickly test the service by running:

curl -X POST "https://indra.labs.coronawhy.org/bel/process_pybel_neighborhood" -H "accept: application/json" -H "content-type: application/json" -d "{ \"genes\": [ \"MAP2K1\" ]}" -l -o test_coronawhy_map2k1.json

Geoparser

Geoparser as a service https://geoparser.labs.coronawhy.org

The Geoparser is a software tool that can process information from any type of file, extract geographic coordinates, and visualize locations on a map. Users who are interested in seeing a geographical representation of information or data can choose to search for locations using the Geoparser, through a search index or by uploading files from their computer. https://github.com/nasa-jpl-memex/GeoParser

Tabula

Tabula allows you to extract data from PDF files into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use interface. We deployed it as a CoronaWhy service available for all community members. More information at Tabula website.

Teamchatviz

We use Teamchatviz to explore how communication works in our distributed team and learn how communication shapes culture in CoronaWhy community. https://moovel.github.io/teamchatviz/

In progress

We are working on the deployment Neo4j graph database.

Articles produced by CoronaWhy people

I’m an AI researcher and here’s how I fight corona by Artur Kiulian

Exploration of Document Clustering with SPECTER Embeddings by Brandon Eychaner

COVID-19 Research Papers Geolocation by Ishan Sharma

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
api		api
elasticsearch		elasticsearch
jupyterhub		jupyterhub
notebooks		notebooks
tests		tests
.gitignore		.gitignore
CONTRIBUTORS.md		CONTRIBUTORS.md
Creations.md		Creations.md
LICENSE		LICENSE
README.md		README.md
build-coronawhy-infra.sh		build-coronawhy-infra.sh
docker-compose-crm.yml		docker-compose-crm.yml
docker-compose-pipeline.yml		docker-compose-pipeline.yml
docker-compose.yml		docker-compose.yml
download_dataset.sh		download_dataset.sh

License

4tikhonov/covid-19-infrastructure

Folders and files

Latest commit

History

Repository files navigation