Skip to content

Avantol13/ga4gh-search

 
 

Repository files navigation

Search

Swagger Validator

GA4GH Search is a standard for searching biomedical data developed by the Discovery Work Stream of the Global Alliance for Genomics & Health.


Table of Contents

Summary

GA4GH Search is an API specification for a simple, uniform mechanism to publish, discover, query, and analyze biomedical data, any “rectangular” data that fits into rows and columns. The API is composed of two principal components: a Tables API that exposes structured tabular data, and a Query API that supports SQL queries over data. It is intentionally general-purpose and minimal. It does not prescribe a particular backend implementation or a data model and supports federation by design.

Purpose and Motivation

The ever advancing biomedical techniques, such as next-generation genome sequencing, imaging, and others are creating vast amounts of data. Everyday researchers and clinicians accumulate and analyze the world's exponentially growing volumes of genomic and clinical data. With this large data comes the challenge for exploring and finding the data, while interpreting various available formats.

In this specification, we offer a simple, uniform mechanism to publish, discover, query, and analyze any format of biomedical data. There are thousands of ways data can be stored or moved over the network. Any “rectangular” data that fits into rows & columns can be represented via GA4GH Search. This is useful for all kinds of data as we now have a common way to use the information regardless of the way it was collected.

Background

The GA4GH has previously developed two standards for discovery. Beacon is a standard for discovery of genomic variants, while Matchmaker is a standard for discovery of subjects with certain genomic and phenotypic features. Implementations of these standards have been linked into federated networks (Beacon Network and Matchmaker Exchange, respectively).

Each standard (and corresponding network) has been successful in its own right. It was acknowledged that it would be broadly useful to develop standards that abstracted common utilities for building searchable, federated networks for a variety of applications in genomics and health.

The Discovery Work Stream develops GA4GH Search as a general-purpose framework for building federatable search-based applications.

Intended Audience

The intended audience of this standard includes:

  • Data custodians looking to make their data discoverable and searchable, especially in a federated way.
  • Data consumers looking to discover and search data in an interoperable way, incl. outside of genomics community.
  • Developers of applications, such as data explorers.
  • API developers within and outside GA4GH looking to incorporate search functionality in their APIs.
  • Data model developers within and outside of GA4GH looking to make their data models searchable and interoperable with other standards.

API Specification

You can view our Full Discovery Search Specification and our Open API 3 Specification.

To see example request/reponse pairs using this API specification, click here

Use Cases

Sample use cases include:

Full summary of use cases can be found in USECASES.md.

Applications

Various applications can be built on top of GA4GH Search, such as

  • Data and metadata indexers
  • Query tools
  • Data federations
  • Concept cross-references
  • Parameters for batch workflows
  • Workflow result summaries
  • Patient matchmaking
  • (Most importantly) Things we haven’t yet imagined!

Out of scope

  • Developing data models. GA4GH Search does not define data models. It defers that effort to others in the GA4GH or outside implementers.
  • Application development. GA4GH Search does not prescribe a specific application. It is intentionally general-purpose. It defers to other efforts in the Discovery Work Stream, GA4GH, and beyond to build domain-specific applications.

Benefits

  • Simple, interoperable, uniform mechanism to publish, discover, query, and analyze biomedical data.
  • Flexibility. Works with any “rectangular” data that fits into rows and columns. Does not prescribe a data model and as such, allows custodians to make their data available without extensive ETL transformations.
  • Supports federation. Serves as a general-purpose framework for building federatable search-based applications across multiple implementations. Federations reference common schemas and properties.
  • Minimal by design. The API is purposely kept minimal so that the barriers to publishing existing data are as small as possible.
  • Backend agnostic. It is possible to implement the API across a large variety of backend datastores.
  • General purpose. Admits use cases that have not yet been thought of.

Implementations

Architecture of a GA4GH Search system:

Sample implementations:

Tables-in-a-bucket (no-code implementation)

The specification allows for a no-code implementation as a collection of files served statically (e.g. in a cloud bucket, or a Git repository). To do this, you need the following JSON files:

  • tables: served in response to GET /tables
  • table/{table_name}/info: served in response to GET /table/{table_name}/info. e.g. a table with the name mytable should have a corresponding file table/mytable/info
  • table/{table_name}/data: served in response to GET /table/{table_name}/data. e.g. a table with the name mytable should have a corresponding file table/mytable/data
  • table/{table_name}/data_{pageNumber}, which will be linked in the next_page_url of the first table (e.g. mytable).
  • table/{table_name}/data_models/{schemaFile}: Though not required, data models may be linked via $ref. Data models can also be stored as static JSON documents, and be referred to by relative or absolute URLs.

A concrete, example test implementation is available (list endpoint) with documentation.

Google Sheets implementation

A Google Sheets spreadsheet can also be exposed via the Tables API using the sheets adapter, located here.

Implementation based on PrestoSQL

DNAstack has provided an implementation of GA4GH Search on top of PrestoSQL.

Security

Sensitive information transmitted over public networks, such as access tokens and human genomic data, MUST be protected using Transport Level Security (TLS) version 1.2 or later, as specified in RFC 5246.

If the data holder requires client authentication and/or authorization, then the client’s HTTPS API request MUST present an OAuth 2.0 bearer access token as specified in RFC 6750, in the Authorization request header field with the Bearer authentication scheme:

Authorization: Bearer [access_token]

The policies and processes used to perform user authentication and authorization, and the means through which access tokens are issued, are beyond the scope of this API specification. GA4GH recommends the use of the OpenID Connect and OAuth 2.0 framework (RFC 6749) for authentication and authorization.

CORS

Cross-origin resource sharing (CORS) is an essential technique used to overcome the same origin content policy seen in browsers. This policy restricts a webpage from making a request to another website and leaking potentially sensitive information. However the same origin policy is a barrier to using open APIs. GA4GH open API implementers should enable CORS to an acceptable level as defined by their internal policy. For any public API implementations should allow requests from any server.

GA4GH published a CORS best practices document, which implementers should refer to for guidance when enabling CORS on public API instances.

Contributing

The GA4GH is an open community that strives for inclusivity. Guidelines for contributing to this repository are listed in CONTRIBUTING.md. Teleconferences and corresponding meeting minutes are open to the public. To learn how to contribute to this effort, please email Rishi Nag (rishi.nag@ga4gh.org).

Testing

Use Swagger Validator Badge to validate the YAML file or its OAS Validator wrapper to validate changes to the Open API specification.

Reporting Security Issues

Please send an email to security-notification@ga4gh.org.

About

Standard for searching biomedical data developed by the Global Alliance for Genomics & Health.

Resources

License

Stars

Watchers

Forks

Packages

No packages published