Skip to content

CanCOGeN Contextual Data Specification

Rhiannon Cameron edited this page Nov 15, 2023 · 10 revisions

CanCOGeN VirusSeq

Table of Contents

What is Contextual Data?

Public health genomics data includes sequence data as well as contextual data (i.e. sample metadata, lab, clinical/epidemiology, and methods data). Contextual data are data that enable further analysis by providing additional information and context about a subject, entity, or event.

Importance of Contextual Data to Public Health

Contextual data facilitates the interpretation of sequence data, informs decision making, and enables us to answer questions pertinent to the health of Canadians.

Importance of Well-Structured, Consistent Contextual Data

Contextual data that are structured and consistent, particularly complying with community standards like minimum information checklists and ontologies, can be more easily understood and processed by both humans and computers, and can be more easily aggregated and reused for different types of analyses. Contextual data that is structured using data standards promotes interoperability of datasets, can better resolve errors and inconsistencies in data, and helps future proof data ensuring its value and reusability the future.

Privacy & Contextual Data Stewardship

As different types of contextual data are subject to different privacy concerns, not all types of CanCOGeN contextual data are immediately or widely shareable. We must recognize that good data management (tracking and documenting) is different from data sharing.

Good data stewardship practices inform us that it is important to capture linked, well organized, standardized contextual data so that it adheres to FAIR principles (Findable, Accessible, Interoperable, Reusable). These practices are not only critical for auditability and reproducibility, but for posterity; documenting critical contextual information can help build a roadmap for dealing with future public health crises.

The CanCOGeN Approach to Contextual Data

A primary goal of the CanCOGeN initiative is to create consistent, high quality minimal contextual data for public use so that Canada can: be an effective partner in global pandemic responses, achieve dynamic national public health surveillance, and better enable public health analyses on the local level.

We recommend a tiered approach to data management and data sharing for the CanCOGen sequencing consortium, in order to ensure privacy, maximize information linkage, content and interoperability, and to better enable fast analyses to fight COVID-19 in Canada and around the world.

The Three Tiers of CanCOGeN Contextual Data

Tier 1) Minimal Contextual Data

  • Public - To be shared publicly (NCBI, GISAID).

Tier 2) Enhanced Contextual Data

  • National - To be shared among protected trusted partners (i.e. National Database); will not be shared publicly.

Tier 3) Enhanced Regional Contextual Data

  • Regional Use Only - to remain private for local analyses, which could be shared in the future for research and surveillance purposes.

1) Minimal Contextual Data

Updated: 2021-05-06

In order to meet the minimal requirements of the public sequence repositories (e.g. NCBI, GISAID), a minimum contextual data package has been developed using existing community standards, and in collaboration with other national sequencing (consortia e.g. US SPHERES and COG-UK).

The proposed minimal contextual data package is comprised of 23 required fields, namely:

Contextual Data Field Definition
Specimen collector sample ID The user-defined name for the sample.
Sample collected by The name of the agency that collected the original sample.
Sequence submitted by The name of the agency that generated the sequence.
Sample collection date The date on which the sample was collected.
Sample collection date precisions The precision to which the "Sample collection date" was provided.
Geo_loc name (country) The country where the sample was collected.
Geo_loc name (province/territory) The province/territory where the sample was collected.
Organism Taxonomic name of the organism sampled.
Isolate Identifier of the specific isolate.
Purpose of sampling The reason that the sample was collected.
Purpose of sampling details The description of why the sample was collected providing specific details.
Isolation source* The material sampled.
Host (scientific name) The taxonomic, or scientific name of the host.
Host disease The name of the disease experienced by the host.
Host age Age of host at the time of sampling.
Host age bin Age of host at the time of sampling, expressed as an age group.
host age unit The unit used to measure the host age, in either months or years.
Host gender The gender of the host at the time of sample collection.
Sequencing instrument The model of the sequencing instrument used.
Purpose of sequencing The reason that the isolate was sequenced.
Purpose of sequencing details The description of why the sample was sequenced providing specific details.
Consensus sequence method name The name of the protocol used to produce the consensus sequence.
Consensus sequence method version The name and version number of the protocol used to produce the consensus sequence.

*Isolation source has been subdivided for better data management. See the CanCOGeN Metadata Curation SOP section "Describing the material and/or site sampled".

2) Enhanced Contextual Data

Enhanced contextual data is data that provides additional information beyond the minimal requirements. We classify enhanced contextual data in two categories based on whether they are shared nationally or for regional use only.

One of the goals of CanCOGeN is to bring public health workers and researchers together to discuss contextual data needs to improve genomic-based analysis. As such, these sets of enhanced contextual data have been and will continue to be established through consensus.

National Contextual Data

It is important to recognize that national contextual data will be protected in the national database, and will not be shared publicly along with minimal contextual data. Different types of enhanced contextual data, if included in the national database, would enable improved tracking, analysis, and modeling of COVID-19 across Canada. Consensus across data providers will be crucial to ensure that the national database is as complete and comprehensive as possible.

The CanCOGeN national contextual data set will be derived from joint discussion and consensus among participating health authorities, with input from participating researchers. As such, the national contextual data set will be discussed at upcoming CPHLN meetings, as well as through additional consultation with member researchers.

3) Regional Contextual Data

In addition to minimal and nationally-shared contextual data, there may be more contextual data that data generators wish to document for local data management and analyses. As such, the contextual data harmonization team will work with consortium members to enable the capture of private contextual data so that it is easier to aggregate and use in local analyses. This data will not be shared publicly, nor with the national database, and its collection and use are at the discretion of data generators.

It is important to remember that prospective contextual data collection is easier and more efficient than retrospective collection, and that standardization will better enable future sharing and integration for research and surveillance.

Case Report Forms and Surveillance

Updated: 2021-05-06

Case report forms are the primary data collection tool for gathering information about patients at the point of sample collection (for diagnostic testing). This information includes contact details, demographic information, risks and exposures (e.g. occupational, patient settings, travel, contact with infected individuals), clinical presentation and pre-existing conditions, symptoms and outcomes, lab testing requisitions and results, and more. In order to compare the information collected in case report forms across provincial and federal jurisdictions, we gathered provincial, territorial, and national SARS-CoV-2 (COVID-19) case collection forms and cross referenced the contents. This information is being compiled into a report, available here, that will be shared with CPHLN members and CanCOGeN researchers to better inform discussions regarding enhanced contextual data sharing.

Curation of case report forms is also empowering the contextual data team to generate well-defined controlled vocabularies for standardized reporting.

Click here to see a collection of provincial, territorial and federal case report forms.

COVID-19 Data Harmonizer

A spreadsheet application developed for the Canadian COVID Genomics Network (CanCOGen) to harmonize and validate COVID-19 contextual data prior to sharing with the Canadian SARS-CoV-2 national genomics database.

Latest release can be found on GitHub at https://github.com/cidgoh/pathogen-genomics-package/releases

Curation and Dataflow

Curation

The meaning and structures (e.g. field header or term label) of vocabulary may vary between different organizations and institutions. As such, the CanCOGeN contextual data team aims to provide fields and terms that are well defined and comprehensive. The curation team manually reviews every term, restructuring and redefining as needed, to ensure the language provided encompasses all the descriptive and prescriptive uses of said term among data providers. These standardized terms are in the process of being structured ontologically; each with their own unique ID, content, and relational axioms.

Curation of vocabulary for required metadata fields has been prioritized, however further curation will be ongoing throughout the project to fulfill additional metadata needs. If a desired term is missing from a pick list, term requests can be made by contacting the curation team.

Dataflow

The diagram below outlines how CanCOGeN data harmonization integrates into the data flow of laboratories that submit their own sequence data and laboratories that submit their samples to the National Microbiology Laboratory (NML).

Dataflow Diagram - NOT FINALIZED

International Adoption

Updated: 2021-05-07

The CanCOGeN contextual data specification is now being adopted and implemented around the world by the following SARS-CoV-2 sequencing initiatives, platforms, and databases:

Country Initiative/System
International PHA4GE
Australia Austrakka Austrakka Logo
Canada DNAStack* DNAStack Logo
Canada National Genomic Surveillance Database
Canada VirusSeq Data Portal CanCOGeN VirusSeq Data Portal Logo
Latin America COV GEN Network COV GEN Network Logo
Nigeria ACEGID ACEGID Logo
South Africa BAOBAB LIMS BAOBAB LIMS Logo
USA SPHERES SPHERES Logo
USA TOAST CDC
USA NCBI NCBI Logo

* Private Company

Ontologizing the Specification

Updated: 2021-05-07

To facilitate compatibility and interoperability between disparate SARS-CoV-2 contextual datasets we are in the process of ontologizing the CanCOGeN contextual data specification. Rather than develop a new ontology, specification fields and picklist values are being integrated into the open-source Open Biological and Biomedical Ontologies Foundry (OBOF) Genomic Epidemiology Ontology (GenEpiO). GenEpio is an application ontology developed to describe pathogen genomic and contextual (metadata) data for the surveillance and investigation of health related epidemiological events.

The CanCOGeN specification and field classes are being directly integrated into GenEpiO, with individual fields being implemented as instances that utilize Linked data Modeling Language (LinkML) classes and relations for data and object property assertions. The use of LinkML specification augments our specification mapping efforts as LinkML is in the process of coding other standards (e.g. GSC MIxS) more precisely, and also facilitates the generation of YAML and JSON data mappings.

Following OBOF principles, we work collaboratively with domain ontologies to obtain or request terms relevant to specification picklist values. We have worked with the following ontologies (so far):

Once integration into GenEpiO is complete, the ontologized specification will be assimilated in the DataHarmonizer to allow export of ontologized harmonized relational datasets into YAML or JSON formats.

Training and Webinars

Updated: 2020-12

​​​​​Tutorial Video Part 1

  • Topic: DataHarmonizer (version 13.6)
  • Date: November 11th, 2020
  • Presenter: Sarah Savić Kallesøe
  • Developers: Emma Griffiths, Sarah Savić Kallesøe, and Rhiannon Cameron
  • Outline: How to find, download, use, and export from the COVID-19 DataHarmonizer.
  • Topic: Applying for a CNPHI Account
  • Date: November 30th, 2020
  • Presenter: Sarah Savić Kallesøe
  • Developers: Emma Griffiths, Sarah Savić Kallesøe, and Rhiannon Cameron
  • Topic: Uploading to CNPHI LaSER
  • Date: November 30th, 2020
  • Presenter: Sarah Savić Kallesøe
  • Developers: Emma Griffiths, Sarah Savić Kallesøe, and Rhiannon Cameron
  • Outline: How to upload, troubleshoot, and submit your data harmonized csv file to CNPHI LaSER.
  • Topic: The CanCOGeN DataHarmonizer Training Session
  • Date: June 22nd, 2020
  • Presenters: Emma Griffiths and Ivan Gill
  • Outline: Importance of well-structured, harmonized contextual data; Collection template/validator application installation; Entering data; Validation; Export.

Papers and Resources

SOP = Standard Operating Procedure

DataHarmonizer

CanCOGeN Contextual Data Curation / DataHarmonizer SOP:

CNPHI LaSER Upload SOP:

Relevant Publications

  • (2020) COVID-19 pandemic reveals the peril of ignoring metadata standards. Lynn M. Schriml, Maria Chuvochina, Neil Davies, Emiley A. Eloe-Fadrosh, Robert D. Finn, Philip Hugenholtz, Christopher I. Hunter, Bonnie L. Hurwitz, Nikos C. Kyrpides, Folker Meyer, Ilene Karsch Mizrachi, Susanna-Assunta Sansone, Granger Sutton, Scott Tighe, Ramona Walls.
  • (2020) CanCOGeN VirusSeq comparison and analysis of Canadian public health SARS-CoV-2 case report forms. Rhiannon Cameron, Sarah Savić Kallesøe, Emma Griffiths.
  • (2020) Dynamic linkage of COVID-19 test results between Public Health England's Second Generation Surveillance System and UK BioBank. Jacob Armstrong, Justine K. Rudkin, Naomi Allen​, Derrick W. Crook, Daniel J. Wilson, David H. Wyllie, Anne Marie O’Connell.
  • (2020) The PHA4GE SARS-CoV-2 Contextual Data Specification for Open Genomic Epidemiology. Emma J. Griffiths, Ruth E. Timme, Andrew J. Page, Nabil-Fareed Alikhan, Dan Fornika, Finlay Maguire, Catarina Inês Mendes, Simon H. Tausch, Allison Black, Thomas R. Connor, Gregory H. Tyson, David M. Aanensen, Brian Alcock, Josefina Campos, Alan Christoffels, Anders Gonçalves da Silva, Emma Hodcroft, William W.L. Hsiao, Lee S. Katz, Samuel M. Nicholls, Paul E. Oluniyi, Idowu B. Olawoye, Amogelang R. Raphenya, Ana Tereza R. Vasconcelos, Adam A. Witney, Duncan R. MacCannell.
  • (2014) Standardized Metadata for Human Pathogen/Vector Genomic Sequences. Vivien G. Dugan, Scott J. Emrich, Gloria I. Giraldo-Calderón, Omar S. Harb, Ruchi M. Newman, Brett E. Pickett, Lynn M. Schriml, Timothy B. Stockwell, Christian J. Stoeckert Jr, Dan E. Sullivan, Indresh Singh, Doyle V. Ward, Alison Yao, Jie Zheng, Tanya Barrett, Bruce Birren, Lauren Brinkac, Vincent M. Bruno, Elizabet Caler, Sinéad Chapman, Frank H. Collins, Christina A. Cuomo, Valentina Di Francesco, Scott Durkin, Mark Eppinger, Michael Feldgarden, Claire Fraser, W. Florian Fricke, Maria Giovanni, Matthew R. Henn, Erin Hine, Julie Dunning Hotopp, Ilene Karsch-Mizrachi, Jessica C. Kissinger, Eun Mi Lee, Punam Mathur, Emmanuel F. Mongodin, Cheryl I. Murphy, Garry Myers, Daniel E. Neafsey, Karen E. Nelson, William C. Nierman, Julia Puzak, David Rasko, David S. Roos, Lisa Sadzewicz, Joana C. Silva, Bruno Sobral, R. Burke Squires, Rick L. Stevens, Luke Tallon, Herve Tettelin, David Wentworth, Owen White, Rebecca Will, Jennifer Wortman, Yun Zhang, Richard H. Scheuermann.

Contact Information

Users are also encouraged to contact the contextual data harmonization team as they can readily provide assistance for troubleshooting and curation.

If you have any ideas for improving the COVID-19 DataHarmonizer, or have encountered any problems running the application, GitHub users may open an issue for discussion.

For more information or further assistance please contact Dr. William Hsiao at wwhsiao@sfu.ca or Dr. Emma Griffiths at emma_griffiths@sfu.ca.