1. HDR UK Dataset Schema - YAML - JSON
The latest version specification required for datasets to be on boarded onto the Gateway are shown in this repository and is comprised of the following:
- The latest json schema and yaml which can be found here: https://github.com/HDRUK/schemata/blob/master/schema/dataset/latest. They represent the V2 metadata specification for onboarding datasets onto the Gateway presented in the descriptive metadata documentation.
- Example json files which can be found here: https://github.com/HDRUK/schemata/blob/master/examples. It includes a current mapped dataset, a minimum json example and a full example of the metadata for onboarding to the Gateway.
- The latest word documentation, change log and mapping file which can be found here: https://github.com/HDRUK/schemata/tree/master/docs/dataset/2.0.0/distribution. The documentation provides details of the descriptive metadata needed for the Gateway including their definitions and user stories to illustrate its purpose.
- An impact assessment and indicative mapping files which can be found here: https://github.com/HDRUK/schemata/tree/master/docs/dataset/2.0.0/impact-assessment. It contains the following files:
- aggregated_errors.xlsx: aggregated validation errors with an overview on the most common errors per attribute that will need to be resolved during migration.
- generated_mapping.py: the mapping algorithm that generates v2 data-models, (basically a mapping function for all fields in the new v2 specs).
- v1_to_v2.json: for each data-model (key is the UUID) v1: old schema, v2: new schema, which is the mapping result (generated_mapping.py) of all data-models in the gateway.
- dm_validation.json: for each data-model (key again is UUID) the data-set in form of the v2 attributes and the JSON schema validation.
- aggregated_errors.json: an aggregation of the validation of all data-models by attribute with a list of unique error messages, a count of unique error messages, and a total count of the errors across all data-models; note that JSON schema validator generates an entry for all hierarchy levels, which means some error messages are repeated along the hierarchy.
Below is the breakdown of the HDR UK V2 Dataset Schema by its properties and sub properties as defined in the JSON Schema. Each property from 1-7 has its own Schema with a description of its corresponding sub properties, including their data type and whether it is a required field.
1. summary: Summary metadata must be completed by Data Custodians onboarding metadata into the Innovation Gateway MVP.
2. documentation: Documentation can include a rich text description of the dataset or links to media such as documents, images, presentations, videos or links to data dictionaries, profiles or dashboards. Organisations are required to confirm that they have permission to distribute any additional media.
3. coverage: This information includes attributes for geographical and temporal coverage, cohort details etc. to enable a deeper understanding of the dataset content so that researchers can make decisions about the relevance of the underlying data.
4. provenance: Provenance information allows researchers to understand data within the context of its origins and can be an indicator of quality, authenticity and timeliness.
5. accessibility: Accessibility information allows researchers to understand access, usage, limitations, formats, standards and linkage or interoperability with toolsets.
6. enrichmentAndLinkage: This section includes information about related datasets that may have previously been linked, as well as indicating if there is the opportunity to link to other datasets in the future. If a dataset has been enriched and/or derivations, scores and existing tools are available this section allows providers to indicate this to researchers.
7. observations: Multiple observations about the dataset may be provided and users are expected to provide at least one observation (1..*). We will be supporting the schema.org observation model (https://schema.org/Observation) with default values. Users will be encouraged to provide their own statistical populations as the project progresses.
8. structuralMetadata: Descriptions and details about the tables and columns within a dataset.
Once a dataset is onboarded onto the Gateway, a quality check is run on its corresponding json schema to produce a weighted quality score based on weighted field completeness and weighted field error percentage. Weights of each field can be found here (https://github.com/HDRUK/datasets/tree/master/config/weights) and details of the quality score calculation can be found here (https://github.com/HDRUK/datasets/tree/master/reports#how-scores-are-calculated).
Based on the weighted quality score, a dataset is given a medallion rating as follows: