# Schema for MOTBX resources

This notebook defines a data schema for MOTBX resources. The schema is first validated against the metaschema JSON schema draft 2020-12. It is then used to validate MOTBX resources. While MOTBX resources are stored as YAML files and the schema is stored in JSON, both are imported to Python as dictionaries using the *yaml* and *json* libraries, respectively. The library *jsonschema* is used to validate resources.

In [12]:
import yaml
import json
import jsonschema

from pathlib import Path
import pprint
pp = pprint.PrettyPrinter(indent=2, width=80, compact=True)


CWD = Path.cwd()
if CWD.name != "notebooks":
    print("Make sure to run this notebook from the 'notebooks' directory.")
MOTBX_DIR = CWD.parent
SCHEMA_JSON = MOTBX_DIR.joinpath("schema/motbxschema.json")
TEST_RESOURCE_YAML = MOTBX_DIR.joinpath("tests/resources_pass/test1.yaml")

In [13]:
schema = {
    # "$id": a URI
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "title": "MOTBX resource",
    "description": "Schema for resources of the EATRIS Multi-omics Toolbox (MOTBX)",
    "type": "object",
    "properties": {
        # "resource": {
            # "type": "object",
        "resourceID": {"type": "string"},
        # "properties": {
        "resourceCategory": {
            "type": "string",
            "enum": [
                # allowed values for filed "resourceCategory"
                "Genomics",
                "Epigenomics",
                "Transcriptomics",
                "Proteomics",
                "Metabolomics",

                #"Internal Quality Control",
                #"External Quality Assessment",
                "Quality control and assessment",

                #"Omics data management and analysis"
                "Data management and stewardship",
                "Data analysis"
            ]
        },
        "resourceSubcategory": {
            "type": "string",
            # allowed values are defined below under "anyOf" based on value of
            # "resourceCategory"
        },
        "resourceTitle": {
            "type": "string",
            "minLength": 15,
            "maxLength": 160
        },
        "resourceDescription": {
            "type": "string",
            "minLength": 50,
            "maxLength": 2500
        },
        "resourceUrl": {
            "type": "string",
            "format": "uri",
            "pattern": "^https://|.pdf$"  #"^https?://"
        },
        "resourceTags": {
            "type": "array",
            "items": {
                "type": "string"},
            "minItems": 1
        },
        "resourceKeywords": {
            "type": "array",
            "items": {
                "type": "string"}
        },
    },
    "anyOf": [
        {"properties": {
            "resourceCategory": {"enum": [
                "Genomics",
                "Epigenomics",
                "Transcriptomics",
                "Proteomics",
                "Metabolomics"]},
            "resourceSubcategory": {"enum": [
                "Guidelines and best practices",
                "Laboratory protocols and methods",
                "Translational research use case"]}
        }},
        {"properties": {
            "resourceCategory": {"enum": [
                "Quality control and assessment",]},
            "resourceSubcategory": {"enum": [
                "Guidelines and best practices",
                "Reference materials for quality control",
                "Proficiency testing and external quality assessment",
                "Quality certification"]}
        }},
        {"properties": {
            "resourceCategory": {"enum": [
                "Data management and stewardship"]},
            "resourceSubcategory": {"enum": [
                "Guidelines and best practices",
                "Data and metadata standards",
                "Databases and catalogues",
                "Translational research use cases"]}
        }},
        {"properties": {
            "resourceCategory": {"enum": [
                "Data analysis"]},
            "resourceSubcategory": {"enum": [
                "Guidelines and best practices",
                "Software applications and workflows",
                "Computing platforms",
                "Translational research use cases"]}
        }}
    ],
    "required": [
        "resourceID",
        "resourceCategory",
        "resourceSubcategory",
        "resourceTitle",
        "resourceDescription",
        "resourceUrl",
        "resourceTags"], # "resourceKeywords" are optional
    #"additionalProperties": False,
    #"examples":
}

# exmaple resource
resource = {
    "resourceID": "1",
    # "resource": {
    "resourceCategory": "Quality control and assessment",
    "resourceSubcategory": "Guidelines and best practices",
    "resourceTitle": "ISO Guide 80:2014: Guidance for in-house preparation of quality control materials",
    "resourceDescription": "ISO Guide 80:2014 guidance for the in-house preparation of quality control materials (QCMs). ISO Guide 80 outlines the characteristics and preparation processes of reference materials for quality control. It applies to stable materials used locally and those transported without significant property changes. Laboratory staff preparing in-house quality control materials should follow ISO Guides 34 and 35 for transportation-based supply chains. The preparation of quality control materials requires assessments for homogeneity, stability, and limited characterization. It aims to demonstrate statistical control in a measurement system but does not provide usage guidance. The guide offers general information on preparation and includes case studies for different sectors. Users should have material knowledge and be aware of matrix effects and contamination risks.",
    "resourceUrl": "https://www.iso.org/standard/44313.html",
    "resourceTags": ["ISO standard", "guidelines", "quality control material", "in-house", "genomics"],
    # },
    # "resourceMetadata": {"last_modified": str(datetime.date(2023, 8, 4))}
}

# validate schema against metaschema
jsonschema.Draft202012Validator.check_schema(schema)

# validate example resource against schema
jsonschema.validate(resource, schema, format_checker = jsonschema.FormatChecker())

In [14]:
with open(TEST_RESOURCE_YAML, "w") as fp:
    yaml.dump(resource, fp)

In [15]:
# print schema formatted as YAML
print(yaml.dump(schema))

$schema: https://json-schema.org/draft/2020-12/schema
anyOf:
- properties:
    resourceCategory:
      enum:
      - Genomics
      - Epigenomics
      - Transcriptomics
      - Proteomics
      - Metabolomics
    resourceSubcategory:
      enum:
      - Guidelines and best practices
      - Laboratory protocols and methods
      - Translational research use case
- properties:
    resourceCategory:
      enum:
      - Quality control and assessment
    resourceSubcategory:
      enum:
      - Guidelines and best practices
      - Reference materials for quality control
      - Proficiency testing and external quality assessment
      - Quality certification
- properties:
    resourceCategory:
      enum:
      - Data management and stewardship
    resourceSubcategory:
      enum:
      - Guidelines and best practices
      - Data and metadata standards
      - Databases and catalogues
      - Translational research use cases
- properties:
    resourceCategory:
      enum:
      - Data 

In [16]:
# print schema formatted as JSON
print(json.dumps(schema, indent = 2))

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "MOTBX resource",
  "description": "Schema for resources of the EATRIS Multi-omics Toolbox (MOTBX)",
  "type": "object",
  "properties": {
    "resourceID": {
      "type": "string"
    },
    "resourceCategory": {
      "type": "string",
      "enum": [
        "Genomics",
        "Epigenomics",
        "Transcriptomics",
        "Proteomics",
        "Metabolomics",
        "Quality control and assessment",
        "Data management and stewardship",
        "Data analysis"
      ]
    },
    "resourceSubcategory": {
      "type": "string"
    },
    "resourceTitle": {
      "type": "string",
      "minLength": 15,
      "maxLength": 160
    },
    "resourceDescription": {
      "type": "string",
      "minLength": 50,
      "maxLength": 2500
    },
    "resourceUrl": {
      "type": "string",
      "format": "uri",
      "pattern": "^https://|.pdf$"
    },
    "resourceTags": {
      "type": "array",
      "ite

In [17]:
# save schema
with open(SCHEMA_JSON, "w") as fp:
    json.dump(schema, fp, indent = 2)