# ETL for Ingesting Data from JSON Sources

This notebook outlines how you can transform NGS data to suit ingestion to Beacon in the data portal.

We can start by import basic imports required for this data transformation.


In [1]:
import json

## Dataset Entry

A dataset entry is essential to encapsulate information regarding the collection of information. Let's create one using some made up date. Please note that relevant field should be populated in practice using the actual data you harvest during the data collection procedures.

We should also have information regarding the reference assembly used to call the variants (eg.: GRCH37/38)


In [2]:
dataset = {
    "createDateTime": "2021-03-21T02:37:00-08:00",
    "dataUseConditions": {
        "duoDataUse": [
            {
                "id": "DUO:0000042",
                "label": "general research use",
                "version": "17-07-2016"
            }
        ]
    },
    "description": "Simulation set 1.",
    "externalUrl": "http://example.org/wiki/Main_Page",
    "info": {},
    "name": "Dataset with NGS data",
    "updateDateTime": "2022-08-05T17:21:00+01:00",
    "version": "v1.1"
}
assembly_id = "GRCH38"

## Individuals Information

Each VCF sample must have an origin individual. These entries can be constructed as follows. Note that for this example, we are using four individual entries.


In [3]:
individuals = [
    {
      "id": "INDIVIDUAL_1",
      "karyotypicSex": "XX",
      "sex": {
        "id": "SNOMED:248153007",
        "label": "MALE"
      }
    },
    {
      "id": "INDIVIDUAL_2",
      "karyotypicSex": "XY",
      "sex": {
        "id": "SNOMED:248152002",
        "label": "FEMALE"
      }
    },
    {
      "id": "INDIVIDUAL_3",
      "karyotypicSex": "XX",
      "sex": {
        "id": "SNOMED:248153007",
        "label": "MALE"
      }
    },
    {
      "id": "INDIVIDUAL_4",
      "karyotypicSex": "XY",
      "sex": {
        "id": "SNOMED:248152002",
        "label": "FEMALE"
      }
    }
]

## Biosample Information

Biosample encapsulates information regarding the collected physiological sample from the individuals body. For each individual, this example related exactly one biosamples.


In [4]:
biosamples = [
    {
      "id": "BIOSAMPLE_1",
      "individualId": "INDIVIDUAL_1",
      "biosampleStatus": {
        "id": "SNOMED:365641003",
        "label": "Minor blood groups - finding"
      },
      "collectionDate": "2020-1-1",
      "sampleOriginType": {
        "id": "SNOMED:31675002",
        "label": "Capillary blood"
      },
      "sampleOriginDetail": {
        "id": "SNOMED:31675002",
        "label": "Capillary blood"
      },
      "info": {},
      "notes": ""
    },
    {
      "id": "BIOSAMPLE_2",
      "individualId": "INDIVIDUAL_2",
      "biosampleStatus": {
        "id": "SNOMED:365641003",
        "label": "Minor blood groups - finding"
      },
      "collectionDate": "2020-1-2",
      "sampleOriginType": {
        "id": "SNOMED:31675002",
        "label": "Capillary blood"
      },
      "sampleOriginDetail": {
        "id": "SNOMED:31675002",
        "label": "Capillary blood"
      },
      "info": {},
      "notes": ""
    },
    {
      "id": "BIOSAMPLE_3",
      "individualId": "INDIVIDUAL_3",
      "biosampleStatus": {
        "id": "SNOMED:365641003",
        "label": "Minor blood groups - finding"
      },
      "collectionDate": "2020-1-3",
      "sampleOriginType": {
        "id": "SNOMED:31675002",
        "label": "Capillary blood"
      },
      "sampleOriginDetail": {
        "id": "SNOMED:31675002",
        "label": "Capillary blood"
      },
      "info": {},
      "notes": ""
    },
    {
      "id": "BIOSAMPLE_4",
      "individualId": "INDIVIDUAL_4",
      "biosampleStatus": {
        "id": "SNOMED:365641003",
        "label": "Minor blood groups - finding"
      },
      "collectionDate": "2020-1-4",
      "sampleOriginType": {
        "id": "SNOMED:31675002",
        "label": "Capillary blood"
      },
      "sampleOriginDetail": {
        "id": "SNOMED:31675002",
        "label": "Capillary blood"
      },
      "info": {},
      "notes": ""
    }
]

## Run Information

Run encapsulates information regarding sequencing run associated with the biosample.


In [5]:
runs = [
    {
      "id": "RUN_1",
      "biosampleId": "BIOSAMPLE_1",
      "individualId": "INDIVIDUAL_1",
      "libraryLayout": "PAIRED",
      "librarySelection": "RANDOM",  
      "libraryStrategy": "WGS",
      "platform": "Illumina", 
      "platformModel": {
        "id": "OBI:0002630",
        "label": "Illumina NovaSeq 6000"
      },
      "librarySource": {
        "id": "GENEPIO:0001969",
        "label": "other library source"
      },
      "runDate": "2019-1-1"
    },
    {
      "id": "RUN_2",
      "biosampleId": "BIOSAMPLE_2",
      "individualId": "INDIVIDUAL_2",
      "libraryLayout": "PAIRED",
      "librarySelection": "RANDOM",  
      "libraryStrategy": "WGS",
      "platform": "Illumina", 
      "platformModel": {
        "id": "OBI:0002630",
        "label": "Illumina NovaSeq 6000"
      },
      "librarySource": {
        "id": "GENEPIO:0001969",
        "label": "other library source"
      },
      "runDate": "2019-1-2"
    },
    {
      "id": "RUN_3",
      "biosampleId": "BIOSAMPLE_3",
      "individualId": "INDIVIDUAL_3",
      "libraryLayout": "PAIRED",
      "librarySelection": "RANDOM",  
      "libraryStrategy": "WGS",
      "platform": "Illumina", 
      "platformModel": {
        "id": "OBI:0002630",
        "label": "Illumina NovaSeq 6000"
      },
      "librarySource": {
        "id": "GENEPIO:0001969",
        "label": "other library source"
      },
      "runDate": "2019-1-3"
    },
    {
      "id": "RUN_4",
      "biosampleId": "BIOSAMPLE_4",
      "individualId": "INDIVIDUAL_4",
      "libraryLayout": "PAIRED",
      "librarySelection": "RANDOM",  
      "libraryStrategy": "WGS",
      "platform": "Illumina", 
      "platformModel": {
        "id": "OBI:0002630",
        "label": "Illumina NovaSeq 6000"
      },
      "librarySource": {
        "id": "GENEPIO:0001969",
        "label": "other library source"
      },
      "runDate": "2019-1-4"
    }
]

## Analysis Information

Information regarding the bioinformatics analysis comes under this entity information.


In [6]:
analyses = [
    {
      "id": "ANALYSIS_1",
      "individualId": "INDIVIDUAL_1",
      "biosampleId": "BIOSAMPLE_1",
      "runId": "RUN_1",
      "analysisDate": "2019-1-1",
      "pipelineName": "GATK Best Practices",
      "pipelineRef": "Broad Institute",
      "variantCaller": "GATK HaplotypeCaller",
      "vcfSampleId": "SMPL_1"
    },
    {
      "id": "ANALYSIS_2",
      "individualId": "INDIVIDUAL_2",
      "biosampleId": "BIOSAMPLE_2",
      "runId": "RUN_2",
      "analysisDate": "2019-1-2",
      "pipelineName": "GATK Best Practices",
      "pipelineRef": "Broad Institute",
      "variantCaller": "GATK HaplotypeCaller",
      "vcfSampleId": "SMPL_2"
    },
    {
      "id": "ANALYSIS_3",
      "individualId": "INDIVIDUAL_3",
      "biosampleId": "BIOSAMPLE_3",
      "runId": "RUN_3",
      "analysisDate": "2019-1-3",
      "pipelineName": "GATK Best Practices",
      "pipelineRef": "Broad Institute",
      "variantCaller": "GATK HaplotypeCaller",
      "vcfSampleId": "SMPL_3"
    },
    {
      "id": "ANALYSIS_4",
      "individualId": "INDIVIDUAL_4",
      "biosampleId": "BIOSAMPLE_4",
      "runId": "RUN_4",
      "analysisDate": "2019-1-4",
      "pipelineName": "GATK Best Practices",
      "pipelineRef": "Broad Institute",
      "variantCaller": "GATK HaplotypeCaller",
      "vcfSampleId": "SMPL_4"
    }
]

In [7]:
vcf = """
##fileformat=VCFv4.2
##contig=<ID=chr2,length=242193529>
##contig=<ID=chr3,length=198295559>
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSMPL_1\tSMPL_2\tSMPL_3\tSMPL_4
chr2\t20436\t.\tC\tT\t47\tPASS\tNS=4;DP=18\tGT\t1|1\t1|0\t0|1\t1|0
chr2\t42619\t.\tG\tT\t23\tPASS\tNS=4;DP=30\tGT\t0|1\t1|0\t0|0\t0|1
chr3\t23398\t.\tG\tA\t60\tPASS\tNS=4;DP=70\tGT\t1|0\t1|1\t1|1\t0|1
chr3\t30264\t.\tA\tT\t30\tPASS\tNS=4;DP=79\tGT\t0|0\t0|1\t0|1\t1|1
chr3\t47708\t.\tT\tG\t25\tPASS\tNS=4;DP=31\tGT\t1|0\t1|1\t0|1\t0|0
""".strip()

with open("data.vcf", "w") as f:
    f.write(vcf)

!bgzip -f data.vcf
!tabix -f data.vcf.gz

In [8]:
submission = {
    "dataset": dataset,
    "assemblyId": assembly_id,
    "individuals": individuals,
    "biosamples": biosamples,
    "runs": runs,
    "analyses": analyses
}

print(json.dumps(submission, indent=2))
json.dump(submission, open("ngs-submission.json", "w+"), indent=2)


{
  "dataset": {
    "createDateTime": "2021-03-21T02:37:00-08:00",
    "dataUseConditions": {
      "duoDataUse": [
        {
          "id": "DUO:0000042",
          "label": "general research use",
          "version": "17-07-2016"
        }
      ]
    },
    "description": "Simulation set 1.",
    "externalUrl": "http://example.org/wiki/Main_Page",
    "info": {},
    "name": "Dataset with NGS data",
    "updateDateTime": "2022-08-05T17:21:00+01:00",
    "version": "v1.1"
  },
  "assemblyId": "GRCH38",
  "individuals": [
    {
      "id": "INDIVIDUAL_1",
      "karyotypicSex": "XX",
      "sex": {
        "id": "SNOMED:248153007",
        "label": "MALE"
      }
    },
    {
      "id": "INDIVIDUAL_2",
      "karyotypicSex": "XY",
      "sex": {
        "id": "SNOMED:248152002",
        "label": "FEMALE"
      }
    },
    {
      "id": "INDIVIDUAL_3",
      "karyotypicSex": "XX",
      "sex": {
        "id": "SNOMED:248153007",
        "label": "MALE"
      }
    },
    {
      "

## Validating the Submission

Now you can validate the generated schema using the following scripts.

Firstly we install python library `jsonschemas` which will help us validate the created payloads.

We then download latest schemas to be used with `jsonschemas` library.


In [9]:
%pip install jsonschema

Note: you may need to restart the kernel to use updated packages.


In [10]:
%mkdir schemas
!wget https://raw.githubusercontent.com/GSI-Xapiens-CSIRO/sBeacon-BGSi/refs/heads/main/shared_resources/schemas/analysis-schema.json -O schemas/analysis-schema.json    
!wget https://raw.githubusercontent.com/GSI-Xapiens-CSIRO/sBeacon-BGSi/refs/heads/main/shared_resources/schemas/biosample-schema.json -O schemas/biosample-schema.json
!wget https://raw.githubusercontent.com/GSI-Xapiens-CSIRO/sBeacon-BGSi/refs/heads/main/shared_resources/schemas/dataset-schema.json -O schemas/dataset-schema.json
!wget https://raw.githubusercontent.com/GSI-Xapiens-CSIRO/sBeacon-BGSi/refs/heads/main/shared_resources/schemas/individual-schema.json -O schemas/individual-schema.json
!wget https://raw.githubusercontent.com/GSI-Xapiens-CSIRO/sBeacon-BGSi/refs/heads/main/shared_resources/schemas/run-schema.json -O schemas/run-schema.json
!wget https://raw.githubusercontent.com/GSI-Xapiens-CSIRO/sBeacon-BGSi/refs/heads/main/shared_resources/schemas/submit-dataset-schema-new.json -O schemas/submit-dataset-schema-new.json

mkdir: schemas: File exists
--2025-02-05 14:26:52--  https://raw.githubusercontent.com/GSI-Xapiens-CSIRO/sBeacon-BGSi/refs/heads/main/shared_resources/schemas/analysis-schema.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3144 (3.1K) [text/plain]
Saving to: ‘schemas/analysis-schema.json’


2025-02-05 14:26:52 (28.3 MB/s) - ‘schemas/analysis-schema.json’ saved [3144/3144]

--2025-02-05 14:26:53--  https://raw.githubusercontent.com/GSI-Xapiens-CSIRO/sBeacon-BGSi/refs/heads/main/shared_resources/schemas/biosample-schema.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8000::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::

In [11]:
from jsonschema import Draft202012Validator, RefResolver
import os

  from jsonschema import Draft202012Validator, RefResolver


In [12]:

def validate_request(parameters):
    # inject dummy values for missing fields - these will be added by the Dataportal during submission
    parameters["datasetId"] = "Dataset Id"
    parameters["projectName"] = "Project Name"
    parameters["vcfLocations"] = ["vcfLocation1", "vcfLocation2"]
    # load validator
    new_schema = "./schemas/submit-dataset-schema-new.json"
    schema_dir = os.path.dirname(os.path.abspath(new_schema))
    new_schema = json.load(open(new_schema))
    resolveNew = RefResolver(base_uri="file://" + schema_dir + "/", referrer=new_schema)
    validator = Draft202012Validator(new_schema, resolver=resolveNew)
    errors = []

    for error in sorted(validator.iter_errors(parameters), key=lambda e: e.path):
        error_message = f"{error.message} "
        for part in list(error.path):
            error_message += f"/{part}"
        errors.append(error_message)
    return errors

validate_request(submission)

[]