In [1]:
pip install fastavro

Note: you may need to restart the kernel to use updated packages.


# PFB Worked Example - ICDC

This notebook continues [an introduction](https://github.com/CBIIT/bento-meta/tree/master/python/bento_meta/pfb#portable-format-for-bioinformatics) to the "Portable Format for Bioinformatics".

Consider the following example. In the [Integrated Canine Data Commons (ICDC) model](https://cbiit.github.io/icdc-model-tool/), a subject of a study is called a _case_, and a case has the following properties (variable, slots for data) associated with it:

    case:
        Props:
          - case_id
          - patient_id
          - patient_first_name

(This and the following snippets of YAML are taken from the [model description files](https://github.com/CBIIT/bento-mdf) found at https://cbiit.github.io/icdc-model-tool/model-desc/.) The properties are defined in the model as follows (omitting human readable descriptions):

    case_id:
        Type: string
        Req: true
      patient_id:
        Type: string
        Req: true
      patient_first_name:
        Type: string

(Patient first name is OK, because we're talking about dogs.)

A case may have a number of other sets of related data. These are separate nodes in the graph,  associated with a specific case via relationships (links, edges). There are 14 such nodes, but only three of these are linked by outgoing (i.e. from case to node) relationships. The _cohort_ node is a simple example:

    cohort:
        Props:
          - cohort_description
          - cohort_dose

Both properties have type ``string``. The relationship ``member_of`` indicates the association:

    Relationships:
      member_of:
        Mul: many_to_one
        Ends:
          - Src: case
            Dst: cohort

Every node, regardless of type, also entails an internal _id_ field.

## Customizing the PFB schema 

Avro schemas that will encode the nodes _case_ and _cohort_ are straightforward enough. These are the User Data Types.

    icdc_case_schema =
		{ 
          "name": "case",
          "type": "record",
		  "fields": [
			{ "name": "id",
			  "type": "string" },
			{ "name": "case_id",
			  "type": "string" },
			{ "name": "patient_id",
			  "type" ; "string" },
			{ "name": "patient_first_name",
			  "type": "string" }
		  ]
		}

    icdc_cohort_schema =
		{
          "name": "cohort",
          "type": "record",
		  "fields": [
			{ "name": "id",
			  "type": "string" },
			{ "name": "cohort_description",
			  "type": "string" },
			{ "name": "cohort_dose",
			  "type": "string" }
		  ]
		}

These schemas need to be included in the ``Entity`` schema at the time the PFB message is created. 

In the next cell, we use [fastavro](https://fastavro.readthedocs.io/en/latest/schema.html) to read the PFB schemas that we've modularized into .avsc files, and appropriately place the custom schemas above into that structure.


In [2]:
import fastavro
from fastavro.schema import load_schema
from fastavro.validation import validate
from tempfile import NamedTemporaryFile

import json
import os

# the following gyrations enable
# * the PFB schema to be modularized with named types,
# * the addition the custom data types to the pfb.Entity schema, and
# * the recursive loading of the named type schemas by fastavro.

pfb_schema = None
tempf = None
with open("pfb.Entity.avsc","r") as Entity:
    # load Entity schema as simple json
    pfb_schema_json = json.load( Entity )
    # find the "object" hash 
    [object] = [ x for x in pfb_schema_json["fields"] if x["name"] == "object" ]
    # add the custom schemas (as names) to the object.type array
    object["type"].extend([ "icdc.case", "icdc.cohort" ])
    # dump json to a tempfile to take advantage of fastavro avsc 
    # name resolution in fastavro.schema.load_schema()
    tempf = NamedTemporaryFile(mode="w+",dir=".")
    json.dump(pfb_schema_json,tempf)
    tempf.seek(0)
    # load the customized schema
    pfb_schema = load_schema(tempf.name)
    pass


### Metadata schemas and links

To encode these data nodes in PFB, we also must construct corresponding [``Node``](./pfb.Node.avsc) and [``Property``](./pfb.Property.avsc) metadata schemas. Example metadata schemas for ``case`` and ``cohort`` nodes are defined in the following cell. 




In [3]:
icdc_cohort_meta = { 
    "name": "icdc.cohort",
    "ontology_reference": "",
    "values": {},
    "links":[],
    "properties": [
        { 
            "name": "cohort_description",
            "ontology_reference": "NCIT",
            "values": {
                "concept_code": "C166209"
                }
        },
        { 
            "name": "cohort_dose",
            "ontology_reference": "NCIT",
            "values": {
                "concept_code": "C166210"
                }
        }
    ]
}

icdc_case_meta = {
    "name": "icdc.case",
    "ontology_reference": "",
    "values": {},
    "properties": [
        { 
            "name": "case_id",
            "ontology_reference": "NCIT",
            "values": {
              "concept_code": "C164324"
            }
        },
        { 
            "name": "patient_id",
            "ontology_reference": "NCIT",
            "values": {
              "concept_code": "C164337"
            }
        }
        ],
        "links": [
            {
                "name": "member_of",
                "dst": "cohort",
                "multiplicity": "MANY_TO_ONE"
            }
        ]
}

Note these are _instances_ of the [``Node`` _schema_](./pfb.Node.avsc). The instances are Avro records (JSON objects), and have acceptable keys ``name``, ``ontology_reference``, ``values``, ``links``, and ``properties``.  

The ``cohort`` as a node type does not have an external terminology reference as yet, so ``ontology_reference`` and ``values`` are present, but set to empty data entities. It does not have any outgoing links in the model, so ``links`` is also set to an empty array. ICDC properties are associated with NCI Thesaurus codes, so these are provided in the ``properties`` schemas.

The ``case`` node schema uses a ``links`` specification to indicate that a case can be a member of a cohort. We take advantage of this below.

Are these two schemas valid -- correct -- within PFB? We can check by asking fastavro to validate a Metadata schema that includes these two Node schemas:

In [4]:
if validate( {
    "name":"Metadata",
    "misc":{},
    "nodes": [
        icdc_case_meta,
        icdc_cohort_meta
    ]}, pfb_schema):
    print("Valid!")
else:
    print("INVALID")    

Valid!


## Data for PFB Message

Now, we set up actual data to be encoded in a PFB message. This will be a ``case`` node instance, with values for its properties, and a similar ``cohort`` node instance, as well as the information that links the case to the cohort.

Each node instance needs to be wrapped in a PFB [``Entity`` schema](./pfb.Entity.avsc). The next code cell steps through these constructs, validating each step.

In [5]:
# data for PFB message:

# cohort_data
cohort_data = {
    "id": "n201",
    "cohort_description": "arm1",
    "cohort_dose": "10mg/kg"
}

if validate( ("icdc.cohort", cohort_data), pfb_schema ):
    print("Valid!")
else:
    print("INVALID")

# case data
case_data = {
    "id": "n101",
    "case_id": "UBC01-007",
    "patient_id": "007",
    "patient_first_name": "Fluffy"
}    

if validate( ("icdc.case", case_data), pfb_schema ):
    print("Valid!")
else:
    print("INVALID")

link = {
    "dst_name": "icdc.cohort",
    "dst_id": "n201"
}

# case_data wrapped in pfb.Entity
case_data_entity = {
    "name": "icdc.case",
    "id": "n101",
    "object": case_data,
    "relations": [ link ]
}

if validate( ("pfb.Entity", case_data_entity), pfb_schema ):
    print ("Valid!")
else:
    print ("INVALID")

# cohort_data wrapped in pfb.Entity
cohort_data_entity = {
    "name":"icdc.cohort",
    "id": "n201",
    "object": cohort_data,
    "relations":[]
}

if validate( ("pfb.Entity", cohort_data_entity), pfb_schema ):
    print("Valid!")
else:
    print("INVALID")

Valid!
Valid!
Valid!
Valid!


## Create PFB Payload and PFB Message

We have built all the components for the PFB message. Now we can bring them together in an array to provide to ``fastavro.writer``, along with the PFB schema, to render the message to a binary file.

The payload consists of a Metadata instance (which describe the case and cohort semantic information, see above), the Entity instance containing the cohort data, and the Entity instance containing the case data. 

In [6]:
payload = [
      { 
        "name": "Metadata",
        "object": {
            "name": "pfb.Metadata",
            "misc": {},
            "nodes": [
                icdc_cohort_meta,
                icdc_case_meta
            ]
        }
      },
      cohort_data_entity,
      case_data_entity
    ]

Now, a call to ``fastavro.writer`` creates the message.

In [7]:
with open("worked-example.avro","wb") as out:
    fastavro.writer(out, pfb_schema, payload)

We can read back this message and check whether the records we sent are correctly reconstituted.

In [8]:
with open("worked-example.avro","rb") as inf:
    rdr = fastavro.reader(inf)
    for rec in rdr:
        print(rec)
        print()

{'id': None, 'name': 'Metadata', 'object': {'nodes': [{'name': 'icdc.cohort', 'ontology_reference': '', 'values': {}, 'links': [], 'properties': [{'name': 'cohort_description', 'ontology_reference': 'NCIT', 'values': {'concept_code': 'C166209'}}, {'name': 'cohort_dose', 'ontology_reference': 'NCIT', 'values': {'concept_code': 'C166210'}}]}, {'name': 'icdc.case', 'ontology_reference': '', 'values': {}, 'links': [{'multiplicity': 'MANY_TO_ONE', 'dst': 'cohort', 'name': 'member_of'}], 'properties': [{'name': 'case_id', 'ontology_reference': 'NCIT', 'values': {'concept_code': 'C164324'}}, {'name': 'patient_id', 'ontology_reference': 'NCIT', 'values': {'concept_code': 'C164337'}}]}], 'misc': {}}, 'relations': []}

{'id': 'n201', 'name': 'icdc.cohort', 'object': {'id': 'n201', 'cohort_description': 'arm1', 'cohort_dose': '10mg/kg'}, 'relations': []}

{'id': 'n101', 'name': 'icdc.case', 'object': {'id': 'n101', 'case_id': 'UBC01-007', 'patient_id': '007', 'patient_first_name': 'Fluffy'}, 'rel

In [9]:
os.remove(tempf.name)