# Example Usage

To use `bento-mdf` in a project, start by installing the latest version with `pip install bento-mdf` and importing it into your project.

In [52]:
import bento_mdf
from pathlib import Path # for file paths
from importlib.metadata import version # check package version

version("bento_mdf")

'0.10.0'

## Loading the Model from MDF(s)

The `bento-mdf` package provides functionality for loading, validating, and manipulating MDF file content in Python.

The `MDF` class is the main interface to the package. It is initialized with the relevant MDF file(s), filepath(s), or URL pointing to these.

In [53]:
from bento_mdf.mdf import MDF

### Loading from File(s)

First, we can specify the paths to the MDF files we want to load. Then, we provide these to the `MDF` class to initalize the model. This loads the content of these files into their corresponding `bento-meta` Python object representations, which we can access via the `Model` object found at `MDF.model`.

(Note: if a top-level model `Handle` is not present in the MDFs, it needs to be provided to the MDF class's `handle` argument.)

In [54]:
mdf_dir = Path.cwd().parent / "tests" / "samples"
ctdc_model = mdf_dir / "ctdc_model_file.yaml"
ctdc_props = mdf_dir / "ctdc_model_properties_file.yaml"

mdf_from_file = MDF(ctdc_model, ctdc_props, handle="CTDC")
mdf_from_file.model

<bento_meta.model.Model at 0x218cbcdb7d0>

### Loading from URL(s)

Similarly, we can instantiate an MDF from URL(s) pointing to the model file(s):

In [55]:
model_url = "https://cbiit.github.io/icdc-model-tool/model-desc/icdc-model.yml"
props_url = "https://cbiit.github.io/icdc-model-tool/model-desc/icdc-model-props.yml"

mdf = MDF(model_url, props_url, handle="ICDC")
mdf.model

<bento_meta.model.Model at 0x218cb4f6240>

## Exploring the Model

Once we've loaded the model, we can start looking at the entities that make it up, including Nodes, Relationships, Properties, and Terms. These are conveniently stored in the `bento-meta Model` object. 

Note: This example will use the model created in the previous section from a URL.

### Nodes

Model nodes are stored as dictionaries in `Model.nodes`, where the keys are node handles and the values are `bento-meta Node` objects.

In [56]:
nodes = mdf.model.nodes

len(nodes)

33

In [57]:
list(nodes.keys())[:3]

['program', 'study', 'study_site']

In [58]:
list(nodes.values())[:3]

[<bento_meta.objects.Node at 0x218cba03b60>,
 <bento_meta.objects.Node at 0x218cbb86240>,
 <bento_meta.objects.Node at 0x218cbb86e40>]

In [59]:
nodes["study"]

<bento_meta.objects.Node at 0x218cbb86240>

The `get_attr_dict()` method is a convenient way to get a dictionary of a `bento-meta Entity's` set attributes. This will return string versions of the attributes. This can be useful for exploring the entity or for providing parameters to Neo4j Cypher queries.

Note: this only includes simple attributes and not other bento-meta Entities or collections of Entities. All attributes can be accessed via methods matching their names.

In [60]:
nodes["diagnosis"].get_attr_dict()

{'handle': 'diagnosis',
 'model': 'ICDC',
 'desc': 'The Diagnosis node contains numerous properties which fully characterize the type of cancer with which any given patient/subject/donor was diagnosed, inclusive of stage. This node also contains properties pertaining to comorbidities, and the availability of pathology reports, treatment data and follow-up data.'}

### Relationships

Simlarly, Model relationships are stored in `Model.edges`. This is a dictionary where the keys are (edge.handle, src.handle, dst.handle) tuples. The values are `Edge` objects.

In [61]:
edges = mdf.model.edges

len(edges)

49

In [62]:
list(edges.keys())[:3]

[('member_of', 'case', 'cohort'),
 ('member_of', 'cohort', 'study_arm'),
 ('member_of', 'study_arm', 'study')]

In [63]:
list(edges.values())[:3]

[<bento_meta.objects.Edge at 0x218cbaf1dc0>,
 <bento_meta.objects.Edge at 0x218cbaf1bb0>,
 <bento_meta.objects.Edge at 0x218cbd58590>]

In [64]:
edges[("of_case", "diagnosis", "case")].get_attr_dict()

{'handle': 'of_case', 'model': 'ICDC', 'multiplicity': 'many_to_one'}

In [65]:
edge = edges[("of_case", "diagnosis", "case")]
print(edge.handle, edge.src.handle, edge.dst.handle, sep=", ")


# TIP: here's a convenient method to get the 3-tuple of an edge
print(edge.triplet)

of_case, diagnosis, case
('of_case', 'diagnosis', 'case')


An `Edge's` `src` and `dst` attributes are `Nodes`

In [66]:
print(edge.src)

print(edge.src.handle)

<bento_meta.objects.Node object at 0x00000218CBB317F0>
diagnosis


The `Model` object also has some useful methods to work with relationships/edges including:
  * `edges_by_src(node)` - get all edges that have a given node as their src attribute
  * `edges_by_dst(node)` - get all edges that have a given node as their dst attribute
  * `edges_by_type(edge_handle)` - get all edges that have a given edge type (i.e., handle)

In [67]:
[e.triplet for e in mdf.model.edges_by_dst(mdf.model.nodes["case"])]

[('of_case', 'enrollment', 'case'),
 ('of_case', 'demographic', 'case'),
 ('of_case', 'diagnosis', 'case'),
 ('of_case', 'cycle', 'case'),
 ('of_case', 'follow_up', 'case'),
 ('of_case', 'sample', 'case'),
 ('of_case', 'file', 'case'),
 ('of_case', 'visit', 'case'),
 ('of_case', 'adverse_event', 'case'),
 ('of_case', 'registration', 'case')]

In [68]:
[e.triplet for e in mdf.model.edges_by_type("of_study")]

[('of_study', 'study_site', 'study'),
 ('of_study', 'principal_investigator', 'study'),
 ('of_study', 'file', 'study'),
 ('of_study', 'image_collection', 'study'),
 ('of_study', 'publication', 'study')]

### Properties

Model properties are stored in `Model.props`. This is a dictionary where the keys are ({edge|node}.handle, prop.handle) tuples. The values are `Property` objects.

In [69]:
props = mdf.model.props

len(props)

240

In [70]:
list(props.keys())[:3]

[('program', 'program_name'),
 ('program', 'program_acronym'),
 ('program', 'program_short_description')]

In [71]:
list(props.values())[:3]

[<bento_meta.objects.Property at 0x218cbd777d0>,
 <bento_meta.objects.Property at 0x218cbd76c60>,
 <bento_meta.objects.Property at 0x218cbd77aa0>]

In [72]:
primary_disease_site = props[("diagnosis", "primary_disease_site")]
primary_disease_site.get_attr_dict()

{'handle': 'primary_disease_site',
 'model': 'ICDC',
 'value_domain': 'value_set',
 'is_required': 'Yes',
 'is_key': 'False',
 'is_nullable': 'False',
 'is_strict': 'True',
 'desc': 'The anatomical location at which the primary disease originated, recorded in relatively general terms at the subject level; the anatomical locations from which tumor samples subject to downstream analysis were acquired is recorded in more detailed terms at the sample level.'}

#### Properties with Value Sets

Properties with the value_domain "value_set" have the `value_set` attribute (`bento-meta ValueSet`), which has a `terms` attribute (`bento-meta Term` dictionary like `{term.value: Term}`).

In [73]:
primary_disease_site.value_set

<bento_meta.objects.ValueSet at 0x218cbf3a7b0>

In [74]:
primary_disease_site.value_set.terms

{'Bladder': <bento_meta.objects.Term object at 0x00000218CBF3A960>, 'Bladder, Prostate': <bento_meta.objects.Term object at 0x00000218CBF3AAB0>, 'Bladder, Urethra': <bento_meta.objects.Term object at 0x00000218CBF3AC00>, 'Bladder, Urethra, Prostate': <bento_meta.objects.Term object at 0x00000218CBF3B0B0>, 'Bladder, Urethra, Vagina': <bento_meta.objects.Term object at 0x00000218CBF3AD20>, 'Bone': <bento_meta.objects.Term object at 0x00000218CBF3AC30>, 'Bone (Appendicular)': <bento_meta.objects.Term object at 0x00000218CBF3AE40>, 'Bone (Axial)': <bento_meta.objects.Term object at 0x00000218CBF3ADE0>, 'Bone Marrow': <bento_meta.objects.Term object at 0x00000218CBF3ABA0>, 'Brain': <bento_meta.objects.Term object at 0x00000218CBF3B020>, 'Carpus': <bento_meta.objects.Term object at 0x00000218CBF3AE70>, 'Chest Wall': <bento_meta.objects.Term object at 0x00000218CBF3B2F0>, 'Distal Urethra': <bento_meta.objects.Term object at 0x00000218CBF3B110>, 'Kidney': <bento_meta.objects.Term object at 0x0

`Property` objects with value sets have some useful methods to get to those terms and their values including:
  * `.terms` returns a list of `Term` objects from the property's value set
  * `.values` returns a list of the term values from the property's value set

In [75]:
print(primary_disease_site.terms)

# TIP: this is the same object found at the ValueSet's `terms` attribute
print(primary_disease_site.terms is primary_disease_site.value_set.terms)

{'Bladder': <bento_meta.objects.Term object at 0x00000218CBF3A960>, 'Bladder, Prostate': <bento_meta.objects.Term object at 0x00000218CBF3AAB0>, 'Bladder, Urethra': <bento_meta.objects.Term object at 0x00000218CBF3AC00>, 'Bladder, Urethra, Prostate': <bento_meta.objects.Term object at 0x00000218CBF3B0B0>, 'Bladder, Urethra, Vagina': <bento_meta.objects.Term object at 0x00000218CBF3AD20>, 'Bone': <bento_meta.objects.Term object at 0x00000218CBF3AC30>, 'Bone (Appendicular)': <bento_meta.objects.Term object at 0x00000218CBF3AE40>, 'Bone (Axial)': <bento_meta.objects.Term object at 0x00000218CBF3ADE0>, 'Bone Marrow': <bento_meta.objects.Term object at 0x00000218CBF3ABA0>, 'Brain': <bento_meta.objects.Term object at 0x00000218CBF3B020>, 'Carpus': <bento_meta.objects.Term object at 0x00000218CBF3AE70>, 'Chest Wall': <bento_meta.objects.Term object at 0x00000218CBF3B2F0>, 'Distal Urethra': <bento_meta.objects.Term object at 0x00000218CBF3B110>, 'Kidney': <bento_meta.objects.Term object at 0x0

In [76]:
print(primary_disease_site.values[20])

print(len(primary_disease_site.values))

print(primary_disease_site.values == list(primary_disease_site.terms.keys()))

Shoulder
29
True


#### Properties via Parent

Model properties can also be accessed via their parent node|edge's `props` attribute, which is a dictionary of properties.

In [77]:
diagnosis_props = nodes["diagnosis"].props
len(diagnosis_props)

14

In [78]:
list(diagnosis_props.keys())[:3]

['diagnosis_id', 'disease_term', 'primary_disease_site']

In [79]:
list(diagnosis_props.values())[:3]

[<bento_meta.objects.Property at 0x218cbf38cb0>,
 <bento_meta.objects.Property at 0x218cbf39100>,
 <bento_meta.objects.Property at 0x218cbf3a660>]

Properties accesed via their parents are the same Property objects found in `Model.props`.

In [80]:
diagnosis_props["primary_disease_site"] is props[("diagnosis", "primary_disease_site")]

True

### Terms

Model terms are stored in `Model.terms` as a dictionary of `Term` objects. The keys are the term handles, and the values are the `Term` objects. Terms are used to relate string descriptors in the model, such as permissible values in a property's value set, or semantic concepts from other frameworks that can describe an entity in the model via annotation (e.g. a caDSR Common Data Element/CDE annotating a model property).

The keys in `Model.terms` are (term.handle, term.origin) tuples and the values are `bento-meta` `Term` objects.

In [81]:
terms = mdf.model.terms

len(terms)

538

In [82]:
list(terms.keys())[:3]

[('Unrestricted', 'ICDC'), ('Pending', 'ICDC'), ('Under Embargo', 'ICDC')]

In [83]:
list(terms.values())[:3]

[<bento_meta.objects.Term at 0x218cbd82600>,
 <bento_meta.objects.Term at 0x218cbd81df0>,
 <bento_meta.objects.Term at 0x218cbd81dc0>]

In [84]:
shoulder = terms[("Shoulder", "ICDC")]
shoulder.get_attr_dict()

{'handle': 'Shoulder', 'value': 'Shoulder', 'origin_name': 'ICDC'}

#### Terms via ValueSet

Terms that are part of value set can be accessed via the owner of that value set as well. This is the same object found in `Model.terms`

In [85]:
primary_disease_site.terms["Shoulder"] is shoulder

True

#### Term Annotations

Terms are also used to annotate model entities with semantic represenations from some other framework. For example, a Term from caDSR may be used to annotate a model property with a semantically equivalent CDE. In the `MDF`, these annotations are provided under the `Term` key for a given entity. 

In [86]:
mdf_dir = Path.cwd().parent / "tests" / "samples"
model_with_terms = mdf_dir / "test-model-with-terms-a.yml"
# Tip: model 'Handle' key is in the yaml file so we don't need to provide one to MDF()
terms_mdf = MDF(model_with_terms)
terms_mdf.model

100%|██████████| 2/2 [00:00<00:00, 2000.62it/s]


<bento_meta.model.Model at 0x218cbd32390>

Terms can annotate nodes, relationships, and properties. The annotating term(s) are linked to the annotated entity via a `bento-meta Concept`, which stores them in a dictionary of the same format found at `Model.terms` (i.e. `{(term.value, term.origin_name): Term}`).

In [87]:
case_concept = terms_mdf.model.nodes["case"].concept
case_concept

<bento_meta.objects.Concept at 0x218cbce4c50>

In [88]:
case_concept.terms

{('case_term', 'CTDC'): <bento_meta.objects.Term object at 0x00000218CBCE7350>, ('subject', 'caDSR'): <bento_meta.objects.Term object at 0x00000218CBD19B50>}

In [89]:
# TIP: to find an annotating CDE, we can look for entries where the origin is 'caDSR'
for term_key, term in case_concept.terms.items():
    if term_key[1] == "caDSR":
        print(term.get_attr_dict())

{'handle': 'subject', 'value': 'subject', 'origin_name': 'caDSR'}


In [90]:
terms_mdf.model.edges[("of_case", "sample", "case")].concept.terms

{('of_case_term', 'CTDC'): <bento_meta.objects.Term object at 0x00000218CBCD80E0>}

In [91]:
terms_mdf.model.props[("case", "case_id")].concept.terms

{('case_id', 'CTDC'): <bento_meta.objects.Term object at 0x00000218CBCF5EE0>}

In [92]:
# TIP: terms found in Model.terms are the same objects as those in an entity's concept
case_id_anno = terms_mdf.model.props[("case", "case_id")].concept.terms[("case_id", "CTDC")]
terms_mdf.model.terms[("case_id", "CTDC")] is case_id_anno

True

### Tags

A tags entry can be added to any object in the model. They are used to associated metainformation with an entity for downstream custom processing. Any `bento-meta Entity` except the `Tag` can be tagged with one of these key-value pairs. They are accessible via the `tags` attribute of the entity, where they are stored in a dictionary where the key is the tag's 'key' and the value is a `bento-meta Tag` object.

In [93]:
icdc_breed_tags = mdf.model.props[("demographic", "breed")].tags
icdc_breed_tags

{'Labeled': <bento_meta.objects.Tag object at 0x00000218CBD881A0>}

In [94]:
icdc_breed_tags["Labeled"].get_attr_dict()

{'key': 'Labeled', 'value': 'Breed'}

## Validating the Model

As the `MDF` class loads the model, it automatically validates it against the MDF schema and will raise an exception if the model is invalid. This will use the [default schema](https://github.com/CBIIT/bento-mdf/blob/main/schema/mdf-schema.yaml) unless one is provided via the `MDF` class's `mdf_schema` argument.

`bento-mdf` also provides the `MDFValidator` class, which can be used to validate a model against the MDF schema directly.

In [95]:
from bento_mdf.validator import MDFValidator

validator = MDFValidator(
    None,
    *[ctdc_model, ctdc_props],
    raise_error=True,
)
validator

<bento_mdf.validator.MDFValidator at 0x218cbd76c00>

In [96]:
validator.load_and_validate_schema(); # load and check that JSON schema is valid

In [97]:
validator.load_and_validate_yaml().as_dict(); # load and check YAML is valid

In [98]:
validator.validate_instance_with_schema(); # check YAML against the schema

If the schema or yaml instances (from MDF files) are invalid, the validation will fail.

In [99]:
from jsonschema import SchemaError, ValidationError
from yaml.parser import ParserError
from IPython.display import clear_output

### Schema is invalid

In [100]:
bad_schema = mdf_dir / "mdf-bad-schema.yaml"

try:
    MDFValidator(bad_schema, raise_error=True).load_and_validate_schema()
except SchemaError as e:
    clear_output()
    print(e)

'crobject' is not valid under any of the given schemas

Failed validating 'anyOf' in metaschema['properties']['properties']['additionalProperties']['properties']['type']:
    {'anyOf': [{'$ref': '#/definitions/simpleTypes'},
               {'type': 'array',
                'items': {'$ref': '#/definitions/simpleTypes'},
                'minItems': 1,
                'uniqueItems': True}]}

On schema['properties']['UniversalNodeProperties']['type']:
    'crobject'


### YAML structure is invalid

In [101]:
bad_yaml = mdf_dir / "ctdc_model_bad.yaml"

try:
    MDFValidator(None, bad_yaml, raise_error=True).load_and_validate_yaml()
except ParserError as e:
    clear_output()
    print(e)

while parsing a block mapping
  in "c:\Users\nelso\Documents\GitHub\bento-mdf\python\tests\samples\ctdc_model_bad.yaml", line 1, column 1
expected <block end>, but found '<block mapping start>'
  in "c:\Users\nelso\Documents\GitHub\bento-mdf\python\tests\samples\ctdc_model_bad.yaml", line 3, column 3


### MDF YAMLs are invalid against the MDF schema

In [102]:
test_schema = mdf_dir / "mdf-schema.yaml"
ctdc_bad = mdf_dir / "ctdc_model_file_invalid.yaml"

try:
    v = MDFValidator(
        test_schema,
        *[ctdc_bad, ctdc_props],
        raise_error=True
    )
    v.load_and_validate_schema()
    v.load_and_validate_yaml()
    v.validate_instance_with_schema()
except ValidationError as e:
    clear_output()
    print(e)

'case.show_node' does not match '^[A-Za-z_][A-Za-z0-9_]*$'

Failed validating 'pattern' in schema['properties']['PropDefinitions']['propertyNames']:
    {'$id': '#snake_case_id',
     'type': 'string',
     'pattern': '^[A-Za-z_][A-Za-z0-9_]*$'}

On instance['PropDefinitions']:
    'case.show_node'


## Model Diff

`bento-mdf` also provides the `diff_models` function, which can be used to compare two models and report on the differences between them. This is useful for comparing models that have been updated or modified over time.

`diff_models()` has two required arguments, both of which are `bento_meta.Model` objects:
  * `mdl_a`: The first model to compare.
  * `mdl_b`: The second model to compare.

The function returns a `dict` with keys for nodes, edges, props, and terms, each with a dictionary with keys:
  * `"added"`: found in `mdl_a` but not in `mdl_b`
  * `"removed"`: found in `mdl_b` but not in `mdl_a`
  * `"changed"`: found in both models but with altered attributes

In [103]:
from bento_mdf.diff import diff_models

old_model = mdf_dir / "test-model-d.yml"
new_model = mdf_dir / "test-model-e.yml"

old_mdf = MDF(old_model, handle="TEST")
new_mdf = MDF(new_model, handle="TEST")

diff_models(mdl_a=old_mdf.model, mdl_b=new_mdf.model)

{'nodes': {'changed': {'diagnosis': {'props': {'removed': {'fatal': <bento_meta.objects.Property at 0x218cbb76150>},
     'added': None}}},
  'removed': None,
  'added': {'outcome': <bento_meta.objects.Node at 0x218cbcc3e60>}},
 'edges': {'removed': None,
  'added': {('end_result',
    'diagnosis',
    'outcome'): <bento_meta.objects.Edge at 0x218cbfd6240>}},
 'props': {'removed': {('diagnosis',
    'fatal'): <bento_meta.objects.Property at 0x218cbb76150>},
  'added': {('outcome',
    'fatal'): <bento_meta.objects.Property at 0x218cbfd7500>}}}

`diff_models` has two optional arguments:
  * `objects_as_dicts`: if True, the output will convert `bento-meta Entity` objects like `Node` or `Edge` to dictionaries with `get_attr_dict()`
  * `include_summary`: if True, the output will include a formatted string summary of the differences between the two models. This can be useful for GitHub changelogs when a model is updated, for example.

In [104]:
diff = diff_models(
    old_mdf.model,
    new_mdf.model,
    objects_as_dicts=True, include_summary=True)

diff["nodes"]["changed"]

{'diagnosis': {'props': {'removed': {'fatal': {'handle': 'fatal',
     'model': 'TEST',
     'value_domain': 'value_set',
     'is_required': 'False',
     'is_key': 'False',
     'is_nullable': 'False',
     'is_strict': 'True'}},
   'added': None}}}

In [127]:
print(diff["summary"], sep="\n")

1 node(s) added; 1 edge(s) added; 1 prop(s) removed; 1 prop(s) added; 1 attribute(s) changed for 1 node(s)
- Added node: 'outcome'
- Added edge: 'end_result' with src: 'diagnosis' and dst: 'outcome'
- Removed prop: 'fatal' with parent: 'diagnosis'
- Added prop: 'fatal' with parent: 'outcome'
