Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added functionality to merge clinical trial metadata objects #114

Merged
merged 4 commits into from Aug 7, 2019
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
35 changes: 35 additions & 0 deletions cidc_schemas/prism.py
Expand Up @@ -878,3 +878,38 @@ def merge_artifact(

# return new object
return new_ct


def merge_clinical_trial_metadata(patch: dict, target: dict) -> dict:
"""
merges two clinical trial metadata objects together

Args:
patch: the metadata object to add
target: the existing metadata object

Returns:
arg1: the merged metadata object
"""

# first we assert the trial details are the same
key_details = ["lead_organization_study_id"]
for d in key_details:
if d in patch and d in target:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If d isn't in patch or isn't in target, then the function continues on happily and tries to merge -- I guess this missing ID would be caught by the validator below, but it's a little hard to follow that in the code. I would either add a comment to that effect, condense this condition into a one-liner with line 899 a la if patch.get(d) != target.get(d), or else raise an error related to missing an expected key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I've added explicit pre-merge validation to each object. This will report any errors in validation before merging to reduce confusion.

if patch[d] != target[d]:
raise RuntimeError("unable to merge trials with different \
Copy link
Member

@curlup curlup Aug 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there will be two trials in different lead organizations but with the same lead_organization_study_id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ID is globally unique

lead_organization_study_id")

# merge the copy with the original.
validator = load_and_validate_schema(
"clinical_trial.json", return_validator=True)
schema = validator.schema
merger = Merger(schema)

merged = merger.merge(target, patch)

# validate this
validator.validate(merged)

# now return it
return merged
60 changes: 59 additions & 1 deletion tests/test_prism.py
Expand Up @@ -12,7 +12,8 @@
from pprint import pprint
from jsonmerge import Merger

from cidc_schemas.prism import prismify, merge_artifact
from cidc_schemas.prism import prismify, merge_artifact, \
merge_clinical_trial_metadata
from cidc_schemas.json_validation import load_and_validate_schema
from cidc_schemas.template import Template
from cidc_schemas.template_writer import RowType
Expand Down Expand Up @@ -342,3 +343,60 @@ def test_snippet_wes():
ds = ct | grep(url)
assert 'matched_values' in ds
assert len(ds['matched_values']) > 0


def test_merge_ct_meta():
"""
tests merging of two clinical trial metadata
objects. Currently this test only supports
WES but other tests should be added in the
future
"""

# create two clinical trials
ct1 = copy.deepcopy(CLINICAL_TRIAL)
ct2 = copy.deepcopy(CLINICAL_TRIAL)

# first we assert the merge is only happening on the same trial
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we should also test the non-schema-conformant patch or target case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this as well. pre-merge validation will stop the merge before it happens.

ct1["lead_organization_study_id"] = "not_the_same"
with pytest.raises(RuntimeError):
merge_clinical_trial_metadata(ct1, ct2)

# revert the data to same key trial id but
# include data in 1 that is missing in the other
# at the trial level and assert the merge
# does not clobber any
ct1["lead_organization_study_id"] = ct2["lead_organization_study_id"]
ct1['trial_name'] = 'name ABC'
ct2['nci_id'] = 'xyz1234'

ct_merge = merge_clinical_trial_metadata(ct1, ct2)
assert ct_merge['trial_name'] == 'name ABC'
assert ct_merge['nci_id'] == 'xyz1234'

# assert the patch over-writes the original value
# when value is present in both objects
ct1['trial_name'] = 'name ABC'
ct2['trial_name'] = 'CBA eman'

ct_merge = merge_clinical_trial_metadata(ct1, ct2)
assert ct_merge['trial_name'] == 'name ABC'

# now change the participant ids
# this should cause the merge to have two
# participants.
ct1['participants'][0]['cimac_participant_id'] = 'different_id'

ct_merge = merge_clinical_trial_metadata(ct1, ct2)
assert len(ct_merge['participants']) == 2

# now lets have the same participant but adding multiple samples.
ct1["lead_organization_study_id"] = ct2["lead_organization_study_id"]
ct1['participants'][0]['cimac_participant_id'] = \
ct2['participants'][0]['cimac_participant_id']
ct1['participants'][0]['samples'][0]['cimac_sample_id'] = 'new_id_1'
ct1['participants'][0]['samples'][1]['cimac_sample_id'] = 'new_id_2'

ct_merge = merge_clinical_trial_metadata(ct1, ct2)
assert len(ct_merge['participants']) == 1
assert len(ct_merge['participants'][0]['samples']) == 4