# GDC to CCDH Conversion

This notebook demonstrates one method for converting GDC data into CCDH (CRDC-H) instance data: by reading node data as JSON and writing it out in the LinkML model. The LinkML can be used to [generate](https://github.com/linkml/linkml#python-dataclasses) [Python Data Classes](https://docs.python.org/3/library/dataclasses.html), which can then be exported in JSON-LD, a JSON-based format used to represent RDF data.

## Why Python Data Classes?

Python Data Classes provide several useful features that we will demonstrate below:

1. **Python Data Classes are generated automatically.** Rather than requiring additional effort to maintain a Python library for accessing the CCDH model, the [LinkML toolset](https://linkml.github.io/) can generate the Python Data Classes directly from the CCDH model, ensuring that users can always access the most recent version of the CCDH model programmatically. This also allows us to maintain Python Data Classes for accessing previous versions of the CCDH model, which we plan to use to implement [data migration between CCDH model versions](https://cancerdhc.github.io/ccdhmodel/latest/data-migration/)
2. **Python Data Classes provide validation on creation.** As we will demonstrate below, creating a Python Data Class requires that all required attributes are filled in, and all fields are filled in the format or enumeration expected.
3. **Easy to use in Python IDEs.** Since the generated Python Data Classes includes model documentation in Python, users using Python IDEs can see available options and documentation while writing their code.

## Setup

We start by installing the [LinkML](https://pypi.org/project/linkml/) and [pandas](https://pypi.org/project/pandas/) packages. This is included in the pipenv file included in this source repository: if you used `pipenv run jupyter notebook` to start this Notebook, you should be set up already. If not, you may need to uncomment these following lines to install these packages.

(Note: if you are running this on macOS 11 "Big Sur", you might need to set `SYSTEM_VERSION_COMPAT=1` in your Terminal environment before running `pipenv install`).

In [3]:
import sys

# These don't need to be installed if you start Jupyter Notebook by running `poetry run jupyter notebook`.

# Install LinkML.
# We use our own fork of LinkML, but all changes made to this repository will eventually be sent
# upstream to the main LinkML release.
#!{sys.executable} -m pip install linkml

# Install pandas.
#!{sys.executable} -m pip install pandas

# Install rdflib.
#!{sys.executable} -m pip install rdflib

# Install JSON Schema.
#!{sys.executable} -m pip install jsonschema

## Loading GDC data as an example

In this demonstration, we will use a dataset of 560 cases relating to head and neck cancers previously downloaded from the public GDC API as [documented elsewhere in this repository](https://github.com/cancerDHC/example-data/blob/main/head-and-mouth/Head%20and%20Mouth%20Cancer%20Datasets.ipynb).

In [4]:
import json
import pandas

with open("head-and-mouth/gdc-head-and-mouth.json") as file:
    gdc_head_and_mouth = json.load(file)

pandas.DataFrame(gdc_head_and_mouth)

Unnamed: 0,aliquot_ids,case_id,created_datetime,days_to_lost_to_followup,demographic,diagnoses,diagnosis_ids,disease_type,id,index_date,...,submitter_diagnosis_ids,submitter_id,submitter_sample_ids,updated_datetime,slide_ids,submitter_slide_ids,analyte_ids,portion_ids,submitter_analyte_ids,submitter_portion_ids
0,[3d4995b8-5b04-46f2-8d37-7e0b9f9b1b1a],a203ac35-914f-4f4d-816c-2af124257500,2018-09-13T13:41:51.057497-05:00,,"{'age_at_index': 22645, 'age_is_obfuscated': N...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",[a4f6276a-b3cc-45f9-9fb8-30edd56ad4ea],Squamous Cell Neoplasms,a203ac35-914f-4f4d-816c-2af124257500,Initial Genomic Sequencing,...,[GENIE-DFCI-011620-10763_diagnosis],GENIE-DFCI-011620,[GENIE-DFCI-011620-10763],2019-11-18T13:54:59.294543-06:00,,,,,,
1,[57d18da1-d1b9-40b0-8ee6-0f94fd9f7575],26d5f693-dfbc-44ec-a073-49a59a3f09a0,2019-06-03T12:43:36.681258-05:00,,"{'age_at_index': 22645, 'age_is_obfuscated': N...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",[2e4427b7-e557-49c7-85ef-39a02c4a441c],Squamous Cell Neoplasms,26d5f693-dfbc-44ec-a073-49a59a3f09a0,Initial Genomic Sequencing,...,[GENIE-DFCI-050738-234120_diagnosis],GENIE-DFCI-050738,[GENIE-DFCI-050738-234120],2019-11-18T13:54:59.294543-06:00,,,,,,
2,[95066691-03ea-422a-bb4c-ba09e9cbd7ab],d7c7ecbd-7495-4d29-8bb6-78797f5a47eb,2018-09-13T13:44:12.915115-05:00,,"{'age_at_index': 21184, 'age_is_obfuscated': N...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",[85d63c6e-a6c1-47a5-92a9-216719073a4a],Squamous Cell Neoplasms,d7c7ecbd-7495-4d29-8bb6-78797f5a47eb,Initial Genomic Sequencing,...,[GENIE-DFCI-004072-413_diagnosis],GENIE-DFCI-004072,[GENIE-DFCI-004072-413],2019-11-18T13:54:59.294543-06:00,,,,,,
3,[b17a8d8a-395e-4d42-bfb0-7e829e1d4a8b],33fa625e-852e-49ef-8134-6ea46edb5183,2019-06-04T18:08:22.482657-05:00,,"{'age_at_index': 11688, 'age_is_obfuscated': N...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",[d0678ad6-4a4a-4ec8-9e5f-352c1347efc2],Squamous Cell Neoplasms,33fa625e-852e-49ef-8134-6ea46edb5183,Initial Genomic Sequencing,...,[GENIE-GRCC-a8pxs0u6-sample-a_diagnosis],GENIE-GRCC-a8pxs0u6,[GENIE-GRCC-a8pxs0u6-sample-a],2019-11-14T11:30:41.503307-06:00,,,,,,
4,[a0f16f51-94eb-4a6b-beb6-9159fa6acc4b],4d49b9f5-09a0-49de-84f7-0d3441c214f6,2018-10-02T17:53:10.070290-05:00,,"{'age_at_index': 22280, 'age_is_obfuscated': N...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",[2707217b-5a4b-404b-97de-921526581d45],Squamous Cell Neoplasms,4d49b9f5-09a0-49de-84f7-0d3441c214f6,Initial Genomic Sequencing,...,[GENIE-GRCC-2b4655c3-sample-a_diagnosis],GENIE-GRCC-2b4655c3,[GENIE-GRCC-2b4655c3-sample-a],2019-11-14T11:30:41.503307-06:00,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
555,"[c85fabe1-652b-4fa6-9b10-7438db6dfffd, 84615ba...",b835289f-6174-4bf9-aedf-9ebb9ef4eb62,,,"{'age_at_index': 43, 'created_datetime': None,...","[{'age_at_diagnosis': 15920, 'ajcc_clinical_m'...",[d80aa9c7-2a22-57da-87ef-750dd4c0b494],Squamous Cell Neoplasms,b835289f-6174-4bf9-aedf-9ebb9ef4eb62,,...,[TCGA-IQ-A61O_diagnosis],TCGA-IQ-A61O,"[TCGA-IQ-A61O-10A, TCGA-IQ-A61O-01A, TCGA-IQ-A...",2019-08-06T14:26:51.527876-05:00,"[17064bb9-496a-42be-b3f4-f0285c2cb45d, 5286fc4...","[TCGA-IQ-A61O-01Z-00-DX1, TCGA-IQ-A61O-01A-01-...","[d9ca2772-e5ca-4802-a502-b847c25798b4, 6ee6a28...","[a18f89dc-eb85-4dae-a4ff-e410f272076c, c5da291...","[TCGA-IQ-A61O-10A-01W, TCGA-IQ-A61O-01A-11D, T...","[TCGA-IQ-A61O-10A-01, TCGA-IQ-A61O-01A-21-A45L..."
556,"[15043d6b-83f2-46b2-ad20-cc92d21adacb, d9671d1...",600d5316-b795-4ab2-86fc-52835059386f,,,"{'age_at_index': 47, 'created_datetime': None,...","[{'age_at_diagnosis': 17454, 'ajcc_clinical_m'...",[47de90b5-a5ed-5ce0-ada8-7f2c8e621913],Squamous Cell Neoplasms,600d5316-b795-4ab2-86fc-52835059386f,,...,[TCGA-CN-A641_diagnosis],TCGA-CN-A641,"[TCGA-CN-A641-10A, TCGA-CN-A641-01Z, TCGA-CN-A...",2019-08-06T14:25:39.854271-05:00,"[39da8684-906b-4954-8ff0-d526064171d0, 1a06b35...","[TCGA-CN-A641-01Z-00-DX1, TCGA-CN-A641-01A-01-...","[41cf3dbd-0568-4d86-9ed4-c4ea80bb9915, 47172df...","[f5a2c14f-121c-4429-b4da-b3eeadd69a3a, 958ab42...","[TCGA-CN-A641-01A-11R, TCGA-CN-A641-01A-11W, T...","[TCGA-CN-A641-01A-21-A45M-20, TCGA-CN-A641-01A..."
557,"[b7770bb6-ccf0-4fb0-9da2-88317719c1f3, eae6dd5...",5ed786f3-1d15-4079-bb8f-2f34ef305644,,,"{'age_at_index': 62, 'created_datetime': None,...","[{'age_at_diagnosis': 22796, 'ajcc_clinical_m'...",[b07c0682-d50b-5680-97fc-8e709e21c064],Squamous Cell Neoplasms,5ed786f3-1d15-4079-bb8f-2f34ef305644,,...,[TCGA-D6-8568_diagnosis],TCGA-D6-8568,"[TCGA-D6-8568-01Z, TCGA-D6-8568-01A, TCGA-D6-8...",2019-08-06T14:26:39.780396-05:00,"[2b595936-8247-45c0-8cf1-4ee3c2137098, a5e8107...","[TCGA-D6-8568-01A-01-BS1, TCGA-D6-8568-01A-01-...","[78232fea-96ff-4367-8dc5-d792e548f42d, 6c95dea...","[2fd01e02-73e3-465c-aa7e-666d4f800b7e, 4b40c12...","[TCGA-D6-8568-01A-11W, TCGA-D6-8568-01A-11R, T...","[TCGA-D6-8568-01A-21-A45L-20, TCGA-D6-8568-01A..."
558,"[1573766d-cd60-4498-bdad-dcb8d6bc88c8, 6077ba6...",60fcc18c-0509-4c7b-aca9-178f47055077,,,"{'age_at_index': 46, 'created_datetime': None,...","[{'age_at_diagnosis': 17125, 'ajcc_clinical_m'...",[f44a9ab8-f121-59c5-8087-567864266629],Squamous Cell Neoplasms,60fcc18c-0509-4c7b-aca9-178f47055077,,...,[TCGA-CN-A63Y_diagnosis],TCGA-CN-A63Y,"[TCGA-CN-A63Y-01A, TCGA-CN-A63Y-01Z, TCGA-CN-A...",2019-08-06T14:25:39.854271-05:00,"[267002c6-8a76-42f4-af49-6c4d36b8ee17, 8dc964e...","[TCGA-CN-A63Y-01Z-00-DX1, TCGA-CN-A63Y-01A-01-...","[8a3c4fe0-060e-4f99-b1b3-d118fb2c15a2, 713c26c...","[6ba4e65f-9347-424c-ad80-c03e97554d65, dfbc654...","[TCGA-CN-A63Y-01A-11R, TCGA-CN-A63Y-10A-01D, T...","[TCGA-CN-A63Y-10A-01, TCGA-CN-A63Y-01A-21, TCG..."


## Loading the Python classes for the CCDH model

The Python DataClasses for the CCDH model as available at https://github.com/cancerDHC/ccdhmodel/. The Python DataClasses cannot be directly loaded from this GitHub repository yet, but we [plan to implement this functionality soon](https://github.com/cancerDHC/ccdhmodel/issues/40). For now, we have copied the file into this repository so we can import them here.

Note that the Python Data Classes includes documentation on entities and enumerations.

In [6]:
import crdch_model

# Documentation for an entity.
print(f"Documentation for Specimen: {crdch_model.Specimen.__doc__}")

# Documentation for an enumeration.
print(
    f"Documentation for Specimen.specimen_type: {crdch_model.EnumCRDCHSpecimenSpecimenType.__doc__}"
)

# List of permissible values for Specimen.specimen_type
print("Permissible values in enumeration Specimen.specimen_type:")
pvalues = [
    pv
    for key, pv in crdch_model.EnumCRDCHSpecimenSpecimenType.__dict__.items()
    if isinstance(pv, crdch_model.PermissibleValue)
]
for pv in pvalues:
    print(f' - Value "{pv.text}": {pv.description}')

Documentation for Specimen: 
    Any material taken as a sample from a biological entity (living or dead), or from a physical object or the
    environment. Specimens are usually collected as an example of their kind, often for use in some investigation.
    
Documentation for Specimen.specimen_type: 
    A high-level type of specimen, based on its derivation provenance (i.e. how far removed it is from the original
    sample extracted from a source).
    
Permissible values in enumeration Specimen.specimen_type:
 - Value "portion": A physical sub-part taken from an existing specimen.
 - Value "aliquot": A specimen that results from the division of some parent specimen into equal amounts for downstream analysis.
 - Value "analyte": A specimen generated through the extraction of a specified class of substance/chemical (e.g. DNA, RNA, protein) from a parent specimen, which is stored in solution as an analyte.
 - Value "slide": A specimen that is mounted on a slide or coverslip for micros

## Transforming GDC cases into CCDH Research Subject

The primary transformation we will demonstrate here is transforming a [GDC case](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=case) into a [CCDH Research Subject](https://cancerdhc.github.io/ccdhmodel/latest/ResearchSubject/). To do this, we need to translate three additional components as well:
* Each GDC case includes a diagnosis, which we need to transform into a [CCDH Diagnosis](https://cancerdhc.github.io/ccdhmodel/latest/Diagnosis/).
* Each GDC diagnosis includes a description of the cancer stage (see properties named `ajcc_*` in [the GDC documentation](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=diagnosis)). We will translate this into a [CCDH Cancer Stage Observation Set](https://cancerdhc.github.io/ccdhmodel/latest/CancerStageObservationSet/).
* Each GDC case contains a hierarchy of samples, portions, analytes, aliquots and slides. For the purposes of this demonstration, we will focus on transforming only the top-level specimens into [CCDH Specimens](https://cancerdhc.github.io/ccdhmodel/latest/Specimen/), but the same method can be used to transform other parts of the hierarchy. We plan to [include that transformation](https://github.com/cancerDHC/example-data/issues/6) in this tutorial eventually. Note that in our model, specimens are associated with diagnoses rather than directly with Research Subjects.

The CCDH Python Data Classes help in writing these transformation methods by applying validation on the data and ensuring that constraints (such as the required fields) are met. We begin by defining a transformation for creating a [CCDH BodySite](https://cancerdhc.github.io/ccdhmodel/latest/BodySite/), which we also use to demonstrate the validation features available on CCDH Python Data Classes.

In [10]:
def codeable_concept(system, code, label=None, text=None, tags=[]):
    """Create a crdch_model.CodeableConcept for a given [single] system and code."""
    coding = crdch_model.Coding(system=system, code=code)
    if label is not None:
        coding.label = label
    if len(tags) > 0:
        coding.tag = tags
    cc = crdch_model.CodeableConcept(coding)
    if text is not None:
        cc.text = text
    return cc

GDC_URL = "http://crdc.nci.nih.gov/gdc"

def create_body_site(site_name):
    """Create a CCDH BodySite based on the name of a site in the human body."""

    return crdch_model.BodySite(site=codeable_concept(GDC_URL, site_name))


# Try to create a body site for a site name not currently included in the CCDH model.
try:
    create_body_site("Laryn")  # Note misspelling.
except ValueError as v:
    print(f"Could not create BodySite: {v}")

# Using a valid name generates no errors.
create_body_site("Larynx")

# Using a mapped name generates no errors, as it is mapped to a valid name.
create_body_site("Larynx, NOS")

BodySite(site=CodeableConcept(coding=[Coding(code='Larynx, NOS', system='http://crdc.nci.nih.gov/gdc', label=None, system_version=None, value_set=None, value_set_version=None, tag=[])], text=None), qualifier=[])

We need a more sophisticated transformation method for transforming the GDC cancer stage information into [CCDH Cancer Stage Observation Set](https://cancerdhc.github.io/ccdhmodel/latest/CancerStageObservationSet/). Each observation set is made up of a number of [CCDH Cancer Stage Observations](https://cancerdhc.github.io/ccdhmodel/latest/CancerStageObservation/), each of which represents a different type of observation.

In [49]:
def create_stage_observation(type, value):
    """Create a CCDHCancerStageObservation from a type of observation and a codeable concept."""
    
    return crdch_model.CancerStageObservation(
        observation_type=codeable_concept(GDC_URL, type),
        value_codeable_concept=codeable_concept(GDC_URL, value)
    )


def create_stage_from_gdc(diagnosis):
    cancer_stage_method_type = None
    
    # Create an observation set
    obs = ccdh.CancerStageObservationSet()
    if diagnosis.get("ajcc_staging_system_edition") == "7th":
        obs.method_type = codeable_concept(GDC_URL, "AJCC staging system 7th edition")

    # Add observations for every type of observation in the GDC diagnosis.
    if diagnosis.get("tumor_stage") is not None:
        obs.observations.append(
            create_stage_observation("Overall", diagnosis.get("tumor_stage"))
        )

    if diagnosis.get("ajcc_clinical_stage") is not None:
        obs.observations.append(
            create_stage_observation(
                "Clinical Overall", diagnosis.get("ajcc_clinical_stage")
            )
        )

    if diagnosis.get("ajcc_clinical_t") is not None:
        obs.observations.append(
            create_stage_observation(
                "Clinical Tumor (T)", diagnosis.get("ajcc_clinical_t")
            )
        )

    if diagnosis.get("ajcc_clinical_n") is not None:
        obs.observations.append(
            create_stage_observation(
                "Clinical Node (N)", diagnosis.get("ajcc_clinical_n")
            )
        )

    if diagnosis.get("ajcc_clinical_m") is not None:
        obs.observations.append(
            create_stage_observation(
                "Clinical Metastasis (M)", diagnosis.get("ajcc_clinical_m")
            )
        )

    if diagnosis.get("ajcc_pathologic_stage") is not None:
        obs.observations.append(
            create_stage_observation(
                "Pathological Overall", diagnosis.get("ajcc_pathologic_stage")
            )
        )

    if diagnosis.get("ajcc_pathologic_t") is not None:
        obs.observations.append(
            create_stage_observation(
                "Pathological Tumor (T)", diagnosis.get("ajcc_pathologic_t")
            )
        )

    if diagnosis.get("ajcc_pathologic_n") is not None:
        obs.observations.append(
            create_stage_observation(
                "Pathological Node (N)", diagnosis.get("ajcc_pathologic_n")
            )
        )

    if diagnosis.get("ajcc_pathologic_m") is not None:
        obs.observations.append(
            create_stage_observation(
                "Pathological Metastasis (M)", diagnosis.get("ajcc_pathologic_m")
            )
        )

    return obs


# Test transform with the diagnosis from the first loaded case.
# Note that the resulting CancerStageObservationSet contains descriptions for the concepts included in it.
# example_observation_set = create_stage_from_gdc(gdc_head_and_mouth[558]['diagnoses'][0], ccdh.Subject(id='1234'))
example_observation_set = create_stage_from_gdc(gdc_head_and_mouth[558]["diagnoses"][0])
example_observation_set

CancerStageObservationSet(id=None, category=None, focus=[], subject=None, method_type=CodeableConcept(coding=[Coding(code='AJCC staging system 7th edition', system='http://crdc.nci.nih.gov/gdc', label=None, system_version=None, value_set=None, value_set_version=None, tag=[])], text=None), performed_by=None, observations=[CancerStageObservation(observation_type=CodeableConcept(coding=[Coding(code='Clinical Overall', system='http://crdc.nci.nih.gov/gdc', label=None, system_version=None, value_set=None, value_set_version=None, tag=[])], text=None), value_codeable_concept=CodeableConcept(coding=[Coding(code='Stage I', system='http://crdc.nci.nih.gov/gdc', label=None, system_version=None, value_set=None, value_set_version=None, tag=[])], text=None), id=None, category=None, method_type=None, focus=None, subject=None, performed_by=None), CancerStageObservation(observation_type=CodeableConcept(coding=[Coding(code='Clinical Tumor (T)', system='http://crdc.nci.nih.gov/gdc', label=None, system_ve

Reading Python Data Classes in its default text output can be difficult! However, we can use LinkML's [YAML](https://en.wikipedia.org/wiki/YAML) dumper to display this Cancer Stage Observation Set as a YAML string. YAML objects are a good way to export LinkML data, and include detailed descriptions of all the enumerations referenced from this object. We currently include basic descriptions for the permissible values (see e.g. "N1 Stage Finding" below), but we will include more detailed descriptions in the future.

In [50]:
from linkml_runtime.dumpers import yaml_dumper

print(yaml_dumper.dumps(example_observation_set))

method_type:
  coding:
  - code: AJCC staging system 7th edition
    system: http://crdc.nci.nih.gov/gdc
observations:
- observation_type:
    coding:
    - code: Clinical Overall
      system: http://crdc.nci.nih.gov/gdc
  value_codeable_concept:
    coding:
    - code: Stage I
      system: http://crdc.nci.nih.gov/gdc
- observation_type:
    coding:
    - code: Clinical Tumor (T)
      system: http://crdc.nci.nih.gov/gdc
  value_codeable_concept:
    coding:
    - code: T1
      system: http://crdc.nci.nih.gov/gdc
- observation_type:
    coding:
    - code: Clinical Node (N)
      system: http://crdc.nci.nih.gov/gdc
  value_codeable_concept:
    coding:
    - code: N0
      system: http://crdc.nci.nih.gov/gdc
- observation_type:
    coding:
    - code: Clinical Metastasis (M)
      system: http://crdc.nci.nih.gov/gdc
  value_codeable_concept:
    coding:
    - code: M0
      system: http://crdc.nci.nih.gov/gdc
- observation_type:
    coding:
    - code: Pathological Overall
      sys

Diagnoses can contain samples, which we transform into [CCDH Samples](https://cancerdhc.github.io/ccdhmodel/latest/Specimen/).

In [56]:
def transform_sample_to_specimen(gdc_sample):
    """
    A method for transforming a GDC Sample into CCDH Specimen.
    """
    specimen = crdch_model.Specimen(id=gdc_sample.get("sample_id"))
    if gdc_sample.get("sample_type"):
        specimen.source_material_type = codeable_concept(
            GDC_URL, gdc_sample.get("sample_type")
        )

    if gdc_sample.get("tissue_type"):
        specimen.general_tissue_pathology = codeable_concept(
            GDC_URL, gdc_sample.get("tissue_type")
        )

    if gdc_sample.get("tumor_code"):
        specimen.specific_tissue_pathology = codeable_concept(
            GDC_URL, gdc_sample.get("tumor_code")
        )
        
    if gdc_sample.get("tumor_descriptor"):
        specimen.tumor_status_at_collection = codeable_concept(
            GDC_URL, gdc_sample.get("tumor_descriptor")
        )
        
    return specimen


# Let's try creating a test specimen.
test_specimen = transform_sample_to_specimen(gdc_head_and_mouth[2]["samples"][0])
test_specimen

Specimen(id='9a26d9df-9ab2-48df-ada8-8bd8455cccd6', identifier=[], description=None, specimen_type=None, analyte_type=None, associated_project=None, data_provider=None, source_material_type=CodeableConcept(coding=[Coding(code='Primary Tumor', system='http://crdc.nci.nih.gov/gdc', label=None, system_version=None, value_set=None, value_set_version=None, tag=[])], text=None), parent_specimen=[], source_subject=None, tumor_status_at_collection=None, creation_activity=None, processing_activity=[], storage_activity=[], transport_activity=[], contained_in=None, dimensional_measures=None, quantity_measure=[], quality_measure=[], cellular_composition_type=None, histological_composition_measures=[], general_tissue_pathology=CodeableConcept(coding=[Coding(code='Not Reported', system='http://crdc.nci.nih.gov/gdc', label=None, system_version=None, value_set=None, value_set_version=None, tag=[])], text=None), specific_tissue_pathology=None, preinvasive_tissue_morphology=None, morphology_pathological

We can now transform an entire diagnosis into a [CCDH Diagnosis](https://cancerdhc.github.io/ccdhmodel/latest/Diagnosis/).

In [75]:
ICD10_URL = "http://hl7.org/fhir/ValueSet/icd-10"

def transform_diagnosis(gdc_diagnosis, gdc_case):
    diagnosis = crdch_model.Diagnosis(id=gdc_diagnosis.get("diagnosis_id"))
    
    if gdc_diagnosis.get("diagnosis_id"):
        diagnosis.identifier = [
            crdch_model.Identifier(
                value=gdc_diagnosis["diagnosis_id"],
                system=f"{GDC_URL}#diagnosis_id",
            )
        ]
        
    condition_codings = []
    if gdc_diagnosis.get("primary_diagnosis"):
        condition_codings.append(
            crdch_model.Coding(
                system=GDC_URL,
                code=gdc_diagnosis.get("primary_diagnosis"),
                tag=["original"],
            )
        )

    if gdc_diagnosis.get("icd_10_code"):
        condition_codings.append(
            crdch_model.Coding(
                system=ICD10_URL,
                code=gdc_diagnosis.get("icd_10_code"),
                tag=["original"],
            )
        )

    diagnosis.condition = crdch_model.CodeableConcept(coding=condition_codings)

    if gdc_diagnosis.get("morphology"):
        diagnosis.morphology = codeable_concept(
            GDC_URL, gdc_diagnosis.get("morphology")
        )
    
    diagnosis.stage=create_stage_from_gdc(gdc_diagnosis)
    
    # Convert the specimen.
    specimens = [
        transform_sample_to_specimen(sample)
        for (sample_index, sample) in enumerate(gdc_case.get("samples") or [])
    ]
    if len(specimens) > 0:
        diagnosis.related_specimen = specimens
    
    diagnosis.identifier = [
        ccdh.Identifier(system="GDC-submitter-id", value=gdc_diagnosis.get("submitter_id"))
    ]

    if "primary_site" in gdc_case and gdc_case["primary_site"] != "":
        body_site = create_body_site(gdc_case["primary_site"])
        if body_site is not None:
            diagnosis.metastatic_site.append(body_site)

    return diagnosis


example_diagnosis = transform_diagnosis(
    gdc_head_and_mouth[558]["diagnoses"][0], gdc_head_and_mouth[131]
)
print(yaml_dumper.dumps(example_diagnosis))

id: f44a9ab8-f121-59c5-8087-567864266629
identifier:
- value: TCGA-CN-A63Y_diagnosis
  system: GDC-submitter-id
condition:
  coding:
  - code: Squamous cell carcinoma, NOS
    system: http://crdc.nci.nih.gov/gdc
    tag:
    - original
  - code: C09.9
    system: http://hl7.org/fhir/ValueSet/icd-10
    tag:
    - original
metastatic_site:
- site:
    coding:
    - code: Nasopharynx
      system: http://crdc.nci.nih.gov/gdc
stage:
  method_type:
    coding:
    - code: AJCC staging system 7th edition
      system: http://crdc.nci.nih.gov/gdc
  observations:
  - observation_type:
      coding:
      - code: Clinical Overall
        system: http://crdc.nci.nih.gov/gdc
    value_codeable_concept:
      coding:
      - code: Stage I
        system: http://crdc.nci.nih.gov/gdc
  - observation_type:
      coding:
      - code: Clinical Tumor (T)
        system: http://crdc.nci.nih.gov/gdc
    value_codeable_concept:
      coding:
      - code: T1
        system: http://crdc.nci.nih.gov/gdc
  

## Exporting Python Data Classes as JSON-LD

Python Data Classes can be exported as [JSON-LD](https://en.wikipedia.org/wiki/JSON-LD), allowing CCDH instance data to be shared in a [JSON](https://en.wikipedia.org/wiki/JSON)-based [RDF](https://en.wikipedia.org/wiki/Resource_Description_Framework) format. RDF formats are particularly useful in sharing data, since they allow us to share [Linked Data](https://en.wikipedia.org/wiki/Linked_data) that can be understood by other consumers.

In [91]:
from linkml.generators.jsonldcontextgen import ContextGenerator
from linkml_runtime.dumpers import json_dumper
import rdflib
import requests

def entity_to_jsonld(entity):
    yaml_schema_url = "https://raw.githubusercontent.com/cancerDHC/ccdhmodel/main/model/schema/crdch_model.yaml"
    req = requests.get(yaml_schema_url)
    ccdh_yaml_schema = req.text

    jsonldContext = ContextGenerator(ccdh_yaml_schema).serialize()
    jsonldContextAsDict = json.loads(jsonldContext)

    as_json_str = json_dumper.dumps(
        {"@graph": entity, "@context": jsonldContextAsDict}
    )
    return as_json_str

# Display the example diagnosis we constructed in a previous step.
print(entity_to_jsonld(example_diagnosis))

{
  "@graph": {
    "id": "f44a9ab8-f121-59c5-8087-567864266629",
    "identifier": [
      {
        "value": "TCGA-CN-A63Y_diagnosis",
        "system": "GDC-submitter-id"
      }
    ],
    "condition": {
      "coding": [
        {
          "code": "Squamous cell carcinoma, NOS",
          "system": "http://crdc.nci.nih.gov/gdc",
          "tag": [
            "original"
          ]
        },
        {
          "code": "C09.9",
          "system": "http://hl7.org/fhir/ValueSet/icd-10",
          "tag": [
            "original"
          ]
        }
      ]
    },
    "metastatic_site": [
      {
        "site": {
          "coding": [
            {
              "code": "Nasopharynx",
              "system": "http://crdc.nci.nih.gov/gdc"
            }
          ]
        }
      }
    ],
    "stage": {
      "method_type": {
        "coding": [
          {
            "code": "AJCC staging system 7th edition",
            "system": "http://crdc.nci.nih.gov/gdc"
          }
     

We can also transform all the diagnoses in this file.

In [96]:
diagnoses = []
for case in gdc_head_and_mouth:
    for diagnosis in case["diagnoses"]:
        diagnoses.append(transform_diagnosis(diagnosis, case))
        
diagnoses_jsonld = entity_to_jsonld(diagnoses)
print(diagnoses_jsonld[0:2000] + '...')

{
  "@graph": [
    {
      "id": "a4f6276a-b3cc-45f9-9fb8-30edd56ad4ea",
      "identifier": [
        {
          "value": "GENIE-DFCI-011620-10763_diagnosis",
          "system": "GDC-submitter-id"
        }
      ],
      "condition": {
        "coding": [
          {
            "code": "Squamous cell carcinoma, NOS",
            "system": "http://crdc.nci.nih.gov/gdc",
            "tag": [
              "original"
            ]
          }
        ]
      },
      "metastatic_site": [
        {
          "site": {
            "coding": [
              {
                "code": "Oropharynx",
                "system": "http://crdc.nci.nih.gov/gdc"
              }
            ]
          }
        }
      ],
      "morphology": {
        "coding": [
          {
            "code": "8070/3",
            "system": "http://crdc.nci.nih.gov/gdc"
          }
        ]
      },
      "related_specimen": [
        {
          "id": "d697e728-9813-4ff5-83eb-c7d814e07bcc",
          "source_

## Converting JSON-LD to Turtle

While JSON-LD is a full dialect of RDF, people are more familiar looking at RDF in a format like [Turtle](https://en.wikipedia.org/wiki/Turtle_(syntax)). We can convert the generated JSON-LD output into Turtle by using the [rdflib](https://rdflib.readthedocs.io/en/stable/) package.

Note that this section is intended to be illustrative -- these are *not* finalized IRIs for properties and entities. We will choose IRIs and develop a canonical RDF representation in future phases of development.

In [98]:
# We can read this JSON-LD in Turtle.
from rdflib import Graph

def entity_to_turtle(entity):
    as_json_str = entity_to_jsonld(entity)

    # Convert JSON-LD into Turtle.
    g = rdflib.Graph()
    g.parse(data=as_json_str, format="json-ld")
    rdf_as_turtle = g.serialize(format="turtle").decode()

    return rdf_as_turtle

g = Graph()
g.parse(data=diagnoses_jsonld, format="json-ld")
rdfAsTurtle = g.serialize(format="turtle").decode()
print("".join(rdfAsTurtle[0:1000]))

@prefix : <https://example.org/crdch/> .
@prefix crdch: <https://example.org/crdch/> .

[] crdch:condition [ crdch:coding [ crdch:code "Squamous cell carcinoma, NOS" ;
                    crdch:system "http://crdc.nci.nih.gov/gdc" ;
                    crdch:tag "original" ] ] ;
    crdch:id "f8933756-4001-4c7b-94c7-2d8964ffda9e" ;
    crdch:identifier [ crdch:system "GDC-submitter-id" ;
            crdch:value "GENIE-DFCI-005118-1198_diagnosis" ] ;
    crdch:metastatic_site [ crdch:site [ crdch:coding [ crdch:code "Oropharynx" ;
                            crdch:system "http://crdc.nci.nih.gov/gdc" ] ] ] ;
    crdch:morphology [ crdch:coding [ crdch:code "8070/3" ;
                    crdch:system "http://crdc.nci.nih.gov/gdc" ] ] ;
    crdch:related_specimen [ crdch:general_tissue_pathology [ crdch:coding [ crdch:code "Not Reported" ;
                            crdch:system "http://crdc.nci.nih.gov/gdc" ] ] ;
            crdch:id "430338db-7df9-43c8-be80-6ef5a6b98899" ;
            