# PDC to CRDCH Transformation Workflow

This notebook demonstrates the conversion of data from one of the CRDC nodes, specifically the PDC, into CRDCH instance data. The notebook reads in the node data as JSON and outputs it in the LinkML format.

In [1]:
import sys

# Installing these packages is only necessary if you don't start this Notebook by running
# `poetry run jupyter notebook`.

# LinkML is a modeling language with built in generators
# that can automagically generate data in many output formats
# like JSON-LD, Python dataclasses modules, etc.
#!{sys.executable} -m pip install linkml

# Dataframes library to visualize node data in a tabular format
#!{sys.executable} -m pip install pandas

# Utilities to visualize data in LinkML YAML format
#!{sys.executable} -m pip install linkml-runtime

# Python dataclasses to load, validate and transform 
# CRDCH instance data
#!{sys.executable} -m pip install crdch-model

## Load and Visualize PDC data

For the purposes of this demonstration, we have aggregated a dataset of 560 cases from the PDC by querying the [PDC public API](https://pdc.cancer.gov/data-dictionary/publicapi-documentation/#!/Case/allCases). The querying and download protocol has been documented in a different [Jupyter notebook](https://github.com/cancerDHC/example-data/blob/main/head-and-mouth/Head%20and%20Mouth%20Cancer%20Datasets.ipynb).

In [2]:
import json
import pandas

with open('head-and-mouth/pdc-head-and-mouth.json') as file:
    pdc_head_and_mouth = json.load(file)
    
pandas.DataFrame(pdc_head_and_mouth)

Unnamed: 0,case_id,case_submitter_id,days_to_lost_to_followup,demographics,diagnoses,disease_type,externalReferences,index_date,lost_to_followup,primary_site,project_submitter_id,samples
0,0232701d-6d00-440c-af6c-5899fbbf4142,OSCC_13,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '426a2696-f073-4...
1,0e943de7-c277-48f2-8fa9-b2e836b03c2c,OSCC_25,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '38404eb4-20a6-4...
2,1104505a-9890-49ce-8d7d-7a8070261324,OSCC_23,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '15218d5b-fc40-4...
3,195cd133-0d53-402d-b31c-3d4fe0481858,OSCC_37,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '47e8d70c-646d-4...
4,1df726a4-8520-4474-8c00-d238a7384be1,OSCC_06,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '333e24c9-ec45-4...
...,...,...,...,...,...,...,...,...,...,...,...,...
143,df6bef95-c233-4b10-b321-36ef4e79b5d4,OSCC_40,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': 'a3402806-a9ec-4...
144,e11e9155-4ac6-43dc-b8e5-1be822cd2dab,OSCC_47,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '9d1789f8-d629-4...
145,ea7c9fbd-8353-4f3c-9fea-2fba79140536,OSCC_56,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '9c36e4e9-a971-4...
146,f581075d-1b69-4812-9fe4-2bde4aad8bf2,OSCC_38,,"[{'cause_of_death': None, 'days_to_birth': Non...","[{'age_at_diagnosis': None, 'ajcc_clinical_m':...",Oral Squamous Cell Carcinoma,[],,,Head and Neck,Oral Squamous Cell Carcinoma - Chang Gung Univ...,[{'aliquots': [{'aliquot_id': '4059003c-b576-4...


## Transform PDC Case into CRDCH Research Subject

The transformation being demonstrated here is the transformation of a PDC case into a CCDH Research Subject. For the purposes of this demo, we will transform two components, specifically, the diagnosis information from a PDC case into a CRDCH Diagnosis. And also the top level specimen component into CCDH Specimens.

TODO: Similar to specimen, other top level components like samples, aliquots, analytes, slides can also be transformed.

In [3]:
import crdch_model
import utils

def create_stage_observation(type, value):
    """Create a CCDHCancerStageObservation from a type of observation and a codeable concept."""
    # As with the body site example above, we need to map PDC values into the values
    # allowed under the CRDCH model.
    stage_mappings = {
        'not reported': 'Not Reported',
        'unknown': 'Unknown',
        'stage i': 'Stage I',
        'stage ii': 'Stage II',
        'stage iii': 'Stage III',
        'stage iva': 'Stage IVA',
        'stage ivb': 'Stage IVB',
        'stage ivc': 'Stage IVC',
    }
    
    # Check if any of the stages are named with lower case Roman numerals
    # rather than upper case.
    if value in stage_mappings:
        return crdch_model.CancerStageObservation(
            observation_type=utils.codeable_concept(
                code=type,
                system="https://example.org/CancerStageObservation"
            ),
            value_codeable_concept=utils.codeable_concept(
                code=stage_mappings[value],
                system="https://example.org/CancerStageObservation"
            )
        )
    
    return crdch_model.CancerStageObservation(
        observation_type=utils.codeable_concept(
            code=type,
            system="https://example.org/CancerStageObservation"
        ),
        value_codeable_concept=utils.codeable_concept(
            code=value,
            system="https://example.org/CancerStageObservation"
        )
    )
    
def create_stage_from_pdc(diagnosis):
    # Create an observation set
    # method_type information not present in case
    obs = crdch_model.CancerStageObservationSet()
    
    # Add observations for every type of observation in the PDC diagnosis.
    if diagnosis.get('tumor_stage') is not None:
        obs.observations.append(create_stage_observation('Overall', diagnosis.get('tumor_stage')))
        
    if diagnosis.get('ajcc_clinical_stage') is not None:
        obs.observations.append(create_stage_observation('Clinical Overall', diagnosis.get('ajcc_clinical_stage')))
        
    if diagnosis.get('ajcc_clinical_t') is not None:
        obs.observations.append(create_stage_observation('Clinical Tumor (T)', diagnosis.get('ajcc_clinical_t')))
        
    if diagnosis.get('ajcc_clinical_n') is not None:
        obs.observations.append(create_stage_observation('Clinical Node (N)', diagnosis.get('ajcc_clinical_n')))
        
    if diagnosis.get('ajcc_clinical_m') is not None:
        obs.observations.append(create_stage_observation('Clinical Metastasis (M)', diagnosis.get('ajcc_clinical_m')))
    
    if diagnosis.get('ajcc_pathologic_stage') is not None:
        obs.observations.append(create_stage_observation('Pathological Overall', diagnosis.get('ajcc_pathologic_stage')))
        
    if diagnosis.get('ajcc_pathologic_t') is not None:
        obs.observations.append(create_stage_observation('Pathological Tumor (T)', diagnosis.get('ajcc_pathologic_t')))
        
    if diagnosis.get('ajcc_pathologic_n') is not None:
        obs.observations.append(create_stage_observation('Pathological Node (N)', diagnosis.get('ajcc_pathologic_n')))
        
    if diagnosis.get('ajcc_pathologic_m') is not None:
        obs.observations.append(create_stage_observation('Pathological Metastasis (M)', diagnosis.get('ajcc_pathologic_m')))
    
    return obs

# Test transform with the diagnosis from the first loaded case.
# Note that the resulting CancerStageObservationSet contains descriptions for the concepts included in it.
example_observation_set = create_stage_from_pdc(pdc_head_and_mouth[131]['diagnoses'][0])
example_observation_set

CancerStageObservationSet(id=None, category=None, focus=[], subject=None, method_type=[], performed_by=None, observations=[CancerStageObservation(observation_type=CodeableConcept(coding=[Coding(code='Overall', system='https://example.org/CancerStageObservation', label=None, system_version=None, value_set=None, value_set_version=None, tag=[])], text=None), value_codeable_concept=CodeableConcept(coding=[Coding(code='Stage IV', system='https://example.org/CancerStageObservation', label=None, system_version=None, value_set=None, value_set_version=None, tag=[])], text=None), id=None, category=None, method_type=None, focus=None, subject=None, performed_by=None), CancerStageObservation(observation_type=CodeableConcept(coding=[Coding(code='Clinical Metastasis (M)', system='https://example.org/CancerStageObservation', label=None, system_version=None, value_set=None, value_set_version=None, tag=[])], text=None), value_codeable_concept=CodeableConcept(coding=[Coding(code='M0', system='https://exa

Use the LinkML YAML dumper to format and display the Cancer Stage Observation Set as a YAML string.

In [4]:
from linkml_runtime.dumpers import yaml_dumper

print(yaml_dumper.dumps(example_observation_set))

observations:
- observation_type:
    coding:
    - code: Overall
      system: https://example.org/CancerStageObservation
  value_codeable_concept:
    coding:
    - code: Stage IV
      system: https://example.org/CancerStageObservation
- observation_type:
    coding:
    - code: Clinical Metastasis (M)
      system: https://example.org/CancerStageObservation
  value_codeable_concept:
    coding:
    - code: M0
      system: https://example.org/CancerStageObservation
- observation_type:
    coding:
    - code: Pathological Overall
      system: https://example.org/CancerStageObservation
  value_codeable_concept:
    coding:
    - code: Stage IVA
      system: https://example.org/CancerStageObservation
- observation_type:
    coding:
    - code: Pathological Tumor (T)
      system: https://example.org/CancerStageObservation
  value_codeable_concept:
    coding:
    - code: T4a
      system: https://example.org/CancerStageObservation
- observation_type:
    coding:
    - code: Patholog

## Transform PDC Diagnosis into CRDCH Diagnosis

We first define transformation methods for creating a CCDH BodySite from the `primary_site` in the PDC case, and also for transforming sample information from the case into a CCDH Specimen. Using these two methods we can define a combined diagnosis transformation method which can transform a full PDC diagnosis into a CCDH Diagnosis.

In [6]:
def create_body_site(site_name):
    """Create a CCDH BodySite based on the name of a site in the human body."""
    
    # Accept 'None'.
    if site_name is None:
        return None
    
    # Some body sites are not currently included in the CCDH model. We will need to translate these sites
    # into values that *are* included in the CCDH model.
    site_mappings = {
        'Larynx, NOS': crdch_model.EnumCRDCHBodySiteSite.Larynx
    }
    
    # Map values if needed. Otherwise, pass them through unmapped.
    if site_name in site_mappings:
        return crdch_model.BodySite(
            utils.codeable_concept(
                code=site_mappings[site_name],
                system="https://example.org/BodySite"
            )
        )
    
    return crdch_model.BodySite(
        utils.codeable_concept(
            code=site_name,
            system="https://example.org/BodySite"
        )
    )

def transform_sample_to_specimen(sample):
    """A method for transforming a PDC Sample into CCDH Specimen."""
    
    specimen = crdch_model.Specimen(id = sample.get('sample_id'))
    specimen.source_material_type = sample.get('sample_type')
    specimen.general_tissue_morphology = sample.get('tissue_type')
    specimen.specific_tissue_morphology = sample.get('tumor_code')
    specimen.tumor_status_at_collection = sample.get('tumor_descriptor')

    return specimen

def transform_diagnosis(diagnosis, case):
    ccdh_diagnosis = crdch_model.Diagnosis(
        id=diagnosis.get('diagnosis_id'),
        condition=utils.codeable_concept(
            code=diagnosis.get('primary_diagnosis'),
            system="https://example.org/PrimaryDiagnosis"
        ),
        morphology=utils.codeable_concept(
            code=diagnosis.get('morphology'),
            system="https://example.org/Morphology"
        ),
        grade=diagnosis.get('grade'),
        stage=create_stage_from_pdc(diagnosis),
        # age_at_diagnosis=diagnosis.get('year_of_diagnosis'),
        related_specimen=[
            transform_sample_to_specimen(
                sample
            ) for sample in case.get('samples')
        ]
    )
    ccdh_diagnosis.identifier = [
        crdch_model.Identifier(
            system='PDC-submitter-id',
            value=diagnosis.get('diagnosis_submitter_id')
        )
    ]
    
    if 'primary_site' in case and case['primary_site'] != '':
        body_site = create_body_site(case['primary_site'])
        if body_site is not None:
            ccdh_diagnosis.metastatic_site.append(body_site)

    return ccdh_diagnosis

example_diagnosis = transform_diagnosis(pdc_head_and_mouth[131]['diagnoses'][0], pdc_head_and_mouth[131])
print(yaml_dumper.dumps(example_diagnosis))

id: 68e054bf-2850-11ec-b712-0a4e2186f121
identifier:
- value: C3N-03889-DX
  system: PDC-submitter-id
condition:
  coding:
  - code: Squamous cell carcinoma, NOS
    system: https://example.org/PrimaryDiagnosis
metastatic_site:
- site:
    coding:
    - code: Head and Neck
      system: https://example.org/BodySite
stage:
- observations:
  - observation_type:
      coding:
      - code: Overall
        system: https://example.org/CancerStageObservation
    value_codeable_concept:
      coding:
      - code: Stage IV
        system: https://example.org/CancerStageObservation
  - observation_type:
      coding:
      - code: Clinical Metastasis (M)
        system: https://example.org/CancerStageObservation
    value_codeable_concept:
      coding:
      - code: M0
        system: https://example.org/CancerStageObservation
  - observation_type:
      coding:
      - code: Pathological Overall
        system: https://example.org/CancerStageObservation
    value_codeable_concept:
      codi