<a href="https://colab.research.google.com/github/ImagingDataCommons/crdc_compound_objects/blob/main/CRDC_compound_object.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro

This notebook investigates a compound object replacement for DRS bundles. The compound object is a JSON structure that is intended to be extensible in support of various use cases. This is a strawman proposal.

This notebook explores this proposed compound object as it cooud be used to represent IDC DICOM series objects. 

# Compound objects

The DRS URI of an IDC DICOM instance can be resolved to obtain the location of the corresponding file object no matter where it is available. Similarly, it should be possible to resolve a DRS URI of an IDC DICOM series to obtain the location(s) of the DICOM instance data of which it is composed.

This instance data might take different forms. For example, in addition to an individual GCS 'blob' for each DICOM instance in the series, there might also be an archive , such as a zip, that includes all the instances in the series.

For this purpose we have defined a compound object, a JSON data structure, that contains DRS URLs that can be resolved to various embodiments of the corresponding series. We expect that the same structure can be used to represent other complex objects in other use cases.

The proposed compound object is very loosely based on the DrsObject, but with significant modifications. Rather than include a formal JSON schema, we will show an example that might be used to represent an IDC series.

Such a schema can be found [here](https://docs.google.com/document/d/1fvXneaDOlnqdl-nhhwkEbv_2sbTE5FTNRSXDay76q1E/edit).

Here us an example of such a compound object that one might have obtain by resolving the DRS URI of a particular IDC series:
```
{
  "encoding_version": "1.0",
  "description": "IDC CRDC DICOM series compound object",
  "object_type": "DICOM series",
  "id": "0190fe71-7144-40ae-a24c-c8d21a99317d",
  "name": "1.3.6.1.4.1.14519.5.2.1.8421.4018.242623218081388666714980416901",
  "self_uri": "drs://dg.4DFC/0190fe71-7144-40ae-a24c-c8d21a99317d",
  "access_methods": [
    {
      "access_method": "children",
      "mime_type": 'application/dicom'
      "description": "List of DRS URIs of instances in this series",
      "contents": [
        {
          "name": "1.3.6.1.4.1.14519.5.2.1.8421.4018.218816382845480339633358956983",
          "drs_uri": "drs://dg.4DFC/01210a30-8395-498c-905f-6667db67101a"
        },
        {
          "name": "1.3.6.1.4.1.14519.5.2.1.8421.4018.241858580971386059648748116938",
          "drs_uri": "drs://dg.4DFC/01b37013-7484-4968-8a9a-3a6b28c457e9"
        },
        :
        :
        :
        {
          "name": "1.3.6.1.4.1.14519.5.2.1.8421.4018.306949952402324943719349396501",
          "drs_uri": "drs://dg.4DFC/fb8a3f58-3e6d-48c5-8ed7-08aac223213e"
        }
      ]
    },
    {
      "method": "folder_object",
      "mime_type": 'application/json'
      "description": "DRS URI that resolves to a gs or s3 folder object corresponding to this series",
      "contents": [
        {
          "name": "1.3.6.1.4.1.14519.5.2.1.8421.4018.242623218081388666714980416901",
          "drs_uri": "drs://dg.4DFC/some_TBD_uuid"
        }
      ]
    },
    {
      "method": "archive_package",
      "mime_type": 'application/zip'
      "description": "DRS URI that resolves to a zip archive of the instances in this series",
      "contents": [
        {
          "name": "1.3.6.1.4.1.14519.5.2.1.8421.4018.242623218081388666714980416901.zip",
          "drs_uri": "drs://dg.4DFC/some_TBD_uuid"
        }
      ]
    }
  ],
  "aliases": []
}
```

The `description`, `id`, `name`, and `self_uri` fields are approximately as defined in the DrsObject. The `encoding_version` field is added in the expectation that this JSON structure might need revision. The `object_type` provides a way to document the ype of object represented by this object (but the `description` might be sufficient for that purpose).

When used to represent an IDC DICOM series, the value of `id` could be the the `series_uuid` (A UUID4 assigned to a version of a DICOM series) of the series, and the value of `name` could be the DICOM SOPInstanceUID of the series.

The `access_methods` field, while having the same name as a field in the DrsObject, is very different. The `access_methods` recognizes that there may be more than one 'embodiment' or method of accessing the same object.

In the case of an DICOM series in the IDC data set, the series might be embodied as the set of GCS (and, at some future time, S3) DICOM instance blobs and accessed by resolving a DRS URI for each such instance. Another possible/expected method is as an archive such as a zip, again embodied as a GCS or S3 blob. A third such method could be as a GCS or S3 'directory' that contains the instances in the series.

Rather than define an IDC specific compound object for this purpose, we have tried to specify a JSON encoding that is somewhat self describing and adaptable to other use cases. Admittedly, the goal of supporting arbitrary access types requires some use case specific knowledge on the part of the user. Specifically, in the case of the IDC DICOM series compound object, the user must know that they are looking for a 'children' method and/or a 'folder_object' method and/or an 'archive_package' method. That is, other methods might be defined in support of other use cases, and which have different `method` IDs.

Each access_method has:
*  a `method`, a string that identifies the method,
*  a `mime_type`, the mime_type of the data element(s) to which each DRS URI in the `contents` array resolves 
*  a `description`, an arbitrary use-case specific string 
*  a `contents` array, each element of which is 
 * an `id`, an arbitrary use-case specific string, 
 * a `drs_uri`, a DRS URI that can be resolved to some object.

At a very high level, the expectation is that, having obtained such a compound object, the user will find the access_method of interest by searching for some `method` string. Having found such an access method, the user will proceed to select one or more elements from the contents array, perhaps based on the `id` value, and then proceed to resolve the corresponding DRS_URI. In subsequent sections, we'll demonstrate this process.

Note that we will refer to the CRDC compound object of an IDC DICOM series as a `series object`. 

# Demo
We've populated a GCS bucket, `cdrcobj`, with a selected set of series from the [NLST collection](https://doi.org/10.7937/tcia.hmq8-j677
).

Instances have hierarchical names like `<series_uuid>/<instance_uuid>.dcm` as described in the [Hierarchical Instance Names](https://github.com/ImagingDataCommons/crdc_compound_objects/blob/main/Hierarchical_Instance_Names.ipynb) notebook.

## Demo config

In [None]:
demo_bucket = 'crdcobj'

The jq JSON processor is useful for inspecting JSON:

In [None]:
!apt -qq update
!apt -qq install jq

In order to access GCS, we first authenticate to Colab:

In [None]:
from google.colab import auth
auth.authenticate_user()

We'll be using the GCS libraries and the Requests modul

In [None]:
import os
import shutil
import json
from google.cloud import storage
from google.cloud.storage import Blob
import requests
client = storage.Client()
bucket = client.bucket(demo_bucket)

## Series objects
For each series in the demo dataset, we have also created a series object, a CRDC object representing a DICOM series, like the example above . Each has a GCS URL like:
`gs://crdcobj/<series_uuid>/crdcobj.json`
We might have obtained such a GCS URL from an IDC cohort manifest or from a BQ query. In that case we can get the series object using gsutil or other tools. We'll assume that we have obtain the GCS URL of the compound object for the series having <series_uuid>:  
`f5d6b517-2c02-4035-9444-0f15be7180ff`:



In [None]:
!gsutil cat gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/crdcobj.json | jq

We'll use this example throughout this notebook.

## Obtaining Unsigned or Signed URLs for DICOM Instances
The Hierarchical Instance Names notebook describes various methods for accessing the instances in a series based on hierarchical naming. This access method can only be used when access to the instance data does not require a signed URL. 

Currently all IDC data is accessible using an unsigned URLs, but, in the future, access to some data might require a signed URL. For this purpse, the `series object` includes a `children` access method.

For demonstration purposes, we will assume that we have obtained the GCS URL, `gs:/crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/crdcobj.json` of the previously discussed `series object`:



In [None]:
series_object_gcs_url = 'gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/crdcobj.json'

Next we'll use the GCS API to get the contents of that folder object:

In [None]:
# Create a bucket object
bucket = client.bucket(demo_bucket)
# Extract the series_uuid from the URL
series_uuid = f"{series_object_gcs_url.split('/',3)[3]}"
# Create a blob object 
src_blob = bucket.blob(series_uuid)
series_object = json.loads(src_blob.download_as_string())
series_object

Now we can resolve each DRS URI to obtain a signed URL, and then access the corresponding blob. We'll first make a list of the instance DRS URIs

In [None]:
# Try to find a 'children' access method
method = next(item for item in series_object['access_methods'] if item['method'] == 'children')
if method:
  drs_ids = [content['drs_uri'].split('drs://')[1] for content in method['contents']]
  for i, id in enumerate(drs_ids):
    print(f'{i}: {id}')
else:
  print('A "children" access method was not found')



We can resolve each DRS ID to get the GCS URL of the instance blob and copy each blob to the local file system.

Warning: This operation is very slow.

In [None]:
drs_server = 'https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects'
!mkdir -vp /tmp/dicom_data

for drs_id in drs_ids:
  url = f"{drs_server}/{drs_id}"
  result = requests.get(url)
  result.raise_for_status()
  drsObject = result.json()
  # We need the access_id to get the signed URL 
  access_id = drsObject['access_methods'][0]['access_id']
  url = f'{url}/access/{access_id}'
  # Get the signed URL
  result = requests.get(url)
  result.raise_for_status()
  signed_url = result.json()['url']

  # Use requests to get the blob contents
  instance_blob_id = drs_id.split('/')[-1]
  with open(f"/tmp/dicom_data/{instance_blob_id}", "wb") as file_obj:
    result = requests.get(signed_url)
    file_obj.write(result.content)
  print(f'resolved and copied {drs_id}')

# Demonstrate that we've downloaded the blobs.      
!ls -l /tmp/dicom_data

# Clean up
!rm -vr /tmp/dicom_data





Of course, if the instance data in a series can be accessed using unsigned URLs, we can use this same process, but just skipping the extra step to get a signed URL.

We'll use the GCS library to copy the blobs.

In [None]:
drs_server = 'https://nci-crdc.datacommons.io/ga4gh/drs/v1/objects'
gcs_server = f'https://storage.googleapis.com'
!mkdir -vp /tmp/dicom_data

for drs_id in drs_ids:
  url = f"{drs_server}/{drs_id}"
  result = requests.get(url)
  result.raise_for_status()
  drsObject = result.json()
  # We need the access_id to get the signed URL 
  unsigned_url = drsObject['access_methods'][0]['access_url']['url']

  # Use requests to get the blob contents
  bucket_id = unsigned_url.split('/')[2]
  instance_blob_id = unsigned_url.split('/')[-1]
  bucket = client.bucket(bucket_id)
  with open(f"/tmp/dicom_data/{instance_blob_id}", "wb") as file_obj:
     bucket.blob(instance_blob_id).download_to_file(file_obj)
  print(f'resolved and copied {drs_id}')

# Demonstrate that we've downloaded the blobs.      
!ls -l /tmp/dicom_data

# Clean up
!rm -vr /tmp/dicom_data

## Obtaining Unsigned or Signed URLs of an Archive
As mentioned, there might be an archive, such as a zip, of all the instances in some series. Such a blob might be freely accessible or might require a signed URL.

For demonstration purposes, we will again assume that we have obtained the GCS URL, `gs:/crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/crdcobj.json` of the previously discussed `series object`:

In [None]:
series_object_gcs_url = 'gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/crdcobj.json'

Next we'll use the GCS API to get the contents of series  object:

In [None]:
# Split oout the bucket name
bucket_name = series_object_gcs_url.split('/')[2]
# Split out the series_uuid and create a blob object
src_blob = client.bucket(bucket_name).blob(f"{series_object_gcs_url.split('/',3)[3]}")
# Get the contents of the blob
series_object = json.loads(src_blob.download_as_string())
series_object

Get the `archive_package` access method from the series object:

In [None]:

method = next(item for item in series_object['access_methods'] if item['method'] == 'archive_package')
if method:
  drs_uri = method['contents'][0]['drs_uri']
  print(f'drs_uri: {drs_uri}')
else:
  print('A "archive_package" access method was not found')

Note that we have not yet created any such archives and therefore can't demonstrate resolving the DRS URI of such an object. However, the process is as for any other DRS URI.

## Using the instance name hierarchy

The Hierarchical Instance Names notebook, previously mentioned, describes accessing the instances in a series when given the `folder object` of the series. If we have a `series object`, we can obtain the GCS URL of its folder object:

1. Resolve the DRS URI in the 'folder_object' method in the series's compound object (shown as `drs://dg.4DFC/some_TBD_uuid' in the above ecxample...we've not yet generated these UUIDs.)
2. Resolving the DRS URI will return a DrsObject. An `AccessURL` in the DrsObject will contain a `url` whose value is the GCS URL of the folder object.

The following example obtains the folder object DRS URI from the same series_object used to obtain the archive DRS URI. We cannot demonstrate this because we have not generated DRS URIs for folder objects.

In [None]:
method = next(item for item in series_object['access_methods'] if item['method'] == 'folder_object')
if method:
  drs_uri = method['contents'][0]['drs_uri']
  print(f'drs_uri: {drs_uri}')
else:
  print('A "folder_object" access method was not found')

We cannot demonstrate resolving this DRS URI because we have not yet indexed folder objects.