This notebook requires read/list access to idc-dev-etl buckets.

I've created a small subset of IDC data in two different buckets. Instances in the whc_si bucket have names like:  
\<series_uuid\>/\<instance_uuid\>.dcm.  
Instances in the whc_ssi bucket have names like:  
\<study_uuid\>/\<series_uuid\>/\<instance_uuid\>.dcm.
The two buckets are otherwise almost identical.

A tree of objects enumerates the entire hierarchy. The root of the tree is a blob named idc.idc. 
The children of the root are the IDC versions. There are similar objects for each version, collection, patient, study and series.

Every such object includes some object metadata including its object type, hierarchical hash and its self_url.
The "encoding" value would presumably be bumped if we revised the encoding, giving users a way to handle different encodings. 

Each object also includes metadata about each of its children including a DICOM or other ID and the GCS URL of the corresponding object. We would also include AWS URLs if/when IDC data is on S3. 
Each patient, and study object also includes the DRS object ID of each of its children.  


Define some routines to download and pretty print the contents of a blob

In [4]:
%pip install -q google
%pip install -q google-cloud-storage
from google.cloud import storage
import json
client = storage.Client()
def get_json(bucket, blob_name):
    client = storage.Client()
    contents = bucket.blob(blob_name).download_as_text()
    return json.loads(contents)

def pp(json_text, indent=2):
    print(json.dumps(json_text, indent=indent))

#For this part of the demonstration, the bucket will be 'whc_si'
bucket = client.bucket("whc_si")


You should consider upgrading via the '/Users/BillClifford/env/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/Users/BillClifford/env/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


Take a look at the contents of idc.idc:

In [5]:
blob_name = "idc.idc"
root = get_json(bucket, blob_name)
pp(root)

{
  "encoding": "1.0",
  "object_type": "root",
  "md5_hash": "9fd635af110a569b7404fc63682011cb",
  "self_uri": "gs://whc_si/idc.idc",
  "children": {
    "count": 10,
    "object_ids": [
      "idc_v1",
      "idc_v2",
      "idc_v3",
      "idc_v4",
      "idc_v5",
      "idc_v6",
      "idc_v7",
      "idc_v8",
      "idc_v9",
      "idc_v10"
    ],
    "gs": {
      "region": "us-central1",
      "bucket": "whc_si",
      "gs_object_ids": [
        "idc_v1.idc",
        "idc_v2.idc",
        "idc_v3.idc",
        "idc_v4.idc",
        "idc_v5.idc",
        "idc_v6.idc",
        "idc_v7.idc",
        "idc_v8.idc",
        "idc_v9.idc",
        "idc_v10.idc"
      ]
    }
  }
}


We can also do this from the command line:

In [6]:
!gsutil cat gs://whc_dev/idc.idc|python3 -m json.tool

{
    "encoding": "1.0",
    "object_type": "root",
    "md5_hash": "9fd635af110a569b7404fc63682011cb",
    "self_uri": "gs://whc_dev/idc.idc",
    "children": {
        "count": 10,
        "object_ids": [
            "idc_v1",
            "idc_v2",
            "idc_v3",
            "idc_v4",
            "idc_v5",
            "idc_v6",
            "idc_v7",
            "idc_v8",
            "idc_v9",
            "idc_v10"
        ],
        "gs": {
            "region": "us-central1",
            "bucket": "whc_dev",
            "gs_object_ids": [
                "idc_v1.idc",
                "idc_v2.idc",
                "idc_v3.idc",
                "idc_v4.idc",
                "idc_v5.idc",
                "idc_v6.idc",
                "idc_v7.idc",
                "idc_v8.idc",
                "idc_v9.idc",
                "idc_v10.idc"
            ]
        }
    }
}


We see that there are currently ten IDC versions. 

Let's look at version 1. It's version object is a blob named idc_v1.idc. 

Instance blobs have the .dcm extension; we'll look at them later. All other blobs have the .idc extension.
The items in the object_ids list have the same ordering as the items in the gs_objects_ids list. We'll create another routine to do the lookup from object_id to gs_object_id:


In [8]:
def get_blob_name(jsn,object_id):
    indx=jsn['children']['object_ids'].index(object_id)
    blob_name = jsn['children']['gs']['gs_object_ids'][indx]
    return blob_name


Now we can look at idc_v1's object:

In [9]:
blob_name = get_blob_name(root,"idc_v1")
v1 = get_json(bucket, blob_name)
pp(v1)

{
  "encoding": "1.0",
  "object_type": "version",
  "version_id": 1,
  "md5_hash": "304ecfae7d9812435d058563d39a07a9",
  "self_uri": "whc_si/idc_v1.idc",
  "children": {
    "count": 2,
    "object_ids": [
      "tcga_esca",
      "tcga_read"
    ],
    "gs": {
      "region": "us-central1",
      "bucket": "whc_si",
      "gs_object_ids": [
        "c919eb76-9395-4ddb-89a7-bd5f800598fa.idc",
        "c4f84514-72bf-41e4-864c-d8c728a8df51.idc"
      ]
    }
  }
}


We see that there are two collections in version 1, tcga_esca and tcga_read. We'll look at tcga_read :

In [10]:
blob_name = get_blob_name(v1,"tcga_read")
tcga_read_v1 = get_json(bucket, blob_name)
pp(tcga_read_v1)

{
  "encoding": "1.0",
  "object_type": "collection",
  "tcia_api_collection_id": "TCGA-READ",
  "idc_webapp_collection_id": "tcga_read",
  "uuid": "c4f84514-72bf-41e4-864c-d8c728a8df51",
  "md5_hash": "86d9aa0d299c202a38212402d08e8946",
  "init_idc_version": 1,
  "rev_idc_version": 1,
  "final_idc_version": 7,
  "self_uri": "whc_si/c4f84514-72bf-41e4-864c-d8c728a8df51.idc",
  "children": {
    "count": 3,
    "object_ids": [
      "TCGA-BM-6198",
      "TCGA-CL-5917",
      "TCGA-CL-4957"
    ],
    "gs": {
      "region": "us-central1",
      "bucket": "gs://whc_si",
      "gs_object_ids": [
        "a83d4205-ecd3-48b2-b23c-9e3fe17f3422.idc",
        "b58e6030-4e72-4772-8c95-4b93a122f560.idc",
        "c6176059-a961-495e-a128-df4d6cdf302e.idc"
      ]
    }
  }
}


The collection metadata tells us that tcga_read first appeared in IDC version 1, that this version of tcga_read was "retired" after IDC version 7, and includes 3 patients. Because collections are seldom "retired", this means there will likely be a revised version of tcga_read in IDC version 8. We can check this by inspecting the idc_v8 object:


In [11]:
blob_name = get_blob_name(root,"idc_v8")
v8 = get_json(bucket, blob_name)
pp(v8)

{
  "encoding": "1.0",
  "object_type": "version",
  "version_id": 8,
  "md5_hash": "d2ed1a322ddd82714fc931f47f525cdf",
  "self_uri": "whc_si/idc_v8.idc",
  "children": {
    "count": 4,
    "object_ids": [
      "apollo_5_lscc",
      "cptac_sar",
      "tcga_esca",
      "tcga_read"
    ],
    "gs": {
      "region": "us-central1",
      "bucket": "whc_si",
      "gs_object_ids": [
        "4467f3bf-3ed2-445d-95b3-ff48df10006a.idc",
        "43519368-8044-4235-a601-1c5f91984467.idc",
        "960a3040-bd35-4caf-8e71-11379695f942.idc",
        "990f73f6-3f62-4007-8471-51792732880d.idc"
      ]
    }
  }
}


tcga-esca is also in v8 and two additional collections have been added since v1. 
If we look at the tcga_read revision in IDC v8, we will see that it now has 171 patients, up from the orignal 3.
Because final_idc_version==0, we know that this version of tcga_read has not been retired:

In [12]:
blob_name = get_blob_name(v8,"tcga_read")
tcga_read_v8 = get_json(bucket, blob_name)
pp(tcga_read_v8)

{
  "encoding": "1.0",
  "object_type": "collection",
  "tcia_api_collection_id": "TCGA-READ",
  "idc_webapp_collection_id": "tcga_read",
  "uuid": "990f73f6-3f62-4007-8471-51792732880d",
  "md5_hash": "946235a93fa85f45a09347b68f0d415b",
  "init_idc_version": 1,
  "rev_idc_version": 8,
  "final_idc_version": 0,
  "self_uri": "whc_si/990f73f6-3f62-4007-8471-51792732880d.idc",
  "children": {
    "count": 171,
    "object_ids": [
      "TCGA-AF-2693",
      "TCGA-DC-6681",
      "TCGA-DC-6156",
      "TCGA-AG-3600",
      "TCGA-F5-6864",
      "TCGA-AG-3882",
      "TCGA-CI-6623",
      "TCGA-AG-3728",
      "TCGA-AG-4021",
      "TCGA-F5-6464",
      "TCGA-AF-3913",
      "TCGA-AG-3598",
      "TCGA-CI-6622",
      "TCGA-AF-3400",
      "TCGA-AG-4001",
      "TCGA-AG-3583",
      "TCGA-AH-6897",
      "TCGA-DY-A1DC",
      "TCGA-AH-6549",
      "TCGA-AG-A008",
      "TCGA-AG-3902",
      "TCGA-AG-3727",
      "TCGA-DT-5265",
      "TCGA-AF-4110",
      "TCGA-F5-6861",
      "TCGA-AG-357

Note that our encoding doesn't directly support finding other versions of an object. We could add a siblings list to each object, if that would be useful. Similarly, we could add a parents list. 

Let's look at a patient in the initial tcga_read version.

In [13]:
blob_name = get_blob_name(tcga_read_v1,"TCGA-BM-6198")
patient_TCGA_BM_6198 = get_json(bucket, blob_name)
pp(patient_TCGA_BM_6198)

{
  "encoding": "1.0",
  "object_type": "patient",
  "submitter_case_id": "TCGA-BM-6198",
  "idc_case_id": "2bc64b47-ae06-43ff-8ae2-307830a68426",
  "uuid": "a83d4205-ecd3-48b2-b23c-9e3fe17f3422",
  "md5_hash": "24758c365b922a2a9430e5bd3448301d",
  "init_idc_version": 1,
  "rev_idc_version": 1,
  "final_idc_version": 7,
  "self_uri": "gs://whc_si/a83d4205-ecd3-48b2-b23c-9e3fe17f3422.idc",
  "children": {
    "count": 2,
    "object_ids": [
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.329305334176079996095294344892",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.304030957341830836628192929917"
    ],
    "gs": {
      "region": "us-central1",
      "bucket": "whc_si",
      "gs_object_ids": [
        "16532652-6fb0-41ca-bb21-705511f93dde.idc",
        "7efeae5d-6263-4184-9ad4-8df22720ada9.idc"
      ]
    },
    "drs": {
      "drs_server": "drs://nci-crdc.datacommons.io",
      "drs_object_ids": [
        "dg.4DFC/16532652-6fb0-41ca-bb21-705511f93dde",
        "dg.4DFC/7efeae5d-6263-4184-9ad4-8

This patient has two studies. In addition to the list of study GCS URLs, the patient object also includes a DRS object ID for each of the studies. (This presumes that study (and series) objects are indexed with DCF as DRS blobs.) 

These study and series objects could replace DRS bundles. Therefore, resolving such a DRS object ID will presumably yield the GCS URL of the study object. This will be the same URL as in the corresponding element in the gs_object_ids list.

We could, of course, index root, version, collection and patient objects as well. There are relatively few of these objects: currently 9 version objects, 189 collection objects and 73155 patient objects.

Continuing on, let's look at the study object whose object_id (in this case a StudyInstanceUID) is 1.3.6.1.4.1.14519.5.2.1.8421.4018.304030957341830836628192929917:

In [14]:
blob_name = get_blob_name(patient_TCGA_BM_6198,"1.3.6.1.4.1.14519.5.2.1.8421.4018.304030957341830836628192929917")
study_917 = get_json(bucket, blob_name)
pp(study_917)

{
  "encoding": "1.0",
  "object_type": "study",
  "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.8421.4018.304030957341830836628192929917",
  "uuid": "7efeae5d-6263-4184-9ad4-8df22720ada9",
  "md5_hash": "39c636d36d9aad3a40ab00690835eb09",
  "init_idc_version": 1,
  "rev_idc_version": 1,
  "final_idc_version": 0,
  "self_uri": "gs://whc_si/7efeae5d-6263-4184-9ad4-8df22720ada9.idc",
  "drs_object_id": "drs://nci-crdc.datacommons.io/dg.4DFC/7efeae5d-6263-4184-9ad4-8df22720ada9",
  "children": {
    "count": 27,
    "object_ids": [
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.242623218081388666714980416901",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.282430338990386948066869857090",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.303578576216694624163326470809",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.303561813582038351658059976611",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.524153479299178096224190777660",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.164159958994050807616117604470",
      "1.3.6.1.4.1.145

There are 27 series in this study. We'll have a look at the first one:

In [15]:
blob_name = get_blob_name(study_917,"1.3.6.1.4.1.14519.5.2.1.8421.4018.242623218081388666714980416901")
series_901 = get_json(bucket, blob_name)
pp(series_901)

{
  "encoding": "1.0",
  "object_type": "series",
  "SeriesInstanceUID": "1.3.6.1.4.1.14519.5.2.1.8421.4018.242623218081388666714980416901",
  "source_doi": "10.7937/K9/TCIA.2016.F7PPNPNU",
  "source_url": "",
  "uuid": "0190fe71-7144-40ae-a24c-c8d21a99317d",
  "md5_hash": "4ef53629c644cb9037e1e0c3422094eb",
  "init_idc_version": 1,
  "rev_idc_version": 1,
  "final_idc_version": 0,
  "self_uri": "gs://whc_si/0190fe71-7144-40ae-a24c-c8d21a99317d.idc",
  "drs_object_id": "drs://nci-crdc.datacommons.io/dg.4DFC/0190fe71-7144-40ae-a24c-c8d21a99317d",
  "instances": {
    "count": 40,
    "SOPInstanceUIDs": [
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.218816382845480339633358956983",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.241858580971386059648748116938",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.171808263784449313763905577800",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.158180798539531140053983330841",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.162146643573228585012915154839",
      "1.3.6.1.4.1.

The series is comprised of 40 instances. Notice that the gs dictionary has a "bucket/folder" element rather than the "bucket" element which the higher level objects have. This is because instances are named as \<series_uuid\>/\<instance_uuid\>.dcm. Structureing instance names in this way makes it somewhat easier to deal with all instances in a series. For example, on the command line, we can list all the instances in the series:

In [16]:
!gsutil ls -l  gs://whc_dev/0190fe71-7144-40ae-a24c-c8d21a99317d/*



Updates are available for some Google Cloud CLI components.  To install them,
please run:
  $ gcloud components update

         0  2022-06-01T18:41:53Z  gs://whc_dev/0190fe71-7144-40ae-a24c-c8d21a99317d/
    101386  2022-03-21T23:02:32Z  gs://whc_dev/0190fe71-7144-40ae-a24c-c8d21a99317d/1.3.6.1.4.1.14519.5.2.1.8421.4018.101624474422696304500489142527.dcm
    101382  2022-03-21T23:02:32Z  gs://whc_dev/0190fe71-7144-40ae-a24c-c8d21a99317d/1.3.6.1.4.1.14519.5.2.1.8421.4018.115940616503179287436193856892.dcm
    101386  2022-03-21T23:02:32Z  gs://whc_dev/0190fe71-7144-40ae-a24c-c8d21a99317d/1.3.6.1.4.1.14519.5.2.1.8421.4018.117426665486356168553370990774.dcm
    101386  2022-03-21T23:02:33Z  gs://whc_dev/0190fe71-7144-40ae-a24c-c8d21a99317d/1.3.6.1.4.1.14519.5.2.1.8421.4018.127687675413194415711297550169.dcm
    101386  2022-03-21T23:02:33Z  gs://whc_dev/0190fe71-7144-40ae-a24c-c8d21a99317d/1.3.6.1.4.1.14519.5.2.1.8421.4018.150463956086078029825544487248.dcm
    101386  2022-03-21T23:02

...and easy to copy all the instances in a series to our local file system:

In [17]:
import shutil
!mkdir /tmp/0190fe71-7144-40ae-a24c-c8d21a99317d
!gsutil -m -q cp gs://whc_dev/0190fe71-7144-40ae-a24c-c8d21a99317d/* /tmp/0190fe71-7144-40ae-a24c-c8d21a99317d
!ls -l /tmp/0190fe71-7144-40ae-a24c-c8d21a99317d/*


-rw-r--r--  1 BillClifford  wheel  101386 Jun 14 17:50 /tmp/0190fe71-7144-40ae-a24c-c8d21a99317d/1.3.6.1.4.1.14519.5.2.1.8421.4018.101624474422696304500489142527.dcm
-rw-r--r--  1 BillClifford  wheel  101382 Jun 14 17:50 /tmp/0190fe71-7144-40ae-a24c-c8d21a99317d/1.3.6.1.4.1.14519.5.2.1.8421.4018.115940616503179287436193856892.dcm
-rw-r--r--  1 BillClifford  wheel  101386 Jun 14 17:50 /tmp/0190fe71-7144-40ae-a24c-c8d21a99317d/1.3.6.1.4.1.14519.5.2.1.8421.4018.117426665486356168553370990774.dcm
-rw-r--r--  1 BillClifford  wheel  101386 Jun 14 17:50 /tmp/0190fe71-7144-40ae-a24c-c8d21a99317d/1.3.6.1.4.1.14519.5.2.1.8421.4018.127687675413194415711297550169.dcm
-rw-r--r--  1 BillClifford  wheel  101386 Jun 14 17:50 /tmp/0190fe71-7144-40ae-a24c-c8d21a99317d/1.3.6.1.4.1.14519.5.2.1.8421.4018.150463956086078029825544487248.dcm
-rw-r--r--  1 BillClifford  wheel  101386 Jun 14 17:50 /tmp/0190fe71-7144-40ae-a24c-c8d21a99317d/1.3.6.1.4.1.14519.5.2.1.8421.4018.158180798539531140053983330841.dcm

We can now dump the metadata of one of these DICOM files which we've downloaded:

In [18]:
from pydicom import dcmread as rdcm
r = rdcm('/tmp/0190fe71-7144-40ae-a24c-c8d21a99317d/1.3.6.1.4.1.14519.5.2.1.8421.4018.253459301039207139994226290725.dcm')
print(r)

Dataset.file_meta -------------------------------
(0002, 0000) File Meta Information Group Length  UL: 196
(0002, 0001) File Meta Information Version       OB: b'\x00\x01'
(0002, 0002) Media Storage SOP Class UID         UI: MR Image Storage
(0002, 0003) Media Storage SOP Instance UID      UI: 1.3.6.1.4.1.14519.5.2.1.8421.4018.253459301039207139994226290725
(0002, 0010) Transfer Syntax UID                 UI: Explicit VR Little Endian
(0002, 0012) Implementation Class UID            UI: 1.2.40.0.13.1.1.1
(0002, 0013) Implementation Version Name         SH: 'dcm4che-1.4.34'
-------------------------------------------------
(0008, 0005) Specific Character Set              CS: 'ISO_IR 100'
(0008, 0008) Image Type                          CS: ['ORIGINAL', 'PRIMARY', 'M', 'NORM', 'DIS2D']
(0008, 0012) Instance Creation Date              DA: '19910209'
(0008, 0013) Instance Creation Time              TM: '141912.890000'
(0008, 0016) SOP Class UID                       UI: MR Image Storage
(0

Delete all the DICOM instances.

In [19]:
!rm -r /tmp/0190fe71-7144-40ae-a24c-c8d21a99317d/

Given any collection/patient/study/series object's uuid, and the bucket where objects are kept, we can get the objects information by adding '.idc' to the uuid.

A series uuid:

In [22]:
uuid = '6553ee7e-7752-4689-80c0-03a5a35744e1'
object = get_json(bucket, f"{uuid}.idc")
pp(object)

{
  "encoding": "1.0",
  "object_type": "series",
  "SeriesInstanceUID": "1.3.6.1.4.1.14519.5.2.1.8421.4018.323516750746191384774778750988",
  "source_doi": "10.7937/K9/TCIA.2016.F7PPNPNU",
  "source_url": "",
  "uuid": "6553ee7e-7752-4689-80c0-03a5a35744e1",
  "md5_hash": "3620b7778acd179706cf797971878848",
  "init_idc_version": 1,
  "rev_idc_version": 1,
  "final_idc_version": 0,
  "self_uri": "gs://whc_si/6553ee7e-7752-4689-80c0-03a5a35744e1.idc",
  "drs_object_id": "drs://nci-crdc.datacommons.io/dg.4DFC/6553ee7e-7752-4689-80c0-03a5a35744e1",
  "instances": {
    "count": 40,
    "SOPInstanceUIDs": [
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.151275341243018498725810943966",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.227585305007378379506744979124",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.150267274381683246307003343142",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.146951145204043238869868630688",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.338885100478996297908651749333",
      "1.3.6.1.4.1.

A study:

In [24]:
uuid = "7efeae5d-6263-4184-9ad4-8df22720ada9"
object = get_json(bucket, f"{uuid}.idc")
pp(object)


{
  "encoding": "1.0",
  "object_type": "study",
  "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.8421.4018.304030957341830836628192929917",
  "uuid": "7efeae5d-6263-4184-9ad4-8df22720ada9",
  "md5_hash": "39c636d36d9aad3a40ab00690835eb09",
  "init_idc_version": 1,
  "rev_idc_version": 1,
  "final_idc_version": 0,
  "self_uri": "gs://whc_si/7efeae5d-6263-4184-9ad4-8df22720ada9.idc",
  "drs_object_id": "drs://nci-crdc.datacommons.io/dg.4DFC/7efeae5d-6263-4184-9ad4-8df22720ada9",
  "children": {
    "count": 27,
    "object_ids": [
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.242623218081388666714980416901",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.282430338990386948066869857090",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.303578576216694624163326470809",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.303561813582038351658059976611",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.524153479299178096224190777660",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.164159958994050807616117604470",
      "1.3.6.1.4.1.145

A patient:

In [25]:
uuid = "a83d4205-ecd3-48b2-b23c-9e3fe17f3422"
object = get_json(bucket, f"{uuid}.idc")
pp(object)


{
  "encoding": "1.0",
  "object_type": "patient",
  "submitter_case_id": "TCGA-BM-6198",
  "idc_case_id": "2bc64b47-ae06-43ff-8ae2-307830a68426",
  "uuid": "a83d4205-ecd3-48b2-b23c-9e3fe17f3422",
  "md5_hash": "24758c365b922a2a9430e5bd3448301d",
  "init_idc_version": 1,
  "rev_idc_version": 1,
  "final_idc_version": 7,
  "self_uri": "gs://whc_si/a83d4205-ecd3-48b2-b23c-9e3fe17f3422.idc",
  "children": {
    "count": 2,
    "object_ids": [
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.329305334176079996095294344892",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.304030957341830836628192929917"
    ],
    "gs": {
      "region": "us-central1",
      "bucket": "whc_si",
      "gs_object_ids": [
        "16532652-6fb0-41ca-bb21-705511f93dde.idc",
        "7efeae5d-6263-4184-9ad4-8df22720ada9.idc"
      ]
    },
    "drs": {
      "drs_server": "drs://nci-crdc.datacommons.io",
      "drs_object_ids": [
        "dg.4DFC/16532652-6fb0-41ca-bb21-705511f93dde",
        "dg.4DFC/7efeae5d-6263-4184-9ad4-8

A collection:

In [26]:
uuid = "2f0b8396-e892-4715-81d3-81c40bcf6637"
object = get_json(bucket, f"{uuid}.idc")
pp(object)


{
  "encoding": "1.0",
  "object_type": "collection",
  "tcia_api_collection_id": "CPTAC-SAR",
  "idc_webapp_collection_id": "cptac_sar",
  "uuid": "2f0b8396-e892-4715-81d3-81c40bcf6637",
  "md5_hash": "e6f1f1499ce297923cadf83b7df99e32",
  "init_idc_version": 2,
  "rev_idc_version": 2,
  "final_idc_version": 6,
  "self_uri": "whc_si/2f0b8396-e892-4715-81d3-81c40bcf6637.idc",
  "children": {
    "count": 24,
    "object_ids": [
      "C3L-03366",
      "C3L-03200",
      "C3N-03232",
      "C3L-03953",
      "C3L-01466",
      "C3L-01551",
      "C3N-03489",
      "C3L-03960",
      "C3L-01059",
      "C3N-00875",
      "C3N-01002",
      "C3L-02851",
      "C3L-03197",
      "C3L-03403",
      "C3L-03166",
      "C3N-03914",
      "C3L-02846",
      "C3L-03253",
      "C3L-03196",
      "C3L-01038",
      "C3N-04164",
      "C3L-03956",
      "C3N-00843",
      "C3L-02213"
    ],
    "gs": {
      "region": "us-central1",
      "bucket": "gs://whc_si",
      "gs_object_ids": [
        

The object metadata enumerates the object type and other details. A criticism of DRS is that the returned object is not self describing in this way. 

We'll now target the whc_ssi bucket:

In [46]:
bucket = client.bucket("whc_ssi")

Now when we examine a series object, we will see that the bucket/folder field has the form \<study_uuid\>/\<series_uuid\>, whereas, in the whc_si bucket it had the form \<study_uuid\>:

In [47]:
uuid = "6553ee7e-7752-4689-80c0-03a5a35744e1"
object = get_json(bucket, f"{uuid}.idc")
pp(object)

{
  "encoding": "1.0",
  "object_type": "series",
  "SeriesInstanceUID": "1.3.6.1.4.1.14519.5.2.1.8421.4018.323516750746191384774778750988",
  "source_doi": "10.7937/K9/TCIA.2016.F7PPNPNU",
  "source_url": "",
  "uuid": "6553ee7e-7752-4689-80c0-03a5a35744e1",
  "md5_hash": "3620b7778acd179706cf797971878848",
  "init_idc_version": 1,
  "rev_idc_version": 1,
  "final_idc_version": 0,
  "self_uri": "gs://whc_ssi/6553ee7e-7752-4689-80c0-03a5a35744e1.idc",
  "drs_object_id": "drs://nci-crdc.datacommons.io/dg.4DFC/6553ee7e-7752-4689-80c0-03a5a35744e1",
  "instances": {
    "count": 40,
    "SOPInstanceUIDs": [
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.151275341243018498725810943966",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.227585305007378379506744979124",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.150267274381683246307003343142",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.146951145204043238869868630688",
      "1.3.6.1.4.1.14519.5.2.1.8421.4018.338885100478996297908651749333",
      "1.3.6.1.4.1

Now we can use gsutil to list all the series in a study:

In [41]:
!gsutil ls gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/

gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669/
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.116333616240640138163404191997/
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.155661808279350667660503899999/
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.158023417769972666501480099896/
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.164159958994050807616117604470/
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.171278458682577702045055744172/
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.198461476658037865560948236665/
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.206096527423871496086339173610/
gs://whc_ssi

...and all the instances in a series in a study:

In [42]:
!gsutil ls gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669

gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669/
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669/1.3.6.1.4.1.14519.5.2.1.8421.4018.110238514170356133957991595438.dcm
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669/1.3.6.1.4.1.14519.5.2.1.8421.4018.114193003786715624244450650714.dcm
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669/1.3.6.1.4.1.14519.5.2.1.8421.4018.121020043490528745973829325260.dcm
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669/1.3.6.1.4.1.14519.5.2.1.8421.4018.123044677903361463906500937018.dcm
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669/1.3.6.1.4.1.14519.5.2.1.8421

In [None]:
...and all the instance in a study:

In [43]:
!gsutil ls gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/*/*.dcm

gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669/1.3.6.1.4.1.14519.5.2.1.8421.4018.110238514170356133957991595438.dcm
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669/1.3.6.1.4.1.14519.5.2.1.8421.4018.114193003786715624244450650714.dcm
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669/1.3.6.1.4.1.14519.5.2.1.8421.4018.121020043490528745973829325260.dcm
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669/1.3.6.1.4.1.14519.5.2.1.8421.4018.123044677903361463906500937018.dcm
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669/1.3.6.1.4.1.14519.5.2.1.8421.4018.123269574487182141526551456351.dcm
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8

gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.206096527423871496086339173610/1.3.6.1.4.1.14519.5.2.1.8421.4018.181249486829274434627094228318.dcm
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.206096527423871496086339173610/1.3.6.1.4.1.14519.5.2.1.8421.4018.186473347502507178269944713949.dcm
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.206096527423871496086339173610/1.3.6.1.4.1.14519.5.2.1.8421.4018.186799561071241362599845128678.dcm
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.206096527423871496086339173610/1.3.6.1.4.1.14519.5.2.1.8421.4018.189215750247148830162584872020.dcm
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.206096527423871496086339173610/1.3.6.1.4.1.14519.5.2.1.8421.4018.194002641720580686548338246484.dcm
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8

gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.282430338990386948066869857090/1.3.6.1.4.1.14519.5.2.1.8421.4018.176963165061122370676321814086.dcm
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.282430338990386948066869857090/1.3.6.1.4.1.14519.5.2.1.8421.4018.177047930580049693948390888331.dcm
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.282430338990386948066869857090/1.3.6.1.4.1.14519.5.2.1.8421.4018.177399810991640637864093106521.dcm
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.282430338990386948066869857090/1.3.6.1.4.1.14519.5.2.1.8421.4018.177605247427473326190894608244.dcm
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.282430338990386948066869857090/1.3.6.1.4.1.14519.5.2.1.8421.4018.179169230239811598934941280548.dcm
gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4

We can, of course, copy these same sets of instances:

In [49]:
!mkdir -p /tmp/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669
!gsutil -m -q cp gs://whc_ssi/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669/* /tmp/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669
!ls -l /tmp/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669/*


-rw-r--r--  1 BillClifford  wheel  169524 Jun 14 18:50 /tmp/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669/1.3.6.1.4.1.14519.5.2.1.8421.4018.110238514170356133957991595438.dcm
-rw-r--r--  1 BillClifford  wheel  169524 Jun 14 18:50 /tmp/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669/1.3.6.1.4.1.14519.5.2.1.8421.4018.114193003786715624244450650714.dcm
-rw-r--r--  1 BillClifford  wheel  169524 Jun 14 18:50 /tmp/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669/1.3.6.1.4.1.14519.5.2.1.8421.4018.121020043490528745973829325260.dcm
-rw-r--r--  1 BillClifford  wheel  169524 Jun 14 18:50 /tmp/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669/1.3.6.1.4.1.14519.5.2.1.8421.4018.123044677903361463906500937018.dcm
-rw-r--r--  1 BillClifford  wheel  169524 Jun 14 18:50 /tmp/7efeae5d-626

Dump the metadata from one of these files:

In [50]:
r = rdcm('/tmp/7efeae5d-6263-4184-9ad4-8df22720ada9/1.3.6.1.4.1.14519.5.2.1.8421.4018.115796396159931879291210366669/1.3.6.1.4.1.14519.5.2.1.8421.4018.110238514170356133957991595438.dcm')
print(r)

Dataset.file_meta -------------------------------
(0002, 0000) File Meta Information Group Length  UL: 196
(0002, 0001) File Meta Information Version       OB: b'\x00\x01'
(0002, 0002) Media Storage SOP Class UID         UI: MR Image Storage
(0002, 0003) Media Storage SOP Instance UID      UI: 1.3.6.1.4.1.14519.5.2.1.8421.4018.110238514170356133957991595438
(0002, 0010) Transfer Syntax UID                 UI: Explicit VR Little Endian
(0002, 0012) Implementation Class UID            UI: 1.2.40.0.13.1.1.1
(0002, 0013) Implementation Version Name         SH: 'dcm4che-1.4.34'
-------------------------------------------------
(0008, 0005) Specific Character Set              CS: 'ISO_IR 100'
(0008, 0008) Image Type                          CS: ['ORIGINAL', 'PRIMARY', 'M', 'DIS2D']
(0008, 0012) Instance Creation Date              DA: '19910209'
(0008, 0013) Instance Creation Time              TM: '143052.562000'
(0008, 0016) SOP Class UID                       UI: MR Image Storage
(0008, 001

Delete the instances:

In [52]:
!rm -r /tmp/7efeae5d-6263-4184-9ad4-8df22720ada9/

The whc_si and whc_ssi buckets have also been populated with 'folder objects' that allow efficient gcsfuse access of the object and instance data in these buckets. It is probably best to experiment with gcsfuse on a VM. (Installing gcsfuse on my Mac, running Monterey 12.4, currently fails..."Installer: Error - The FUSE for macOS installation package is not compatible with this version of macOS.") Installation instructions are [here](https://github.com/GoogleCloudPlatform/gcsfuse/blob/master/docs/installing.md).

To use it, mount one of the notebooks to a local directory, e.g.:

\$gcsfuse whc_ssi temp  

You should now be able to ls against that directory, e.g.:  

\$ls temp/idc.idc/  
\$ls temp/7efeae5d-6263-4184-9ad4-8df22720ada9  
