<a href="https://colab.research.google.com/github/ImagingDataCommons/crdc_compound_objects/blob/main/Hierarchical_Instance_Names_s3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro

In this notebook we investigate a proposed alternative to current IDC blob naming and its interaction with the proposed CRDC compound object.

## IDC DICOM Instance Names

IDC instance blobs are stored in GCS buckets and currently have a flat name space. Blobs have GCS names like `<instance_uuid>.dcm`. `<instance_uuid>` is the UUID4 assigned by IDC to a version of a DICOM instance.

This notebook demonstrates an alternative to current IDC blob naming in which instance blobs would be named as `<series_uuid>/<instance_uuid>.dcm` and where we index series and instances. Here `<series_uuid>` is the UUID4 assigned by IDC to a version of a DICOM series. By definition, all the instances in a series are in a single bucket.
  
We demonstrate that, given such a naming convention, we do not need to resolve the DRS URI of individual instances when accessing publicly available instance blobs, that is, when signed URLs are not needed for access. Specifically, because gsutil and other libraries and utilities understand hierarchical naming, knowing a <series_uuid> and the name of the bucket in which the instances in that series reside, is sufficient in order to identify and/or access all the instances in a series.

In the case that signed URLs are needed to access IDC instance data in a series, the DRS URI, which has the form `drs://dg.4DFC/<series_uuid> can resolved to obtain the GCS URL of an IDC series object. See the [CRDC compound object notebook](https://github.com/ImagingDataCommons/crdc_compound_objects/blob/main/CRDC_compound_object.ipynb).

# Demo

We've populated a GCS bucket, `cdrcobj`, with a selected set of series from the [NLST collection](https://doi.org/10.7937/tcia.hmq8-j677).

Instances have hierarchical names like `<series_uuid>/<instance_uuid>.dcm` as described above.



In [None]:
demo_bucket = 'crdcobj'

# Demo config
The jq JSON processor is useful for inspecting JSON:

In [None]:
!apt -qq update
!apt -qq install jq

In order to access GCS, we first authenticate to Colab:

In [None]:
from google.colab import auth
auth.authenticate_user()

We'll be using the GCS Python libraries:

In [None]:
from google.cloud import storage
import json
client = storage.Client()
bucket = client.bucket(demo_bucket)

Install s5cmd.

In [None]:
!wget https://github.com/peak/s5cmd/releases/download/v2.0.0-beta/s5cmd_2.0.0-beta_Linux-64bit.tar.gz
!tar zxf s5cmd_2.0.0-beta_Linux-64bit.tar.gz

# Series objects in the demo bucket

Note that we have also created a series compound object for each series. Each has a GCS URL like:
`gs://crdcobj/<series_uuid>/crdcobj.json`. E.G.:

In [None]:
!gsutil cat gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/crdcobj.json | jq

See the CRDC compound cbject notebook for more details. We won't be further discussing these `series objects` in this notebook.

## Using the instance name hierarchy

As mentioned previously, instance blobs in the demo_bucket have hierarchical names. This means that, given a GCS URL of the form `gs://<bucket_name>/<series_uuid>/`, the user can take advantage of 'wildcarding' capabilities of various utiliaties to access the instances in the corresponding series. We will demonstrate.

For this purpose, we will assume that the user has obtained such a GCS URL, specifically this URL:  
`gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/`

We expect that users will usually obtain such a GCS URL from the manifest of some IDC cohort or from some BQ query against a IDC BQ table such as dicom_all.

Another way in which the user could have obtained such URL is by:
1. resolving the series's DRS URI, which would in this case be `drs://dg.4DFC/f5d6b517-2c02-4035-9444-0f15be7180ff` (perhaps obtained from the manifest of an IDC cohort or from some BQ query against IDC BQ tables like dicom_all) to obtain a series compound object above
2. Resolving the DRS URI in the 'folder_object' method (`drs://dg.4DFC/some_TBD_uuid' in the above compound object...we've not yet generated these UUIDs.)
3. Resolving the DRS URI will return a DrsObject. An `AccessURL` in the DrsObject will have a `url` whose value is `gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/`

As noted previously, see the CRDC compound object notebook for more details.

### Accessing the name hierarchy with gsutil
Regardless of how the GCS URL of a series folder object was obtained, it can now be used with gsutil to access the instances in the series:

In [None]:
!gsutil ls gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff

We'll create a directory and copy all the instances in the series to the directory using gsutil.

In [None]:
%%time

!mkdir -vp /tmp/dicom_data
!gsutil -m cp gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/*.dcm /tmp/dicom_data
# Demonstrated that we've copied the instances
!ls -l /tmp/dicom_data
# No longer needed. Delete them
!rm -vr /tmp/dicom_data

### Accessing the name hierarchy using gcsfuse

[gcsfuse](https://github.com/GoogleCloudPlatform/gcsfuse) is a [FUSE](https://en.wikipedia.org/wiki/Filesystem_in_Userspace) files system that allows you to mount a GCS bucket as a file system. Using gcsfuse we can access a series 'folder' in GCS as a directory.

We first need to install gcsfuse:

In [None]:
!echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
!apt -qq update
!apt -qq install gcsfuse

Create a directory on which to mount the bucket, and perform the mount. 

In [None]:
!mkdir -p /tmp/gcsfuse_mnt
!gcsfuse $demo_bucket /tmp/gcsfuse_mnt

We can list the contents of that same series using the <series_uuid>:


In [None]:
!ls -l /tmp/gcsfuse_mnt/f5d6b517-2c02-4035-9444-0f15be7180ff/

As with gsutil, we can copy all the instances in a series:

In [None]:
%%time

!mkdir -vp /tmp/dicom_data
!cp -v /tmp/gcsfuse_mnt/f5d6b517-2c02-4035-9444-0f15be7180ff/*.dcm /tmp/dicom_data
!ls -l /tmp/dicom_data
!rm -vr /tmp/dicom_data

## Accessing the name hierarchy using s5cmd
The [s5cmd repo](https://github.com/peak/s5cmd) describes s5cmd as "a very fast S3 and local filesystem execution tool". But s5cmd can also be used against GCS. 

To use s5cmd, you must first create an HMAC. See [this](https://github.com/peak/s5cmd#google-cloud-storage-support) segment of the s5cmd documentation, which links you to [these](https://cloud.google.com/storage/docs/authentication/managing-hmackeys#create) instructions.

Keeping your AWS credentials in Google drive allows for convenient access. The following assumes AWS credentials are in /aws/credentials on the user's Google Drive

In [None]:
from google.colab import drive

drive.mount('/content/gdrive')

!mkdir -p ~/.aws
!cp /content/gdrive/MyDrive/aws/credentials ~/.aws

s5cmd can take a manifest of files to be transferred. We'll use gcsfuse to enumerate the instances in a series and then sed to the format required by s5cmd.

In [None]:
!ls /tmp/gcsfuse_mnt/f5d6b517-2c02-4035-9444-0f15be7180ff/*.dcm | sed  "s|/tmp/gcsfuse_mnt/|cp s3://$demo_bucket/|" | sed "s|\(.*\)|\1 /tmp/dicom_data/.|" > s5cmd_manifest.txt
# !gsutil ls gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/*.dcm | sed  "s|gs://|cp s3://|" | sed "s|\(.*\)|\1 /tmp/dicom_data/.|" > s5cmd_manifest.txt
!cat s5cmd_manifest.txt

Now we can pass the to s5cmd for processing:

In [None]:
%%time

!mkdir -vp /tmp/dicom_data
!./s5cmd --endpoint-url https://storage.googleapis.com run s5cmd_manifest.txt
!ls -l /tmp/dicom_data
!rm -vr /tmp/dicom_data

As we can see from the wall time, s5cmd is very fast.

### Accessing the name hierarchy using the Python GCS API

The Google GCS APIs understand the implicit hierarchy in blob names. We can thus use it to access all the instance blobs of a series.

First we define a function that we will use to "walk" the hierarchy.  
The list_blobs() function is documented [here](https://googleapis.dev/python/storage/latest/storage/client.html?highlight=prefixes). Its *prefix* and *delimiter* parameters are used to emulate hierarchical naming. Basically it returns partial blob names for all blob names that begin with the specified *prefix* and up to the specified delimiter.

In [None]:
def list_blobs_with_prefix(prefix, delimiter=None):
    client = storage.Client()
    bucket = client.bucket(demo_bucket)
    blobs =  bucket.list_blobs(prefix=prefix, delimiter=delimiter)
    names = [blob.name for blob in blobs]
    ids = list(blobs.prefixes)
    ids.sort()
    return ids



We'll use the <series_uuid> as the prefix to find all .dcm (instance) blobs:

In [None]:
instances= list_blobs_with_prefix(prefix='f5d6b517-2c02-4035-9444-0f15be7180ff', delimiter='.dcm')
print("instance uuids:")
for instance in instances:
  print(instances.index(instance), instance)


Now that have the `instances` list of instance blob names we can use other GCS functions to copy the contents of those blobs to a local directory:

In [None]:
%%time

import os
import shutil
from google.cloud.storage import Blob

# Create a directory for the instance data:
if not os.path.exists('/tmp/dicom_data'):
    os.makedirs('/tmp/dicom_data')

for instance in instances:
  src_blob = bucket.blob(instance)
  instance_blob_id = instance.split('/')[-1]
  with open(f"/tmp/dicom_data/{instance.split('/')[-1]}", "wb") as file_obj:
      src_blob.download_to_file(file_obj)

# Demonstrate that we've downloaded the blobs.      
!ls -l /tmp/dicom_data
# Clean up
shutil.rmtree('/tmp/dicom_data')

## Accessing instance data using the Python Requests library or curl
Unlike the previous interfaces, requests, curl and wget do not support wildcarding. They each require a complete https URL in order to access an instance blob. That is, you can't use these APIs directly to get the instances in a series given just a bucket name and series_uuid. 

However, if an application really needs to use the Python Requests library, it can use the Google Storage library, as demonstrated above to get the full name of each instance in a series, and then use Requests to access the instance blob.

We'll define a function for this purpose:



In [None]:
import requests
access_url_prefix = f'https://storage.googleapis.com/{demo_bucket}'
access_token = !gcloud auth print-access-token
def get_blob_contents(instance):
  url = f'{access_url_prefix}/{instance}'
  headers = dict(Authorization=f'Bearer {access_token[0]}')
  result = requests.get(url, headers=headers)
  result.raise_for_status()
  return result


Now we can use requests.get to copy each instances in the previously created `instances` list to a local directory.

Warning: This operation is very slow:

In [None]:
%%time

import os
import shutil
from google.cloud.storage import Blob

# Create a directory for the instance data:
if not os.path.exists('/tmp/dicom_data'):
    os.makedirs('/tmp/dicom_data')

bucket = client.get_bucket(demo_bucket)
for instance in instances:
  instance_blob_id = instance.split('/')[-1]
  with open(f"/tmp/dicom_data/{instance_blob_id}", "wb") as file_obj:
      result = get_blob_contents(instance)
      file_obj.write(result.content)
      print(f'Copied {instance}')
# Demonstrate that we've downloaded the blobs.      
!ls -l /tmp/dicom_data
# Clean up
shutil.rmtree('/tmp/dicom_data')

Similarly, a shell application that will use curl to copy blobs can use gsutil to get the names of instance blobs in a series.

Warning: This operation is very slow:

In [None]:
%%time

%%bash -s "$demo_bucket" "f5d6b517-2c02-4035-9444-0f15be7180ff"
mkdir -vp /tmp/dicom_data
for instance in $(gsutil ls gs://$1/$2/*.dcm); \
do \
  IFS='/'; read -r -a parts <<< "$instance"; unset IFS; \
  bckt="${parts[2]}"; series_uuid="${parts[3]}"; instance_uuid="${parts[4]}"; \
  URL="https://storage.googleapis.com/$bckt/$series_uuid/$instance_uuid?alt=media"; \
  curl -s -X GET \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -o /tmp/dicom_data/$instance_uuid  \
    $URL; \
  echo Copied $instance; \
done
# Demonstrate that we've downloaded the blobs.      
ls -l /tmp/dicom_data
# Clean up
rm -vr /tmp/dicom_data