<a href="https://colab.research.google.com/github/ImagingDataCommons/crdc_compound_objects/blob/main/Hierarchical_Instance_Names.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro

In this notebook we investigate a proposed alternative to current IDC blob naming and its interaction with the proposed compound object.

## IDC DICOM Instance Names

IDC instance blobs are stored in GCS buckets and currently have a flat name space. Blobs have GCS names like `<instance_uuid>.dcm`. `<instance_uuid>` is the UUID4 assigned by IDC to a version of a DICOM instance.

This notebook demonstrates an alternative to current IDC blob naming in which blobs are named as `<series_uuid>/<instance_uuid>.dcm` and where we index series and instances. Here `<series_uuid>` is the UUID4 assigned to by IDC to a version of a DICOM series. By definition, all the instances in a series are in a single bucket.
  
In this notebook, we demonstrate that, given such a naming convention, we do not need to resolve the DRS URI of individual instances when accessing publicly available instance blobs, that is, when signed URLs are not needed for access. Specifically, because gsutil and other libraries and utilities understand hierarchical naming, having a <series_uuid> and the name of the bucket in which the instances in that series reside, is sufficient in order to identify and/or access all the instances in a series.

In the case that signed URLs are needed to access IDC instance data in a series, the DRS URI, which has the form `drs://dg.4DFC/<series_uuid> can resolved to obtain the GCS URL of an IDC series object. See the [CRDC compound object notebook](https://github.com/ImagingDataCommons/crdc_compound_objects/blob/main/CRDC_compound_object.ipynb).

# Demo

We've populated a GCS bucket, `cdrcobj`, with a selected set of series from the [NLST collection](https://doi.org/10.7937/tcia.hmq8-j677).

Instances have hierarchical names like `<series_uuid>/<instance_uuid>.dcm` as described above.



In [1]:
demo_bucket = 'crdcobj'

# Demo config
The jq JSON processor is useful for inspecting JSON:

In [2]:
!apt -qq update
!apt -qq install jq

30 packages can be upgraded. Run 'apt list --upgradable' to see them.
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  libjq1 libonig4
The following NEW packages will be installed:
  jq libjq1 libonig4
0 upgraded, 3 newly installed, 0 to remove and 30 not upgraded.
Need to get 276 kB of archives.
After this operation, 930 kB of additional disk space will be used.
Selecting previously unselected package libonig4:amd64.
(Reading database ... 123942 files and directories currently installed.)
Preparing to unpack .../libonig4_6.7.0-1_amd64.deb ...
Unpacking libonig4:amd64 (6.7.0-1) ...
Selecting previously unselected package libjq1:amd64.
Preparing to unpack .../libjq1_1.5+dfsg-2_amd64.deb ...
Unpacking libjq1:amd64 (1.5+dfsg-2) ...
Selecting previously unselected package jq.
Preparing to unpack .../jq_1.5+dfsg-2_amd64.deb ...
Unpacking jq (1.5+dfsg-2

In order to access GCS, we first authenticate to Colab:

In [3]:
from google.colab import auth
auth.authenticate_user()

We'll be using the GCS Python libraries:

In [4]:
from google.cloud import storage
import json
client = storage.Client()
bucket = client.bucket(demo_bucket)

We have also created a series compound object for each series. Each has a GCS URL like:
`gs://crdcobj/<series_uuid>/crdcobj.json`. E.G.:

In [9]:
!gsutil cat gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/crdcobj.json | jq

[1;39m{
  [0m[34;1m"encoding_version"[0m[1;39m: [0m[0;32m"1.0"[0m[1;39m,
  [0m[34;1m"description"[0m[1;39m: [0m[0;32m"IDC CRDC DICOM series compound object"[0m[1;39m,
  [0m[34;1m"object_type"[0m[1;39m: [0m[0;32m"DICOM series"[0m[1;39m,
  [0m[34;1m"id"[0m[1;39m: [0m[0;32m"f5d6b517-2c02-4035-9444-0f15be7180ff"[0m[1;39m,
  [0m[34;1m"name"[0m[1;39m: [0m[0;32m"1.3.6.1.4.1.14519.5.2.1.7009.9004.330787529051851659409647497231"[0m[1;39m,
  [0m[34;1m"self_uri"[0m[1;39m: [0m[0;32m"drs://dg.4DFC/f5d6b517-2c02-4035-9444-0f15be7180ff"[0m[1;39m,
  [0m[34;1m"access_methods"[0m[1;39m: [0m[1;39m[
    [1;39m{
      [0m[34;1m"method"[0m[1;39m: [0m[0;32m"children"[0m[1;39m,
      [0m[34;1m"mime_type"[0m[1;39m: [0m[0;32m"application/dicom"[0m[1;39m,
      [0m[34;1m"description"[0m[1;39m: [0m[0;32m"List of DRS URIs of instances in this series"[0m[1;39m,
      [0m[34;1m"contents"[0m[1;39m: [0m[1;39m[
        [1;39m{
      

See the CRDC compound cbject notebook for more details.

## Using the instance name hierarchy

As mentioned previously, instance blobs in the demo_bucket have hierarchical name. This means that, given a GCS URL of the form `gs://<bucket_name>/<series_uuid>/`, the user can take advantage of gsutil wildcarding to access the instances in the corresponding series. We will demonstrate.

For this purpose, we will assume that the user has obtained such a GCS URL, specifically this URL:  
`gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/`

We expect that users will usually obtain such a GCS URL from the manifest of some IDC cohort or from some BQ query against IDC BQ tables like dicom_all.

Another way in which the user could have obtained such URL is by:
1. resolving the series's DRS URI, which would in this case be `drs://dg.4DFC/f5d6b517-2c02-4035-9444-0f15be7180ff` (perhaps obtained from the manifest of an IDC cohort or from some BQ query against IDC BQ tables like dicom_all) to obtain a series compound object above
2. Resolving the DRS URI in the 'folder_object' method (`drs://dg.4DFC/some_TBD_uuid' in the above compound object...we've not yet generated these UUIDs.)
3. Resolving the DRS URI will return a DrsObject. An `AccessURL` in the DrsObject will have a `url` whose value is `gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/`

As noted previously, see the CRDC compound object notebook formore details.

### Accessing the name hierarchy with gsutil
Regardless of how the GCS URL of a series folder object was obtained, it can now be used with gsutil to access the instances in the series:

In [10]:
!gsutil ls gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff

gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/
gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/00f84d01-384b-4610-a8db-ebb68f3b3dc3.dcm
gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/011bc2f0-3b70-4a9d-b9de-9b9022c6facc.dcm
gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/04e23de0-416d-42db-ba34-5e233cdf7dcd.dcm
gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/05f7526f-a9c5-401f-8cbc-794f541392c5.dcm
gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0b2393fa-a16d-4f7a-9da1-52ce56ad3007.dcm
gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0da3b58f-ab68-4735-97a8-608cd94789c8.dcm
gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0e234611-bc68-4801-86b9-73008209358a.dcm
gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0e922ba0-5576-4d25-8ea3-1e9879790b37.dcm
gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0ed09652-f14b-49ea-a559-970eda2e5475.dcm
gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0fcd3eca-ad7f-41e9-8660-dabe5bf79130.dcm
gs://crdcobj/f5d6b517-2c02-4035-9444-0f

We'll create a directory and copy all the instances in the series to the directory using gsutil.

In [11]:
%%time

!mkdir -vp /tmp/dicom_data
!gsutil -m cp gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/*.dcm /tmp/dicom_data
# Demonstrated that we've copied the instances
!ls -l /tmp/dicom_data
# No longer needed. Delete them
!rm -vr /tmp/dicom_data

mkdir: created directory '/tmp/dicom_data'
Copying gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/00f84d01-384b-4610-a8db-ebb68f3b3dc3.dcm...
Copying gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/04e23de0-416d-42db-ba34-5e233cdf7dcd.dcm...
Copying gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0e234611-bc68-4801-86b9-73008209358a.dcm...
Copying gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0da3b58f-ab68-4735-97a8-608cd94789c8.dcm...
Copying gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/011bc2f0-3b70-4a9d-b9de-9b9022c6facc.dcm...
Copying gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/05f7526f-a9c5-401f-8cbc-794f541392c5.dcm...
Copying gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0e922ba0-5576-4d25-8ea3-1e9879790b37.dcm...
Copying gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0b2393fa-a16d-4f7a-9da1-52ce56ad3007.dcm...
Copying gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0fcd3eca-ad7f-41e9-8660-dabe5bf79130.dcm...
Copying gs://crdcobj/f5d6b517-2c02-4035

...and do the same with gcloud alpha storage:

In [12]:
%%time
!mkdir -vp /tmp/dicom_data
!gsutil ls gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/*.dcm  \
| gcloud alpha storage cp --read-paths-from-stdin /tmp/dicom_data
# Demonstrate that we've copied the instances
!ls -l /tmp/dicom_data
# No longer needed. Delete them
!rm -vr /tmp/dicom_data

mkdir: created directory '/tmp/dicom_data'
Copying gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/00f84d01-384b-4610-a8db-ebb68f3b3dc3.dcm to file:///tmp/dicom_data/00f84d01-384b-4610-a8db-ebb68f3b3dc3.dcm
Copying gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/011bc2f0-3b70-4a9d-b9de-9b9022c6facc.dcm to file:///tmp/dicom_data/011bc2f0-3b70-4a9d-b9de-9b9022c6facc.dcm
Copying gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/04e23de0-416d-42db-ba34-5e233cdf7dcd.dcm to file:///tmp/dicom_data/04e23de0-416d-42db-ba34-5e233cdf7dcd.dcm
Copying gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/05f7526f-a9c5-401f-8cbc-794f541392c5.dcm to file:///tmp/dicom_data/05f7526f-a9c5-401f-8cbc-794f541392c5.dcm
Copying gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0b2393fa-a16d-4f7a-9da1-52ce56ad3007.dcm to file:///tmp/dicom_data/0b2393fa-a16d-4f7a-9da1-52ce56ad3007.dcm
Copying gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0da3b58f-ab68-4735-97a8-608cd94789c8.dcm to file:///tmp/dicom_data/0d

### Accessing the name hierarchy using gcsfuse

[gcsfuse](https://github.com/GoogleCloudPlatform/gcsfuse) is a [FUSE](https://en.wikipedia.org/wiki/Filesystem_in_Userspace) files system that allows you to mount a GCS bucket as a file system. Using gcsfuse we can access a series 'folder' in GCS as a directory.

We first need to install gcsfuse:

In [13]:
!echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
!apt -qq update
!apt -qq install gcsfuse

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  2537  100  2537    0     0  72485      0 --:--:-- --:--:-- --:--:-- 72485
OK
30 packages can be upgraded. Run 'apt list --upgradable' to see them.
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  gcsfuse
0 upgraded, 1 newly installed, 0 to remove and 30 not upgraded.
Need to get 13.3 MB of archives.
After this operation, 30.7 MB of additional disk space will be used.
Selecting previously unselected package gcsfuse.
(Reading database ... 123959 files and directories currently installed.)
Preparing to unpack .../gcsfuse_0.41.8_amd64.deb ...
Unpacking gcsfuse (0.41.8) ...
Setting up gcsfuse (0.41.8) ...


Create a directory on which to mount the bucket, and perform the mount. 

In [14]:
!mkdir -p /tmp/gcsfuse_mnt
!gcsfuse $demo_bucket /tmp/gcsfuse_mnt

2022/11/07 20:57:30.951492 Start gcsfuse/0.41.8 (Go version go1.18.4) for app "" using mount point: /tmp/gcsfuse_mnt
2022/11/07 20:57:30.968779 Opening GCS connection...
2022/11/07 20:57:32.270537 Mounting file system "crdcobj"...
2022/11/07 20:57:32.271393 File system has been successfully mounted.


We can list the contents of that same series using ls using the <series_uuid>:


In [15]:
!ls -l /tmp/gcsfuse_mnt/f5d6b517-2c02-4035-9444-0f15be7180ff/

total 51980
-rw-r--r-- 1 root root 526824 Nov  7 18:49 00f84d01-384b-4610-a8db-ebb68f3b3dc3.dcm
-rw-r--r-- 1 root root 526824 Nov  7 18:49 011bc2f0-3b70-4a9d-b9de-9b9022c6facc.dcm
-rw-r--r-- 1 root root 526824 Nov  7 18:49 04e23de0-416d-42db-ba34-5e233cdf7dcd.dcm
-rw-r--r-- 1 root root 526820 Nov  7 18:45 05f7526f-a9c5-401f-8cbc-794f541392c5.dcm
-rw-r--r-- 1 root root 526824 Nov  7 18:52 0b2393fa-a16d-4f7a-9da1-52ce56ad3007.dcm
-rw-r--r-- 1 root root 526820 Nov  7 18:50 0da3b58f-ab68-4735-97a8-608cd94789c8.dcm
-rw-r--r-- 1 root root 526820 Nov  7 18:44 0e234611-bc68-4801-86b9-73008209358a.dcm
-rw-r--r-- 1 root root 526820 Nov  7 18:40 0e922ba0-5576-4d25-8ea3-1e9879790b37.dcm
-rw-r--r-- 1 root root 526826 Nov  7 18:45 0ed09652-f14b-49ea-a559-970eda2e5475.dcm
-rw-r--r-- 1 root root 526820 Nov  7 18:43 0fcd3eca-ad7f-41e9-8660-dabe5bf79130.dcm
-rw-r--r-- 1 root root 526824 Nov  7 18:49 1212d8f7-b955-4891-b01d-0af3455b7bb4.dcm
-rw-r--r-- 1 root root 526820 Nov  7 18:41 14b17ed3-0436-4d60-89

As with gsutil, we can copy all the instances in a series:

In [16]:
%%time

!mkdir -vp /tmp/dicom_data
!cp -v /tmp/gcsfuse_mnt/f5d6b517-2c02-4035-9444-0f15be7180ff/*.dcm /tmp/dicom_data
!ls -l /tmp/dicom_data
!rm -vr /tmp/dicom_data

mkdir: created directory '/tmp/dicom_data'
'/tmp/gcsfuse_mnt/f5d6b517-2c02-4035-9444-0f15be7180ff/00f84d01-384b-4610-a8db-ebb68f3b3dc3.dcm' -> '/tmp/dicom_data/00f84d01-384b-4610-a8db-ebb68f3b3dc3.dcm'
'/tmp/gcsfuse_mnt/f5d6b517-2c02-4035-9444-0f15be7180ff/011bc2f0-3b70-4a9d-b9de-9b9022c6facc.dcm' -> '/tmp/dicom_data/011bc2f0-3b70-4a9d-b9de-9b9022c6facc.dcm'
'/tmp/gcsfuse_mnt/f5d6b517-2c02-4035-9444-0f15be7180ff/04e23de0-416d-42db-ba34-5e233cdf7dcd.dcm' -> '/tmp/dicom_data/04e23de0-416d-42db-ba34-5e233cdf7dcd.dcm'
'/tmp/gcsfuse_mnt/f5d6b517-2c02-4035-9444-0f15be7180ff/05f7526f-a9c5-401f-8cbc-794f541392c5.dcm' -> '/tmp/dicom_data/05f7526f-a9c5-401f-8cbc-794f541392c5.dcm'
'/tmp/gcsfuse_mnt/f5d6b517-2c02-4035-9444-0f15be7180ff/0b2393fa-a16d-4f7a-9da1-52ce56ad3007.dcm' -> '/tmp/dicom_data/0b2393fa-a16d-4f7a-9da1-52ce56ad3007.dcm'
'/tmp/gcsfuse_mnt/f5d6b517-2c02-4035-9444-0f15be7180ff/0da3b58f-ab68-4735-97a8-608cd94789c8.dcm' -> '/tmp/dicom_data/0da3b58f-ab68-4735-97a8-608cd94789c8.dcm'
'/t

## Access the name hierarchy using s5cmd
The [s5cmd repo](https://github.com/peak/s5cmd) describes s5cmd as "a very fast S3 and local filesystem execution tool". But s5cmd can also be used against GCS. 

Install s5cmd.

In [17]:
!wget https://github.com/peak/s5cmd/releases/download/v2.0.0-beta/s5cmd_2.0.0-beta_Linux-64bit.tar.gz
!tar zxf s5cmd_2.0.0-beta_Linux-64bit.tar.gz

--2022-11-07 20:59:27--  https://github.com/peak/s5cmd/releases/download/v2.0.0-beta/s5cmd_2.0.0-beta_Linux-64bit.tar.gz
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/73909333/aafb8c9b-5844-4d77-bd36-a58662d19c98?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20221107%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20221107T205927Z&X-Amz-Expires=300&X-Amz-Signature=718f6c976a55a1c95dfb3c616cba6749fd0c279a45cca00aced1b6f178552c8a&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=73909333&response-content-disposition=attachment%3B%20filename%3Ds5cmd_2.0.0-beta_Linux-64bit.tar.gz&response-content-type=application%2Foctet-stream [following]
--2022-11-07 20:59:27--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/73909333/aafb8c9

Keeping your AWS credentials in Google drive allows for convenient access:

In [18]:
# see here on the details on setting up GCP credentials
# for s5cmd: https://github.com/peak/s5cmd/pull/377
# The following assumes AWS credentials are in /aws/credentials on user's Google Drive

from google.colab import drive

drive.mount('/content/gdrive')

!mkdir -p ~/.aws
!cp /content/gdrive/MyDrive/aws/credentials ~/.aws

Mounted at /content/gdrive


s5cmd can take a manifest of files to be transferred. We'll use gcsfuse to enumerate the instances in a series and then sed to the format required by s5cmd.

In [19]:
!ls /tmp/gcsfuse_mnt/f5d6b517-2c02-4035-9444-0f15be7180ff/*.dcm | sed  "s|/tmp/gcsfuse_mnt/|cp s3://$demo_bucket/|" | sed "s|\(.*\)|\1 /tmp/dicom_data/.|" > s5cmd_manifest.txt
# !gsutil ls gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/*.dcm | sed  "s|gs://|cp s3://|" | sed "s|\(.*\)|\1 /tmp/dicom_data/.|" > s5cmd_manifest.txt
!cat s5cmd_manifest.txt

cp s3://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/00f84d01-384b-4610-a8db-ebb68f3b3dc3.dcm /tmp/dicom_data/.
cp s3://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/011bc2f0-3b70-4a9d-b9de-9b9022c6facc.dcm /tmp/dicom_data/.
cp s3://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/04e23de0-416d-42db-ba34-5e233cdf7dcd.dcm /tmp/dicom_data/.
cp s3://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/05f7526f-a9c5-401f-8cbc-794f541392c5.dcm /tmp/dicom_data/.
cp s3://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0b2393fa-a16d-4f7a-9da1-52ce56ad3007.dcm /tmp/dicom_data/.
cp s3://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0da3b58f-ab68-4735-97a8-608cd94789c8.dcm /tmp/dicom_data/.
cp s3://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0e234611-bc68-4801-86b9-73008209358a.dcm /tmp/dicom_data/.
cp s3://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0e922ba0-5576-4d25-8ea3-1e9879790b37.dcm /tmp/dicom_data/.
cp s3://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0ed09652-f14b-49ea-a559-970eda2e5475.dcm /tmp/dicom

Now we can pass the to s5cmd for processing:

In [20]:
%%time

!mkdir -vp /tmp/dicom_data
!./s5cmd --endpoint-url https://storage.googleapis.com run s5cmd_manifest.txt
!ls -l /tmp/dicom_data
!rm -vr /tmp/dicom_data

mkdir: created directory '/tmp/dicom_data'
cp s3://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/00f84d01-384b-4610-a8db-ebb68f3b3dc3.dcm /tmp/dicom_data/00f84d01-384b-4610-a8db-ebb68f3b3dc3.dcm
cp s3://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/f67ce8ee-dd2a-4733-812a-0b44ee7d39f3.dcm /tmp/dicom_data/f67ce8ee-dd2a-4733-812a-0b44ee7d39f3.dcm
cp s3://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/011bc2f0-3b70-4a9d-b9de-9b9022c6facc.dcm /tmp/dicom_data/011bc2f0-3b70-4a9d-b9de-9b9022c6facc.dcm
cp s3://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/05f7526f-a9c5-401f-8cbc-794f541392c5.dcm /tmp/dicom_data/05f7526f-a9c5-401f-8cbc-794f541392c5.dcm
cp s3://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0da3b58f-ab68-4735-97a8-608cd94789c8.dcm /tmp/dicom_data/0da3b58f-ab68-4735-97a8-608cd94789c8.dcm
cp s3://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/04e23de0-416d-42db-ba34-5e233cdf7dcd.dcm /tmp/dicom_data/04e23de0-416d-42db-ba34-5e233cdf7dcd.dcm
cp s3://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180f

As we can see from the wall time, s5cmd is very fast.

### Accessing the name hierarchy using the Python GCS API

The Google GCS APIs understand the implicit hierarchy in blob names. We can thus use it to access all the instance blobs of a series.

First we define a function that we will use to "walk" the hierarchy.  
The list_blobs() function is documented [here](https://googleapis.dev/python/storage/latest/storage/client.html?highlight=prefixes). Its *prefix* and *delimiter* parameters are used to emulate hierarchical naming. Basically it returns partial blob names for all blob names that begin with the specified *prefix* and up to the specified delimiter.

In [21]:
def list_blobs_with_prefix(prefix, delimiter=None):
    client = storage.Client()
    blobs =  client.list_blobs(bucket, prefix=prefix, delimiter=delimiter)
    names = [blob.name for blob in blobs]
    ids = list(blobs.prefixes)
    ids.sort()
    return ids


We'll use the <series_uuid> as the prefix to find all .dcm (instance) blobs:

In [22]:
instances= list_blobs_with_prefix(prefix='f5d6b517-2c02-4035-9444-0f15be7180ff', delimiter='.dcm')
print("instance uuids:")
for instance in instances:
  print(instances.index(instance), instance)


instance uuids:
0 f5d6b517-2c02-4035-9444-0f15be7180ff/00f84d01-384b-4610-a8db-ebb68f3b3dc3.dcm
1 f5d6b517-2c02-4035-9444-0f15be7180ff/011bc2f0-3b70-4a9d-b9de-9b9022c6facc.dcm
2 f5d6b517-2c02-4035-9444-0f15be7180ff/04e23de0-416d-42db-ba34-5e233cdf7dcd.dcm
3 f5d6b517-2c02-4035-9444-0f15be7180ff/05f7526f-a9c5-401f-8cbc-794f541392c5.dcm
4 f5d6b517-2c02-4035-9444-0f15be7180ff/0b2393fa-a16d-4f7a-9da1-52ce56ad3007.dcm
5 f5d6b517-2c02-4035-9444-0f15be7180ff/0da3b58f-ab68-4735-97a8-608cd94789c8.dcm
6 f5d6b517-2c02-4035-9444-0f15be7180ff/0e234611-bc68-4801-86b9-73008209358a.dcm
7 f5d6b517-2c02-4035-9444-0f15be7180ff/0e922ba0-5576-4d25-8ea3-1e9879790b37.dcm
8 f5d6b517-2c02-4035-9444-0f15be7180ff/0ed09652-f14b-49ea-a559-970eda2e5475.dcm
9 f5d6b517-2c02-4035-9444-0f15be7180ff/0fcd3eca-ad7f-41e9-8660-dabe5bf79130.dcm
10 f5d6b517-2c02-4035-9444-0f15be7180ff/1212d8f7-b955-4891-b01d-0af3455b7bb4.dcm
11 f5d6b517-2c02-4035-9444-0f15be7180ff/14b17ed3-0436-4d60-8957-1f103471fcd2.dcm
12 f5d6b517-2c02-4035-

Now that have the `instances` list of instance blob names we can use other GCS functions to copy the contents of those blobs to a local directory:

In [23]:
%%time

import os
import shutil
from google.cloud.storage import Blob

# Create a directory for the instance data:
if not os.path.exists('/tmp/dicom_data'):
    os.makedirs('/tmp/dicom_data')

bucket = client.get_bucket(demo_bucket)
for instance in instances:
  src_blob = Blob(instance, bucket)
  instance_blob_id = instance.split('/')[-1]
  with open(f"/tmp/dicom_data/{instance.split('/')[-1]}", "wb") as file_obj:
      src_blob.download_to_file(file_obj)

# Demonstrate that we've downloaded the blobs.      
!ls -l /tmp/dicom_data
# Clean up
shutil.rmtree('/tmp/dicom_data')

total 52116
-rw-r--r-- 1 root root 526824 Nov  7 21:00 00f84d01-384b-4610-a8db-ebb68f3b3dc3.dcm
-rw-r--r-- 1 root root 526824 Nov  7 21:00 011bc2f0-3b70-4a9d-b9de-9b9022c6facc.dcm
-rw-r--r-- 1 root root 526824 Nov  7 21:00 04e23de0-416d-42db-ba34-5e233cdf7dcd.dcm
-rw-r--r-- 1 root root 526820 Nov  7 21:00 05f7526f-a9c5-401f-8cbc-794f541392c5.dcm
-rw-r--r-- 1 root root 526824 Nov  7 21:00 0b2393fa-a16d-4f7a-9da1-52ce56ad3007.dcm
-rw-r--r-- 1 root root 526820 Nov  7 21:00 0da3b58f-ab68-4735-97a8-608cd94789c8.dcm
-rw-r--r-- 1 root root 526820 Nov  7 21:00 0e234611-bc68-4801-86b9-73008209358a.dcm
-rw-r--r-- 1 root root 526820 Nov  7 21:00 0e922ba0-5576-4d25-8ea3-1e9879790b37.dcm
-rw-r--r-- 1 root root 526826 Nov  7 21:01 0ed09652-f14b-49ea-a559-970eda2e5475.dcm
-rw-r--r-- 1 root root 526820 Nov  7 21:01 0fcd3eca-ad7f-41e9-8660-dabe5bf79130.dcm
-rw-r--r-- 1 root root 526824 Nov  7 21:01 1212d8f7-b955-4891-b01d-0af3455b7bb4.dcm
-rw-r--r-- 1 root root 526820 Nov  7 21:01 14b17ed3-0436-4d60-89

## Accessing instance data using the Python Requests library or curl
Unlike the previous interfaces, requests, curl and wget do not support wildcarding. They each require a complete https URL in order to access an instance blob. That is, you can't use these APIs directly to get the instances in a series given just a bucket name and series_uuid. 

However, if an application really needs to use the Python Requests library, it can use the Google Storage library, as demonstrated above to get the full name of each instance in a series, and then use Requests to access the instance blob.

We'll define a function for this purpose:



In [27]:
import requests
access_url_prefix = f'https://storage.googleapis.com/{demo_bucket}'
access_token = !gcloud auth print-access-token
def get_blob_contents(instance):
  url = f'{access_url_prefix}/{instance}'
  headers = dict(Authorization=f'Bearer {access_token[0]}')
  result = requests.get(url, headers=headers)
  result.raise_for_status()
  return result


Now we can use requests.get to copy each instances in the previously created `instances` list to a local directory.

Warning: This operation is very slow:

In [28]:
%%time

import os
import shutil
from google.cloud.storage import Blob

# Create a directory for the instance data:
if not os.path.exists('/tmp/dicom_data'):
    os.makedirs('/tmp/dicom_data')

bucket = client.get_bucket(demo_bucket)
for instance in instances:
  instance_blob_id = instance.split('/')[-1]
  with open(f"/tmp/dicom_data/{instance_blob_id}", "wb") as file_obj:
      result = get_blob_contents(instance)
      file_obj.write(result.content)
      print(f'Copied {instance}')
# Demonstrate that we've downloaded the blobs.      
!ls -l /tmp/dicom_data
# Clean up
shutil.rmtree('/tmp/dicom_data')

Copied f5d6b517-2c02-4035-9444-0f15be7180ff/00f84d01-384b-4610-a8db-ebb68f3b3dc3.dcm
Copied f5d6b517-2c02-4035-9444-0f15be7180ff/011bc2f0-3b70-4a9d-b9de-9b9022c6facc.dcm
Copied f5d6b517-2c02-4035-9444-0f15be7180ff/04e23de0-416d-42db-ba34-5e233cdf7dcd.dcm
Copied f5d6b517-2c02-4035-9444-0f15be7180ff/05f7526f-a9c5-401f-8cbc-794f541392c5.dcm
Copied f5d6b517-2c02-4035-9444-0f15be7180ff/0b2393fa-a16d-4f7a-9da1-52ce56ad3007.dcm
Copied f5d6b517-2c02-4035-9444-0f15be7180ff/0da3b58f-ab68-4735-97a8-608cd94789c8.dcm
Copied f5d6b517-2c02-4035-9444-0f15be7180ff/0e234611-bc68-4801-86b9-73008209358a.dcm
Copied f5d6b517-2c02-4035-9444-0f15be7180ff/0e922ba0-5576-4d25-8ea3-1e9879790b37.dcm
Copied f5d6b517-2c02-4035-9444-0f15be7180ff/0ed09652-f14b-49ea-a559-970eda2e5475.dcm
Copied f5d6b517-2c02-4035-9444-0f15be7180ff/0fcd3eca-ad7f-41e9-8660-dabe5bf79130.dcm
Copied f5d6b517-2c02-4035-9444-0f15be7180ff/1212d8f7-b955-4891-b01d-0af3455b7bb4.dcm
Copied f5d6b517-2c02-4035-9444-0f15be7180ff/14b17ed3-0436-4d60-89

Similarly, a shell application that will use curl to copy blobs can use gsutil to get the names of instance blobs in a series.

Warning: This operation is very slow:

In [25]:
%%time

%%bash -s "$demo_bucket" "f5d6b517-2c02-4035-9444-0f15be7180ff"
mkdir -vp /tmp/dicom_data
for instance in $(gsutil ls gs://$1/$2/*.dcm); \
do \
  IFS='/'; read -r -a parts <<< "$instance"; unset IFS; \
  bckt="${parts[2]}"; series_uuid="${parts[3]}"; instance_uuid="${parts[4]}"; \
  URL="https://storage.googleapis.com/$bckt/$series_uuid/$instance_uuid?alt=media"; \
  curl -s -X GET \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -o /tmp/dicom_data/$instance_uuid  \
    $URL; \
  echo Copied $instance; \
done
# Demonstrate that we've downloaded the blobs.      
ls -l /tmp/dicom_data
# Clean up
rm -vr /tmp/dicom_data

mkdir: created directory '/tmp/dicom_data'
Copied gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/00f84d01-384b-4610-a8db-ebb68f3b3dc3.dcm
Copied gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/011bc2f0-3b70-4a9d-b9de-9b9022c6facc.dcm
Copied gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/04e23de0-416d-42db-ba34-5e233cdf7dcd.dcm
Copied gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/05f7526f-a9c5-401f-8cbc-794f541392c5.dcm
Copied gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0b2393fa-a16d-4f7a-9da1-52ce56ad3007.dcm
Copied gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0da3b58f-ab68-4735-97a8-608cd94789c8.dcm
Copied gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0e234611-bc68-4801-86b9-73008209358a.dcm
Copied gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0e922ba0-5576-4d25-8ea3-1e9879790b37.dcm
Copied gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0ed09652-f14b-49ea-a559-970eda2e5475.dcm
Copied gs://crdcobj/f5d6b517-2c02-4035-9444-0f15be7180ff/0fcd3eca-ad7f-41e9