<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Examples/blob/master/notebooks/download_benchmarking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Benchmarking download performance from GCS using gsutil and s5cmd

Initial version prepared and executed by Andrey Fedorov, andrey.fedorov@gmail.com, on May 6, 2022

## Executive summary

Poor performance of `gsutil` has been [well documented](https://www.doit-intl.com/optimize-data-transfer-between-compute-engine-and-cloud-storage/), and [s5cmd](https://github.com/peak/s5cmd) has been recommended as a preferred alternative. 

Anecdotally, IDC users experienced poor performance using `gsutil` as well, and spent some time investigating how to optimize the use of `gsutil` using `xargs` parameterized by the number of threads and file batches (`-n` and `-P` arguments).

Considering the performance of download depends on a variety of factors, such as the size and number of files and hardware configuration of the VM, we performed experiments to compare `gsutil` and `s5cmd` on datasets representative of what we have in IDC.

The datasets used are public, and the cells below provide instructions how to reproduce the experiments.

Our conclusion from the experiments are that:
* `s5cmd` provides significantly better download performance (at times close to order of magnitude improvement)
* `s5cmd` does not require any effort to parameterize download with the number of threads etc - its performance with the default configuration is consistently better both for datasets with small number of large files and with large number of smaller files
* `s5cmd` however requires small additional effort in setting up HMAC keys

Test dataset  | Total size | Number of files | Download time, s5cmd | Download time, gsutil
-------------------|--------|----------|---|---
SM_MAX_Series       | 8 Gb | 6 | 40.1s | 1m 44s
MR_MAX_Series       | 0.9 Gb | 7200 | 15.5s | 2m 41s
CT_NLST_Series    | 0.2 Gb | 319 | 2.52s | 12.5s





# Prerequisites

* Google account
* GCP project
* Generate personal HMAC key https://cloud.google.com/storage/docs/authentication/hmackeys

  * in GCP console, go to "Cloud storage > Settings"
  * switch to "Interoperability" tab
  * under "User account HMAC" section, set default GCP project
  * generate access key for your user account
  * populate `~/.aws/credentials` file with the following content corresponding to the access key you generated (I keep that file to my Drive, and then copy it to the Colab VM for convenience):
  ```text
  [default]
aws_access_key_id=<access key>
aws_secret_access_key=<secret>
  ```

In [None]:
from google.colab import auth
auth.authenticate_user()

# replace the below with a project you can use
# note, project ID is used only to perform BigQuery queries against public tables
my_ProjectID = "idc-sandbox-000"

import os
os.environ["GCP_PROJECT_ID"] = my_ProjectID
os.environ["MANIFEST"] = "/content/idc_manifest.txt"
os.environ["MANIFEST_S5CMD"] = "/content/idc_manifest_s5cmd.txt"

!gcloud config set project $my_ProjectID

Updated property [core/project].


Run the cell below if you have GCP access key in the `aws/credentials` folder in your Google Drive, as discussed above. Otherwise, you can skip this cell, and manually populate the keys.

In [None]:
# see here on the details on setting up GCP credentials
# for s5cmd: https://github.com/peak/s5cmd/pull/377

from google.colab import drive

drive.mount('/content/gdrive')

!mkdir -p ~/.aws
!cp /content/gdrive/MyDrive/aws/credentials ~/.aws

Mounted at /content/gdrive


Sample various DICOM series from IDC based on size and number of files

```sql
SELECT
  SeriesInstanceUID,
  SUM(instance_size/POW(1024,3)) AS series_size_gb,
  STRING_AGG(DISTINCT(Modality)) AS series_modalities,
  COUNT(DISTINCT(SOPInstanceUID)) AS num_files
FROM
  `bigquery-public-data.idc_current.dicom_all`
GROUP BY
  SeriesInstanceUID
ORDER BY
  series_size_gb DESC
```

Based on the above, the following samples were selected:

* **SM_MAX_Series**: Largest series (SM, 8Gb, 6 files): `1.3.6.1.4.1.5962.99.1.110304717.1947547991.1640787844557.2.0`
* **MR_MAX_Series**: Series with the largest number of files (MR, 0.9Gb, 7200 files): `1.3.6.1.4.1.14519.5.2.1.7009.2402.338975706529210469207534758159`
* **NLST_CT_Series**: (Representative?) series from NLST (CT, ~0.2 Gb, 319 files): `1.3.6.1.4.1.14519.5.2.1.7009.9004.304086955181030444939278938416`

In [None]:
import os
os.environ["SM_MAX_Series"] = "1.3.6.1.4.1.5962.99.1.110304717.1947547991.1640787844557.2.0"
os.environ["MR_MAX_Series"] = "1.3.6.1.4.1.14519.5.2.1.7009.2402.338975706529210469207534758159"
os.environ["CT_NLST_Series"] = "1.3.6.1.4.1.14519.5.2.1.7009.9004.304086955181030444939278938416"

Install s5cmd https://github.com/peak/s5cmd

In [None]:
!wget https://github.com/peak/s5cmd/releases/download/v2.0.0-beta/s5cmd_2.0.0-beta_Linux-64bit.tar.gz
!tar zxf s5cmd_2.0.0-beta_Linux-64bit.tar.gz

--2022-06-22 18:59:48--  https://github.com/peak/s5cmd/releases/download/v2.0.0-beta/s5cmd_2.0.0-beta_Linux-64bit.tar.gz
Resolving github.com (github.com)... 52.192.72.89
Connecting to github.com (github.com)|52.192.72.89|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/73909333/aafb8c9b-5844-4d77-bd36-a58662d19c98?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20220622%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220622T185948Z&X-Amz-Expires=300&X-Amz-Signature=902fbc2cc3d7c75f27328d506d2407c448996a125827b60e4a7f11cde167c371&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=73909333&response-content-disposition=attachment%3B%20filename%3Ds5cmd_2.0.0-beta_Linux-64bit.tar.gz&response-content-type=application%2Foctet-stream [following]
--2022-06-22 18:59:49--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/73909333/aafb8c9b-

# Performance tests

Results below were achieved on a Colab VM under Pro subscription, with the following specs:

```text
/content# cat /proc/cpuinfo 

stepping        : 0
microcode       : 0x1
cpu MHz         : 2199.998
cache size      : 56320 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa
bogomips        : 4399.99
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:
```



# Generate manifests for s5cmd and gsutil

1. create query templates for generating manifests suitable for the use with s5cmd, gsutil, or that contain list of DRS URIs
2. create queries for the specific series of interest

In [None]:
!rm -rf *txt *sql

!echo "SELECT gcs_url FROM bigquery-public-data.idc_current.dicom_all WHERE SeriesInstanceUID = \"REPLACE_WITH_SeriesInstanceUID_value\"" > query_template.txt
!echo "SELECT CONCAT(\"cp \",REPLACE(gcs_url, \"gs://\", \"s3://\"), \" .\") FROM bigquery-public-data.idc_current.dicom_all WHERE SeriesInstanceUID = \"REPLACE_WITH_SeriesInstanceUID_value\"" > s5cmd_query_template.txt
!echo "SELECT CONCAT(\"drs://dg.4DFC/\",crdc_instance_uuid) as drs_uri FROM bigquery-public-data.idc_current.dicom_all WHERE SeriesInstanceUID = \"REPLACE_WITH_SeriesInstanceUID_value\"" > drs_query_template.txt

Replace `REPLACE_WITH_SeriesInstanceUID_value` with the value of `SeriesInstanceUID` (e.g., `SM_MAX_Series`) to have the query that retrieves the list of GCS or DRS URIs that can be used to download the files. Next you can use the following command to save the list of URIs corresponding to the selected files.

In [None]:
!echo $SM_MAX_Series
!echo $MR_MAX_Series
!echo $CT_NLST_Series

1.3.6.1.4.1.5962.99.1.110304717.1947547991.1640787844557.2.0
1.3.6.1.4.1.14519.5.2.1.7009.2402.338975706529210469207534758159
1.3.6.1.4.1.14519.5.2.1.7009.9004.304086955181030444939278938416


In [None]:
!sed -e "s/REPLACE_WITH_SeriesInstanceUID_value/$SM_MAX_Series/" query_template.txt > SM_MAX_Series_query_gcs.txt
!sed -e "s/REPLACE_WITH_SeriesInstanceUID_value/$MR_MAX_Series/" query_template.txt > MR_MAX_Series_query_gcs.txt
!sed -e "s/REPLACE_WITH_SeriesInstanceUID_value/$CT_NLST_Series/" query_template.txt > CT_NLST_Series_query_gcs.txt

!sed -e "s/REPLACE_WITH_SeriesInstanceUID_value/$SM_MAX_Series/" s5cmd_query_template.txt > SM_MAX_Series_query_s5cmd.txt
!sed -e "s/REPLACE_WITH_SeriesInstanceUID_value/$MR_MAX_Series/" s5cmd_query_template.txt > MR_MAX_Series_query_s5cmd.txt
!sed -e "s/REPLACE_WITH_SeriesInstanceUID_value/$CT_NLST_Series/" s5cmd_query_template.txt > CT_NLST_Series_query_s5cmd.txt

!sed -e "s/REPLACE_WITH_SeriesInstanceUID_value/$SM_MAX_Series/" drs_query_template.txt > SM_MAX_Series_query_drs.txt
!sed -e "s/REPLACE_WITH_SeriesInstanceUID_value/$MR_MAX_Series/" drs_query_template.txt > MR_MAX_Series_query_drs.txt
!sed -e "s/REPLACE_WITH_SeriesInstanceUID_value/$CT_NLST_Series/" drs_query_template.txt > CT_NLST_Series_query_drs.txt

In [None]:
!bq query --use_legacy_sql=false --format=csv --max_rows=200000 < SM_MAX_Series_query_gcs.txt | tail -n +2 > SM_MAX_Series_gcs_manifest.txt
!bq query --use_legacy_sql=false --format=csv --max_rows=200000 < MR_MAX_Series_query_gcs.txt | tail -n +2 > MR_MAX_Series_gcs_manifest.txt
!bq query --use_legacy_sql=false --format=csv --max_rows=200000 < CT_NLST_Series_query_gcs.txt | tail -n +2 > CT_NLST_Series_gcs_manifest.txt

In [None]:
!bq query --use_legacy_sql=false --format=csv --max_rows=200000 < SM_MAX_Series_query_s5cmd.txt | tail -n +2 > SM_MAX_Series_s5cmd_manifest.txt
!bq query --use_legacy_sql=false --format=csv --max_rows=200000 < MR_MAX_Series_query_s5cmd.txt | tail -n +2 > MR_MAX_Series_s5cmd_manifest.txt
!bq query --use_legacy_sql=false --format=csv --max_rows=200000 < CT_NLST_Series_query_s5cmd.txt | tail -n +2 > CT_NLST_Series_s5cmd_manifest.txt

Waiting on bqjob_r1d4cb9b5743db7d5_000001818cb957ae_1 ... (0s) Current status: DONE   
Waiting on bqjob_r3d2c979cc9d04108_000001818cb9745c_1 ... (0s) Current status: DONE   
Waiting on bqjob_r7ee7eef0c1b00505_000001818cb996e2_1 ... (1s) Current status: DONE   


In [None]:
!bq query --use_legacy_sql=false --format=csv --max_rows=200000 < SM_MAX_Series_query_drs.txt | tail -n +2 > SM_MAX_Series_drs_manifest.txt
!bq query --use_legacy_sql=false --format=csv --max_rows=200000 < MR_MAX_Series_query_drs.txt | tail -n +2 > MR_MAX_Series_drs_manifest.txt
!bq query --use_legacy_sql=false --format=csv --max_rows=200000 < CT_NLST_Series_query_drs.txt | tail -n +2 > CT_NLST_Series_drs_manifest.txt

Waiting on bqjob_r46ce06668fd866ff_000001818cbaf41e_1 ... (0s) Current status: DONE   
Waiting on bqjob_r727fb7b2d69426fd_000001818cbb11ab_1 ... (0s) Current status: DONE   
Waiting on bqjob_r2e9caacf6c0ed648_000001818cbb2f85_1 ... (0s) Current status: DONE   


# SM_MAX_Series

In [None]:
%%time

!rm -rf *dcm
!./s5cmd --endpoint-url https://storage.googleapis.com run SM_MAX_Series_s5cmd_manifest.txt

cp s3://public-datasets-idc/93749fc6-5c42-4346-9b46-41128b54a93e.dcm 93749fc6-5c42-4346-9b46-41128b54a93e.dcm
cp s3://public-datasets-idc/cdad2c18-bfa4-4e28-ba91-e60cbfe2a992.dcm cdad2c18-bfa4-4e28-ba91-e60cbfe2a992.dcm
cp s3://public-datasets-idc/121e6068-7b8f-4cf5-a4c1-3cec3e2ececf.dcm 121e6068-7b8f-4cf5-a4c1-3cec3e2ececf.dcm
cp s3://public-datasets-idc/4a12d515-ee8e-455b-865d-8d29fd7906d1.dcm 4a12d515-ee8e-455b-865d-8d29fd7906d1.dcm
cp s3://public-datasets-idc/1b719f60-81ae-436f-b3d8-ce245094afe8.dcm 1b719f60-81ae-436f-b3d8-ce245094afe8.dcm
cp s3://public-datasets-idc/5cb43495-6f3e-4c65-9bc5-4566e6481d14.dcm 5cb43495-6f3e-4c65-9bc5-4566e6481d14.dcm
CPU times: user 443 ms, sys: 111 ms, total: 553 ms
Wall time: 51.8 s


In [None]:
%%time

!rm -rf *dcm
!cat SM_MAX_Series_gcs_manifest.txt | gsutil -m cp -I .

Copying gs://public-datasets-idc/1b719f60-81ae-436f-b3d8-ce245094afe8.dcm...
Copying gs://public-datasets-idc/93749fc6-5c42-4346-9b46-41128b54a93e.dcm...
Copying gs://public-datasets-idc/cdad2c18-bfa4-4e28-ba91-e60cbfe2a992.dcm...
Copying gs://public-datasets-idc/4a12d515-ee8e-455b-865d-8d29fd7906d1.dcm...
Copying gs://public-datasets-idc/5cb43495-6f3e-4c65-9bc5-4566e6481d14.dcm...
Copying gs://public-datasets-idc/121e6068-7b8f-4cf5-a4c1-3cec3e2ececf.dcm...
\ [6/6 files][  8.0 GiB/  8.0 GiB] 100% Done  22.9 MiB/s ETA 00:00:00           
Operation completed over 6 objects/8.0 GiB.                                      
CPU times: user 1.35 s, sys: 227 ms, total: 1.58 s
Wall time: 1min 54s


# MR_MAX_Series

In [None]:
%%time
%%capture

!rm -rf *dcm
!./s5cmd --endpoint-url https://storage.googleapis.com run MR_MAX_Series_s5cmd_manifest.txt

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
cp s3://idc-open-idc/f031d713-f67c-491c-a1aa-6b53a912b4fd.dcm f031d713-f67c-491c-a1aa-6b53a912b4fd.dcm
cp s3://idc-open-idc/31e767d3-a198-45ea-9eac-64ee642d862c.dcm 31e767d3-a198-45ea-9eac-64ee642d862c.dcm
cp s3://idc-open-idc/609f99b8-e103-421a-a2fe-dab8ace8a083.dcm 609f99b8-e103-421a-a2fe-dab8ace8a083.dcm
cp s3://idc-open-idc/98127838-a58e-44d9-bf34-f2c47412915f.dcm 98127838-a58e-44d9-bf34-f2c47412915f.dcm
cp s3://idc-open-idc/f5b6cc7c-7771-4d40-a68a-50907dd2a1d6.dcm f5b6cc7c-7771-4d40-a68a-50907dd2a1d6.dcm
cp s3://idc-open-idc/4eb24760-f9de-4df6-8589-9e18d33601f3.dcm 4eb24760-f9de-4df6-8589-9e18d33601f3.dcm
cp s3://idc-open-idc/8dbb013b-5674-49ea-8d2e-c03b4e040184.dcm 8dbb013b-5674-49ea-8d2e-c03b4e040184.dcm
cp s3://idc-open-idc/43a99d05-cb9f-4eeb-a8a8-04ccdc3db5c4.dcm 43a99d05-cb9f-4eeb-a8a8-04ccdc3db5c4.dcm
cp s3://idc-open-idc/4ed35faf-4f6a-4188-9301-b0d7cf51749a.dcm 4ed35faf-4f6a-4188-9301-b0d7cf51749a.dcm
cp s3://

In [None]:
%%time

!rm -rf *dcm
!cat MR_MAX_Series_gcs_manifest.txt | gsutil -m cp -I .

Copying gs://idc-open-idc/5e1740c6-6013-473c-a7da-162c5b60571a.dcm...
Copying gs://idc-open-idc/3ab2418d-3920-445b-aef7-ce1b8dfb0f86.dcm...
Copying gs://idc-open-idc/b59b6fbc-8dc1-4bcf-a38e-10bc1f88547f.dcm...
Copying gs://idc-open-idc/9feb45b2-251b-4fd3-95e7-c2cdb196e7ca.dcm...
Copying gs://idc-open-idc/60167694-87cc-472f-ab7f-520c921061ec.dcm...
Copying gs://idc-open-idc/70defda4-e9ab-4709-9fa4-f7c4661d8818.dcm...
Copying gs://idc-open-idc/40bc5f2c-47bc-4e4a-8097-ae0ca896678b.dcm...
Copying gs://idc-open-idc/16ec51b5-ded1-4a82-8145-0f909c8f9504.dcm...
Copying gs://idc-open-idc/2776fe65-eb59-4b33-afe0-2358b75a5567.dcm...
Copying gs://idc-open-idc/a6341f2d-a463-4be6-a81a-0f8805a85de9.dcm...
Copying gs://idc-open-idc/7c1d8626-acc9-4b74-8f02-be825f336a03.dcm...
Copying gs://idc-open-idc/9bdca19a-1e29-4ee5-bdf3-1701aa2a0160.dcm...
Copying gs://idc-open-idc/ce4865ef-1b46-4b8a-8815-208b8d4f2437.dcm...
Copying gs://idc-open-idc/579c7eed-294c-4664-8122-bb516a1976cd.dcm...
Copying gs://idc-ope

# CT_NLST_Series

In [None]:
%%time

!rm -rf *dcm
!./s5cmd --endpoint-url https://storage.googleapis.com run CT_NLST_Series_s5cmd_manifest.txt

In [None]:
%%time

!rm -rf *dcm
!cat CT_NLST_Series_gcs_manifest.txt | gsutil -m cp -I .