# IDC ETL, v10
This notebook implements the IDC ETL process. It closely follows the steps in section 6 of [ETL Workflow, v10](https://docs.google.com/document/d/1luEnT0Vr5_VZwOYl2WaIwwHDvVfm0IBeXVzxzRGfzMs/edit#heading=h.2r0uhxc). Refer to that document for a description of the process.

## Preliminary

We do ETL development in Pycharm, wherein Pycharm is configured to execute ETL scripts on a remote VM. For purposes of remote execution, Pycharm maintains a copy of most of the files in a Pycharm project on the remote VM.
Presumably, one could also pull/clone the project data from the [etl_flow](https://github.com/ImagingDataCommons/etl_flow) repo. We have not yet tried that.

When performing the ingestion step, an 8-core VM is recommended to support multi-process downloading of the data. For other tasks, a 2-core VM is usually sufficient.

Regardless, the following constants should be change to define the location of the etl_flow project.

VM is an alias of the VM on which ETL scripts are executed. Such an alias can be generated by the [gcloud compute config-ssh](https://cloud.google.com/sdk/gcloud/reference/compute/config-ssh) CLI.

EF is the top directory of the Pycharm etl_flow project

In [None]:
VM = 'etl-dev-whc.us-central1-a.idc-etl-processing'
EF = '/pycharm/etl_flow/tmp/pycharm_project_936'

Define an alias for initiating remote execution of an ETL script. It is used like:

%remex utilities/tcia_helpers.py

remex takes a single parameter that is the path from the project root to the script to be executed

In [None]:
%alias remex ssh bcliffor@$VM env PYTHONPATH=/pycharm/etl_flow/tmp/pycharm_project_936:/pycharm/etl_flow/tmp/secure_files PYTHONUNBUFFERED=1 SECURE_LOCAL_PATH=../secure_files/etl SETTINGS_MODULE=settings python3.9 $EF/%s

### A note on logging
We are working to standardize logging as follows:

Log files are created in the directory settings.LOGGING_BASE/settings.BASE_NAME where, currently:

```
LOGGING_BASE = f'/mnt/disks/idc-etl/logs/v{CURRENT_VERSION}'
BASE_NAME = sys.argv[0].rsplit('/',1)[-1].rsplit('.',1)[0]
LOG_DIR = f'{LOGGING_BASE}/{BASE_NAME}'
```

Then we configure three loggers, successlogger, errlogger and progresslogger, which output, respectively, to success.log, error.log and progress.log files in LOG_DIR. 

In general, outputs to success.log track completed operations, and are used to avoid repeating those operations in the event that execution of a script is interrupted and must be restarted. For example, a script that copies some set of instances from one bucket to another might log the name of each blob as it is successfully copied. Then, if that script must be restarted, it can input the contents of success.log, and skip copying any instance whose ID is in that input data.

As its name implies, errors are logged to error.log.

The progresslogger is generally used to log information about the progress of the operation. 

A particular script may use all, some or none of these loggers.

Note that when a scripte is restarted, its the progress.log is truncated. The other log files are appended to.

We often execute `tail -f xxx.log` on each of the log files in separate terminal windows to monitor progress. 

## Pre-ingestion


### Update settings.py

The CURRENT_VERSION and PREVIOUS_VERSION in settings.py must be set as needed for the new version.

### Import the DB
Create a new database, typically "idc_v\<X\>", where X is the CURRENT_VERSION in settings. Then import from the saved final idc_v\<Y\> DB where Y is PREVIOUS_VERSION.

### Revise the collection_id_map DB table
Determine if there are any non-casing collectionID changes and revise collection_id as needed. For this purpose we execute the detect_tcia_collection_name_changes.py preingestion script. It uses the TCIA source DOI to map between TCIA collection IDs s and IDC collections names. Collection IDs that are only different in casing are ignored. Only failures are reported:

In [None]:
%remex preingestion/detect_tcia_collection_name_changes.py

A failure indicates that the TCIA ID of a collection, which we refer to as the tcia_api_collection_id, has changed in more than just casing. If such a change is detected, the collection_id_map table must be manually updated. 
Note that this script only deals with TCIA collections. Dealing with other sources is TBD.

### Update wsi metadata tables
If there is new WSI data to be ingested, we must first update the WSI database tables. There are several parameters that likely need to be specified in order to override the defaults set in the script.
We can get help to view from the build_wsi_metadata_tables_tsv.py script to view the possible parameters:

In [None]:
%remex preingestion/wsi_build/build_wsi_metadata_tables_tsv.py --help

As an example use of these parameters: At the time of writing, several HTAN collections have been copind into the *htan-transfer* bucket. The *HTAN-V1-Converted/Converted_20220416* folder in that bucket is the root of the specific conversion set which we want to ingest. The *identifiers.txt* blob in that folder is a tsv file that enumerates the instances in all the HTAN collections in the particular conversion. 

At this time, we only want to ingest the HTAN-OHSU collection and thus want to skip the HTAN-HMS, HTAN-Vanderbilt and HTAN-WUSTL collections. None of these collections are in collection groups, all of whose members are to be skipped.

Therefore, we execute of build_wsi_metadata_tables_tsv.py:

In [None]:
%remex preingestion/wsi_build/build_wsi_metadata_tables_tsv.py \
    --src_bucket htan-transfer \
    --src_path HTAN-V1-Converted/Converted_20220416 \
    --tsv_blob identifiers.txt \
    --skipped_collections HTAN-HMS HTAN-Vanderbilt HTAN-WUSTL



Now verify that the new instances have been correctly added to the DB:

In [None]:
# TBD

As an aside, the preingestion/wsi_build/remove_wsi_metadata_tsv.py script does the opposite of build_wsi_metadata_tables_tsv.py: it removes any instances that is listed in a tsv and which it finds in the wsi_instance DB table. It also hierarchically removes emptied series, studies, patients and collections from the corresponding wsi tables. The parameterization is the sames as for build_wsi_metadata_tables_tsv.py. E.G., the following will remove the wsi metadata added above:

In [None]:
%remex preingestion/wsi_build/remove_wsi_metadata_tsv.py \
    --src_bucket htan-transfer \
    --src_path HTAN-V1-Converted/Converted_20220416 \
    --tsv_blob identifiers.txt \
    --skipped_collections HTAN-HMS HTAN-Vanderbilt HTAN-WUSTL


## Ingestion

 We are now almost ready to ingest data. The base script is ingestion/ingest.py. It has several parameters that need to be configured:

In [None]:
%remex ingestion/ingest.py --help

ingestion/ingest.py can spawn multiple processes to speed up ingestion. If num_processes is 0, than all work is performed in the base process. We do not generally exceed 16 processes in order to avoid overloading the TCIA/NBIA server. 

Ingestion pulls radiology data from tcia, and pulls pathology data from a bucket or buckets into which it has been placed by the WSI conversion process and as specified in the wsi_instance table. Basically, ingestion asks each of the sources "What collections do you have?" and then proceeds to get new data and update our database. However, we often want to limit the collections which we try to get from one of the sources. In particular, we do not revise excluded collections; these are collections that we have previously ingested but then did not make public because they were considered of questionable quality. Similarly, we do not revise redacted collection; these are the collections that are known to contain head scans. So we specifically do not update the radiology component of these collections, but, in some cases do want to get pathology. We also do not ever try to update NLST radiology which is supposed never to change. However we have and will, for example, want to update NLST pathology.

For this purpose, there are separate skipped_tcia_groups and skipped_path_groups parameters that separaetly list groups of collections which we don't want ingestion to process.

The skipped_tcia_collections and skipped_path_collections parameters allow skipping additional collections from some other collection group.

Collections identified by the include_tcia_collections and include_path_collections override corresponding sets of skipped collections. That is, the collections enumerated by these parameters are processed even if enumerated in one of the previous parameters.

The server parameter, essentially, indicates whether to check for updates to the NLST collection. Because we don't expect to ever revise NLST, this parameter can be ignored.

Ingestion copies data from the sources into per-version/per-collection/per-source buckets e.g. idc_v10_path_tcga_brca. The prestaging_tcia_bucket_prefix and prestaging_path_bucket_prefix parameters allows changing the bucket prefix from the default `idc_v<CURRENT_VERSION>_tcia_` and `idc_v<CURRENT_VERSION>_path_`

Finally, it is often desirable to see which collections the ingestion process has identified as new, subject to revision, or about to be retired, before proceeding. If stop_after_collection_summary is True, ingestion will exit after printing out a summary of these pending changes, e.g.:

In [None]:
%remex ingestion/ingest.py --stop_after_collection_summary True

In the following we have added two collections to both skipped_tcia_collection and skipped_path_collections. Notice that, because the collection IDs contain spaces, we must both escape the spaces and quote the IDs

In [None]:
%remex ingestion/ingest.py \
    --skipped_tcia_collections NLST HCC-TACE-Seg 'QIN\ Breast\ DCE-MRI' 'QIN\ LUNG\ CT' \
    --skipped_path_collections NLST HCC-TACE-Seg "QIN\ Breast\ DCE-MRI"  "QIN\ LUNG\ CT" \
    --stop_after_collection_summary True

We can now perform ingestion. This can take several days, depending on the amount of data to be ingested. In particular, pulling data from NBIA generally runs at 8-12MB/s. 

In [None]:
%remex ingestion/ingest.py \
    --num_processes 12 \
    --skipped_tcia_collections NLST HCC-TACE-Seg 'QIN\ Breast\ DCE-MRI' 'QIN\ LUNG\ CT' \
    --skipped_path_collections NLST HCC-TACE-Seg "QIN\ Breast\ DCE-MRI"  "QIN\ LUNG\ CT" 

### Validate UUID uniqueness
The DB will check for the uniqueness of UUIDs within a level, e.g. among all instances, or among all patiens, but not across levels. We now need to validate that there are no collisions among all IDC generated UUIDs. Such collisions are extremely unlikely but still possible:

In [None]:
%remex ingestion/validation/uuids_are_unique.py

In the event that there is a collision, the UUID of the new instance, and the name of the corresponding blob must be changed. There is no script for this, so it must be done manually. Note that the `series_instance` many_to_many table is based on UUIDs, and so will also have to be changes.

### Other ingestion validation
To a large extent, ingestion is self-validating. Specifically, the instance stage of ingestion verifies that after instances are copied to GCS, GCS has the expected blobs and that they each have the expected hash.

Then, at each stage in the hierarchy, it compares hierarchical hashes with those from the corresponding source or sources. A hash mismatch prevents marking the corresponding objects as done. That, in turn, prevents marking higher level objects as done.

Thus, if ingestion marks the new version as done, then all hierarchical hashes match the corresponding hashes of the sources.

## Post-ingestion


### Revise collection group tables
There are four collection group *tables* in the database:
- cr_collections
- defaced_collections
- redacted_collections
- excluded_collections

These are described in the ETL Workflow document. 
Any collections not in these tables are defined as open collections. `open_collections` is a view that resolves to the metadata of all open collections.
If ingestion adds a new collection to one of the cr, defaced, redacted or excludes collection groups, the corresponding table must be manually updated. This is expected to be a rare event so there is no script for this. Example SQL for manually updating the cr_collections table:
```
idc_v10=> insert into cr_collections values('ISPY2', 'cc608e04-d37b-4faa-86d8-5b17068c6f80', 'idc-dev-cr', 'idc-dev-cr', 'idc-open-cr', 'idc-open-cr', 'Public', 'Public')
idc_v10=> insert into cr_collections values('ACRIN-6698', '50befbe5-9bd6-4d08-b766-b8825a8b7bb3', 'idc-dev-cr', 'idc-dev-cr', 'idc-open-cr', 'idc-open-cr', 'Public', 'Public');
```


### Revise program table
The program table associates each collection with a program. It is manually updated after updating a separately maintained [spreadsheet](https://docs.google.com/spreadsheets/d/1-sk8CMTDDj-deKv7sXglLvHUhDSNS1cRqUg5Oy5UpRY/edit#gid=0). Example SQL:

```
idc_v10=> insert into program values ('ISPY2', 'NCI Trials');
idc_v10=> insert into program values ('ACRIN-6698', 'NCI Trials');
```

### Populate a DICOM store and export metadata to dicom_metadata BQ table
This step has several substeps.

#### Import buckets all IDC data in version
The first substep is to import all instances from the idc-dev-open, idc-dev-cr, idc-dev-defaced, and idc-dev retracted staging buckets, and the premerge buckets (e.g. idc_v9_path_tcga_gbm) into a new DICOM store. In other words we import into the DICOM store before merging the premerge buckets.

We will eventually export DICOM metadata to BQ. For this purpose, we import from the redacted collections because we continue to include metadata of those collections in BQ. 

The default parameters do not normally need overriding.

Note that the progress.log will show 'errors'. These are due to there being more one version of some instance in a bucket. However, a GCH DICOM store can only hold a single version of an instance (i.e an instance having a particular SOPInstanceUID) and  GCH reports, as an error, each attempt to import an instance having a SOPInstanceUID that is already in the DICOM store. We will deal with this issue in subsequent steps.

We could avoid such errors import instance individually, but it is much more efficient to import entire buckets.

In [None]:
%remex gch/populate_dicom_store/step1_import_buckets_with_redacted.py. 

#### Delete revised and retired instances from the DICOM store
The errors described above occur when there is more than one version of an instance. When such an error has occurred, we know that one version of an instance was uploaded successfully, but any other versions were rejected. We need to end up with the most recent version of any such instance, but we have no way way of knowing which version of any instance was actually imported. 

Therefore, in this substep we delete from the DICOM store, all instances that have ever been revised. In addition, by importing the entire contents of a bucket, we will have imported any instances that have been "retired"...are no longer in the IDC version that we are building. So, in this substep, we also delete those retired instances.

As above, the default parameters do not normally need overriding.


In [None]:
%remex gch/populate_dicom_store/step2_delete_revised_retired_instances_with_redacted.py

#### Insert revised instances
Now that we have deleted all instances that have revisions, we load the latest revision of each such instance. We do not load instances that have been retired; they are not in this IDC version.

In [None]:
%remex gch/populate_dicom_store/step3_insert_revised_instances_with_redaction.py

#### Export DICOM metadata 
The DICOM store now holds all instances whose metadata we want to export to BQ. The step4_export_metadata.py script exports DICOM metadata to the idc-dev-etl.idc_vX_pub.dicom_metadata table. There are no parameters that might need overriding.

In [None]:
%remex gch/populate_dicom_store/step4_export_metadata.py

#### Delete redacted instances
The DICOM store holds instance data from redacted collections. We now remove them from so that they cannot be viewed.

In [None]:
%remex gch/populate_dicom_store/step5_delete_redacted_instances.py

#### Validation  
Validate that the dicom_metadata BQ table has the expected SOPInstanceUIDs.

In [None]:
%remex gch/validate_dicom_store/validate_dicom_store_instance_count.py

### Create an external connection
Uploading DB tables to BQ, the next step, uses an external connection to access a Cloud SQL database. Each external connection is specific to a particular database, therefore we need to create an external connection for the DB of the new IDC version. Previously defined connections can be seen in the BQ console in the External connections dataset of idc-dev-etl project. 

The following script creates a new connection:

In [None]:
%remex bq/bq_IO/utils/create_bq_external_connection.py \

### Upload DB tables to BQ
Next, we upload these DB tables to BQ. 
- analysis_id_map
- collection_id_map
- version
- version_collection
- collection
- collection_patient
- patient
- patient_study
- study
- study_series
- series
- series_instance
- instance
- cr_collections
- defaced_collections
- excluded_collections
- open_collections
- redacted_collections
- all_collections
- all_included_collections
- program
- non_tcia_collection_metadata'

All of these tables are uploaded by default:

In [None]:
%remex bq/bq_IO/upload_psql_to_bq.vnext.dev.py 

On occasion, we need to upload selected tables, e.g. if we have had to correct a table in the database. Use the --upload parameter for this, e.g.:

In [None]:
%remex bq/bq_IO/upload_psql_to_bq.vnext.dev.py \
    --upload 'analysis_results_descriptions'

### Generate analysis_results_metadata
Generate the analysis_results_metadata BQ table into the dev project. This script currently scrapes the TCIA analysis results page to get most of the data.
The normal flow is to generate this table in the idc-dev-etl project and later copy to the PDP staging project.


In [None]:
%remex bq/gen_analysis_results_table/gen_analysis_results_metadata_table.py

### Generate original_collections_metadata
As for the analysis_results_metadata table, we'll copy to the PDP staging project later. 

In [None]:
%remex bq/gen_original_data_collections_table/gen_included_original_collection_metadata.py

### Generate excluded_collections_metadata
excluded_collections_metadata includes the metadata of collections in the excluded_collections group. Unlike original_collections_metadata, it is not a public table.

In [None]:
%remex bq/gen_original_data_collections_table/gen_excluded_original_collection_metadata.py

### Generate auxiliary_metadata
The auxiliary_metadata BQ table defines the version. It contains IDC metadata for each instance in the version. In particular it includes the GCS URL of each instance.

There are two GCS blobs for each instance. One instance is in one of several buckets in the idc-dev-etl project. Another copy is (will be after Google PDP staging) in Google PDP-owned or IDC-owned public buckets. (Actually there is only a single copy, in the idc-dev-excluded bucket) of instances in the excluded collections). Therefore we generate two versions of auxiliary_metadata: idc-dev-etl.idc_v\<version\>.auxiliary_metadata has GCS URLs in the idc-dev-etl buckets; idc-pdp-staging has GCS URLs in the public buckets. We generate tables separately.

At this point in the process, the newly ingested data remains in the per-version/per-collection/per-source "premerge" buckets. The following script is configured to build from data in those buckets as well as from the idc-open, idc-dev-cr, idc-dev-defaced and idc-dev-redacted staging buckets.

In [None]:
%remex bq/gen_aux_metadata_table/gen_auxiliary_metadata_table.dev.premerge.py

Now we build the table in the idc-pdp-staging project:

In [None]:
%remex bq/gen_aux_metadata_table/gen_auxiliary_metadata_table.pdp.py

Verify that auxiliary_metadata has the same set of SOPInstanceUIDs as dicom_metadata. We only check the dev version of auxiliary_metadata. First check the auxiliary_metadata in idc-dev-etl:

In [None]:
%remex bq/gen_aux_metadata_table/validate/aux_matches_dicom_metadata.dev.py

Then auxiliary_metadata in the PDP staging project:

In [None]:
%remex bq/gen_aux_metadata_table/validate/aux_matches_dicom_metadata.pdp.py

 ### Generate version_metadata
 Create/update a table of per-version metadata

In [None]:
%remex bq/gen_version_metadata_table/gen_version_metadata_table/py

### Copy bioclin tables

At this time the TCGA and NLST bioclinical are in the idc-dev-etl.idc_v\<version\>\_pub dataset rather than idc_clinical. They are generally unchanged across IDC versions so we just copy them from the previous version.

In [None]:
%remex bq/copy_tables/copy_bioclin_tables.vn-1_to_vn.py

### Copy BQ tables from dev to pdp datasets
We can now copy the following tables from idc-dev-etl.idc_v\<version\>\_pub to idc-pdp-staging.idc_v\<version\>:
- analysis_results_metadata
- dicom_metadata
- nlst_canc
- nlst_ctab
- nlst_ctabc
- nlst_prsn
- nlst_screen
- original_collections_metadata
- tcga_biospecimen_rel9
- tcga_clinical_rel9
- version_metadata

All the above are copied by default.

In [None]:
%remex bq/copy_tables/copy_public_tables.dev_to_pdp.py

You can override the default --bqtables parameter, e.g.:

In [None]:
%remex bq/copy_tables/copy_public_tables.dev_to_pdp.py --bqtables nlst_canc nlst_ctab

### Generate open_collections_blob_names
Generate a BQ table of the blob names (\<uuid\>.dcm) of all blobs that should be in the Google PDP public-datasets-idc bucket. This is for the use of Google PDP program and is only needed in the pdp project:

In [None]:
%remex bq/gen_open_collections_blob_names/gen_open_collections_blob_names.py

### Populate BQ views
Several BQ views are now generated. First generate views in the idc-dev-etl project:

In [None]:
%remex view_creation/BQ_Table_Building/publish_bq_views.dev.py

Then generate views in the idc-pdp-staging project:

In [None]:
%remex bq/view_creation/BQ_Table_Building/publish_bq_views.pdp.py

Now validate that the views "point" to tables in the expected project/dataset

In [None]:
# TBD

### Generate idc_current dataset
There is an `idc_current` dataset in both the idc-dev-etl and the idc-pdp-staging projects. For completeness, these must be created (or recreated) after the webapp team has generated the dicom_derived_all table and the dicom_pivot_v\<version\> in each of these projects. 

Generate the idc_current dataset in idc-dev-etl:

In [None]:
%remex bq/gen_idc_current/gen_idc_current_dataset.dev.py


and in idc-pdp-staging:

In [None]:
%remex bq/gen_idc_current/gen_idc_current_dataset.pdp.py


Now validate that the views "point" to tables in the expected project/dataset

In [None]:
# TBD

### Generate DCF manifest
For each IDC version, we need DCF to index all new (versions of) instances:

In [None]:
%remex dcf/gen_instance_manifest/vX_instance_manifest.py

The script saves the resulting manifest in GCS (in the idc-dev-etl project) as gs://indexd_manifests/dcf_input/pdp_hosting/idc\_v\<version\>\_instance_manifest_\*.tsv, where \* is some number. The manifest for a version may be comprised of several parts. All such parts must be uploaded to [this](https://drive.google.com/drive/folders/1wNYfxLhX0Bhc_CCcuk_llB28YATX4aIm) DCF Drive folder for indexing. In addition, the new manifest name should be added to [this](https://docs.google.com/spreadsheets/d/1CcaPf4hjK9is2JbHnxpfmDYr9WJ7ZACI/edit#gid=820166513) spreadsheet.

After DCF indexes a manifest, we do a statistical validation. dcf/validate_indexes/validate_indexes_in_version.py takes two parameters: --version specifies the version to be validated, --starts_with is a string of hex digits (0-9a-f). The script creates a list of UUIDs of new instances in --version which start with the --starts_with value. Thus, if --starts_with is '0', then all UUIDs that start with '0' are selected, resulting in a random selection of about one in sixteen instances. A --starts_with value of '00' selects one in 256, etc.

The script then attempts to resolve each GUID at the DCF server and validate that the expected URL is returned. (It only validates instances (DRS blobs), not series or studies (DRS bundles).) Resolution is quite slow, about 3/sec, so --starts_with needs to be chosen so that validation completes in some reasonable time, (whatever 'reasonable' means to the validator.)

\-\-version defaults to settings.CURRENT_VERSION  
\-\-starts_with has no default

In [None]:
%remex dcf/validate_indexes/validate_indexes_in_version.py --version 9 --starts_with '000'

### Populate the staging and public buckets
We need to copy new instances to the staging and public buckets. 
Google PDP pulls new data from the idc-open-pdp-staging bucket. This is a “delta” bucket, containing only the instances that are new to a version. Therefore, it must be emptied before populating with new instances in the next version:

In [None]:
%remex gcs/empty_bucket_mp/empty_pdp_bucket_mp.py

Now copy dev staging buckets to PDP staging and IDC public buckets. The public buckets are idc-dev-cr and idc-dev-open1; these hold data from the cr, defaced, and redacted collection groups:

In [None]:
%remex gcs/copy_new_blobs_to pub_buckets/copy_new_blobs_to pub_buckets.py.

We need to validate that the staging and public buckets are correctly populated.
Validate the staging bucket:

In [None]:
%remex gcs/validate_bucket/validate_idc_open_pdp_staging.py

Validate idc-open-cr:

In [None]:
%remex gcs/validate_bucket/validate_idc_open_cr.py

Validate idc-open-idc1:

In [None]:
%remex gcs/validate_bucket/validate_public_datasets_idc.py

## Post-Release

### Validate PDP release
Validate that the public-datasets-idc bucket has the correct set of instance:

Check that the bigquery-public-datasets.idc_v\<version\> dataset is correct:

In [None]:
%remex gcs/validate_bucket/validate_public_datasets_idc.py

Check that the bigquery-public-datasets.idc_current dataset is correct:

In [None]:
# TBD

### Merge the premerge buckets
Up to this point, new data has remained in per-version/per-collection/per-source buckets in idc-dev-etl. We now move it to the staging buckets idc-dev-open, idc-dev-cr, idc-dev-defaced and idc-dev-redacted as needed:

In [None]:
%remex gcs/copy_prestaging_to_staging/copy_prestaging_to_staging.py 

Validate that the staging buckets contents are correct

In [None]:
%remex gcs/validate_bucket/validate_idc_dev_open.py
%remex gcs/validate_bucket/validate_idc_dev_cr.py
%remex gcs/validate_bucket/validate_idc_dev_defaced.py
%remex gcs/validate_bucket/validate_idc_dev_redacted.py

### Regenerate auxiliary_metadata
Now that the premerge buckets have been merged, we need to regenerate the idc-dev-etl auxiliary_metadata table so that gcs_urls are correct.

In [None]:
%remex bq/gen_aux_metadata_table/gen_auxiliary_metadata_table.dev.postmerge.py

### Delete premerge buckets
The per-version/per-collection/per-source premerge buckets are now no longer needed.

In [None]:
# Needs debugging
# %remex gcs/delete_prestaging_buckets/delete_prestaging_buckets.py

### Delete DICOM store
Finally, delete the DICOM store for the previous IDC version. This is done via the GCH portal, e.g. [https://console.cloud.google.com/healthcare/browser?authuser=1&project=canceridc-data](https://console.cloud.google.com/healthcare/browser?authuser=1&project=canceridc-data)