<a href="https://colab.research.google.com/github/ImagingDataCommons/Cloud-Resources-Workflows/blob/notebooks2/Notebooks/Totalsegmentator/Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**How to Generate a Datatable for Terra to Run the TotalSegmentatortwoVmWorkflowOnTerra**

This notebook provides a step-by-step guide on how to prepare a datatable for Terra that is compatible with the [TotalSegmentatortwoVmWorkflowOnTerra](https://dockstore.org/workflows/github.com/ImagingDataCommons/Cloud-Resources-Workflows/TotalSegmentatortwoVmWorkflowOnTerra:dev?tab=info) workflow. This workflow performs segmentation and feature extraction on DICOM images using two virtual machines (VMs) on Terra.

The steps are:

1. **Filter out localizer and inconsistent series**. Run an SQL query to exclude series that are localizer scans or have geometric inconsistencies from the cohort of interest.
2. **Extract the AWS URLs of the series**. The IDC buckets store the DICOM images at the series level, (i.e a reference to the series folder is enough, and there is no need to get the location of each SOPInstance's url) so the AWS URL of each series is the only input required for the workflow. However, you can also include other attributes that may help you organize or filter the data. The `s5cmdurl` column of the resulting table contains the command that can be used with `s5cmd` to download the series. Note: The query configures downloading the series to the `idc_data` folder by default, as this folder is cleaned after processing each series in the notebooks. You can change the destination folder if needed by modifying the sql query.
3. **Split the cohort into chunks**. Create manifests of 12 series each, so you can leverage Terra's parallel computing capabilities and run the workflow across thousands of VMs on Terra. Note: Rawls, the underlying engine of Terra, can run up to 3000 jobs and up to 28800 tasks (a job may contain multiple tasks) at a time.
4. **Copy the manifests to the Terra workspace bucket**. Use the `gsutil` command to copy the manifests from your local machine to the bucket associated with your Terra workspace.
5. **Generate a Terra datatable**. Use the manifests and the AWS URLs of the series to create a datatable that has the inputs for the TotalSegmentatortwoVmWorkflowOnTerra. The datatable should have one row per series and one column per input parameter.



##**Authenticate gcloud**

In [2]:
project_id='my-test-project'
terra_workspace_bucket__folder_url='gs://my-test-terra-workspace-bucket/nlst-121523'

In [None]:
!gcloud auth login

In [4]:
!gcloud config set project $project_id

Updated property [core/project].


##**Download and run the sql query which removes localizer and geometrically inconsistent series**##

In [5]:
!wget https://raw.githubusercontent.com/ImagingDataCommons/Cloud-Resources-Workflows/sqlquery/sqlQueries/nlstCohort.sql

--2023-12-15 17:46:47--  https://raw.githubusercontent.com/ImagingDataCommons/Cloud-Resources-Workflows/sqlquery/sqlQueries/nlstCohort.sql
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10512 (10K) [text/plain]
Saving to: ‘nlstCohort.sql’


2023-12-15 17:46:47 (13.0 MB/s) - ‘nlstCohort.sql’ saved [10512/10512]



In [6]:
!cat nlstCohort.sql

#standardSQL

-- The point of this query is to remove the CT Image Series from nlst study which do not conform to
-- geometrical checks generally required for NIFTI file converison
-- also to remove those series that require additional decompression steps before passing to dcm2niix for conversion to NIfTI format.

-- The assumptions made here are
  -- consider only those series that have CT modality and belong to the NLST collection and TransferSyntaxUID's NOT IN ( '1.2.840.10008.1.2.4.70','1.2.840.10008.1.2.4.51')
  -- do not contain LOCALIZER in ImageType
  -- all instances in a series have identical values for ImageOrientationPatient (converted to string for the purposes of comparison)
  -- all instances in a series have 1 ± 0.01 as the dot product between cross product of first and second vectors, and [1,0,0] in ImageOrientationPosition
  -- have number of instances in the series equal to the number of distinct values of ImagePositionPatient attribute
      -- (converted to string 

###Run this command twice as the first time bq is run, it returns a initialization message.

https://github.com/GoogleCloudPlatform/terraform-google-secured-data-warehouse/issues/35

In [10]:
!cat nlstCohort.sql | bq query --format=csv  --project_id=$project_id --max_rows=999999999 --use_legacy_sql=false > nlst_cohort.csv

##**Generate Batches of 12 series and a terra data table**

In [13]:
from datetime import datetime
import math
import numpy as np
import os
import pandas as pd
import shutil
df= pd.read_csv('nlst_cohort.csv')
df

Unnamed: 0,SeriesInstanceUID,seriesNumber,s5cmdUrls,StudyInstanceUID,PatientID,iopCount,dotProduct,pixelSpacingCount,positionCount,sopInstanceCount,...,pixelColumnCount,maxSliceIntervalDifference,minSliceIntervalDifference,sliceIntervalifferenceTolerance,exposureCount,maxExposure,minExposure,maxExposureDifference,seriesSizeInMB,viewerUrl
0,1.3.6.1.4.1.14519.5.2.1.7009.9004.100143549999...,3,cp --show-progress s3://idc-open-data/d0686cc8...,1.3.6.1.4.1.14519.5.2.1.7009.9004.157698385886...,214471,1,1.0,1,173,173,...,1,1.801,1.799,0.002,1,75.0,75.0,0.0,86.882189,https://viewer.imaging.datacommons.cancer.gov/...
1,1.3.6.1.4.1.14519.5.2.1.7009.9004.100148350742...,3,cp --show-progress s3://idc-open-data/65597299...,1.3.6.1.4.1.14519.5.2.1.7009.9004.292622187955...,216711,1,1.0,1,194,194,...,1,1.701,1.699,0.002,1,40.0,40.0,0.0,97.436441,https://viewer.imaging.datacommons.cancer.gov/...
2,1.3.6.1.4.1.14519.5.2.1.7009.9004.100241427395...,3,cp --show-progress s3://idc-open-data/0ac326f0...,1.3.6.1.4.1.14519.5.2.1.7009.9004.659103532541...,207917,1,1.0,1,179,179,...,1,1.801,1.799,0.002,1,60.0,60.0,0.0,89.894773,https://viewer.imaging.datacommons.cancer.gov/...
3,1.3.6.1.4.1.14519.5.2.1.7009.9004.100266844261...,3,cp --show-progress s3://idc-open-data/2cbe4a26...,1.3.6.1.4.1.14519.5.2.1.7009.9004.151770170572...,203675,1,1.0,1,182,182,...,1,1.701,1.699,0.002,1,80.0,80.0,0.0,91.401714,https://viewer.imaging.datacommons.cancer.gov/...
4,1.3.6.1.4.1.14519.5.2.1.7009.9004.100554983367...,3,cp --show-progress s3://idc-open-data/fb516653...,1.3.6.1.4.1.14519.5.2.1.7009.9004.705857522143...,213742,1,1.0,1,197,197,...,1,1.801,1.799,0.002,1,60.0,60.0,0.0,98.933716,https://viewer.imaging.datacommons.cancer.gov/...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
126083,1.3.6.1.4.1.14519.5.2.1.7009.9004.998833275664...,4,cp --show-progress s3://idc-open-data/062c6be2...,1.3.6.1.4.1.14519.5.2.1.7009.9004.364348944209...,207027,1,1.0,1,335,335,...,1,1.000,1.000,0.000,1,100.0,100.0,0.0,168.268482,https://viewer.imaging.datacommons.cancer.gov/...
126084,1.3.6.1.4.1.14519.5.2.1.7009.9004.998940893051...,2,cp --show-progress s3://idc-open-data/be689701...,1.3.6.1.4.1.14519.5.2.1.7009.9004.133839788276...,215319,1,1.0,1,291,291,...,1,1.000,1.000,0.000,1,20.0,20.0,0.0,146.199738,https://viewer.imaging.datacommons.cancer.gov/...
126085,1.3.6.1.4.1.14519.5.2.1.7009.9004.999757880961...,4,cp --show-progress s3://idc-open-data/102fa1a7...,1.3.6.1.4.1.14519.5.2.1.7009.9004.289218334054...,200392,1,1.0,1,168,168,...,1,2.000,2.000,0.000,1,634.0,634.0,0.0,84.406584,https://viewer.imaging.datacommons.cancer.gov/...
126086,1.3.6.1.4.1.14519.5.2.1.7009.9004.118957004151...,100,cp --show-progress s3://idc-open-data/cbf4f300...,1.3.6.1.4.1.14519.5.2.1.7009.9004.749265201934...,201729,1,1.0,1,149,149,...,1,2.000,2.000,0.000,0,,,,74.829597,https://viewer.imaging.datacommons.cancer.gov/...


In [16]:
df['projected_batch_number']=np.ceil((df.index + 1) / 12)
df

Unnamed: 0,SeriesInstanceUID,seriesNumber,s5cmdUrls,StudyInstanceUID,PatientID,iopCount,dotProduct,pixelSpacingCount,positionCount,sopInstanceCount,...,maxSliceIntervalDifference,minSliceIntervalDifference,sliceIntervalifferenceTolerance,exposureCount,maxExposure,minExposure,maxExposureDifference,seriesSizeInMB,viewerUrl,projected_batch_number
0,1.3.6.1.4.1.14519.5.2.1.7009.9004.100143549999...,3,cp --show-progress s3://idc-open-data/d0686cc8...,1.3.6.1.4.1.14519.5.2.1.7009.9004.157698385886...,214471,1,1.0,1,173,173,...,1.801,1.799,0.002,1,75.0,75.0,0.0,86.882189,https://viewer.imaging.datacommons.cancer.gov/...,1.0
1,1.3.6.1.4.1.14519.5.2.1.7009.9004.100148350742...,3,cp --show-progress s3://idc-open-data/65597299...,1.3.6.1.4.1.14519.5.2.1.7009.9004.292622187955...,216711,1,1.0,1,194,194,...,1.701,1.699,0.002,1,40.0,40.0,0.0,97.436441,https://viewer.imaging.datacommons.cancer.gov/...,1.0
2,1.3.6.1.4.1.14519.5.2.1.7009.9004.100241427395...,3,cp --show-progress s3://idc-open-data/0ac326f0...,1.3.6.1.4.1.14519.5.2.1.7009.9004.659103532541...,207917,1,1.0,1,179,179,...,1.801,1.799,0.002,1,60.0,60.0,0.0,89.894773,https://viewer.imaging.datacommons.cancer.gov/...,1.0
3,1.3.6.1.4.1.14519.5.2.1.7009.9004.100266844261...,3,cp --show-progress s3://idc-open-data/2cbe4a26...,1.3.6.1.4.1.14519.5.2.1.7009.9004.151770170572...,203675,1,1.0,1,182,182,...,1.701,1.699,0.002,1,80.0,80.0,0.0,91.401714,https://viewer.imaging.datacommons.cancer.gov/...,1.0
4,1.3.6.1.4.1.14519.5.2.1.7009.9004.100554983367...,3,cp --show-progress s3://idc-open-data/fb516653...,1.3.6.1.4.1.14519.5.2.1.7009.9004.705857522143...,213742,1,1.0,1,197,197,...,1.801,1.799,0.002,1,60.0,60.0,0.0,98.933716,https://viewer.imaging.datacommons.cancer.gov/...,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
126083,1.3.6.1.4.1.14519.5.2.1.7009.9004.998833275664...,4,cp --show-progress s3://idc-open-data/062c6be2...,1.3.6.1.4.1.14519.5.2.1.7009.9004.364348944209...,207027,1,1.0,1,335,335,...,1.000,1.000,0.000,1,100.0,100.0,0.0,168.268482,https://viewer.imaging.datacommons.cancer.gov/...,10507.0
126084,1.3.6.1.4.1.14519.5.2.1.7009.9004.998940893051...,2,cp --show-progress s3://idc-open-data/be689701...,1.3.6.1.4.1.14519.5.2.1.7009.9004.133839788276...,215319,1,1.0,1,291,291,...,1.000,1.000,0.000,1,20.0,20.0,0.0,146.199738,https://viewer.imaging.datacommons.cancer.gov/...,10508.0
126085,1.3.6.1.4.1.14519.5.2.1.7009.9004.999757880961...,4,cp --show-progress s3://idc-open-data/102fa1a7...,1.3.6.1.4.1.14519.5.2.1.7009.9004.289218334054...,200392,1,1.0,1,168,168,...,2.000,2.000,0.000,1,634.0,634.0,0.0,84.406584,https://viewer.imaging.datacommons.cancer.gov/...,10508.0
126086,1.3.6.1.4.1.14519.5.2.1.7009.9004.118957004151...,100,cp --show-progress s3://idc-open-data/cbf4f300...,1.3.6.1.4.1.14519.5.2.1.7009.9004.749265201934...,201729,1,1.0,1,149,149,...,2.000,2.000,0.000,0,,,,74.829597,https://viewer.imaging.datacommons.cancer.gov/...,10508.0


In [17]:
try:
    shutil.rmtree(f'urls')
except OSError:
    pass
os.makedirs('urls')


# Set the number of rows per file
rows_per_file = 12

# Calculate the number of files needed
num_files = math.ceil(len(df) / rows_per_file)

# Split the dataframe into multiple dataframes
dfs = [df[i*rows_per_file:(i+1)*rows_per_file] for i in range(num_files)]

# Get the current date and time formatted with underscores up to minutes
now = datetime.now().strftime('%Y_%m_%d_%H_%M')

# Create a new column name for the batch_id column
batch_id_column = f'entity:twoVM_{now}_id'

# Create a new dataframe to store the batch information
batch_df = pd.DataFrame(columns=[batch_id_column, 'dicomToNiftiConverterTool', 's5cmd_url', 'dicomSegAndSRcpu', 'dicomSegAndSRram'])

# Analyze each file and add a row to the batch dataframe
for i, df in enumerate(dfs):
    max_sopinstancecount = df['sopInstanceCount'].max()
    filename = f'urls/batch_{i+1}.csv'
    url_suffix = f'batch_{i+1}.csv'
    df.to_csv(filename, index=False)
    s5cmd_url = f'{terra_workspace_bucket__folder_url}/{url_suffix}'

    if max_sopinstancecount >= 300:
        cpu = 8
        ram = 32
    else:
        cpu = 4
        ram = 16

    new_row = pd.DataFrame({
        batch_id_column: [i+1],
        'dicomToNiftiConverterTool': ['dcm2niix'],
        's5cmd_url': [s5cmd_url],
        'dicomSegAndSRcpu': [cpu],
        'dicomSegAndSRram': [ram]
    })
    batch_df = pd.concat([batch_df, new_row], ignore_index=True)


In [18]:
batch_df.to_csv(f'terra_data_table_manifest_{now}.tsv', sep='\t', index=False)
batch_df

Unnamed: 0,entity:twoVM_2023_12_15_17_56_id,dicomToNiftiConverterTool,s5cmd_url,dicomSegAndSRcpu,dicomSegAndSRram
0,1,dcm2niix,gs://my-test-terra-workspace-bucket/nlst-12152...,4,16
1,2,dcm2niix,gs://my-test-terra-workspace-bucket/nlst-12152...,4,16
2,3,dcm2niix,gs://my-test-terra-workspace-bucket/nlst-12152...,4,16
3,4,dcm2niix,gs://my-test-terra-workspace-bucket/nlst-12152...,4,16
4,5,dcm2niix,gs://my-test-terra-workspace-bucket/nlst-12152...,4,16
...,...,...,...,...,...
10503,10504,dcm2niix,gs://my-test-terra-workspace-bucket/nlst-12152...,8,32
10504,10505,dcm2niix,gs://my-test-terra-workspace-bucket/nlst-12152...,8,32
10505,10506,dcm2niix,gs://my-test-terra-workspace-bucket/nlst-12152...,8,32
10506,10507,dcm2niix,gs://my-test-terra-workspace-bucket/nlst-12152...,8,32


##**Copy files to terra workspace bucket**
A folder need not be created first. gsutil automatically creates the destination folder if not present


In [None]:
!gsutil -m cp -r urls/* $terra_workspace_bucket__folder_url