<a href="https://colab.research.google.com/github/ImagingDataCommons/Cloud-Resources-Workflows/blob/notebooks2/Notebooks/Totalsegmentator/Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**How to Generate a Datatable for Terra to Run the TotalSegmentatortwoVmWorkflowOnTerra**

This notebook provides a step-by-step guide on how to prepare a datatable for Terra that is compatible with the [TotalSegmentatortwoVmWorkflowOnTerra](https://dockstore.org/workflows/github.com/ImagingDataCommons/Cloud-Resources-Workflows/TotalSegmentatortwoVmWorkflowOnTerra:dev?tab=info) workflow. This workflow performs segmentation and feature extraction on DICOM images using two virtual machines (VMs) on Terra.

The steps are:

1. **Filter out localizer and inconsistent series**. Run an SQL query to exclude series that are localizer scans or have geometric inconsistencies from the cohort of interest.
2. **Extract the AWS URLs of the series**. The IDC buckets store the DICOM images at the series level, (i.e a reference to the series folder is enough, and there is no need to get the location of each SOPInstance's url) so the AWS URL of each series is the only input required for the workflow. However, you can also include other attributes that may help you organize or filter the data. The `s5cmdurl` column of the resulting table contains the command that can be used with `s5cmd` to download the series. Note: The query configures downloading the series to the `idc_data` folder by default, as this folder is cleaned after processing each series in the notebooks. You can change the destination folder if needed by modifying the sql query.
3. **Split the cohort into chunks**. Create manifests of 12 series each, so you can leverage Terra's parallel computing capabilities and run the workflow across thousands of VMs on Terra. Note: Rawls, the underlying engine of Terra, can run up to 3000 jobs and up to 28800 tasks (a job may contain multiple tasks) at a time.
4. **Copy the manifests to the Terra workspace bucket**. Use the `gsutil` command to copy the manifests from your local machine to the bucket associated with your Terra workspace.
5. **Generate a Terra datatable**. Use the manifests and the AWS URLs of the series to create a datatable that has the inputs for the TotalSegmentatortwoVmWorkflowOnTerra. The datatable should have one row per series and one column per input parameter.



##**Authenticate gcloud**

In [None]:
project_id='my-test-project'
terra_workspace_bucket__folder_url='gs://my-test-terra-workspace-bucket/nlst-121523'

In [None]:
!gcloud auth login

In [None]:
!gcloud config set project $project_id

##**Download and run the sql query which removes localizer and geometrically inconsistent series**##

In [None]:
!wget https://raw.githubusercontent.com/ImagingDataCommons/Cloud-Resources-Workflows/sqlqueryfix/sqlQueries/nlstCohort.sql

In [None]:
!cat nlstCohort.sql

###Run this command twice as the first time bq is run, it returns a initialization message.

https://github.com/GoogleCloudPlatform/terraform-google-secured-data-warehouse/issues/35

In [None]:
!cat nlstCohort.sql | bq query --format=csv  --project_id=$project_id --max_rows=999999999 --use_legacy_sql=false > nlst_cohort.csv

##**Generate Batches of 12 series and a terra data table**

In [None]:
from datetime import datetime
import math
import numpy as np
import os
import pandas as pd
import shutil
df= pd.read_csv('nlst_cohort.csv')
df

In [None]:
df['projected_batch_number']=np.ceil((df.index + 1) / 12)
df

In [None]:
try:
    shutil.rmtree(f'urls')
except OSError:
    pass
os.makedirs('urls')


# Set the number of rows per file
rows_per_file = 12

# Calculate the number of files needed
num_files = math.ceil(len(df) / rows_per_file)

# Split the dataframe into multiple dataframes
dfs = [df[i*rows_per_file:(i+1)*rows_per_file] for i in range(num_files)]

# Get the current date and time formatted with underscores up to minutes
now = datetime.now().strftime('%Y_%m_%d_%H_%M')

# Create a new column name for the batch_id column
batch_id_column = f'entity:twoVM_{now}_id'

# Create a new dataframe to store the batch information
batch_df = pd.DataFrame(columns=[batch_id_column, 'dicomToNiftiConverterTool', 's5cmd_url', 'dicomSegAndSRcpu', 'dicomSegAndSRram'])

# Analyze each file and add a row to the batch dataframe
for i, df in enumerate(dfs):
    max_sopinstancecount = df['sopInstanceCount'].max()
    filename = f'urls/batch_{i+1}.csv'
    url_suffix = f'batch_{i+1}.csv'
    df.to_csv(filename, index=False)
    s5cmd_url = f'{terra_workspace_bucket__folder_url}/{url_suffix}'

    if max_sopinstancecount >= 300:
        cpu = 8
        ram = 32
    else:
        cpu = 4
        ram = 16

    new_row = pd.DataFrame({
        batch_id_column: [i+1],
        'dicomToNiftiConverterTool': ['dcm2niix'],
        's5cmd_url': [s5cmd_url],
        'dicomSegAndSRcpu': [cpu],
        'dicomSegAndSRram': [ram]
    })
    batch_df = pd.concat([batch_df, new_row], ignore_index=True)


In [None]:
batch_df.to_csv(f'terra_data_table_manifest_{now}.tsv', sep='\t', index=False)
batch_df

##**Copy files to terra workspace bucket**
A folder need not be created first. gsutil automatically creates the destination folder if not present


In [None]:
!gsutil -m cp -r urls/* $terra_workspace_bucket__folder_url