<a href="https://colab.research.google.com/github/ImagingDataCommons/CloudSegmentator/blob/main/workflows/TotalSegmentator/Notebooks/Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**How to Generate a Datatable for Terra to Run the TotalSegmentatortwoVmWorkflowOnTerra**

This notebook provides a step-by-step guide on how to prepare a datatable for Terra that is compatible with the [TotalSegmentatortwoVmWorkflowOnTerra](https://dockstore.org/workflows/github.com/ImagingDataCommons/Cloud-Resources-Workflows/TotalSegmentatortwoVmWorkflowOnTerra:dev?tab=info) workflow. This workflow performs segmentation and feature extraction on DICOM images using two virtual machines (VMs) on Terra.

The steps are:

1. **Filter out localizer and inconsistent series**. Run an SQL query to exclude series that are localizer scans or have geometric inconsistencies from the cohort of interest.
3. **Split the cohort into chunks**. Create batches of 12 series  (assigning 12 series per VM), so you can leverage Terra's parallel computing capabilities efficiently and run the workflow across thousands of VMs on Terra. Note: Rawls, the underlying engine of Terra, can run up to 3000 jobs and up to 28800 tasks (a job may contain multiple tasks) at a time.
5. **Generate a Terra datatable**. Each row of datatable will have the batchid and list of seriesInstanceuids in a yaml form amenable to be passed to papermill



##**Authenticate gcloud**

In [None]:
project_id='my-test-project'

In [None]:
!gcloud auth login

In [None]:
!gcloud config set project $project_id

##**Download and run the sql query which removes localizer and geometrically inconsistent series**##

In [None]:
!wget https://raw.githubusercontent.com/ImagingDataCommons/CloudSegmentator/main/workflows/TotalSegmentator/sqlQueries/nlstCohort.sql

In [None]:
!cat nlstCohort.sql

###Run this command twice as the first time bq is run, it returns a initialization message.

https://github.com/GoogleCloudPlatform/terraform-google-secured-data-warehouse/issues/35

In [None]:
!cat nlstCohort.sql | bq query --format=csv  --project_id=$project_id --max_rows=999999999 --use_legacy_sql=false > nlst_cohort.csv

##**Generate Batches of 12 series and a terra data table**

In [None]:
from datetime import datetime
import math
import numpy as np
import os
import pandas as pd
import shutil
df= pd.read_csv('nlst_cohort.csv')
df

In [None]:
import pandas as pd
import math
from datetime import datetime
import os
import yaml
import json

# Set the number of rows per file
series_per_batch = 12

# Calculate the number of files needed
num_files = math.ceil(len(df) / series_per_batch)

# Split the dataframe into multiple dataframes
dfs = [df[i*series_per_batch:(i+1)*series_per_batch] for i in range(num_files)]

# Get the current date and time formatted with underscores up to minutes
now = datetime.now().strftime('%Y_%m_%d_%H_%M')

# Set the directory for the manifests
manifests_dir = 'manifests'

# Make sure the directory exists
os.makedirs(manifests_dir, exist_ok=True)

# Create a new column name for the batch_id column
batch_id_column = f'entity:twoVM_{now}_id'

# Create a new dataframe to store the batch information
batch_df = pd.DataFrame(columns=[batch_id_column, 'SeriesInstanceUIDs', 'idc-version', 'dicomSegAndSRcpu', 'dicomSegAndSRram'])

# Create a list to store YAML-formatted SeriesInstanceUIDs
yaml_series_list = []

# Analyze each file and add a row to the batch dataframe
for i, df_batch in enumerate(dfs):

    # Create a list of seriesInstanceuids for this batch
    series_list = df_batch['SeriesInstanceUID'].tolist()

    # Create a dictionary with the key 'SeriesInstanceUIDs'
    data_dict = {'SeriesInstanceUIDs': series_list}

    # Create the filename for the YAML file
    yaml_filename = os.path.join(manifests_dir, f'batch_{i+1}.yaml')

    # Write the dictionary to a YAML file
    with open(yaml_filename, 'w') as yaml_file:
        yaml.dump(data_dict, yaml_file)

    max_sopinstancecount = df_batch['sopInstanceCount'].max()

    # Format the series list as a JSON dictionary
    json_series_dict = json.dumps({"SeriesInstanceUIDs": series_list})

    if max_sopinstancecount >= 300:
        cpu = 8
        ram = 32
    else:
        cpu = 4
        ram = 16

    # Create a new row with the batch information and the series list
    new_row = pd.DataFrame({
        batch_id_column: [i+1],
        'SeriesInstanceUIDs': [json_series_dict],
        'dicomSegAndSRcpu': [cpu],
        'dicomSegAndSRram': [ram],
        'idc-version': 'v17'
    })
    # Add the new row to the batch dataframe
    batch_df = pd.concat([batch_df, new_row], ignore_index=True)

# Display the final batch_df
batch_df

In [None]:
batch_df.to_csv(f'terra_data_table_manifest_{now}.tsv', sep='\t', index=False)