<a href="https://colab.research.google.com/github/ImagingDataCommons/CloudSegmentator/blob/v1.2.0/workflows/TotalSegmentator/Notebooks/preProccessing_of_postProcessingExtractPerframe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **This notebook provides a step-by-step guide on how to generate a datatable for Terra. This datatable is essential for extracting the DICOM attribute, PerFrameFunctionalGroupsSequence, using the workflow linked below.**

You can find the PerFrameFunctionalGroupsSequence Extraction workflow [here](https://dockstore.org/workflows/github.com/ImagingDataCommons/CloudSegmentator/perFrameFunctionalGroupSequenceExtractionOnTerra:main?tab=info).

The workflow requires list of paths to the lz4 compressed DICOM SEG objects generated by our TotalSegmentator workflows like [here](https://github.com/ImagingDataCommons/CloudSegmentator/blob/v1.2.0/workflows/TotalSegmentator/Notebooks/dicomsegAndRadiomicsSR_Notebook.ipynb). While running the workflow on Cloud, each VM is assigned 10 (chosen arbitarily can be any number) batches of compressed DICOM SEGs, amounting to up to 120 DICOM SEG files.

Once these steps are completed, a datatable is produced and is ready to be uploaded to Terra's data tables, that can be referenced for  PerFrameFunctionalGroupsSequence Extraction workflow

### **Installing Packages**

In [2]:
%%capture
!sudo apt-get update \
  && apt-get install -y --no-install-recommends \
  lz4

In [3]:
%%capture
!pip install pydicom \
   google-cloud-bigquery \
   pyarrow \
   db_dtypes

### **Importing Packages**

In [4]:
from datetime import datetime
import json
import math
import os
import shutil
import pandas as pd
import pydicom
import traceback
import logging
from tqdm import tqdm
import subprocess
import yaml

### **Example Terra datatable**

In [8]:
segFilesCsv='https://github.com/ImagingDataCommons/CloudSegmentator/releases/download/v1.0.0/sample_manifest_for_perframe.tsv'

### **Read the tsv from twoVMworkflow datatable on terra**

In [None]:
data= pd.read_table(segFilesCsv)
data

### **Generate manifests for Terra datatable**

In [None]:
# Set the number of rows per file
batches_per_row = 10

# Sort the dataframe by the first non-index column (assuming it's 'batch_id')
df = data.sort_values(by=data.columns[0])

# Calculate the number of files needed
num_files = math.ceil(len(df) / batches_per_row)

# Split the dataframe into multiple dataframes
dfs = [df[i*batches_per_row:(i+1)*batches_per_row] for i in range(num_files)]

# Get the current date and time formatted with underscores up to minutes
now = datetime.now().strftime('%Y_%m_%d_%H_%M')

# Set the directory for the manifests
manifests_dir = 'manifests'

# Make sure the directory exists
os.makedirs(manifests_dir, exist_ok=True)

# Create a new column name for the batch_id column
batch_id_column = f'entity:perFrameExtraction_{now}_id'

# Create a new dataframe to store the batch information
batch_df = pd.DataFrame(columns=[batch_id_column, 'dicomsegAndRadiomicsSR_CompressedFiles'])

# Analyze each file and add a row to the batch dataframe
for i, df_batch in enumerate(dfs):
    # Create a list of segFiles for this batch
    segFiles_list = df_batch['dicomsegAndRadiomicsSR_CompressedFiles'].tolist()

    # Convert the list to a JSON string with double quotes
    segFiles_json = json.dumps(segFiles_list)

    # Create a new row with the batch information and the segFiles list
    new_row = pd.DataFrame({
        batch_id_column: [i+1],
        'dicomsegAndRadiomicsSR_CompressedFiles': [segFiles_json],
    })
    # Add the new row to the batch dataframe
    batch_df = pd.concat([batch_df, new_row], ignore_index=True)

batch_df

In [15]:
batch_df.to_csv(f'perframe_datatable_{now}.tsv',sep="\t", index=False)