<a href="https://colab.research.google.com/github/vkt1414/CloudSegmentator/blob/feat-convert-raw-radiomics-to-dataframe/workflows/TotalSegmentator/Notebooks/preProccessing_of_RadiomicsJsonToDataFrame.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **This notebook provides a step-by-step guide on how to generate a datatable for Terra. This datatable is essential for converting the radiomics features in json format to a pandas dataframe, using the workflow linked below.**

You can find the PerFrameFunctionalGroupsSequence Extraction workflow [here](https://dockstore.org/workflows/github.com/ImagingDataCommons/CloudSegmentator/convert_radiomics_features_json_to_dataframe:main?tab=info).

The workflow requires list of paths to the lz4 compressed DICOM SEG objects generated by our TotalSegmentator workflows like [here](https://github.com/ImagingDataCommons/CloudSegmentator/blob/main/workflows/TotalSegmentator/Notebooks/dicomsegAndRadiomicsSR_Notebook.ipynb). While running the workflow on Cloud, each VM is assigned 100 (chosen arbitarily can be any number) batches of compressed json files containing raw radiomics features, amounting to up to 1200 json files per VM.

Once these steps are completed, a datatable is produced and is ready to be uploaded to Terra's data tables

### **Installing Packages**

In [None]:
%%capture
!sudo apt-get update \
  && apt-get install -y --no-install-recommends \
  lz4

### **Importing Packages**

In [None]:
from datetime import datetime
import json
import math
import os
import pandas as pd
import subprocess


### **Example Terra datatable**

In [None]:
segFilesCsv='https://github.com/ImagingDataCommons/CloudSegmentator/releases/download/v1.0.0/sample_manifest_for_perframe.tsv'


### **Read the tsv from twoVMworkflow datatable on terra**

In [None]:
data= pd.read_table(segFilesCsv)
data

### **Generate manifests for Terra datatable**

In [8]:
# Set the number of rows per file
batches_per_row = 100

# Sort the dataframe by the first non-index column (assuming it's 'batch_id')
df = data.sort_values(by=data.columns[0])

# Calculate the number of files needed
num_files = math.ceil(len(df) / batches_per_row)

# Split the dataframe into multiple dataframes
dfs = [df[i*batches_per_row:(i+1)*batches_per_row] for i in range(num_files)]

# Get the current date and time formatted with underscores up to minutes
now = datetime.now().strftime('%Y_%m_%d_%H_%M')

# Set the directory for the manifests
manifests_dir = 'manifests'

# Make sure the directory exists
os.makedirs(manifests_dir, exist_ok=True)

# Create a new column name for the batch_id column
batch_id_column = f'entity:rawRadiomics_{now}_id'

# Create a new dataframe to store the batch information
batch_df = pd.DataFrame(columns=[batch_id_column, 'pyradiomicsRadiomicsFeatures'])

# Analyze each file and add a row to the batch dataframe
for i, df_batch in enumerate(dfs):
    # Create a list of segFiles for this batch
    rawRadiomics_list = df_batch['pyradiomicsRadiomicsFeatures'].tolist()

    # Convert the list to a JSON string with double quotes
    rawRadiomics_json = json.dumps(rawRadiomics_list)

    # Create a new row with the batch information and the segFiles list
    new_row = pd.DataFrame({
        batch_id_column: [i+1],
        'pyradiomicsRadiomicsFeatures': [rawRadiomics_json],
    })
    # Add the new row to the batch dataframe
    batch_df = pd.concat([batch_df, new_row], ignore_index=True)

batch_df

Unnamed: 0,entity:rawRadiomics_2024_05_15_20_18_id,pyradiomicsRadiomicsFeatures
0,1,"[""gs://fc-5af492dc-6993-4c91-bbf6-3e2747868642..."
1,2,"[""gs://fc-5af492dc-6993-4c91-bbf6-3e2747868642..."
2,3,"[""gs://fc-5af492dc-6993-4c91-bbf6-3e2747868642..."
3,4,"[""gs://fc-5af492dc-6993-4c91-bbf6-3e2747868642..."
4,5,"[""gs://fc-5af492dc-6993-4c91-bbf6-3e2747868642..."
...,...,...
101,102,"[""gs://fc-5af492dc-6993-4c91-bbf6-3e2747868642..."
102,103,"[""gs://fc-5af492dc-6993-4c91-bbf6-3e2747868642..."
103,104,"[""gs://fc-5af492dc-6993-4c91-bbf6-3e2747868642..."
104,105,"[""gs://fc-5af492dc-6993-4c91-bbf6-3e2747868642..."


In [9]:
batch_df.to_csv(f'rawRadiomics_{now}.tsv',sep="\t", index=False)