<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/collections_demos/TotalSegmentator_CT_Segmentations_features_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extraction of radiomics features for the TotalSegmentator-CT-Segmentations collection

This notebook is provided for the sake of transparency to describe the process of extracting radiomics features in CSV and Parquet formats that are shared as part of the following dataset:

> Thiriveedhi, V. K., Krishnaswamy, D., Clunie, D., & Fedorov, A. (2023). TotalSegmentator segmentations and radiomics features for NCI Imaging Data Commons CT images [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8347012

---

Initial version: June 2024

## Step 1: Generation of a features pivot table

The query below was used to create a pivot table (a table where each feature is stored in a dedicated table column), which was then saved into `idc-sandbox-000.andrey_cohorts.totalsegmentator_quant_pivot`. The latter table was used in the following step to reduce query time.

```sql
SELECT
  da.PatientID AS PatientID,
  StudyInstanceUID,
  StudyDate,
  sourceSegmentedSeriesUID AS CT_SeriesInstanceUID,
  SeriesInstanceUID AS SEG_SeriesInstanceUID,
  segmentationSegmentNumber[0] AS SEG_SegmentNumber,
  findingSite.CodeMeaning FindingSite,
  lateralityModifier.CodeMeaning FindingSiteLaterality,
  MAX(CASE
      WHEN Quantity.CodeMeaning = '10th percentile' THEN Value
  END
    ) AS `Percentile_10th`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = '90th percentile' THEN Value
  END
    ) AS `Percentile_90th`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Elongation' THEN Value
  END
    ) AS `Elongation`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Energy' THEN Value
  END
    ) AS `Energy`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Flatness' THEN Value
  END
    ) AS `Flatness`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Intensity Histogram Entropy' THEN Value
  END
    ) AS `Intensity_Histogram_Entropy`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Intensity histogram uniformity' THEN Value
  END
    ) AS `Intensity_histogram_uniformity`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Interquartile range' THEN Value
  END
    ) AS `Interquartile_range`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Kurtosis' THEN Value
  END
    ) AS `Kurtosis`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Least Axis in 3D Length' THEN Value
  END
    ) AS `Least_Axis_in_3D_Length`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Major Axis in 3D Length' THEN Value
  END
    ) AS `Major_Axis_in_3D_Length`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Maximum 3D Diameter of a Mesh' THEN Value
  END
    ) AS `Maximum_3D_Diameter_of_a_Mesh`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Maximum grey level' THEN Value
  END
    ) AS `Maximum_grey_level`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Mean' THEN Value
  END
    ) AS `Mean`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Mean absolute deviation' THEN Value
  END
    ) AS `Mean_absolute_deviation`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Median' THEN Value
  END
    ) AS `Median`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Minimum grey level' THEN Value
  END
    ) AS `Minimum_grey_level`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Minor Axis in 3D Length' THEN Value
  END
    ) AS `Minor_Axis_in_3D_Length`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Range' THEN Value
  END
    ) AS `Range`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Robust mean absolute deviation' THEN Value
  END
    ) AS `Robust_mean_absolute_deviation`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Root mean square' THEN Value
  END
    ) AS `Root_mean_square`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Skewness' THEN Value
  END
    ) AS `Skewness`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Sphericity' THEN Value
  END
    ) AS `Sphericity`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Surface Area of Mesh' THEN Value
  END
    ) AS `Surface_Area_of_Mesh`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Surface to Volume Ratio' THEN Value
  END
    ) AS `Surface_to_Volume_Ratio`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Variance' THEN Value
  END
    ) AS `Variance`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Volume from Voxel Summation' THEN Value
  END
    ) AS `Volume_from_Voxel_Summation`,
  MAX(CASE
      WHEN Quantity.CodeMeaning = 'Volume of Mesh' THEN Value
  END
    ) AS `Volume_of_Mesh`
FROM
  `bigquery-public-data.idc_v18.quantitative_measurements` qm
JOIN
  `bigquery-public-data.idc_v18.dicom_all` da
ON
  qm.segmentationInstanceUID=da.SOPInstanceUID
WHERE
  analysis_result_id IN ('TotalSegmentator-CT-Segmentations')
GROUP BY
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8
ORDER BY
  PatientID,
  StudyDate,
  FindingSite,
  FindingSiteLaterality
  
```

## Step 2: Save radiomics features for individual segmented structures

In the following cells the intermediate table is queried to select features for one structure at a time and save the result in the CSV and Parquet formats.

In [None]:
#@title Enter your Project ID
# initialize this variable with your Google Cloud Project ID!
my_ProjectID = "idc-sandbox-000" #@param {type:"string"}

import os
os.environ["GCP_PROJECT_ID"] = my_ProjectID

from google.colab import auth
auth.authenticate_user()

In [None]:
from google.cloud import bigquery

# BigQuery client is initialized with the ID of the project
# we specified in the beginning of the notebook!
bq_client = bigquery.Client(my_ProjectID)

selection_query = """
SELECT
  DISTINCT(FindingSite), FindingSiteLaterality
FROM
  idc-sandbox-000.andrey_cohorts.totalsegmentator_quant_pivot
ORDER BY
  FindingSite, FindingSiteLaterality
"""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

selection_df

In [None]:
%%bigquery --project idc-external-002

    SELECT * from idc-sandbox-000.andrey_cohorts.totalsegmentator_quant_pivot
        where FindingSite = 'Clavicle' and FindingSiteLaterality = 'Left' limit 10


In [None]:
!mkdir csv
!mkdir parquet

In [None]:
# prompt: call a parameterized bigquery query

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

query_lateral = """
    SELECT * from idc-sandbox-000.andrey_cohorts.totalsegmentator_quant_pivot
        where FindingSite = @findingSite and FindingSiteLaterality = @findingSiteLaterality
"""

query_non_lateral = """
    SELECT * from idc-sandbox-000.andrey_cohorts.totalsegmentator_quant_pivot
        where FindingSite = @findingSite
"""

for row in selection_df.itertuples():
  print(row.FindingSite, row.FindingSiteLaterality)

  if str(row.FindingSiteLaterality) == "None":
    query_parameters = [
        bigquery.ScalarQueryParameter("findingSite", "STRING", row.FindingSite)
    ]

    job_config = bigquery.QueryJobConfig(query_parameters=query_parameters)
    query_job = bq_client.query(query_non_lateral, job_config=job_config)  # Make an API request.

    filePrefix = f'./csv/{row.FindingSite}'

  else:
    query_parameters = [
          bigquery.ScalarQueryParameter("findingSite", "STRING", row.FindingSite),
          bigquery.ScalarQueryParameter("findingSiteLaterality", "STRING", row.FindingSiteLaterality),
      ]

    job_config = bigquery.QueryJobConfig(query_parameters=query_parameters)
    query_job = bq_client.query(query_lateral, job_config=job_config)  # Make an API request.

    filePrefix = f'{row.FindingSite}_{row.FindingSiteLaterality}'

  query_df = query_job.result().to_dataframe()

  query_df.to_csv(f'./csv/{filePrefix}.csv', index=False)
  query_df.to_parquet(f'./parquet/{filePrefix}.parquet', compression='gzip', index=False)

  #break

#query_df

In [None]:
!gsutil -m cp -r ./csv gs://af-dev-storage/ts_features_20240617
!gsutil -m cp -r ./parquet gs://af-dev-storage/ts_features_20240617

In [None]:
for i in query_df.columns:
  print(i)

Features dictionary: https://docs.google.com/spreadsheets/d/1GbNv0yX06okLNtjxPjP1P9g_JrH0wDNqdUQsUBEFtPs/edit?usp=sharing

Extracted using

```sql
SELECT
  DISTINCT(Quantity.CodeMeaning) Quantity_CodeMeaning,
  Quantity.CodeValue Quantity_CodeValue,
  Quantity.CodingSchemeDesignator Quantity_CodingSchemeDesignator,
  qm.Units.CodeMeaning Units_CodeMeaning,
  qm.Units.CodeValue Units_CodeValue,
  qm.Units.CodingSchemeDesignator Units_CodingSchemeDesignator,
FROM
  `bigquery-public-data.idc_v18.quantitative_measurements` qm
JOIN
  `bigquery-public-data.idc_v18.dicom_all` da
ON
  qm.segmentationInstanceUID=da.SOPInstanceUID
WHERE
  analysis_result_id IN ('TotalSegmentator-CT-Segmentations')
ORDER BY
  Quantity_CodeMeaning,
  Units_CodeMeaning
```

Anatomic structures dictionary: https://docs.google.com/spreadsheets/d/169G8Yo2tZKIYYP3JmHVFLWERQCM9XsRnz3xYEZlAUoo/edit?usp=sharing

Extracted using


```sql
SELECT
  DISTINCT(findingSite.CodeMeaning) FindingSite_CodeMeaning,
  findingSite.CodeValue FindingSite_CodeValue,
  findingSite.CodingSchemeDesignator FindingSite_CodingSchemeDesignator,
  lateralityModifier.CodeMeaning FindingSiteLaterality_CodeMeaning,
  lateralityModifier.CodeValue FindingSiteLaterality_CodeValue,
  lateralityModifier.CodingSchemeDesignator FindingSiteLaterality_CodingSchemeDesignator,
FROM
  `bigquery-public-data.idc_v18.quantitative_measurements` qm
JOIN
  `bigquery-public-data.idc_v18.dicom_all` da
ON
  qm.segmentationInstanceUID=da.SOPInstanceUID
WHERE
  analysis_result_id IN ('TotalSegmentator-CT-Segmentations')
ORDER BY
  FindingSite_CodeMeaning,
  FindingSiteLaterality_CodeMeaning
```

## Support

If you have any questions, please post them in IDC forum: https://discourse.canceridc.dev.