# Dataset Creation

## DICOM Datasets


### Find (and optionally Reorganize) DICOMs

CLI example:

For a publicly available dataset: [upenn-gbm](https://portal.imaging.datacommons.cancer.gov/explore/filters/?collection_id=upenn_gbm)

In [1]:
!preprocessing dicom-dataset --help

usage: preprocessing <command> [<args>]

The following commands are available:
    validate-installation       Check that the `preprocessing` library is installed correctly along
                                with all of its dependencies.

    dicom-dataset               Create a DICOM dataset CSV compatible with subsequent `preprocessing`
                                scripts. The final CSV provides a series level summary of the location
                                of each series alongside metadata extracted from DICOM headers.  If the
                                previous organization schems of the dataset does not enforce a DICOM
                                series being isolated to a unique directory (instances belonging to
                                multiple series must not share the same lowest level directory),
                                reorganization must be applied for NIfTI conversion.

    nifti-dataset               Create a NIfTI dataset CSV compat

In [2]:
!preprocessing dicom-dataset \
    /autofs/space/crater_001/datasets/public/NIH_IDC_Brain/upenn_gbm \
    dicom_dataset_examples/upenn_gbm_dataset.csv \
    -c 80

Constructing DICOM dataset: 100%|█████▉| 839990/840621 [18:38<00:00, 751.06it/s]
Dataset of DICOM instances saved to /autofs/space/crater_001/tools/repos/preprocessing_dev/notebooks/dicom_dataset_examples/upenn_gbm_dataset_instances.csv
Anonymizing dataset: 100%|██████████████████| 630/630 [00:00<00:00, 1177.71it/s]
Anonymization completed
Dataset written to /autofs/space/crater_001/tools/repos/preprocessing_dev/notebooks/dicom_dataset_examples/upenn_gbm_dataset.csv


Python API example:

For a publically available dataset: [remind](https://portal.imaging.datacommons.cancer.gov/explore/filters/?collection_id=remind)

In [3]:
from preprocessing.data import create_dicom_dataset
help(create_dicom_dataset)

Help on function create_dicom_dataset in module preprocessing.data.datasets:

create_dicom_dataset(dicom_dir: pathlib.Path | str, dataset_csv: pathlib.Path | str, reorg_dir: pathlib.Path | str | None = None, anon: Literal['is_anon', 'auto', 'deferred'] = 'auto', batch_size: int = 1000, file_extension: Literal['*', '*.dcm'] = '*', mode: Literal['arbitrary', 'midas'] = 'arbitrary', cpus: int = 1)
    Create a DICOM dataset CSV compatible with subsequent `preprocessing`
    scripts. The final CSV provides a series level summary of the location
    of each series alongside metadata extracted from DICOM headers.  If the
    previous organization schems of the dataset does not enforce a DICOM
    series being isolated to a unique directory (instances belonging to
    multiple series must not share the same lowest level directory),
    reorganization must be applied for NIfTI conversion.
    
    
    Parameters
    ----------
    dicom_dir: Path | str
        The directory in which the DICOM

In [4]:
create_dicom_dataset(
    dicom_dir="/autofs/space/crater_001/datasets/public/NIH_IDC_Brain/remind",
    dataset_csv="dicom_dataset_examples/remind_dataset.csv",
    cpus=80
)

Constructing DICOM dataset: 90419it [03:21, 448.13it/s]                                                                                                                                                                                                                                                                                        


Dataset of DICOM instances saved to /autofs/space/crater_001/tools/repos/preprocessing_dev/notebooks/dicom_dataset_examples/remind_dataset_instances.csv


Anonymizing dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 114/114 [00:00<00:00, 1790.80it/s]


Anonymization completed
Dataset written to /autofs/space/crater_001/tools/repos/preprocessing_dev/notebooks/dicom_dataset_examples/remind_dataset.csv


Remember that subsequent commands require data to be converted from DICOM to the NIfTI file format. See [DICOM to NIfTI Conversion](#DICOM-to-NIfTI-Conversion) for more information

## NIfTI Datasets

In [5]:
!preprocessing nifti-dataset --help

usage: preprocessing <command> [<args>]

The following commands are available:
    validate-installation       Check that the `preprocessing` library is installed correctly along
                                with all of its dependencies.

    dicom-dataset               Create a DICOM dataset CSV compatible with subsequent `preprocessing`
                                scripts. The final CSV provides a series level summary of the location
                                of each series alongside metadata extracted from DICOM headers.  If the
                                previous organization schems of the dataset does not enforce a DICOM
                                series being isolated to a unique directory (instances belonging to
                                multiple series must not share the same lowest level directory),
                                reorganization must be applied for NIfTI conversion.

    nifti-dataset               Create a NIfTI dataset CSV compat

## DICOM to NIfTI Conversion

The outputs of our NIfTI conversion script follow a naming convention inspired by [BIDS](https://bids-website.readthedocs.io/en/latest/getting_started/folders_and_files/files.html#mri). However, note that the outputs are **NOT** BIDS compliant as the `preprocessing` library tracks metadata differently through the use of a dataset CSV.

Before NIfTI conversion, the "NormalizedSeriesDescription" and "SeriesType" columns of the dataset CSV **MUST** be populated. To avoid potential conflicts in the resulting NIfTI filenames, we do not use the standard "SeriesDescription" attribute and instead specify our own before running our script. Keep in mind that repeating values of "NormalizedSeriesDescription" within a study would result in the same converted filename. Therefore, if repetition in a study is encountered, only the first file will be converted and the other(s) will be ignored. The "SeriesType" entries serve to further separate the files in both terms of organization and relevant logic of preprocessing steps applied. Again borrowing from BIDS, expected values include "anat", "dwi", "fmap", "func", "perf", "etc". Currently, only "anat" is explicitly supported (preprocessing steps do not currently depend on the value of "SeriesType"), but this is intended to change in the future.

There are no automated methods for populating these columns included directly in `preprocessing`, as we do not consider this a preprocessing step in and of itself. Rather, series selection should be handled manually or by another tool. Continuing the examples from before, we are relying on the assumption that the example datasets are curated enough that T1 weighted post-contrast MRI can be identified through the standard "SeriesDescription" attribute (given the "NormalizedSeriesDescription" of "T1Post"). In practice, we do not recommend this due to frequent inaccuracy in this DICOM tag and suggest using tools based on DICOM metadata or manual identification for series selection.

In [6]:
# manually performing series selection
import pandas as pd

remind_dataset_csv = "dicom_dataset_examples/remind_dataset.csv"
remind_df = pd.read_csv(remind_dataset_csv, dtype=str)
remind_df["NormalizedSeriesDescription"] = remind_df["SeriesDescription"].apply(lambda x: "T1Post" if "post" in str(x).lower() else '')
remind_df["SeriesType"] = remind_df["NormalizedSeriesDescription"].apply(lambda x: "anat" if x == "T1Post" else None)
remind_df.to_csv(remind_dataset_csv, index=False)

upenn_dataset_csv = "dicom_dataset_examples/upenn_gbm_dataset.csv"
upenn_df = pd.read_csv(upenn_dataset_csv, dtype=str)
upenn_df["NormalizedSeriesDescription"] = upenn_df["SeriesDescription"].apply(lambda x: "T1Post" if "post" in str(x).lower() else '')
upenn_df["SeriesType"] = upenn_df["NormalizedSeriesDescription"].apply(lambda x: "anat" if x == "T1Post" else None)
upenn_df.to_csv(upenn_dataset_csv, index=False)

In [7]:
!preprocessing dataset-to-nifti --help

usage: preprocessing <command> [<args>]

The following commands are available:
    validate-installation       Check that the `preprocessing` library is installed correctly along
                                with all of its dependencies.

    dicom-dataset               Create a DICOM dataset CSV compatible with subsequent `preprocessing`
                                scripts. The final CSV provides a series level summary of the location
                                of each series alongside metadata extracted from DICOM headers.  If the
                                previous organization schems of the dataset does not enforce a DICOM
                                series being isolated to a unique directory (instances belonging to
                                multiple series must not share the same lowest level directory),
                                reorganization must be applied for NIfTI conversion.

    nifti-dataset               Create a NIfTI dataset CSV compat

In [8]:
!preprocessing dataset-to-nifti \
    "/autofs/space/crater_001/datasets/public/NIH_IDC_Brain/remind_nifti" \
    "dicom_dataset_examples/remind_dataset.csv" \
    -ss T1Post \
    -c 80

Converting to NIfTI:   0%|                              | 0/226 [00:00<?, ?it/s]/autofs/space/crater_001/datasets/public/NIH_IDC_Brain/remind/ReMIND-023/1.3.6.1.4.1.14519.5.2.1.36117150865305132080994538736709909848/US_1.3.6.1.4.1.14519.5.2.1.229063924195156492220643991189118349143 does not pass integrity checks and will not be converted to NIfTI
/autofs/space/crater_001/datasets/public/NIH_IDC_Brain/remind/ReMIND-106/1.3.6.1.4.1.14519.5.2.1.291284197960973976865313182698264218988/US_1.3.6.1.4.1.14519.5.2.1.304956174426870126481642669784274153861 does not pass integrity checks and will not be converted to NIfTI
/autofs/space/crater_001/datasets/public/NIH_IDC_Brain/remind/ReMIND-110/1.3.6.1.4.1.14519.5.2.1.100183031377204705413739421095350394731/US_1.3.6.1.4.1.14519.5.2.1.143822537397792974937792828054343479598 does not pass integrity checks and will not be converted to NIfTI
/autofs/space/crater_001/datasets/public/NIH_IDC_Brain/remind/ReMIND-069/1.3.6.1.4.1.14519.5.2.1.15793795550059

In [9]:
from preprocessing.data import convert_batch_to_nifti
help(convert_batch_to_nifti)

Help on function convert_batch_to_nifti in module preprocessing.data.nifti_conversion:

convert_batch_to_nifti(nifti_dir: pathlib.Path | str, csv: pathlib.Path | str, seg_source: str | None = None, overwrite_nifti: bool = False, skip_integrity_checks: bool = False, tolerance: float = 0.05, cpus: int = 1, check_columns: bool = True) -> pandas.core.frame.DataFrame
    Convert a DICOM dataset to NIfTI files representing each series.
    
    Parameters
    ----------
    nifti_dir: Path | str
        The root directory under which the converted NIfTI files will be written. Subdirectories
        will be created to follow a BIDS inspired convention.
    
    csv: Path | str
        The path to a CSV containing an entire dataset. It must contain the following
        columns: ['Dicoms', 'AnonPatientID', 'AnonStudyID', 'StudyInstanceUID',
        'SeriesInstanceUID', 'Manufacturer', 'NormalizedSeriesDescription', 'SeriesType'].
    
    seg_source: str | None
        The 'NormalizedSeriesDes

In [10]:
df = convert_batch_to_nifti(
    nifti_dir="/autofs/space/crater_001/datasets/public/NIH_IDC_Brain/upenn_gbm_nifti",
    csv="dicom_dataset_examples/upenn_gbm_dataset.csv",
    seg_source="T1Post",
    overwrite_nifti=False,
    cpus=80
)

Converting to NIfTI:   1%|██▏                                                                                                                                                                                                                                                                                | 18/2270 [00:07<06:25,  5.85it/s]

/autofs/space/crater_001/datasets/public/NIH_IDC_Brain/upenn_gbm/UPENN-GBM-00263/1.3.6.1.4.1.14519.5.2.1.326397760301295542527701713877675360535/MR_1.3.6.1.4.1.14519.5.2.1.149535003298172005278901576115724126751 does not pass integrity checks and will not be converted to NIfTI


Converting to NIfTI: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2270/2270 [05:32<00:00,  6.83it/s]
