# Dataset Creation

## DICOM Datasets


### Find (and optionally Reorganize) DICOMs

CLI example:

For a publicly available dataset: [upenn-gbm](https://portal.imaging.datacommons.cancer.gov/explore/filters/?collection_id=upenn_gbm)

In [1]:
!preprocessing dicom-dataset --help

usage: preprocessing <command> [<args>]

The following commands are available:
    validate-installation       Check that the `preprocessing` library is installed correctly along
                                with all of its dependencies.

    dicom-dataset               Create a DICOM dataset CSV compatible with subsequent `preprocessing`
                                scripts. The final CSV provides a series level summary of the location
                                of each series alongside metadata extracted from DICOM headers.  If the
                                previous organization schems of the dataset does not enforce a DICOM
                                series being isolated to a unique directory (instances belonging to
                                multiple series must not share the same lowest level directory),
                                reorganization must be applied for NIfTI conversion.

    nifti-dataset               Create a NIfTI dataset CSV compat

In [2]:
!preprocessing dicom-dataset \
    /autofs/space/crater_001/datasets/public/NIH_IDC_Brain/upenn_gbm \
    dicom_dataset_examples/upenn_gbm_dataset.csv \
    -c 80

Constructing DICOM dataset: 100%|█████▉| 839990/840621 [18:38<00:00, 751.06it/s]
Dataset of DICOM instances saved to /autofs/space/crater_001/tools/repos/preprocessing_dev/notebooks/dicom_dataset_examples/upenn_gbm_dataset_instances.csv
Anonymizing dataset: 100%|██████████████████| 630/630 [00:00<00:00, 1177.71it/s]
Anonymization completed
Dataset written to /autofs/space/crater_001/tools/repos/preprocessing_dev/notebooks/dicom_dataset_examples/upenn_gbm_dataset.csv


Python API example:

For a publically available dataset: [remind](https://portal.imaging.datacommons.cancer.gov/explore/filters/?collection_id=remind)

In [3]:
from preprocessing.data import create_dicom_dataset
help(create_dicom_dataset)

Help on function create_dicom_dataset in module preprocessing.data.datasets:

create_dicom_dataset(dicom_dir: pathlib.Path | str, dataset_csv: pathlib.Path | str, reorg_dir: pathlib.Path | str | None = None, anon: Literal['is_anon', 'auto', 'deferred'] = 'auto', batch_size: int = 1000, file_extension: Literal['*', '*.dcm'] = '*', mode: Literal['arbitrary', 'midas'] = 'arbitrary', cpus: int = 1)
    Create a DICOM dataset CSV compatible with subsequent `preprocessing`
    scripts. The final CSV provides a series level summary of the location
    of each series alongside metadata extracted from DICOM headers.  If the
    previous organization schems of the dataset does not enforce a DICOM
    series being isolated to a unique directory (instances belonging to
    multiple series must not share the same lowest level directory),
    reorganization must be applied for NIfTI conversion.
    
    
    Parameters
    ----------
    dicom_dir: Path | str
        The directory in which the DICOM

In [4]:
create_dicom_dataset(
    dicom_dir="/autofs/space/crater_001/datasets/public/NIH_IDC_Brain/remind",
    dataset_csv="dicom_dataset_examples/remind_dataset.csv",
    cpus=80
)

Constructing DICOM dataset: 90419it [03:21, 448.13it/s]                                                                                                                                                                                                                                                                                        


Dataset of DICOM instances saved to /autofs/space/crater_001/tools/repos/preprocessing_dev/notebooks/dicom_dataset_examples/remind_dataset_instances.csv


Anonymizing dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 114/114 [00:00<00:00, 1790.80it/s]


Anonymization completed
Dataset written to /autofs/space/crater_001/tools/repos/preprocessing_dev/notebooks/dicom_dataset_examples/remind_dataset.csv


Remember that subsequent commands require data to be converted from DICOM to the NIfTI file format. See [DICOM to NIfTI Conversion](#DICOM-to-NIfTI-Conversion) for more information

## NIfTI Datasets

In [5]:
!preprocessing nifti-dataset --help

usage: preprocessing <command> [<args>]

The following commands are available:
    validate-installation       Check that the `preprocessing` library is installed correctly along
                                with all of its dependencies.

    dicom-dataset               Create a DICOM dataset CSV compatible with subsequent `preprocessing`
                                scripts. The final CSV provides a series level summary of the location
                                of each series alongside metadata extracted from DICOM headers.  If the
                                previous organization schems of the dataset does not enforce a DICOM
                                series being isolated to a unique directory (instances belonging to
                                multiple series must not share the same lowest level directory),
                                reorganization must be applied for NIfTI conversion.

    nifti-dataset               Create a NIfTI dataset CSV compat

## DICOM to NIfTI Conversion

In [1]:
from preprocessing.data import convert_batch_to_nifti
help(convert_batch_to_nifti)

Help on function convert_batch_to_nifti in module preprocessing.data.nifti_conversion:

convert_batch_to_nifti(nifti_dir: pathlib.Path | str, csv: pathlib.Path | str, seg_source: str | None = None, overwrite_nifti: bool = False, cpus: int = 1, check_columns: bool = True) -> pandas.core.frame.DataFrame
    Convert a DICOM dataset to NIfTI files representing each series.
    
    Parameters
    ----------
    nifti_dir: Path | str
        The root directory under which the converted NIfTI files will be written. Subdirectories
        will be created to follow a BIDS inspired convention.
    
    csv: Path | str
        The path to a CSV containing an entire dataset. It must contain the following
        columns: ['Dicoms', 'AnonPatientID', 'AnonStudyID', 'StudyInstanceUID',
        'SeriesInstanceUID', 'Manufacturer', 'NormalizedSeriesDescription', 'SeriesType'].
    
    overwrite: bool
        Whether to overwrite the NIfTI file if there is already one with the same output name.
      

In [2]:
import pandas as pd 

dataset_csv = "dicom_dataset_examples/upenn_gbm_dataset.csv"

df = pd.read_csv(dataset_csv, dtype=str)

df["NormalizedSeriesDescription"] = df["SeriesDescription"].apply(lambda x: "T1Post" if "post" in str(x).lower() else '')

df["SeriesType"] = df["NormalizedSeriesDescription"].apply(lambda x: "anat" if x == "T1Post" else None)

df.to_csv(dataset_csv, index=False)

In [4]:
convert_batch_to_nifti(
    nifti_dir="/autofs/space/crater_001/datasets/public/NIH_IDC_Brain/upenn_gbm_nifti",
    csv=dataset_csv,
    seg_source="T1Post",
    overwrite_nifti=False,
    cpus=80
)

Converting to NIfTI:   0%|▎                                                                                                                                                                                                                                                                                   | 3/2270 [00:02<27:38,  1.37it/s]

/autofs/space/crater_001/datasets/public/NIH_IDC_Brain/upenn_gbm/UPENN-GBM-00263/1.3.6.1.4.1.14519.5.2.1.326397760301295542527701713877675360535/MR_1.3.6.1.4.1.14519.5.2.1.149535003298172005278901576115724126751 does not pass integrity checks and will not be converted to NIfTI


Converting to NIfTI: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2270/2270 [05:42<00:00,  6.63it/s]


Unnamed: 0,SeriesInstanceUID,StudyInstanceUID,PatientID,AccessionNumber,Manufacturer,StudyDate,StudyDescription,SeriesDescription,Modality,Dicoms,AnonPatientID,AnonStudyID,NormalizedSeriesDescription,SeriesType,Nifti,Seg
0,1.2.276.0.7230010.3.1.3.17436516.3156027.17205...,1.3.6.1.4.1.14519.5.2.1.2564367332369685313507...,UPENN-GBM-00450,,QIICR,20110910,BrainTumor,AIMI Brain MRI AI segmentation,SEG,/autofs/space/crater_001/datasets/public/NIH_I...,sub-01,ses-01,,,,
1,1.3.6.1.4.1.14519.5.2.1.3205019030341469044367...,1.3.6.1.4.1.14519.5.2.1.2564367332369685313507...,UPENN-GBM-00450,,SIEMENS,20110910,BrainTumor,t2_Flair_axial: Processed_CaPTk,MR,/autofs/space/crater_001/datasets/public/NIH_I...,sub-01,ses-01,,,,
2,1.2.276.0.7230010.3.1.3.17436516.3155769.17205...,1.3.6.1.4.1.14519.5.2.1.1266203829456751464071...,UPENN-GBM-00450,,QIICR,20110910,BrainTumor,AIMI Brain MRI AI segmentation,SEG,/autofs/space/crater_001/datasets/public/NIH_I...,sub-01,ses-02,,,,
3,1.3.6.1.4.1.14519.5.2.1.3046573623188230926346...,1.3.6.1.4.1.14519.5.2.1.1266203829456751464071...,UPENN-GBM-00450,,SIEMENS,20110910,BrainTumor,t1 axial_ 3D: Processed_CaPTk,MR,/autofs/space/crater_001/datasets/public/NIH_I...,sub-01,ses-02,,,,
4,1.3.6.1.4.1.14519.5.2.1.1840500037272730575369...,1.3.6.1.4.1.14519.5.2.1.1821188704518021360282...,UPENN-GBM-00450,,SIEMENS,20110910,BrainTumor,ep2d_DTI_30dir,MR,/autofs/space/crater_001/datasets/public/NIH_I...,sub-01,ses-03,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6059,1.2.276.0.7230010.3.1.3.17436516.3889651.17204...,1.3.6.1.4.1.14519.5.2.1.3372171645775161420612...,UPENN-GBM-00046,,QIICR,20040613,BRAIN^ROUTINE,AIMI Brain MRI AI segmentation,SEG,/autofs/space/crater_001/datasets/public/NIH_I...,sub-99,ses-04,,,,
6060,1.3.6.1.4.1.14519.5.2.1.3076294954827787757748...,1.3.6.1.4.1.14519.5.2.1.3372171645775161420612...,UPENN-GBM-00046,,SIEMENS,20040613,BRAIN^ROUTINE,Axial T2 tse: Processed_CaPTk,MR,/autofs/space/crater_001/datasets/public/NIH_I...,sub-99,ses-04,,,,
6061,1.2.276.0.7230010.3.1.3.17436516.3888998.17204...,1.3.6.1.4.1.14519.5.2.1.1977575682986953456091...,UPENN-GBM-00046,,QIICR,20040613,BRAIN^SPECTROSCOPY,AIMI Brain MRI AI segmentation,SEG,/autofs/space/crater_001/datasets/public/NIH_I...,sub-99,ses-05,,,,
6062,1.3.6.1.4.1.14519.5.2.1.1547248985746940291713...,1.3.6.1.4.1.14519.5.2.1.1977575682986953456091...,UPENN-GBM-00046,,SIEMENS,20040613,BRAIN^SPECTROSCOPY,t2_Flair_axial: Processed_CaPTk,MR,/autofs/space/crater_001/datasets/public/NIH_I...,sub-99,ses-05,,,,
