# DataManager Tutorial − Initiation to the DataManager class

@Author : [MEDomics consortium](https://github.com/medomics/)

@Email : medomics.info@gmail.com


**STATEMENT**:
This file is part of <https://github.com/MEDomics/MEDomicsLab/>,
a package providing PYTHON programming tools for radiomics analysis.
--> Copyright (C) MEDomicsLab consortium.

This package is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This package is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this package.  If not, see <http://www.gnu.org/licenses/>.

## Introduction


This notebook is a tutorial for the *DataManager* class to give a detailed introduction & explanation on how to use this Python class. The *DataManager* class is the main object used in the *MEDimage* package when it comes to processing raw data in NIfTI and DICOM formats. This class can:
 - Create ``MEDimgae`` class objects from the raw data and makes the manipulation of these objects easy.
 - Help find the proper dimension and re-segmentation ranges options for radiomics analysis by running some pre-computation checks.

In this tutorial, we will go through all this step by step and help you learn everything you need about the *DataManager* class.

The ``DataManager`` class is one of the first operation done in the radiomics analysis workflow, because it helps create the ``MEDimage`` class objects which is the main asset used in the *MEDimage* package.

<img src="images/MEDimageFlowchart.png"/>

Let's take a look inside the ``DataManager`` box:

<img src="images/DataManager-workflow.png"/>

Make sure your folder structure looks like the figure below, it is the same as the repository structure but you need to download the imaging data and place in *DICOM-STS*:
<img src="images/DataManagerFolderStructure.png"/>

Imports

In [1]:
import os
import sys

MODULE_DIR = os.path.dirname(os.path.abspath('../MEDimage/'))
sys.path.append(os.path.dirname(MODULE_DIR))

import MEDimage

## Dataset

The data used here consists of 204 different scans, with different scan types (MR, PET, CT) and with different contours (regions of interest) of the STS (soft tissue sarcoma) cancer. Every datasets should have a csv along with it. In our case, our dataset is linked to 4 csv files, every csv file contains information about the ROI we are dealing with for each scan:
 - **GTV mass**: Contains scans with GTV mass contrours.
 - **GTV edema**: Contains scans with GTV edema contrours.
 - **GTV ring**: Contains scans with GTV mass contrours included but GTV edema contrours excluded.
 - **GTV**: A more general csv containing all scans with combined ROI contours.
You can find the csv files for this dataset in *csv* folder the repository. We will start by loading all the files here and taking a look into it before we go deeper in the ``DataManager`` class.

In [2]:
import pandas

from pathlib import Path

path_csv_mass = Path(os.getcwd()) / "CSV" / "roiNames_GTVmass.csv"
path_csv_edema = Path(os.getcwd()) / "CSV" / "roiNames_GTVedema.csv"
path_csv_ring = Path(os.getcwd()) / "CSV" / "roiNames_GTVring.csv"
path_csv = Path(os.getcwd()) / "CSV" / "roiNames_GTV.csv"

pandas.read_csv(path_csv_mass)

Unnamed: 0,PatientID,ImagingScanName,ImagingModality,ROIname
0,STS-McGill-001,T1,MRscan,{GTV_Mass}
1,STS-McGill-001,CT,CTscan,{GTV_Mass}
2,STS-McGill-001,PET,PTscan,{GTV_Mass}
3,STS-McGill-001,T2FS,MRscan,{GTV_Mass}
4,STS-McGill-002,T1,MRscan,{GTV_Mass}
...,...,...,...,...
199,STS-McGill-050,T2FS,MRscan,{GTV_Mass}
200,STS-McGill-051,T1,MRscan,{GTV_Mass}
201,STS-McGill-051,CT,CTscan,{GTV_Mass}
202,STS-McGill-051,PET,PTscan,{GTV_Mass}


In [3]:
pandas.read_csv(path_csv_edema)

Unnamed: 0,PatientID,ImagingScanName,ImagingModality,ROIname
0,STS-McGill-001,T1,MRscan,{GTV_Edema}
1,STS-McGill-001,CT,CTscan,{GTV_Edema}
2,STS-McGill-001,PET,PTscan,{GTV_Edema}
3,STS-McGill-001,T2FS,MRscan,{GTV_Edema}
4,STS-McGill-002,T1,MRscan,{GTV_Edema}
...,...,...,...,...
123,STS-McGill-048,T2FS,MRscan,{GTV_Edema}
124,STS-McGill-050,T1,MRscan,{GTV_Edema}
125,STS-McGill-050,CT,CTscan,{GTV_Edema}
126,STS-McGill-050,PET,PTscan,{GTV_Edema}


In [4]:
pandas.read_csv(path_csv_ring)

Unnamed: 0,PatientID,ImagingScanName,ImagingModality,ROIname
0,STS-McGill-001,T1,MRscan,{GTV_Mass}-{GTV_Edema}
1,STS-McGill-001,CT,CTscan,{GTV_Mass}-{GTV_Edema}
2,STS-McGill-001,PET,PTscan,{GTV_Mass}-{GTV_Edema}
3,STS-McGill-001,T2FS,MRscan,{GTV_Mass}-{GTV_Edema}
4,STS-McGill-002,T1,MRscan,{GTV_Mass}-{GTV_Edema}
...,...,...,...,...
123,STS-McGill-048,T2FS,MRscan,{GTV_Mass}-{GTV_Edema}
124,STS-McGill-050,T1,MRscan,{GTV_Mass}-{GTV_Edema}
125,STS-McGill-050,CT,CTscan,{GTV_Mass}-{GTV_Edema}
126,STS-McGill-050,PET,PTscan,{GTV_Mass}-{GTV_Edema}


So in general, the csv is a summary of the scans in the dataset that is organized this way:
- **PatientID**: ID of the patient or scan
- **ImagingScanName**: Imaging scan name or known as series_description in DICOM headers.
- **ImagingModality**: Imaging modality (MRcsan, PTscan or CTscan).
- **ROInames**: Names of the ROIs that will be processed and extracted in the analysis.

## DataManager initialization

As mentionned before we are gonna use ``DataManager`` class to create ``MEDimage`` objects from raw data, we will use DICOM files only, but the process is the same for NIfTI format. We will only need path to the data folder (folder that hold all the data). The class diagram below shows the different attributes and methods. What's worth noting is:
- ``process_all()`` method which process all type of data (NIfTI or DICOM).
- ``process_all_dicoms`` method which process DICOM data only using the path to DICOM data (path given in initialization).
- ``process_all_niftis`` method which process NIfTI data only using the path to NIfTI data (path given in initialization).
- ``save`` attribute: if True will save the created ``MEDimage`` objects locally.
- ``instances`` attribute: Holds all the created ``MEDimage`` objects (can hold 10 objects maximum)

<img src="images/DataManagerClassDiagram.png"/>

We will go through all the functionalities of the ``DataManager`` class. For more details about the class please refer to the [DataManager documentation](https://medimage.readthedocs.io/en/documentation/wrangling.html#module-MEDimage.wrangling.DataManager)

### Initialization

The minimum information we need for intialization is a path to raw data folder, then we can call the ``DataManager`` from the ``wrangling`` sub-package. But we will add some other informations:
 - ``path_save``: path to where the results are gonna be saved.
 - ``save``: ``True`` to save the created ``MEDimage`` objetcs. (If you have more than 10 scans in your dataset you must set it to ``True``)
 - ``n_batch``:  Numerical value specifying the number of batch to use in the parallel computations (use 0 for serial computation).

In [5]:
path_dicoms_data = Path(os.getcwd()) / "data" / "DICOM-STS"
path_save = Path(os.getcwd()) / "data" / "npy"
dm = MEDimage.wrangling.DataManager(path_to_dicoms=path_dicoms_data,
                                    path_save=path_save,
                                    path_csv=path_csv,
                                    save=True, 
                                    n_batch=2)

We have now initialized the ``DataManager`` and we can call ``process_all_dicoms`` for DICOM data, ``process_all_niftis`` for NIfTI data or ``process_all`` for both to process our data. This method returns a list of instances for datasets with 10 scans or less (for memory considerations). In our case we have 204 scans so the scans will to be localy saved.

In [6]:
dm.process_all_dicoms()

2022-08-09 15:14:20,610	INFO services.py:1090 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8268[39m[22m


--> Reading all DICOM objects to create MEDimage classes

--> Scanning all folders in initial directory...

100%|██████████| 34928/34928 [00:26<00:00, 1308.52it/s]


DONE
--> Associating all RT objects to imaging volumes


100%|██████████| 204/204 [00:00<00:00, 124185.49it/s]


DONE


100%|██████████| 204/204 [14:34<00:00,  4.29s/it]

DONE





[<MEDimage.MEDimage.MEDimage at 0x7fb4548c39d0>,
 <MEDimage.MEDimage.MEDimage at 0x7fb36cf3d190>,
 <MEDimage.MEDimage.MEDimage at 0x7fb36cdf4610>,
 <MEDimage.MEDimage.MEDimage at 0x7fb36c08d160>,
 <MEDimage.MEDimage.MEDimage at 0x7fb36ebab940>,
 <MEDimage.MEDimage.MEDimage at 0x7fb00fe7b160>,
 <MEDimage.MEDimage.MEDimage at 0x7fb36da25ac0>,
 <MEDimage.MEDimage.MEDimage at 0x7fb036c32d60>,
 <MEDimage.MEDimage.MEDimage at 0x7fb006037070>,
 <MEDimage.MEDimage.MEDimage at 0x7fb36ec13c10>]

After this is done, we recommend calling the method ``summarize`` which is a class method that gives a summary of the processed scans depending on what CSV you are using for your data

In [7]:
dm.summarize()

| study   | institution   | scan_type   | roi_type   |   count |
|:--------|:--------------|:------------|:-----------|--------:|
| STS     |               |             |            |     204 |
| STS     | McGill        |             |            |     204 |
| STS     | McGill        | PET         |            |      51 |
| STS     | McGill        | CT          |            |      51 |
| STS     | McGill        | T2FS        |            |      51 |
| STS     | McGill        | T1          |            |      51 |


You can also change the dataset csv file and update your ``DataManager`` by calling the method ``update_from_csv`` that takes a path to your new csv as argument and automatically calls the method ``summarize`` at the end.

GTV Mass

In [8]:
dm.update_from_csv(path_csv_mass)

| study   | institution   | scan_type   | roi_type   |   count |
|:--------|:--------------|:------------|:-----------|--------:|
| STS     |               |             |            |     204 |
| STS     | McGill        |             |            |     204 |
| STS     | McGill        | PET         |            |      51 |
| STS     | McGill        | PET         | GTVmass    |      51 |
| STS     | McGill        | CT          |            |      51 |
| STS     | McGill        | CT          | GTVmass    |      51 |
| STS     | McGill        | T2FS        |            |      51 |
| STS     | McGill        | T2FS        | GTVmass    |      51 |
| STS     | McGill        | T1          |            |      51 |
| STS     | McGill        | T1          | GTVmass    |      51 |


GTV Edema

In [9]:
dm.update_from_csv(path_csv_edema)

| study   | institution   | scan_type   | roi_type   |   count |
|:--------|:--------------|:------------|:-----------|--------:|
| STS     |               |             |            |     204 |
| STS     | McGill        |             |            |     204 |
| STS     | McGill        | PET         |            |      51 |
| STS     | McGill        | PET         | GTVedema   |      32 |
| STS     | McGill        | CT          |            |      51 |
| STS     | McGill        | CT          | GTVedema   |      32 |
| STS     | McGill        | T2FS        |            |      51 |
| STS     | McGill        | T2FS        | GTVedema   |      32 |
| STS     | McGill        | T1          |            |      51 |
| STS     | McGill        | T1          | GTVedema   |      32 |


GTV Ring

In [10]:
dm.update_from_csv(path_csv_ring)

| study   | institution   | scan_type   | roi_type   |   count |
|:--------|:--------------|:------------|:-----------|--------:|
| STS     |               |             |            |     204 |
| STS     | McGill        |             |            |     204 |
| STS     | McGill        | PET         |            |      51 |
| STS     | McGill        | PET         | GTVring    |      32 |
| STS     | McGill        | CT          |            |      51 |
| STS     | McGill        | CT          | GTVring    |      32 |
| STS     | McGill        | T2FS        |            |      51 |
| STS     | McGill        | T2FS        | GTVring    |      32 |
| STS     | McGill        | T1          |            |      51 |
| STS     | McGill        | T1          | GTVring    |      32 |


We now have all scans converted to ``MEDimage`` objects saved localy in ``path_save``

## DataManager radiomics pre-checks

We have seen the first part of the use of the ``DataManager`` class that creates ``MEDimage`` objects from raw data. We will now use to run a pre-analysis of a certain group of scans. These pre-checks help find the proper dimension and re-segmentation ranges options for radiomics analysis by running some pre-computation checks. As seen in the class diagram above, it has many methods responsible for the pre-checks analysis:
- ``pre_radiomics_checks()`` runs radiomics checks to find proper dimension and re-segmentation ranges options.
- More methods for imaging summary will be added soon.
The use of the ``pre_radiomics_checks()`` needs a little attention because it requires more arguments that the methods we have tested before. First, we have to add a json settings file to our ``DataManager`` instance, if not we can set the parameters directly (our case). The parameters we need to set are the following:
 - ``wildcards_dimensions`` We now need to define wildcards for dimensions [Read more about wildcards](https://www.linuxtechtips.com/2013/11/how-wildcards-work-in-linux-and-unix.html). wildcards is used here to identify the scans we would like to pre-process by scan type, study, institution...For example: a wildcard like 'STS * .MRscan.npy' means we will only pre-process STS studies & MR scans. Finally, you can set a different wildcard for each pre-processing step, a wildcard for dimensions check and another one for windows check. We will set it to *'"STS-McGill*.MRscan.npy"'* to process only MRI scans from STS study and McGill institution.
 - ``wildcards_window`` Same as ``wildcards_dimensions`` but for window checks this time.
 - ``use_instances``: If ``True`` will use the instances/objetcs of the ``MEDimage`` class saved in ``DataManager`` instance. We will set it to ``False`` since we saved all our objects localy.
 - ``path_csv``: Path to csv with roi information (check example in repository)

In [11]:
wildcards_dimensions = ['STS*.MRscan.npy']  # dimensions analysis wildcard
wildcards_window = ['STS*.MRscan.npy']  # windows ranges analysis wildcard
use_instances = False  # instances are saved localy and cannot be used on the fly

We now have everything we need to run our pre-checks all we have left is call our method and pass the right arguments.

In [12]:
dm.pre_radiomics_checks(wildcards_dimensions=wildcards_dimensions, 
                        wildcards_window=wildcards_window, 
                        use_instances=False,
                        path_csv=path_csv_mass)



************************* PRE-RADIOMICS CHECKS *************************
--> PRE-RADIOMICS CHECKS -- DIMENSIONS ... 

100%|██████████| 102/102 [00:04<00:00, 20.75it/s]


DONE
Elapsed time: 4.92 sec

--> PRE-RADIOMICS CHECKS -- WINDOW ... 


100%|██████████| 102/102 [00:09<00:00, 11.27it/s]


DONE
Elapsed time: 17.23 sec

--> TOTAL TIME FOR PRE-RADIOMICS CHECKS: 22.15 seconds
-------------------------------------------------------------------------------------


The results must be now saved localy in a folder called *checks* as JSON files.

### Common errors to avoid:

- Bad dataset management: 
  - NIfTI: Make sure all your NIfTI files repsect the file naming convention:
    - Imaging volume: patientID__ImagingScanName.ScanType(roiTypeLabel).nii.gz. For example *Glioma-TCGA-001__T1(tunmorAuto).MRscan.nii.gz*
    - ROI mask: patientID__ImagingScanName(roiName).ROI.nii.gz. For example *Glioma-TCGA-001__T1(NET).ROI.nii.gz*
  - DICOM: Make sure the PatientID in the DICOM headers respect the follwoing name: study-institution-id *(Ex: Glioma-TCGA-001)*
- No files found: make sure you have used the right names in the wildcard. If u saved your files localy then they should be available at ``path_save`` and the code will find it automatically.
- Same scans in different paths: Make sure you don't have the same scans (same filenames) in the different paths (maybe in a sub-folder).