# Converting Data

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AustralianCancerDataNetwork/pydicer/blob/main/examples/ConvertingData.ipynb)

In this example, the preprocessing and conversion of DICOM data is demonstrated. These are
essential first steps before data can be analysed using PyDicer.

In [1]:
try:
    from pydicer import PyDicer
except ImportError:
    !pip install pydicer
    from pydicer import PyDicer

from pathlib import Path

from pydicer.input.test import TestInput

## Setup PyDicer

As in the [Getting Started example](https://australiancancerdatanetwork.github.io/pydicer/_examples/GettingStarted.html), we must first define a working directory for our dataset. We
also create a `PyDicer` object.

In [2]:
directory = Path("./working")
pydicer = PyDicer(directory)

## Fetch some data

A [TestInput class](https://australiancancerdatanetwork.github.io/pydicer/input.html#pydicer.input.test.TestInput) is provided in pydicer to download some sample data to work with. Several other
input classes exist if you'd like to retrieve DICOM data for conversion from somewhere else. See 
the [docs for information](https://australiancancerdatanetwork.github.io/pydicer/html/input.html)
on how the PyDicer input classes work.

Most commonly, if you have DICOM files stored within a folder on your file system you can simply
pass the path to your DICOM directory to the `pydicer.add_input()` function.

In [3]:
dicom_directory = directory.joinpath("dicom")
test_input = TestInput(dicom_directory)
test_input.fetch_data()

# Add the input DICOM location to the pydicer object
pydicer.add_input(dicom_directory)

## Preprocess

With some DICOM data ready to work with, we must first use the PyDicer [preprocess module](https://australiancancerdatanetwork.github.io/pydicer/preprocess.html). This
module will crawl over all DICOM data available and will index all information required for
conversion of the data.

In [4]:
pydicer.preprocess()

  0%|          | 0/1309 [00:00<?, ?files/s, preprocess]

  4%|▍         | 55/1309 [00:00<00:02, 543.73files/s, preprocess]

  8%|▊         | 110/1309 [00:00<00:02, 545.54files/s, preprocess]

 13%|█▎        | 166/1309 [00:00<00:02, 550.07files/s, preprocess]

 17%|█▋        | 222/1309 [00:00<00:03, 275.08files/s, preprocess]

 23%|██▎       | 295/1309 [00:00<00:02, 373.75files/s, preprocess]

 27%|██▋       | 347/1309 [00:01<00:03, 252.14files/s, preprocess]

 32%|███▏      | 420/1309 [00:01<00:02, 334.64files/s, preprocess]

 37%|███▋      | 483/1309 [00:01<00:02, 294.76files/s, preprocess]

 41%|████      | 538/1309 [00:01<00:02, 338.35files/s, preprocess]

 45%|████▌     | 593/1309 [00:01<00:01, 379.46files/s, preprocess]

 50%|████▉     | 648/1309 [00:01<00:01, 415.98files/s, preprocess]

 54%|█████▎    | 703/1309 [00:01<00:01, 446.51files/s, preprocess]

 58%|█████▊    | 758/1309 [00:01<00:01, 470.78files/s, preprocess]

 62%|██████▏   | 812/1309 [00:02<00:01, 488.14files/s, preprocess]

 66%|██████▌   | 867/1309 [00:02<00:00, 503.16files/s, preprocess]

 70%|███████   | 922/1309 [00:02<00:00, 513.99files/s, preprocess]

 75%|███████▍  | 977/1309 [00:02<00:00, 522.16files/s, preprocess]

 79%|███████▉  | 1032/1309 [00:02<00:00, 527.86files/s, preprocess]

 83%|████████▎ | 1087/1309 [00:02<00:00, 532.24files/s, preprocess]

 87%|████████▋ | 1142/1309 [00:02<00:00, 535.65files/s, preprocess]

 91%|█████████▏| 1197/1309 [00:02<00:00, 377.23files/s, preprocess]

 96%|█████████▋| 1262/1309 [00:03<00:00, 438.13files/s, preprocess]

100%|██████████| 1309/1309 [00:03<00:00, 413.80files/s, preprocess]




### Inspect Preprocessed Data

Here we load the data that was indexed during preprocessing and output the first rows. This data
will be used by the following step of data conversion.

In [5]:
df_preprocessed = pydicer.read_preprocessed_data()
df_preprocessed.head()

Unnamed: 0,patient_id,study_uid,series_uid,modality,sop_class_uid,sop_instance_uid,for_uid,file_path,slice_location,referenced_uid,referenced_for_uid
1185,HNSCC-01-0019,1.3.6.1.4.1.14519.5.2.1.1706.8040.797724702538...,1.3.6.1.4.1.14519.5.2.1.1706.8040.233510441938...,CT,1.2.840.10008.5.1.4.1.1.2,1.3.6.1.4.1.14519.5.2.1.1706.8040.418136430763...,1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603...,working/dicom/HNSCC/HNSCC-01-0019/07-04-1998-N...,-807.0,,
1242,HNSCC-01-0019,1.3.6.1.4.1.14519.5.2.1.1706.8040.797724702538...,1.3.6.1.4.1.14519.5.2.1.1706.8040.233510441938...,CT,1.2.840.10008.5.1.4.1.1.2,1.3.6.1.4.1.14519.5.2.1.1706.8040.206018114826...,1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603...,working/dicom/HNSCC/HNSCC-01-0019/07-04-1998-N...,-804.0,,
1186,HNSCC-01-0019,1.3.6.1.4.1.14519.5.2.1.1706.8040.797724702538...,1.3.6.1.4.1.14519.5.2.1.1706.8040.233510441938...,CT,1.2.840.10008.5.1.4.1.1.2,1.3.6.1.4.1.14519.5.2.1.1706.8040.100785615013...,1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603...,working/dicom/HNSCC/HNSCC-01-0019/07-04-1998-N...,-801.0,,
1209,HNSCC-01-0019,1.3.6.1.4.1.14519.5.2.1.1706.8040.797724702538...,1.3.6.1.4.1.14519.5.2.1.1706.8040.233510441938...,CT,1.2.840.10008.5.1.4.1.1.2,1.3.6.1.4.1.14519.5.2.1.1706.8040.113351005230...,1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603...,working/dicom/HNSCC/HNSCC-01-0019/07-04-1998-N...,-798.0,,
1230,HNSCC-01-0019,1.3.6.1.4.1.14519.5.2.1.1706.8040.797724702538...,1.3.6.1.4.1.14519.5.2.1.1706.8040.233510441938...,CT,1.2.840.10008.5.1.4.1.1.2,1.3.6.1.4.1.14519.5.2.1.1706.8040.112029189313...,1.3.6.1.4.1.14519.5.2.1.1706.8040.290727775603...,working/dicom/HNSCC/HNSCC-01-0019/07-04-1998-N...,-795.0,,


## Convert Data

With the DICOM data having been indexed during preprocessing, we are now ready to convert this data
into NIfTI format which will be stored within the PyDicer standard directory structure.

Running the following cell will begin the conversion process. While this cell is running, take a
look inside the `working/data` directory to see how the converted data is being stored.

Notice the `converted.csv` file stored for each patient. This tracks each converted data object.
This will be loaded as a [Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) for use throughout PyDicer.


In [6]:
pydicer.convert.convert()

  0%|          | 0/21 [00:00<?, ?objects/s, convert]

  5%|▍         | 1/21 [00:01<00:39,  1.97s/objects, convert]

 14%|█▍        | 3/21 [00:02<00:11,  1.60objects/s, convert]

 19%|█▉        | 4/21 [00:50<04:49, 17.04s/objects, convert]

 24%|██▍       | 5/21 [00:52<03:12, 12.02s/objects, convert]

ImageSeriesReader (0x5624d21c1b90): Non uniform sampling or missing slices detected,  maximum nonuniformity:0.000641026



 29%|██▊       | 6/21 [00:53<02:08,  8.54s/objects, convert]

ImageSeriesReader (0x5624d21c1b90): Non uniform sampling or missing slices detected,  maximum nonuniformity:0.000641026



 33%|███▎      | 7/21 [00:54<01:27,  6.23s/objects, convert]

 38%|███▊      | 8/21 [00:56<01:05,  5.01s/objects, convert]

 48%|████▊     | 10/21 [00:58<00:34,  3.13s/objects, convert]

 52%|█████▏    | 11/21 [01:01<00:30,  3.03s/objects, convert]

 62%|██████▏   | 13/21 [01:01<00:14,  1.79s/objects, convert]

 67%|██████▋   | 14/21 [01:01<00:09,  1.43s/objects, convert]

 71%|███████▏  | 15/21 [01:02<00:06,  1.11s/objects, convert]

 76%|███████▌  | 16/21 [02:02<01:22, 16.49s/objects, convert]

 81%|████████  | 17/21 [02:55<01:46, 26.54s/objects, convert]

 86%|████████▌ | 18/21 [02:58<00:59, 19.87s/objects, convert]

 95%|█████████▌| 20/21 [02:58<00:11, 11.12s/objects, convert]

100%|██████████| 21/21 [04:03<00:00, 23.99s/objects, convert]

100%|██████████| 21/21 [04:03<00:00, 11.58s/objects, convert]




### Load Converted DataFrame

Once data is converted, we can load a Pandas DataFrame which contains a description of each object
converted.

The most useful columns in the DataFrame for working with this data in PyDicer are:
- `hashed_uid`: This is a 6 character hexidecimal hash of the associated DICOM SeriesInstanceUID.
  PyDicer refers to objects using this hashed identifier for a more consice representation.
- `modality`: The modality of the data object.
- `patient_id`: The ID of the patient this data object belongs to.
- `path`: The path within the working directory where files for this data object are stored.

In [7]:
df = pydicer.read_converted_data()
df

Unnamed: 0,sop_instance_uid,hashed_uid,modality,patient_id,series_uid,for_uid,referenced_sop_instance_uid,path
0,1.3.6.1.4.1.14519.5.2.1.1706.8040.240263316258...,72b0f9,CT,HNSCC-01-0199,1.3.6.1.4.1.14519.5.2.1.1706.8040.261759476368...,1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819...,,working/data/HNSCC-01-0199/images/72b0f9
1,1.3.6.1.4.1.14519.5.2.1.1706.8040.264264397186...,c16e76,RTDOSE,HNSCC-01-0199,1.3.6.1.4.1.14519.5.2.1.1706.8040.233527028792...,1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819...,1.3.6.1.4.1.14519.5.2.1.1706.8040.287865632112...,working/data/HNSCC-01-0199/doses/c16e76
2,1.3.6.1.4.1.14519.5.2.1.1706.8040.287865632112...,664e96,RTPLAN,HNSCC-01-0199,1.3.6.1.4.1.14519.5.2.1.1706.8040.137463901488...,1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819...,1.3.6.1.4.1.14519.5.2.1.1706.8040.166429645421...,working/data/HNSCC-01-0199/plans/664e96
3,1.3.6.1.4.1.14519.5.2.1.1706.8040.166429645421...,06e49c,RTSTRUCT,HNSCC-01-0199,1.3.6.1.4.1.14519.5.2.1.1706.8040.243934637013...,1.3.6.1.4.1.14519.5.2.1.1706.8040.870916135819...,1.3.6.1.4.1.14519.5.2.1.1706.8040.240263316258...,working/data/HNSCC-01-0199/structures/06e49c
4,1.3.6.1.4.1.14519.5.2.1.1706.8040.334001018535...,c4ffd0,CT,HNSCC-01-0176,1.3.6.1.4.1.14519.5.2.1.1706.8040.151938046710...,1.3.6.1.4.1.14519.5.2.1.1706.8040.120880328745...,,working/data/HNSCC-01-0176/images/c4ffd0
5,1.3.6.1.4.1.14519.5.2.1.1706.8040.107072817915...,8e0da9,CT,HNSCC-01-0176,1.3.6.1.4.1.14519.5.2.1.1706.8040.176143398282...,1.3.6.1.4.1.14519.5.2.1.1706.8040.216161306702...,,working/data/HNSCC-01-0176/images/8e0da9
6,1.3.6.1.4.1.14519.5.2.1.1706.8040.133948865586...,ec4aec,CT,HNSCC-01-0176,1.3.6.1.4.1.14519.5.2.1.1706.8040.192899726585...,1.3.6.1.4.1.14519.5.2.1.1706.8040.216161306702...,,working/data/HNSCC-01-0176/images/ec4aec
7,1.3.6.1.4.1.14519.5.2.1.1706.8040.469610481459...,33c44a,CT,HNSCC-01-0176,1.3.6.1.4.1.14519.5.2.1.1706.8040.244362210503...,1.3.6.1.4.1.14519.5.2.1.1706.8040.310630617866...,,working/data/HNSCC-01-0176/images/33c44a
8,1.3.6.1.4.1.14519.5.2.1.1706.8040.169033525924...,833a74,RTDOSE,HNSCC-01-0176,1.3.6.1.4.1.14519.5.2.1.1706.8040.279793773343...,1.3.6.1.4.1.14519.5.2.1.1706.8040.706719210726...,1.3.6.1.4.1.14519.5.2.1.1706.8040.470253980284...,working/data/HNSCC-01-0176/doses/833a74
9,1.3.6.1.4.1.14519.5.2.1.1706.8040.267291308489...,bf3fba,RTDOSE,HNSCC-01-0176,1.3.6.1.4.1.14519.5.2.1.1706.8040.283706688235...,1.3.6.1.4.1.14519.5.2.1.1706.8040.566662631858...,1.3.6.1.4.1.14519.5.2.1.1706.8040.173917268454...,working/data/HNSCC-01-0176/doses/bf3fba


## Data Quarantine

If anything goes wrong while converting a DICOM object during either the preprocess step or the
conversion step, the problematic DICOM data will be copied to the `working/quarantine` directory.

It's a good idea to regularly check your quarantine directory to ensure that no critical data
objects are being quarantine. If so you may want to consider rectifying the issue and running the
preprocess and conversion steps again.

As can be seen by running the cell below, there were several DICOM objects moved to the quarantine
during for our test dataset. This was due to there being multiple slices at the same location with
differing pixel data in one CT image series.

In [8]:
df_quarantine = pydicer.read_quarantined_data()
df_quarantine

Unnamed: 0,file,error,quarantine_dttm,PatientID,Modality,SOPInstanceUID,SeriesDescription
0,working/dicom/HNSCC/HNSCC-01-0176/03-05-2004-N...,2 slices at location 0.0 containing different ...,2024-12-16 10:45:33.607202,HNSCC-01-0176,CT,1.3.6.1.4.1.14519.5.2.1.1706.8040.258957568007...,SCOUT/NECK-ORAL/NASO W/CON
1,working/dicom/HNSCC/HNSCC-01-0176/03-05-2004-N...,2 slices at location 0.0 containing different ...,2024-12-16 10:45:33.612104,HNSCC-01-0176,CT,1.3.6.1.4.1.14519.5.2.1.1706.8040.181695106907...,SCOUT/NECK-ORAL/NASO W/CON
2,working/dicom/HNSCC/HNSCC-01-0176/03-05-2004-N...,2 slices at location -155.0 containing differe...,2024-12-16 10:45:33.626832,HNSCC-01-0176,CT,1.3.6.1.4.1.14519.5.2.1.1706.8040.308207714344...,BONE
3,working/dicom/HNSCC/HNSCC-01-0176/03-05-2004-N...,2 slices at location -155.0 containing differe...,2024-12-16 10:45:33.632129,HNSCC-01-0176,CT,1.3.6.1.4.1.14519.5.2.1.1706.8040.189167578552...,BONE
4,working/dicom/HNSCC/HNSCC-01-0176/03-05-2004-N...,2 slices at location -155.0 containing differe...,2024-12-16 10:45:33.637284,HNSCC-01-0176,CT,1.3.6.1.4.1.14519.5.2.1.1706.8040.146032766668...,BONE
...,...,...,...,...,...,...,...
607,working/dicom/HNSCC/HNSCC-01-0176/03-05-2004-N...,2 slices at location -155.0 containing differe...,2024-12-16 10:45:38.243995,HNSCC-01-0176,CT,1.3.6.1.4.1.14519.5.2.1.1706.8040.190466192108...,"CONTRAST120CC@3CC/S,90S DELAY"
608,working/dicom/HNSCC/HNSCC-01-0176/03-05-2004-N...,2 slices at location -155.0 containing differe...,2024-12-16 10:45:38.254101,HNSCC-01-0176,CT,1.3.6.1.4.1.14519.5.2.1.1706.8040.209452648754...,"CONTRAST120CC@3CC/S,90S DELAY"
609,working/dicom/HNSCC/HNSCC-01-0176/03-05-2004-N...,2 slices at location -155.0 containing differe...,2024-12-16 10:45:38.264199,HNSCC-01-0176,CT,1.3.6.1.4.1.14519.5.2.1.1706.8040.174557835738...,"CONTRAST120CC@3CC/S,90S DELAY"
610,working/dicom/HNSCC/HNSCC-01-0176/03-05-2004-N...,2 slices at location -155.0 containing differe...,2024-12-16 10:45:38.274353,HNSCC-01-0176,CT,1.3.6.1.4.1.14519.5.2.1.1706.8040.113716820433...,"CONTRAST120CC@3CC/S,90S DELAY"
