# Working with Data

Here we present some useful tips & tricks which to help working with data which has been converted
using PyDicer. As you will see, working with data in PyDicer is heavily oriented around DataFrames
provided by the Pandas library. If you aren't familiar with Pandas, we recommend working through 
the [Pandas Getting Started Tutorials](https://pandas.pydata.org/docs/getting_started/index.html).

In [1]:
try:
    from pydicer import PyDicer
except ImportError:
    !pip install pydicer
    from pydicer import PyDicer

from pathlib import Path

import SimpleITK as sitk

from pydicer.utils import fetch_converted_test_data

from pydicer.utils import load_object_metadata, determine_dcm_datetime

## Setup PyDicer

Here we load the LCTSC data which has already been converted. This is downloaded into the
`testdata_lctsc` directory. We also initialise a `PyDicer` object.

In [2]:
working_directory = fetch_converted_test_data("./testdata_lctsc", dataset="LCTSC")

pydicer = PyDicer(working_directory)

Working directory %s aready exists, won't download test data.


## Read Converted Data

To obtain a DataFrame of the converted data, use the `read_converted_data` function.

In [3]:
df = pydicer.read_converted_data()
df

Unnamed: 0,sop_instance_uid,hashed_uid,modality,patient_id,series_uid,for_uid,referenced_sop_instance_uid,path
0,1.3.6.1.4.1.14519.5.2.1.7014.4598.235489581364...,914d57,CT,LCTSC-Test-S1-102,1.3.6.1.4.1.14519.5.2.1.7014.4598.639871532605...,1.3.6.1.4.1.14519.5.2.1.7014.4598.408067568497...,,testdata_lctsc/data/LCTSC-Test-S1-102/images/9...
1,1.3.6.1.4.1.14519.5.2.1.7014.4598.110977663386...,6c6ea4,RTSTRUCT,LCTSC-Test-S1-102,1.3.6.1.4.1.14519.5.2.1.7014.4598.110977663386...,1.3.6.1.4.1.14519.5.2.1.7014.4598.408067568497...,1.3.6.1.4.1.14519.5.2.1.7014.4598.235489581364...,testdata_lctsc/data/LCTSC-Test-S1-102/structur...
2,1.3.6.1.4.1.14519.5.2.1.7014.4598.318848546630...,aa38e6,CT,LCTSC-Train-S1-002,1.3.6.1.4.1.14519.5.2.1.7014.4598.234842392725...,1.3.6.1.4.1.14519.5.2.1.7014.4598.145984743865...,,testdata_lctsc/data/LCTSC-Train-S1-002/images/...
3,1.3.6.1.4.1.14519.5.2.1.7014.4598.291449913947...,f036b8,RTSTRUCT,LCTSC-Train-S1-002,1.3.6.1.4.1.14519.5.2.1.7014.4598.291449913947...,1.3.6.1.4.1.14519.5.2.1.7014.4598.145984743865...,1.3.6.1.4.1.14519.5.2.1.7014.4598.318848546630...,testdata_lctsc/data/LCTSC-Train-S1-002/structu...
4,1.3.6.1.4.1.14519.5.2.1.7014.4598.188371727865...,6834c9,CT,LCTSC-Train-S1-001,1.3.6.1.4.1.14519.5.2.1.7014.4598.330486033168...,1.3.6.1.4.1.14519.5.2.1.7014.4598.109432485688...,,testdata_lctsc/data/LCTSC-Train-S1-001/images/...
5,1.3.6.1.4.1.14519.5.2.1.7014.4598.267594131248...,b5bddb,RTSTRUCT,LCTSC-Train-S1-001,1.3.6.1.4.1.14519.5.2.1.7014.4598.267594131248...,1.3.6.1.4.1.14519.5.2.1.7014.4598.109432485688...,1.3.6.1.4.1.14519.5.2.1.7014.4598.188371727865...,testdata_lctsc/data/LCTSC-Train-S1-001/structu...
6,1.3.6.1.4.1.14519.5.2.1.7014.4598.141349678572...,d91c84,CT,LCTSC-Train-S1-004,1.3.6.1.4.1.14519.5.2.1.7014.4598.269433294341...,1.3.6.1.4.1.14519.5.2.1.7014.4598.313160975008...,,testdata_lctsc/data/LCTSC-Train-S1-004/images/...
7,1.3.6.1.4.1.14519.5.2.1.7014.4598.595315284787...,61758b,RTSTRUCT,LCTSC-Train-S1-004,1.3.6.1.4.1.14519.5.2.1.7014.4598.595315284787...,1.3.6.1.4.1.14519.5.2.1.7014.4598.313160975008...,1.3.6.1.4.1.14519.5.2.1.7014.4598.141349678572...,testdata_lctsc/data/LCTSC-Train-S1-004/structu...
8,1.3.6.1.4.1.14519.5.2.1.7014.4598.140943693489...,738c1a,CT,LCTSC-Train-S1-005,1.3.6.1.4.1.14519.5.2.1.7014.4598.338518041666...,1.3.6.1.4.1.14519.5.2.1.7014.4598.964812114328...,,testdata_lctsc/data/LCTSC-Train-S1-005/images/...
9,1.3.6.1.4.1.14519.5.2.1.7014.4598.214803404117...,68d663,RTSTRUCT,LCTSC-Train-S1-005,1.3.6.1.4.1.14519.5.2.1.7014.4598.214803404117...,1.3.6.1.4.1.14519.5.2.1.7014.4598.964812114328...,1.3.6.1.4.1.14519.5.2.1.7014.4598.140943693489...,testdata_lctsc/data/LCTSC-Train-S1-005/structu...


## Iterating Over Objects

If you want to perform some operation on (for example) all images in your dataset, you can iterate
over each image row like this. Within each loop we load each image as a `SimpleITK` image (just
for demonstration purposes).)

In [4]:
for idx, ct_row in df[df.modality=="CT"].iterrows():

    print(f"Loading image with hashed UID: {ct_row.hashed_uid}...", end="")

    img_path = Path(ct_row.path).joinpath("CT.nii.gz")
    img = sitk.ReadImage(str(img_path))

    print(" Complete")

Loading image with hashed UID: 914d57... Complete
Loading image with hashed UID: aa38e6... Complete
Loading image with hashed UID: 6834c9... Complete
Loading image with hashed UID: d91c84... Complete
Loading image with hashed UID: 738c1a... Complete
Loading image with hashed UID: 5adf40... Complete
Loading image with hashed UID: dd0026...

 Complete
Loading image with hashed UID: 2bf2f9... Complete
Loading image with hashed UID: 666be6... Complete
Loading image with hashed UID: 88c5ef... Complete


## Loading Object Metadata

The metadata from the DICOM headers is stored by PyDicer and can be easily loaded using the
`load_object_metadata` function. Simply pass a row from the converted DataFrame into this function
to load the metadata for that object.

In [5]:
first_row = df.iloc[0]
ds = load_object_metadata(first_row)
ds

(0008, 0005) Specific Character Set              CS: 'ISO_IR 100'
(0008, 0008) Image Type                          CS: ['ORIGINAL', 'PRIMARY', 'AXIAL', 'CT_SOM5 SPI']
(0008, 0012) Instance Creation Date              DA: '20050726'
(0008, 0013) Instance Creation Time              TM: '091720.609000'
(0008, 0016) SOP Class UID                       UI: CT Image Storage
(0008, 0018) SOP Instance UID                    UI: 1.3.6.1.4.1.14519.5.2.1.7014.4598.235489581364823930179693677352
(0008, 0020) Study Date                          DA: '20031104'
(0008, 0021) Series Date                         DA: '20031104'
(0008, 0023) Content Date                        DA: '20031104'
(0008, 0030) Study Time                          TM: '112017.690000'
(0008, 0031) Series Time                         TM: '114545.847000'
(0008, 0033) Content Time                        TM: '114241.493000'
(0008, 0050) Accession Number                    SH: '2356173256177297'
(0008, 0060) Modality                    

### Keep only specific header tags

Loading object metadata can be slow, especially when doing this for many objects at once. So, you
can specify the `keep_tags` argument if you know which header attributes you want to use. This
speeds up loading metadata significantly.

Here we load only the `StudyDate`, `PatientSex` and `Manufacturer`.

> Tip: These tags are defined by the DICOM standard, and we use `pydicom` to load this metadata. In
> fact, the metadata returned is a `pydicom` Dataset. Check out the [`pydicom` documentation](https://pydicom.github.io/pydicom/dev/old/pydicom_user_guide.html) for more information.

In [6]:
ds = load_object_metadata(first_row, keep_tags=["StudyDate", "PatientSex", "Manufacturer"])
ds

(0008, 0020) Study Date                          DA: '20031104'
(0008, 0070) Manufacturer                        LO: 'SIEMENS'
(0010, 0040) Patient's Sex                       CS: 'F'

### Loading metadata for all data objects

You can use the Pandas `apply` function to load metadata for all rows and add it as a column to the
converted DataFrame.

In [7]:
df["StudyDescription"] = df.apply(lambda row: load_object_metadata(row, keep_tags="StudyDescription").StudyDescription, axis=1)
df

Unnamed: 0,sop_instance_uid,hashed_uid,modality,patient_id,series_uid,for_uid,referenced_sop_instance_uid,path,StudyDescription
0,1.3.6.1.4.1.14519.5.2.1.7014.4598.235489581364...,914d57,CT,LCTSC-Test-S1-102,1.3.6.1.4.1.14519.5.2.1.7014.4598.639871532605...,1.3.6.1.4.1.14519.5.2.1.7014.4598.408067568497...,,testdata_lctsc/data/LCTSC-Test-S1-102/images/9...,RT^RCCT_THORAX_8F_High (Adult)
1,1.3.6.1.4.1.14519.5.2.1.7014.4598.110977663386...,6c6ea4,RTSTRUCT,LCTSC-Test-S1-102,1.3.6.1.4.1.14519.5.2.1.7014.4598.110977663386...,1.3.6.1.4.1.14519.5.2.1.7014.4598.408067568497...,1.3.6.1.4.1.14519.5.2.1.7014.4598.235489581364...,testdata_lctsc/data/LCTSC-Test-S1-102/structur...,RT^RCCT_THORAX_8F_High (Adult)
2,1.3.6.1.4.1.14519.5.2.1.7014.4598.318848546630...,aa38e6,CT,LCTSC-Train-S1-002,1.3.6.1.4.1.14519.5.2.1.7014.4598.234842392725...,1.3.6.1.4.1.14519.5.2.1.7014.4598.145984743865...,,testdata_lctsc/data/LCTSC-Train-S1-002/images/...,RT^RCCT_THORAX_8F (Adult)
3,1.3.6.1.4.1.14519.5.2.1.7014.4598.291449913947...,f036b8,RTSTRUCT,LCTSC-Train-S1-002,1.3.6.1.4.1.14519.5.2.1.7014.4598.291449913947...,1.3.6.1.4.1.14519.5.2.1.7014.4598.145984743865...,1.3.6.1.4.1.14519.5.2.1.7014.4598.318848546630...,testdata_lctsc/data/LCTSC-Train-S1-002/structu...,RT^RCCT_THORAX_8F (Adult)
4,1.3.6.1.4.1.14519.5.2.1.7014.4598.188371727865...,6834c9,CT,LCTSC-Train-S1-001,1.3.6.1.4.1.14519.5.2.1.7014.4598.330486033168...,1.3.6.1.4.1.14519.5.2.1.7014.4598.109432485688...,,testdata_lctsc/data/LCTSC-Train-S1-001/images/...,RT^RCCT_THORAX_8F_Low (Adult)
5,1.3.6.1.4.1.14519.5.2.1.7014.4598.267594131248...,b5bddb,RTSTRUCT,LCTSC-Train-S1-001,1.3.6.1.4.1.14519.5.2.1.7014.4598.267594131248...,1.3.6.1.4.1.14519.5.2.1.7014.4598.109432485688...,1.3.6.1.4.1.14519.5.2.1.7014.4598.188371727865...,testdata_lctsc/data/LCTSC-Train-S1-001/structu...,RT^RCCT_THORAX_8F_Low (Adult)
6,1.3.6.1.4.1.14519.5.2.1.7014.4598.141349678572...,d91c84,CT,LCTSC-Train-S1-004,1.3.6.1.4.1.14519.5.2.1.7014.4598.269433294341...,1.3.6.1.4.1.14519.5.2.1.7014.4598.313160975008...,,testdata_lctsc/data/LCTSC-Train-S1-004/images/...,RT^RCCT_THORAX_8F_High_CONTRAST (Adult)
7,1.3.6.1.4.1.14519.5.2.1.7014.4598.595315284787...,61758b,RTSTRUCT,LCTSC-Train-S1-004,1.3.6.1.4.1.14519.5.2.1.7014.4598.595315284787...,1.3.6.1.4.1.14519.5.2.1.7014.4598.313160975008...,1.3.6.1.4.1.14519.5.2.1.7014.4598.141349678572...,testdata_lctsc/data/LCTSC-Train-S1-004/structu...,RT^RCCT_THORAX_8F_High_CONTRAST (Adult)
8,1.3.6.1.4.1.14519.5.2.1.7014.4598.140943693489...,738c1a,CT,LCTSC-Train-S1-005,1.3.6.1.4.1.14519.5.2.1.7014.4598.338518041666...,1.3.6.1.4.1.14519.5.2.1.7014.4598.964812114328...,,testdata_lctsc/data/LCTSC-Train-S1-005/images/...,RT^RCCT_THORAX_8F_High (Adult)
9,1.3.6.1.4.1.14519.5.2.1.7014.4598.214803404117...,68d663,RTSTRUCT,LCTSC-Train-S1-005,1.3.6.1.4.1.14519.5.2.1.7014.4598.214803404117...,1.3.6.1.4.1.14519.5.2.1.7014.4598.964812114328...,1.3.6.1.4.1.14519.5.2.1.7014.4598.140943693489...,testdata_lctsc/data/LCTSC-Train-S1-005/structu...,RT^RCCT_THORAX_8F_High (Adult)


### Determine Date of Object

There are several DICOM header tags which could define the date of an object. The DICOM standard
doesn't require all of these to be set within the metadata. PyDicer provides the 
`determine_dcm_datetime` function to extract the date from the DICOM header.

In [8]:
ds = load_object_metadata(first_row)
obj_datetime = determine_dcm_datetime(ds)
print(obj_datetime)

2003-11-04 11:45:45.847000
