-------------
# DICOM Dataset EDA

We will discuss some of the methods that could be applied in dataset analysis, when you are dealing with large multitude of 3D volumes. A lot of the same principles could be applied here as were used with individual volume EDA.

In this case we will look at a collection of images and try to figure out what we are looking at. We will look at some of the techniques that might help us collect relevant meta information.

In [None]:
import pydicom
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy.ma as ma
import numpy as np
import os

Let us load all series metadata, but not pixels

In [None]:
path = r"data"
series = np.array([[(os.path.join(dp, f), pydicom.dcmread(os.path.join(dp, f), stop_before_pixels = True)) for f in files]
                   for dp,_,files in os.walk(path) if len(files) != 0])



Let's print a few, see what we've got:

In [None]:
series[0][0][1]

*Looks like we have MR data*  
*Looks like we can rely on patient IDs*

How many total files

In [None]:
instances = [f for l in series for f in l]
len(instances)

How many patients?

In [None]:
patient_ids = np.unique([inst[1].PatientID for inst in instances])
len(patient_ids)

*Great - no errors hence all instances have the PatientID tag, and looks like we can rely on it*

How many total series (i.e. 3D volumes)?

In [None]:
len(series)

What is the relationship between patients, studies and series?

In [None]:
# How many studies?

studies = {}

for s in series:
    studies.setdefault(s[0][1].StudyInstanceUID, []).append(s)   

In [None]:
len(studies)

Let's see how many studies per patient

In [None]:
[len([st for st in studies.values() if st[0][0][1].PatientID == p]) for p in patient_ids]

*Nice, all even. Let's look at directory on the file system.*  
*Looks like 2 points in time per patient*

In [None]:
# Let's see how many series per study

series_per_study = [(len(sr), sr[0][0][1].PatientID) for sr in studies.values()]
series_per_study

Let's take a quick glimpse at that outlier on the file system. 

- seems like it's missing some sequences

Finally, how many images per series:

In [None]:
img_per_series = [len(s) for s in series]
print(img_per_series)

*Nice, no outliers*

Let's look at spacing and in-plane resolution:

In [None]:
res = {}
spc = {}
thck = {}

for sr in series:
    dcm = sr[0][1]
    key = str(dcm.PixelSpacing)
    spc.setdefault(key, [])
    spc[key].append((dcm.PatientID, dcm.StudyDescription, dcm.StudyDate, dcm.SeriesDescription))
    
    key = str((dcm.Rows, dcm.Columns))
    res.setdefault(key, [])
    res[key].append((dcm.PatientID, dcm.StudyDescription, dcm.StudyDate, dcm.SeriesDescription))
    
    key = str(dcm.SliceThickness)
    thck.setdefault(key, [])
    thck[key].append((dcm.PatientID, dcm.StudyDescription, dcm.StudyDate, dcm.SeriesDescription))
    


Let's look at slice thickness

In [None]:
thck.keys()

*Great, all consistent*

Let's look at pixel spacing

In [None]:
spc.keys()

*Not very consistent, let's try to see what is going on*

In [None]:
spc

*So looks like there is a slight discrepancy among T1/T2 series. Note to self - make sure to resample if I'm using them.  
Also, seems that some sequences have tighter pixels than others. If I'm using those, need to make sure resampling is meaningful*

Let's look at in-plane resolution

In [None]:
res.keys()

*Not a lot of variety, but clearly some seq types are high-res, need to see what's up*

In [None]:
res

Now let's visualize some of those images and see if the series align with each other

In [None]:
# Let's try to see how images from same series stack w each other. We might want to use multiple
# input channels for our problem

# Remember, though, that we don't have the pixel data? Let's load it properly:

seq1 = r"PGBM-003/10-17-1995-MR RCBV SEQUENCE-57198/34911-T1prereg-46949"
t1_slices = [pydicom.dcmread(os.path.join(path, seq1, f)) for f in os.listdir(os.path.join(path, seq1))]
t1_slices.sort(key = lambda inst: int(inst.ImagePositionPatient[2]))

seq2 = r"PGBM-003/10-17-1995-MR RCBV SEQUENCE-57198/36471-FLAIRreg-02052"
flair_slices = [pydicom.dcmread(os.path.join(path, seq2, f)) for f in os.listdir(os.path.join(path, seq2))]
flair_slices.sort(key = lambda inst: int(inst.ImagePositionPatient[2]))

In [None]:
# t1_slices[0]

In [None]:
# flair_slices[0]

In [None]:
t1 = np.stack([s.pixel_array for s in t1_slices])
flair = np.stack([s.pixel_array for s in flair_slices])

In [None]:
[ipp.ImagePositionPatient for ipp in t1_slices]

In [None]:
[ipp.ImagePositionPatient for ipp in flair_slices]

In [None]:
np.array([ipp.ImageOrientationPatient for ipp in t1_slices]) - np.array([ipp.ImageOrientationPatient for ipp in flair_slices])

In [None]:
t1.shape

In [None]:
flair.shape

In [None]:
plt.imshow((flair+1.0*t1)[9,:,:], cmap="gray")

In [None]:
plt.imshow((0.0*flair+t1)[9,:,:], cmap="gray")

At this point we could visualize a few more series, and probably run some code to ensure IPP and IOP alignment across all of the dataset. We could maybe load this data in Slicer to get a feel of what we are dealing with, and then start looking at the ground truth and photometrics. 

However, by going through these simple steps, we already learned the following about our dataset:

* We are dealing with MR images
* We have data from 4 patients
* Each patient has 2 studies representing 2 time points
* Each study has between 8 and 10 series, 76 series altogether
* In total we have 1794 slices
* We have varying pixel spacing, even within same sequence types
* We have consistent slice thickness (5mm)
* We have low-res slices (older ones) and higher-res slices (newer ones)
* Our volumes seem to all be registered together and thus it is safe to just align them. I.e. they could be passed as multi-channel data into our ML algorithms

That is a lot! 