# ☑️ Overview:

### 🥅 Investigate and create a DataFrame for the metadata for the 33,026 CT images associated with the 176 patients in Training set.

### 🔗 Link to my [DICOM Metadata DataFrame](https://www.kaggle.com/samuellongenbach/osic-pulmonary-fibrosis-dicom-metadata-pkl)

### 👏 Thanks to the following notebooks for their ideas & code:  

* [What should we consider when handling DICOM?](https://www.kaggle.com/jryoungw/what-should-we-consider-when-handling-dicom) by [jryoungw](https://www.kaggle.com/jryoungw)

* [Pulmonary Fibrosis Competition: EDA & DICOM Prep](https://www.kaggle.com/andradaolteanu/pulmonary-fibrosis-competition-eda-dicom-prep) by [andradaolteanu](https://www.kaggle.com/andradaolteanu) 

* [Understanding DICOMS✔](https://www.kaggle.com/avirdee/understanding-dicoms) by [avirdee](https://www.kaggle.com/avirdee)



### 👍 If you find the notebook or dataset helpful, suggestions & a upvote is appreciated!

## 📚 Libraries:

In [None]:
##
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
##
import pydicom 

## 📊Load CSV Data:

### Key Findings:
* 176 Patients in the training set.
* 5 Patients in the testing set which also show up in training set.

* Each patient has a 3D baseline CT scan which is stored as a collection of 2D images.
* We also refer to these images as slices.
* The number of 2D images for each of the Patients varies from 12 to 1018 images.  

In [None]:
train_df = pd.read_csv("../input/osic-pulmonary-fibrosis-progression/train.csv")
test_df = pd.read_csv("../input/osic-pulmonary-fibrosis-progression/test.csv")

train_df.head(5)

In [None]:
print("# of Patients in Train: ",len(np.unique(train_df["Patient"])))
print("# of Patients in Test: ",len(np.unique(test_df["Patient"])))
print("Train/Test overlap?: ",len(np.intersect1d(train_df["Patient"],test_df["Patient"])))

In [None]:
# Base directory for Train .dcm files
osic_dir = "../input/osic-pulmonary-fibrosis-progression/train/"

train_df["Path"] = osic_dir + train_df["Patient"] 

# Calculate how many CT images each patient has
train_df["CT_images"] = 0

for k, path in enumerate(train_df["Path"]):
    train_df["CT_images"][k] = len(os.listdir(path))

train_df.head(5)

In [None]:
# CT Scans per Patient
data = train_df.groupby(by="Patient")["CT_images"].first().reset_index(drop=False)

# Sort by number of CT Scans
data = data.sort_values(['CT_images']).reset_index(drop=True)
print("Minimum number of CT images: {}".format(data["CT_images"].min()), "\n" +
      "Maximum number of CT images: {}".format(data["CT_images"].max()), "\n" +
      "Median number of CT images: {}".format(data["CT_images"].median()))

# Plot
plt.figure(figsize = (16, 6))
p = sns.barplot(data["Patient"], data["CT_images"], color="darkgreen")
plt.axvline(x=85, color="lightgreen", linestyle='--', lw=3)

plt.title("Number of CT images in baseline for each Patient", fontsize = 17)
plt.xlabel('Patient', fontsize=14)
plt.ylabel('Frequency', fontsize=14)

plt.text(86, 850, "Median=98", fontsize=13)

p.axes.get_xaxis().set_visible(False);

## ⚕️Collect DICOM Metadata:

### Example:
* For Patient -> ID00122637202216437668965 we load all the 2D slices using **pydicom.dcmread**.
* Below we display the metadata for one of the slices:


In [None]:
path = "../input/osic-pulmonary-fibrosis-progression/train/ID00122637202216437668965"

slices = [pydicom.dcmread(path + '/' + s) for s in os.listdir(path)] 
slices.sort(key = lambda x: int(x.InstanceNumber)) 

print("Patient: ","ID00122637202216437668965")
print("Image/Slice: ",slices[36]["InstanceNumber"].value)
slices[36]

### Metadata for a Patient:
* **Metadata_for_Patient** creates a DataFrame with all the metadata for a given Patient. 
* If the DICOM attribute doesn't exist, append np.nan.

In [None]:
dicom_atts = ["SpecificCharacterSet","ImageType","SOPInstanceUID","Modality","Manufacturer","ManufacturerModelName","PatientName","PatientID",
             "PatientSex","DeidentificationMethod","BodyPartExamined","SliceThickness","KVP","SpacingBetweenSlices","DistanceSourceToDetector","DistanceSourceToPatient","GantryDetectorTilt",
             "TableHeight","RotationDirection","XRayTubeCurrent","GeneratorPower","FocalSpots","ConvolutionKernel","PatientPosition","RevolutionTime","SingleCollimationWidth","TotalCollimationWidth","TableSpeed","TableFeedPerRotation","SpiralPitchFactor",
              "StudyInstanceUID","SeriesInstanceUID","StudyID","InstanceNumber","PatientOrientation","ImagePositionPatient","ImageOrientationPatient","FrameOfReferenceUID","PositionReferenceIndicator","SliceLocation","SamplesPerPixel","PhotometricInterpretation",
             "Rows","Columns","PixelSpacing","BitsAllocated","BitsStored","HighBit","PixelRepresentation","PixelPaddingValue","WindowCenter","WindowWidth","RescaleIntercept","RescaleSlope","RescaleType"]

list_attributes = ["ImageType","ImagePositionPatient","ImageOrientationPatient","PixelSpacing"]

def Metadata_for_Patient(folder_path):
    files = os.listdir(folder_path)
    patient_id = folder_path.split('/')[-1]
    
    ## Each row is an image file:
    base_data = {'Patient': [patient_id]*len(files), 'File': files}
    patient_df = pd.DataFrame(data=base_data)
    
    ## Add Columns by looping through DICOM attributes for each image file:
    slices = [pydicom.dcmread(folder_path + '/' + s) for s in files] 
    for d in dicom_atts:
        attribute_i = []
        for s in slices:
            try:
                attribute_i.append(s[d].value)
            except:
                attribute_i.append(np.nan)
        patient_df[d] = attribute_i
        
    ## Store min pixel value for each image file 
    attribute_min_pixel = []
    for s in slices:
        try:
            mp = np.min(s.pixel_array.astype(np.int16).flatten())
        except:
            mp = np.nan
        attribute_min_pixel.append(mp)
    patient_df["MinPixelValue"] = attribute_min_pixel
  
    return patient_df

In [None]:
Metadata_for_Patient(path).head()

### 💾 Create DataFrame for all Patients:

In [None]:
DICOM_Meta_df = pd.DataFrame()

## For all 176 Patient Folders: 
unique_patient_df = train_df.groupby(by="Patient").first()
for pth in unique_patient_df["Path"]:
    temp_df = Metadata_for_Patient(pth)
    DICOM_Meta_df = pd.concat([DICOM_Meta_df,temp_df],ignore_index=True)
    
## SAVE:
DICOM_Meta_df.to_pickle("DICOM_Metadata.pkl")

## LOAD:
#load_df = pd.read_pickle("DICOM_Metadata.pkl")

## 💾 Metadata EDA:

* To be done...

In [None]:
## All Images:
print("Shape: ", DICOM_Meta_df.shape)
DICOM_Meta_df.info()

In [None]:
## All Patients:
unique_meta_df = DICOM_Meta_df.groupby(by="Patient").first()
unique_meta_df.info()

In [None]:
unique_meta_df.describe()

### CT Image Issues:
* Patient -> ID00011637202177653955184 can't load 31 images
* Patient -> ID00052637202186188008618 can't load 1 images

In [None]:
issues = DICOM_Meta_df[DICOM_Meta_df["MinPixelValue"].isna()]
issues

In [None]:
issues["Patient"].value_counts()

In [None]:
issue_path = "../input/osic-pulmonary-fibrosis-progression/train/ID00011637202177653955184/6.dcm"

fff = pydicom.dcmread(issue_path) 
fff.pixel_array