<h1 style="border-style: outset;border-color: red;text-align: center;">SIIM-FISABIO-RSNA COVID-19 Dataset Preparation</h1>

<img src="https://content.presspage.com/uploads/2110/1920_gettyimages-1216304354.jpg" height="500" width="500" style="display: block;margin-left: auto;margin-right: auto;"> 

<h2 style="text-align: center;border-style: double;text-align: center;border-color: red; ">About SIIM</h2>
<img src="https://siim.org/resource/resmgr/SIIM_logo-600x315.png" width="200" style="display: block;margin-left: auto;margin-right: auto;">
<p> <b>Society for Imaging Informatics in Medicine</b> (<a href="https://siim.org/">SIIM</a>) is the leading healthcare professional organization for those interested in the current and future use of informatics in medical imaging. The society's mission is to advance medical imaging informatics across the enterprise through education, research, and innovation in a multi-disciplinary community.</p>

<a href = "https://www.kaggle.com/shanmukh05/siim-covid19-dataset-256px-jpg" style="font-weight:'bold'; color:blue; font-family:monospace; "> <h3>My Dataset</h3></a>
<a href = "https://www.kaggle.com/shanmukh05/siim-covid-19-detection-detectron2-training" style="font-weight:'bold'; color:blue; font-family:monospace; "> <h3>My Training Notebook</h3></a> To be updated
<a href = "" style="font-weight:'bold'; color:blue; font-family:monospace; "> <h3>My Data Visualization Notebook</h3></a> Will be created soon
<a href = "" style="font-weight:'bold'; color:blue; font-family:monospace; "> <h3>My Inference Notebook</h3></a> Will be created soon.

In [None]:
##---------------------------------
# installing dependency for pydicom
##---------------------------------

!conda install gdcm -c conda-forge -y

In [None]:
##--------------------------
#importing required lbraries
##--------------------------

import numpy as np
import pandas as pd

import os
import ast

import PIL
from PIL import Image
import matplotlib.pyplot as plt

import tensorflow as tf

import pydicom
from pydicom.pixel_data_handlers.util import apply_voi_lut

In [None]:
TRAIN_PATH = "../input/siim-covid19-detection/train"
TEST_PATH = "../input/siim-covid19-detection/test"
FINAL_TRAIN_PATH = "./dataset/train"
FINAL_TEST_PATH = "./dataset/test"

HEIGHT,WIDTH = 640,640

TRAIN_FILES = tf.io.gfile.glob(TRAIN_PATH+"/*/*/*.dcm")
TEST_FILES = tf.io.gfile.glob(TEST_PATH+"/*/*/*.dcm")

train_id,test_id = [], []
train_h, test_h = [], []
train_w, test_w = [], []

<h1 style = "font-family:'Courier New';font-weight: bold;margin-top: 0px;margin-bottom: 1px;text-align: center;">Preparing Image Data</h1>

<h2 style="font-weight:'bold'; color:red; font-family:verdana; text-align: center;">What is DICOM format</h2>
<p> <b>Digital Imaging and Communications in Medicine</b> (DICOM) is the standard for the communication and management of medical imaging information and related data. DICOM is most commonly used for storing and transmitting medical images enabling the integration of medical imaging devices such as scanners, servers, workstations, printers, network hardware.</p>

<a href = "https://en.wikipedia.org/wiki/DICOM" style="font-weight:'bold'; color:blue; font-family:monospace; text-align: center;"> Read more about DICOM here</a>

<h2 style="font-weight:'bold'; color:red; font-family:verdana; text-align: center;">Why is DICOM used</h2>

<p>Following are some of applications of using DICOM format in medical imaging</p>

- Collaboration With Existing IT Systems
- High-Performance Review
- Complete Scanning and Image Reviewing

<a href = "https://www.covetus.com/blog/how-is-dicom-important-beneficial-for-the-healthcare-industry" style="font-weight:'bold'; color:blue; font-family:monospace; text-align: center;"> Read more about DICOM applications here</a>

In [None]:
##---------------------------------------
#dicom to pixel array converting function
##---------------------------------------

# Ref : https://www.kaggle.com/raddar/convert-dicom-to-np-array-the-correct-way
def dicom2arr(path, voi_lut = True, fix_monochrome = True):
    dicom = pydicom.read_file(path)
    
    if voi_lut:
        arr = apply_voi_lut(dicom.pixel_array, dicom)
    else:
        arr = dicom.pixel_array
               
    if fix_monochrome and dicom.PhotometricInterpretation == "MONOCHROME1":
        arr = np.amax(arr) - arr
        
    arr = arr - np.min(arr)
    arr = arr / np.max(arr)
    arr = (arr * 255).astype(np.uint8)
        
    return arr

##-----------------------
#resizing the pixel array
##-----------------------

def resizeArr(arr,is_train=True):
    im = Image.fromarray(arr)
    if is_train:
        train_w.append(im.size[0])
        train_h.append(im.size[1])
    else:
        test_w.append(im.size[0])
        test_h.append(im.size[1]) 
    im = im.resize((HEIGHT,WIDTH),resample= Image.LANCZOS)
    return im

##------------------------------
#Filename for resized jpg images 
##------------------------------

def getFilename(filepath,is_train=True):
    '''
        Fromat = '{STUDY-ID}_{SUB-STUDY-ID}_{IMAGE-ID}.jpg'
    '''
    ls = filepath.split("/")
    filename = ls[-3]+'_'+ls[-2]+'_'+ls[-1].split(".")[0]+".jpg"
    
    if is_train:
        train_id.append(ls[-1].split(".")[0])
    else:
        test_id.append(ls[-1].split(".")[0])
    return filename

##-------------------
#Finally saving image
##-------------------

def saveImage(filepath,mainpath=FINAL_TRAIN_PATH,train=True):
    arr = dicom2arr(filepath)
    arr = resizeArr(arr,train)
    path = os.path.join(mainpath,getFilename(filepath,train))
    
    arr.save(path)

In [None]:
os.makedirs(FINAL_TRAIN_PATH,exist_ok=True)
os.makedirs(FINAL_TEST_PATH,exist_ok=True)

#Preparing Training Image Data
for filepath in TRAIN_FILES:
    saveImage(filepath)

#Preparing Test Image Data
for filepath in TEST_FILES:
    saveImage(filepath,mainpath=FINAL_TEST_PATH,train=False)
    
# Ref : https://www.kaggle.com/xhlulu/siim-covid-19-convert-to-jpg-256px
!tar -zcf train.tar.gz -C "./dataset/train/" .
!tar -zcf test.tar.gz -C "./dataset/test/" .

<h1 style = "font-family:'Courier New';font-weight: bold;margin-top: 0px;margin-bottom: 1px;text-align: center;">Saving metadata of Images</h1>

In [None]:
# Saving metadata of images into csv files
meta_train = pd.DataFrame.from_dict({
    "ImageInstanceUID" : train_id,
    "width" : train_w,
    "height" : train_h
})
meta_train.to_csv("./meta_train.csv",index=False)

meta_test = pd.DataFrame.from_dict({
    "ImageInstanceUID" : test_id,
    "width" : test_w,
    "height" : test_h
})
meta_test.to_csv("./meta_test.csv",index=False)

In [None]:
# Sample metadata of training set
meta_train.head()

In [None]:
# Sample metadata of test set
meta_test.head()

In [None]:
classes_dict = {
    0 : "Negative for Pneumonia",
    1  : "Typical Appearance",
    2  : "Indeterminate Appearance",
    3  : "Atypical Appearance"
}


# Format of dictionary : {"image_id" : "study_id"}
id_dict = {}
for file in TRAIN_FILES:
    ls = file.split("/")
    id_dict[ls[-1].split(".")[0]] = ls[-3]
    
##-----------------------------------------
#getting filepath from study_id or image_id
##-----------------------------------------

def get_path(file_id,main_path,id_type):
    if id_type == "study":
        path = tf.io.gfile.glob(main_path+f"/{file_id}/*/*.dcm")[0]
    else:
        path = tf.io.gfile.glob(main_path+f"/*/*/{file_id}.dcm")[0]
    return path

<h1 style = "font-family:'Courier New';font-weight: bold;margin-top: 0px;margin-bottom: 1px;text-align: center;">Preparing CSV file for Training</h1>

In [None]:
# Read CSV files
image_df = pd.read_csv("../input/siim-covid19-detection/train_image_level.csv")
study_df = pd.read_csv("../input/siim-covid19-detection/train_study_level.csv")

In [None]:
# Print sample data of image_level csv file
image_df.head(2)

In [None]:
# Print sample data of study_level csv file
study_df.head(2)

In [None]:
# Making one-hot of study_level labels and removing other 4 class columns
study_df["one_hot"] = study_df.apply(lambda x : np.array([x["Negative for Pneumonia"],
                                                        x["Typical Appearance"],
                                                        x["Indeterminate Appearance"],
                                                        x["Atypical Appearance"]]),axis=1)

study_df["label_id"] = study_df["one_hot"].map(lambda x : classes_dict[np.argmax(x)])
study_df["study_label"] = study_df["one_hot"].map(lambda x : np.argmax(x))
study_df = study_df.drop(["Negative for Pneumonia","Typical Appearance","Indeterminate Appearance","Atypical Appearance","one_hot"],axis=1)
study_df.head(1)

In [None]:
# Renaming "id" to "ImageInstanceUID" and "id" to "StudyInstanceUID" for better understanding
image_df["id"] = image_df["id"].map(lambda x : x.replace("_image",""))
image_df.rename(columns={'id':"ImageInstanceUID",'label':"image_label"},inplace=True)

study_df["id"] = study_df["id"].map(lambda x : x.replace("_study",""))
study_df.rename(columns={"id" : "StudyInstanceUID"},inplace=True)

<h3>In the below cell I used <i>ast</i> library to convert <b>string of boxes columns</b>  into <b>dictionary type</b></h3>

<h2 style="font-weight:'bold'; color:red; font-family:verdana; text-align: center;">What is ast library</h2>

<h4>The ast (<b>Abstract Syntax Tree</b>) module helps Python applications to process trees of the Python abstract syntax grammar.</h4>

<a href = "https://docs.python.org/3/library/ast.html" style="font-weight:'bold'; color:blue; font-family:monospace; text-align: center;"> Read more about AST here</a>

<img src="https://libcst.readthedocs.io/en/latest/_images/graphviz-d27e3495fa9bb130d76879db599060e8039a9fc5.png" height="500" width="500" style="display: block;margin-left: auto;margin-right: auto;"> 

In [None]:
train_df = pd.merge(image_df,study_df,on = "StudyInstanceUID") # Merging study_df and image_df

try :
    train_df = pd.merge(train_df,meta_train,on = "ImageInstanceUID") # Merging to meta_train for height,width
except:
    pass

# Filling NaN values 
train_df["boxes"].fillna("[{'x':0,'y':0,'width':1,'height':1}]",inplace=True)
temp = train_df # for going through the data
train_df["boxes"] = train_df["boxes"].map(lambda x : ast.literal_eval(x))

try:
    columns = ["ImageInstanceUID","StudyInstanceUID","label_id","study_label","height","width","boxes","image_label"] # for proper order
    train_df = train_df[columns]
except:
    columns = ["ImageInstanceUID","StudyInstanceUID","label_id","study_label","boxes","image_label"] # for proper order
    train_df = train_df[columns]

train_df.to_csv("./train.csv",index=False)
train_df.head()

In [None]:
import shutil
shutil.rmtree("./dataset",ignore_errors=False)

<h1 style="font-weight:'bold'; color:blue; font-family:verdana; text-align: center; background-image: url(https://image.freepik.com/free-vector/gradient-background-green-shades_23-2148363157.jpg);">Do <b>UPVOTE</b> if you found this Notebook useful😊</h1>

<img src="https://external-preview.redd.it/vWYcdynWxuFy6bKYpZwuw6KiNgYuBPM6daHCwWRs4mo.png?auto=webp&s=fe3bf857b3c9f1369aed1ea0bc5e1acd8ae39449" height="200" width="200" style="display: block;margin-left: auto;margin-right: auto;"> 