# EDA, Visualization and Some Level of Prediction: SIIM-FISABIO-RSNA COVID-19 Detection
In this competition, we will be providing step by step walk through the different approaches how we can contribute on identifying COVID-19 symptoms comparing with other significant bacterial penumonias.

### Problem Statement:
Currently, COVID-19 can be diagnosed via polymerase chain reaction to detect genetic material from the virus or chest radiograph. However, it can take a few hours and sometimes days before the molecular test results are back. By contrast, chest radiographs can be obtained in minutes. While guidelines exist to help radiologists differentiate COVID-19 from other types of infection, their assessments vary. In addition, non-radiologists could be supported with better localization of the disease, such as with a visual bounding box.

In this competition, you’ll identify and localize COVID-19 abnormalities on chest radiographs. In particular, you'll categorize the radiographs as negative for pneumonia or typical, indeterminate, or atypical for COVID-19. You and your model will work with imaging data and annotations from a group of radiologists.

If successful, you'll help radiologists diagnose the millions of COVID-19 patients more confidently and quickly. This will also enable doctors to see the extent of the disease and help them make decisions regarding treatment. Depending upon severity, affected patients may need hospitalization, admission into an intensive care unit, or supportive therapies like mechanical ventilation. As a result of better diagnosis, more patients will quickly receive the best care for their condition, which could mitigate the most severe effects of the virus.
(Copied from Overview Section)

### Dataset Information: Understand your Data First
The train dataset comprises 6,334 chest scans in DICOM format, which were de-identified to protect patient privacy. All images were labeled by a panel of experienced radiologists for the presence of opacities as well as overall appearance.

Note that all images are stored in paths with the form study/series/image. The study ID here relates directly to the study-level predictions, and the image ID is the ID used for image-level predictions.

The hidden test dataset is of roughly the same scale as the training dataset.

**Before even we dive into the EDA, it's time to look at the data one step more closer. 
Let's have a quick look at the files that has been provided:**

* train_study_level.csv - the train study-level metadata, with one row for each study, including correct labels.
* train_image_level.csv - the train image-level metadata, with one row for each image, including both correct labels and any bounding boxes in a dictionary format. Some 
* images in both test and train have multiple bounding boxes.
* sample_submission.csv - a sample submission file containing all image- and study-level IDs.

**Columns**

**train_study_level.csv**
* id - unique study identifier
* Negative for Pneumonia - 1 if the study is negative for pneumonia, 0 otherwise
* Typical Appearance - 1 if the study has this appearance, 0 otherwise
* Indeterminate Appearance  - 1 if the study has this appearance, 0 otherwise
* Atypical Appearance  - 1 if the study has this appearance, 0 otherwise

**train_image_level.csv**
* id - unique image identifier
* boxes - bounding boxes in easily-readable dictionary format
* label - the correct prediction label for the provided bounding boxes

Here is the evaluation Metrics:
<br>Standard PASCAL VOC 2010 mean average Precision at IoU > 0.5

### Time to Start with the Importing the Basic Libraries and Datasets

References: Took help from: 
https://www.kaggle.com/tanlikesmath/siim-covid-19-detection-a-simple-eda

In [None]:
!conda install -c conda-forge gdcm -y

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import os
from os import listdir, mkdir

import pydicom
import scipy.ndimage
import gdcm

import glob

from skimage import measure
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
from skimage.morphology import disk, opening, closing
from tqdm.notebook import tqdm

from IPython.display import HTML
from PIL import Image

from pydicom.pixel_data_handlers.util import apply_voi_lut
from skimage import exposure
import cv2

import warnings

import vtk
from vtk.util import numpy_support
from fastai.vision.all import *
from fastai.medical.imaging import *

reader = vtk.vtkDICOMImageReader()

warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
listdir("../input/")

In [None]:
# Time to have a quick look at the CSV files that is provided in the competition
train_study_level = pd.read_csv("../input/siim-covid19-detection/train_study_level.csv") # let's see the train_study_level.csv file
train_image_level = pd.read_csv("../input/siim-covid19-detection/train_image_level.csv") # let's see the train_image_level.csv file

train_study_level.head(15) # I love seeing little more rows than just typical 5 rows for no reason lol

In the Above 15 rows and 4 columns of train_study_level.csv, you can see that we are predicting those 4 column values while moving forward in the model
* **Negative for Pneumonia** - prediction value 1 if the study is negative for pneumonia, 0 for everything else
* **Typical Appearance** - prediction value 1 if the study has this appearance, 0 for everything else
* **Indeterminate Appearance** - prediction value 1 if the study has this appearance, 0 for everything else
* **Atypical Appearance** - prediction value 1 if the study has this appearance, 0 for everything else

* **id** - Unique Study Identifier

In [None]:
train_image_level.head(15) # Want to look at the same amount of rows as above, no compromise

Here is what we see in above train_image_level.csv file:
* **id** - Unique image identifier for each image
* **boxes** - bounding boxes of the image which is kept in easily-readable dictionary format
* **label** - correction prediction label for the provided bounding boxes

#### Further:
In label, we have values for bounding box which you can identify after 'opacity' & 'none'. Basically, There are two classes, opacity & none, so while moving forward we will create independent feature for that.
Additionally, we will split the confidence score & bounding box, if 'boxes' has NaN (there is no xmin, xmax, ymin, ymax) then the bounding box value will be 1 0 0 1 1 & vis-a-versa.

Let's continue exploring

In [None]:
study_classes = ['Negative for Pneumonia', 'Typical Appearance', 'Indeterminate Appearance', 'Atypical Appearance']
np.unique(train_study_level[study_classes].values, axis=0)

In [None]:
# Checking at the distribution of each Study Classes
plt.figure(figsize=(10, 8))
plt.bar([1,2,3,4], train_study_level[study_classes].values.sum(axis=0))
plt.xticks([1,2,3,4], study_classes)
plt.ylabel('Frequency')
plt.show()

In [None]:
# Time to look at the image data
train_image_level.head()

Format of the label column is broken in the way: class ID, confidence score and bounding box

class ID: either opacity or none

confidence score: confidence from neural network model, 1 if none

bounding box: typical xmin ymin xmax y max format. if class ID is none, bounding box is 10011

In [None]:
# Checking the distribution of opacity vs none
train_image_level['split_label'] = train_image_level.label.apply(lambda x: [x.split()[offs:offs+6] for offs in range(0, len(x.split()), 6)])

classes_freq = []
for i in range(len(train_image_level)):
    for j in train_image_level.iloc[i].split_label: classes_freq.append(j[0])
plt.hist(classes_freq)
plt.ylabel('Frequency')
plt.show()

Time to check for the distribution of the bounding box areas

In [None]:
bbox_areas = []

for i in range(len(train_image_level)):
    for j in train_image_level.iloc[i].split_label:
        bbox_areas.append((float(j[4])-float(j[2]))*(float(j[5])*float(j[3])))
plt.hist(bbox_areas)
plt.ylabel('Frequency')
plt.show()

#### Let's Start Exploring Image Dataset Now!

#### Let's Look at the Image Dataset Now

In [None]:
def dicom2array(path, voi_lut=True, fix_monochrome=True): # Converting pixel data to array
    dicom = pydicom.read_file(path)
    # voi_lut is used to transform raw DICOM data to "human-friendly" view
    if voi_lut:
        data = apply_voi_lut(dicom.pixel_array, dicom)
    else:
        data = dicom.pixel_array
        
    # depending on this value, X-ray may look inverted - fix that:
    if fix_monochrome and dicom.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data # np.amax returns the maximum of an array or maximum along the axis (if mentioned)
    data = data - np.min(data) # Actual value - minimum value of pixel array
    data = data / np.max(data) # Actual value / maximum value of pixel array
    data = (data * 255).astype(np.uint8)
    return data
        
def plot_img(img, size=(6, 6), is_rgb=True, title="", cmap='gray'):
    plt.figure(figsize=size)
    plt.imshow(img, cmap=cmap)
    plt.suptitle(title)
    plt.show()


def plot_imgs(imgs, cols=4, size=6, is_rgb=True, title="", cmap='gray', img_size=(500,500)):
    rows = len(imgs)//cols + 1
    fig = plt.figure(figsize=(cols*size, rows*size))
    for i, img in enumerate(imgs):
        if img_size is not None:
            img = cv2.resize(img, img_size)
        fig.add_subplot(rows, cols, i+1)
        plt.imshow(img, cmap=cmap)
    plt.suptitle(title)
    plt.show()

In [None]:
data_path = Path('../input/siim-covid19-detection')

# Let's look at sample 16 records for better understanding of images
dicom_paths = get_dicom_files(data_path/'train')
imgs = [dicom2array(path) for path in dicom_paths[:16]]
plot_imgs(imgs)

Let's analyze how many images are available per study

In [None]:
# Looking at the image data per study
images_per_study = []
for i in (data_path/'train').ls():
    images_per_study.append(len(get_dicom_files(i)))
    if len(get_dicom_files(i)) > 5:
        print(f'Study {i} had {len(get_dicom_files(i))} images')

plt.hist(images_per_study)

In [None]:
# Setting up the image path for each images in study
def image_path(row):
    study_path = data_path/'train'/row.StudyInstanceUID
    for i in get_dicom_files(study_path):
        if row.id.split('_')[0] == i.stem: return i 
        
train_image_level['image_path'] = train_image_level.apply(image_path, axis=1)

# Have a look at few
train_image_level['image_path'].head()

In [None]:
# Initialize list for images
imgs = []
# set up image path values
image_paths = train_image_level['image_path'].values

# Mapping label_id to specify color
thickness = 10
scale = 5

for i in range(8):
    image_path = random.choice(image_paths)
    print(image_path)
    img = dicom2array(path=image_path)
    img = cv2.resize(img, None, fx=1/scale, fy=1/scale, interpolation=cv2.INTER_CUBIC)
    img = np.stack([img, img, img], axis=-1)
    for i in train_image_level.loc[train_image_level['image_path'] == image_path].split_label.values[0]:
        if i[0] == 'opacity':
            img = cv2.rectangle(img,
                                (int(float(i[2])/scale), int(float(i[3])/scale)),
                                (int(float(i[4])/scale), int(float(i[5])/scale)),
                                [255,0,0], thickness)
    
    img = cv2.resize(img, (500,500))
    imgs.append(img)

# Plotting Images
plot_imgs(imgs, cmap=None)

It's now all about submission, never worried for now, going to keep improving the notebook every day, will see how far I can go! Learning every day! :)

In [None]:
submission_df = pd.read_csv(data_path/'sample_submission.csv')

In [None]:
submission_df.head()

In [None]:
submission_df.iloc[2000:2010]

In [None]:
submission_df.to_csv('submission.csv', index=False)