<br>
<h1 style = "font-size:60px; font-family:Garamond ; font-weight : normal; background-color: #f6f5f5 ; color : #fe346e; text-align: center; border-radius: 200px 200px;"> SIIM COVID-19 Detection: Complete EDA   <br> Exploratory Data Analysis 🧐 & Modeling</h1>
<br>

# Objective 
In this competition, we are identifying and localizing COVID-19 abnormalities on chest radiographs. This is an object detection and classification problem.

For each test image, you will be predicting a bounding box and class for all findings. If you predict that there are no findings, you should create a prediction of "none 1 0 0 1 1" ("none" is the class ID for no finding, and this provides a one-pixel bounding box with a confidence of 1.0).

Further, for each test study, you should make a determination within the following labels:

'Negative for Pneumonia' 'Typical Appearance' 'Indeterminate Appearance' 'Atypical Appearance'

To make a prediction of one of the above labels, create a prediction string similar to the "none" class above: e.g. atypical 1 0 0 1 1

Please see the Evaluation page for more details about formatting predictions.

The images are in DICOM format, which means they contain additional data that might be useful for visualizing and classifying.

# Dataset information

The train dataset comprises 6,334 chest scans in DICOM format, which were de-identified to protect patient privacy. All images were labeled by a panel of experienced radiologists for the presence of opacities as well as overall appearance.

Note that all images are stored in paths with the form study/series/image. The study ID here relates directly to the study-level predictions, and the image ID is the ID used for image-level predictions.

The hidden test dataset is of roughly the same scale as the training dataset.

### Files
* train_study_level.csv - the train study-level metadata, with one row for each study, including correct labels.
* train_image_level.csv - the train image-level metadata, with one row for each image, including both correct labels and any bounding boxes in a dictionary format. Some images in both test and train have multiple bounding boxes.
* sample_submission.csv - a sample submission file containing all image- and study-level IDs.

### Columns
#### train_study_level.csv

* id - unique study identifier
* Negative for Pneumonia - 1 if the study is negative for pneumonia, 0 otherwise
* Typical Appearance - 1 if the study has this appearance, 0 otherwise
* Indeterminate Appearance  - 1 if the study has this appearance, 0 otherwise
* Atypical Appearance  - 1 if the study has this appearance, 0 otherwise

#### train_image_level.csv

* id - unique image identifier
* boxes - bounding boxes in easily-readable dictionary format
* label - the correct prediction label for the provided bounding boxes

In [None]:
!pip install -q sweetviz
!pip install -q klib

In [None]:


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import pydicom as dicom
import cv2
import ast


import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)


import sweetviz
import klib


import tensorflow as tf


import warnings
warnings.filterwarnings("ignore")

In [None]:
AUTO = tf.data.experimental.AUTOTUNE

try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)

In [None]:
PATH='/kaggle/input/siim-covid19-detection/'
TRAIN_PATH = "../input/siim-covid19-detection/train"
#TEST_PATH = GCS_DS_PATH + "/test"
TEST_PATH = "../input/siim-covid19-detection/test"
TRAIN_FILES = tf.io.gfile.glob(TRAIN_PATH+"/*/*/*.dcm")
TEST_FILES = tf.io.gfile.glob(TEST_PATH+"/*/*/*.dcm")

In [None]:
df_train_img=pd.read_csv(PATH+'train_image_level.csv')
df_train_study=pd.read_csv(PATH+'train_study_level.csv')

In [None]:
classes_dict = {
    0 : "Negative for Pneumonia",
    1  : "Typical Appearance",
    2  : "Indeterminate Appearance",
    3  : "Atypical Appearance"
}

In [None]:
#getting filepath from study_id or image_id
def get_path(file_id,main_path,id_type):
    name = file_id.split("_")[0]
    if id_type == "study":
        path = tf.io.gfile.glob(main_path+f"/{name}/*/*.dcm")[0]
    else:
        path = tf.io.gfile.glob(main_path+f"/*/*/{name}.dcm")[0]
    return path

In [None]:
df_train_img.head()

In [None]:
print('Total Images in directory for model training :- ',len(df_train_img['id'].unique()))
print('Total Images which does not have Pneumonia   :- ',df_train_img[df_train_img['boxes'].isnull()].shape[0])

In [None]:
df_train_img.loc[0, 'StudyInstanceUID']

In [None]:
df_train_img

In [None]:
path_train = PATH+'/train/'+df_train_img.loc[0, 'StudyInstanceUID']+'/'+'81456c9c5423'+'/'
img_id = df_train_img.loc[0, 'id'].replace('_image', '.dcm')
data_file = dicom.dcmread(path_train+img_id)
img = data_file.pixel_array

## What is there in DICOM metadata file

Digital Imaging and Communications in Medicine (DICOM) is the standard for the communication and management of medical imaging information and related data.DICOM is most commonly used for storing and transmitting medical images enabling the integration of medical imaging devices such as scanners, servers, workstations, printers, network hardware, and picture archiving and communication systems (PACS) from multiple manufacturers. It has been widely adopted by hospitals and is making inroads into smaller applications like dentists' and doctors' offices.

DICOM files can be exchanged between two entities that are capable of receiving image and patient data in DICOM format. The different devices come with DICOM Conformance Statements which state which DICOM classes they support. The standard includes a file format definition and a network communications protocol that uses TCP/IP to communicate between systems.

The National Electrical Manufacturers Association (NEMA) holds the copyright to the published standard which was developed by the DICOM Standards Committee, whose members are also partly members of NEMA.It is also known as NEMA standard PS3, and as ISO standard 12052:2017 "Health informatics -- Digital imaging and communication in medicine (DICOM) including workflow and data management".

In [None]:
data_file

# Shape of the Image

In [None]:
print('Image shape:', img.shape)

In [None]:
boxes = ast.literal_eval(df_train_img.loc[0, 'boxes'])
boxes

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(20, 4))

for box in boxes:
    p = matplotlib.patches.Rectangle((box['x'], box['y']), box['width'], box['height'],
                                     ec='r', fc='none', lw=2.)
    ax.add_patch(p)
ax.imshow(img, cmap='gray')
plt.show()

## Let see more samples

In [None]:
fig, axs = plt.subplots(3, 3, figsize=(20, 20))
fig.subplots_adjust(hspace = .1, wspace=.1)
axs = axs.ravel()

for row in range(9):
    study = df_train_img.loc[row, 'StudyInstanceUID']
    path_in = PATH+'train/'+study+'/'
    folder = os.listdir(path_in)
    path_file = path_in+folder[0]
    filename = os.listdir(path_file)[0]
    file_id = filename.split('.')[0]
    
    data_file = dicom.dcmread(path_file+'/'+file_id+'.dcm')
    img = data_file.pixel_array
    if (df_train_img.loc[row, 'boxes']!=df_train_img.loc[row, 'boxes']) == False:
        boxes = ast.literal_eval(df_train_img.loc[row, 'boxes'])
    
        for box in boxes:
            p = matplotlib.patches.Rectangle((box['x'], box['y']), box['width'], box['height'],
                                     ec='r', fc='none', lw=2.)
            axs[row].add_patch(p)
    axs[row].imshow(img, cmap='gray')
    axs[row].set_title(df_train_img.loc[row, 'label'].split(' ')[0])
    axs[row].set_xticklabels([])
    axs[row].set_yticklabels([])

In [None]:
def split_label(s):
    return s.split(' ')[0]

In [None]:
df_train_img['label_name'] = df_train_img['label'].apply(split_label)
df_train_img['label_name'].value_counts()

In [None]:
df_train_study

In [None]:
klib.missingval_plot(df_train_study)

# Distribution of Pneumonia Symptoms

In [None]:
fig, ax = plt.subplots(2,2,figsize=(20,16))
sns.kdeplot(df_train_study["Negative for Pneumonia"], shade=True,ax=ax[0,0],color="#ffb4a2")
ax[0,0].set_title("Negative for Pneumonia Distribution",font="Serif", fontsize=15)
sns.kdeplot(df_train_study["Typical Appearance"], shade=True,ax=ax[0,1],color="#e5989b")
ax[0,1].set_title("Typical Appearance Distribution",font="Serif", fontsize=15)
sns.kdeplot(df_train_study["Indeterminate Appearance"], shade=True,ax=ax[1,0],color="#b5838d")
ax[1,0].set_title("Indeterminate Appearance Distribution",font="Serif", fontsize=15)
sns.kdeplot(df_train_study["Atypical Appearance"], shade=True,ax=ax[1,1],color="#6d6875")
ax[1,1].set_title("Atypical Appearance Distribution",font="Serif", fontsize=15)
plt.show()

In [None]:
df_train_img.head()

In [None]:
df_train_study['StudyInstanceUID'] = df_train_study['id'].apply(lambda x: x.replace('_study', ''))

In [None]:
df_train_img = df_train_img.merge(df_train_study[['Negative for Pneumonia', 'Typical Appearance','Indeterminate Appearance', 'Atypical Appearance','StudyInstanceUID']], on='StudyInstanceUID')

In [None]:
df_train_img

# Parallel categories plot of targets

In [None]:
plt.figure(figsize=(35,20))
fig = px.parallel_categories(df_train_img[['Negative for Pneumonia', 'Typical Appearance',
       'Indeterminate Appearance', 'Atypical Appearance']], color="Negative for Pneumonia", color_continuous_scale="sunset",\
                             title="Parallel categories plot of targets")
fig

#### Observation :- 
    Negative for Pneumonia have share of other symptomes 

In [None]:
df_train = df_train_img.copy()

In [None]:
#converting into one-hot label
df_train["one_hot"] = df_train.apply(lambda x : np.array([x["Negative for Pneumonia"],
                                                        x["Typical Appearance"],
                                                        x["Indeterminate Appearance"],
                                                        x["Atypical Appearance"]]),axis=1)

df_train = df_train.drop(["Negative for Pneumonia","Typical Appearance","Indeterminate Appearance","Atypical Appearance"],axis=1)

In [None]:
df_train["label_id"] = df_train["one_hot"].map(lambda x : classes_dict[np.argmax(x)])

#### .... Inprogress 