# 🥱😬 TL;DR

Two major findings:

1. `Atypical Appearance`, `Indeterminate Appearance`, `Negative for Pneumonia` are highly unbalanced classes.


2. Pixel Spacing varies a lot between images and train & test data. It should be fixed in preprocessing for better and stable predictions.


Everthing else looks normal!

# SIIM-FISABIO-RSNA COVID-19 Detection EDA

👉 [Problem Type] Object Detection and Multiclass Classification Problem

In [None]:
!pip install pandas-profiling[notebook] --quiet
!pip install pydicom --quiet

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import pydicom as dicom
from pathlib import Path
from pandas_profiling import ProfileReport
from PIL import Image
from fastprogress import progress_bar

sns.set_style("whitegrid", {'axes.grid' : False})
%config InlineBackend.figure_format = 'retina'

pio.templates.default = "ggplot2"

In [None]:
DATADIR = Path("../input/siim-covid19-detection")
TRAINDIR = DATADIR/"train"
TESTDIR = DATADIR/"test"
train_study_filepath = DATADIR/"train_study_level.csv"
train_image_filepath = DATADIR/"train_image_level.csv"

In [None]:
train_study_df = pd.read_csv(train_study_filepath)
train_image_df = pd.read_csv(train_image_filepath)

# 🔥 Pandas Profiler 
The pandas `df.describe()` function is great but a little basic for serious exploratory data analysis. *pandas_profiling* extends the pandas DataFrame with `df.profile_report()` for quick data analysis.

This saves time to write basic EDA code.

There are two ways to use it:
1. As a method of dataframe - `df.profile_report()`
2. `pandas_profiling.ProfileReport()` method. (see below)

In [None]:
profile = ProfileReport(train_study_df, title="Study Level")
profile

🤔 Takeaway from Study Level profiler:
1. Very unbalanced class distribution. 
2. *Typical Appearance* is the most balanced. ~ 52.8% (class 0) and 47.2% (class 1)
1. *Atypical Appearance* has really low class 1. ~ 92.2% (class 0) and 7.8% (class 1)

In [None]:
train_image_df.profile_report(title="Image Level")

🤔 Takeaway from Image Level profiler:
1. boxes has 2040 (32.2%) missing values 
2. label has 2040 (32.2%) `"none 1 0 0 1 1"` values.

### Using plots to understand data

To make metadata more simpler, let's merge two dataframes (image_level and study_level) into one. This can be done using outer join on`StudyInstanceUID` column from *image level* dataframe and `id` column from *study level* dataframe (just need to remove ..._study from samples).

In [None]:
# Remove _study part from id column and save ids in new column named StudyInstanceUID to merge dataframes
train_study_df["StudyInstanceUID"] = train_study_df.id.apply(lambda x: x.split("_")[0])
traindf = pd.merge(train_study_df, train_image_df, on="StudyInstanceUID")
traindf.rename({"id_x": "id_study", "id_y": "id_image"}, axis=1, inplace=True)
traindf.head()


In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(histfunc="count", x=traindf["Typical Appearance"], name="Typical Appearance"))
fig.add_trace(go.Histogram(histfunc="count", x=traindf["Atypical Appearance"], name="Atypical Appearance"))
fig.add_trace(go.Histogram(histfunc="count", x=traindf["Indeterminate Appearance"], name="Indeterminate Appearance"))
fig.add_trace(go.Histogram(histfunc="count", x=traindf["Negative for Pneumonia"], name="Negative for Pneumonia"))
fig.update_layout(title_text='Sample count per class') # title of plot
fig.show()

# DICOM Metadata Analysis

In this, we will analyse the DCM's metadata. Let's see if we can get any useful information.

In [None]:
dcmpaths = TRAINDIR.rglob("*.dcm")

In [None]:
sample_path = next(dcmpaths)
sample = dicom.dcmread(sample_path)
print("Image size =", (sample.Rows, sample.Columns))
sample

In [None]:
plt.figure(figsize=(10, 10))
plt.imshow(sample.pixel_array)
plt.axis("off");
plt.title("Sample image", {'fontsize':20});

Let's create a dataframe out of metadata for all samples.

In [None]:
# TRAINING DATA
dcmpaths = TRAINDIR.rglob("*.dcm")
metadata_traindf = {
    "Gender": [], "BodyPartExamined": [], "ImgHeight": [], 
    "ImgWidth": [], "ImagerPixelSpacing": [], "SOPInstanceUID": []
}
for path in progress_bar(list(dcmpaths)):
    sample = dicom.dcmread(path)
    metadata_traindf["Gender"].append(sample.PatientSex)
    metadata_traindf["BodyPartExamined"].append(sample.BodyPartExamined)
    metadata_traindf["ImgHeight"].append(sample.Rows)
    metadata_traindf["ImgWidth"].append(sample.Columns)
    metadata_traindf["ImagerPixelSpacing"].append(float(sample.ImagerPixelSpacing[0]))
    metadata_traindf["SOPInstanceUID"].append(sample.SOPInstanceUID)
    
metadata_traindf = pd.DataFrame.from_dict(metadata_traindf)

In [None]:
metadata_traindf.profile_report(title="DICOM Training Metadata")

In [None]:
# TESTING DATA
dcmpaths = TESTDIR.rglob("*.dcm")
metadata_testdf = {
    "Gender": [], "BodyPartExamined": [], "ImgHeight": [], 
    "ImgWidth": [], "ImagerPixelSpacing": [], "SOPInstanceUID": []
}
for path in progress_bar(list(dcmpaths)):
    sample = dicom.dcmread(path)
    metadata_testdf["Gender"].append(sample.PatientSex)
    metadata_testdf["BodyPartExamined"].append(sample.BodyPartExamined)
    metadata_testdf["ImgHeight"].append(sample.Rows)
    metadata_testdf["ImgWidth"].append(sample.Columns)
    metadata_testdf["ImagerPixelSpacing"].append(float(sample.ImagerPixelSpacing[0]))
    metadata_testdf["SOPInstanceUID"].append(sample.SOPInstanceUID)
    
metadata_testdf = pd.DataFrame.from_dict(metadata_testdf)

In [None]:
metadata_testdf.profile_report(title="DICOM Testing Metadata")

🤔 Takeaway from Metadata profilers:
1. Both training and test data have almost similiar gender distribution
2. Both training and test set have ~80% CHEST bodypart scans.
3. Pixel spacing is not consistent between images as well as training and testing data. We need to fix this in preprocessing because the results might vary.

Let's compare with plots

### Gender

As said earlier, distribution across train and test is similar as shown below.

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(histfunc="count", x=metadata_traindf["Gender"], name="Train data Gender"))
fig.add_trace(go.Histogram(histfunc="count", x=metadata_testdf["Gender"], name="Test dataGender"))
fig.update_layout(title_text='Sample count per class') # title of plot
fig.show()

### Pixel Spacing
`ImagerPixelSpacing` has similar distribution across train and test data but the pixel spacing is very different (given below). For better predictions, we need to fix the pixel spacing in preprocessing.

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(histfunc="count", x=metadata_traindf["ImagerPixelSpacing"], histnorm="probability", name="Train ImagerPixelSpacing"))
fig.add_trace(go.Histogram(histfunc="count", x=metadata_testdf["ImagerPixelSpacing"], histnorm="probability", name="Test ImagerPixelSpacing"))
fig.update_layout(title_text='Sample count per class (Probability Normalized just to adjust sample scale)') # title of plot
fig.show()

### Pixel Spacing and Gender
Another analysis to understand what is the most common pixel spacing value.

In [None]:
px.histogram(metadata_traindf, x='ImagerPixelSpacing', marginal="box", color='Gender', title="Train data ImagerPixelSpacing Distribution (based on Gender)")

In [None]:
px.histogram(metadata_testdf, x='ImagerPixelSpacing', marginal="box", color='Gender', title="Test data ImagerPixelSpacing Distribution (based on Gender)")

### BodyPart Examined and Gender
Looks similar distribution Genderwise.

In [None]:
metadata_traindf.BodyPartExamined.unique()

Note that the same BodyPart has different spellings in metadata. This is maybe due to different scanner. I don't know. If you know then please let me know in the comments.

Anyway, I'm replacing different spellings to one.
So, there will be two changes:
1. Replacing `2- TORAX`, `TORAX`, `TÒRAX`, `T?RAX` with `THORAX`.
2. Replacing `Pecho` with `PECHO`

In [None]:
replace_dict = {
    '2- TORAX': 'THORAX',
    'TORAX': 'THORAX',
    'TÒRAX': 'THORAX',
    'T?RAX': 'THORAX',
    'Pecho': 'PECHO'
}
metadata_traindf.BodyPartExamined = metadata_traindf.BodyPartExamined.replace(replace_dict)
metadata_testdf.BodyPartExamined = metadata_testdf.BodyPartExamined.replace(replace_dict)

#### 1. Train and Test set distribution comparison

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(histfunc="count", x=metadata_traindf["BodyPartExamined"], histnorm="probability", name="Train BodyPartExamined"))
fig.add_trace(go.Histogram(histfunc="count", x=metadata_testdf["BodyPartExamined"], histnorm="probability", name="Test BodyPartExamined"))
fig.update_layout(title_text='Sample count per Bodypart (Probability Normalized just to adjust sample scale)') # title of plot
fig.show()

Same thing on Log scale for better visualization. Looks normal to me.

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(histfunc="count", x=metadata_traindf["BodyPartExamined"], histnorm="probability", name="Train BodyPartExamined"))
fig.add_trace(go.Histogram(histfunc="count", x=metadata_testdf["BodyPartExamined"], histnorm="probability", name="Test BodyPartExamined"))
fig.update_layout(title_text='(Log) Sample count per Bodypart (Probability Normalized just to adjust sample scale)') # title of plot
fig.update_yaxes(type="log")
fig.show()

#### 2. BodyPartExamined and Gender

In [None]:
px.histogram(metadata_traindf, x='BodyPartExamined', marginal="violin", color='Gender', title="Train data BodyPartExamined Distribution (based on Gender)")

In [None]:
px.histogram(metadata_testdf, x='BodyPartExamined', marginal="violin", color='Gender', title="Test data BodyPartExamined Distribution (based on Gender)")

#### That's all! Please upvote if you find this information useful ✌️🙂