# Exploratory Data Analysis (EDA)

### Imports

In [None]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from PIL import Image

In [None]:
random_seed: int = 8080
data_root: Path = Path("../data")
xray_images_root: Path = Path("/home/uziel/Downloads/nih_chest_x_rays")

## 1. Data Loading

### 1.1. Load samples annotation

In [None]:
annot_df = pd.read_csv(data_root.joinpath("samples_annotation_2017.csv"))
annot_df

The column `labels` contains all the disease annotations.

### 1.2. Replace `labels` column with dummy variables

In [None]:
labels_dummies = annot_df["labels"].str.get_dummies("|")
labels_dummies.columns = [c.replace(" ", "_").lower() for c in labels_dummies]

In [None]:
annot_df = annot_df.join(labels_dummies).drop(columns=["labels"])
annot_df

## 2. Exploration of patients metadata

### Missing values

In [None]:
annot_df.isna().sum()

No missing values in our data.

### Patient information

In [None]:
annot_df.nunique()

In [None]:
annot_df.groupby("patient_id")["image_name"].count().mean()

We have information for 30,805 patients, with an average of 3 to 4 images available per patient.

In [None]:
patient_had_pneumonia = annot_df.groupby("patient_id")["pneumonia"].sum().astype(bool)
patient_had_pneumonia

In [None]:
patient_had_pneumonia.sum() / len(patient_had_pneumonia)

Around 3% of all patients have had pneumonia.

### Disease Labels

In [None]:
disease_labels = labels_dummies.columns

print(
    f"There are up to {len(disease_labels)} possible disease labels "
    f"(including no finding) annotated in each image:"
)
print("\n\t- " + "\n\t- ".join(disease_labels))

#### How many annotations per disease?

In [None]:
labels_dummies.sum().sort_values(ascending=False)

It can be clearly seen that pneumonia is the second least common labelled disease in our dataset.

### Gender distribution among patients with and without pneumonia

In [None]:
annot_df["patient_gender"].value_counts(normalize=True)

We have data from more males than females, but it's not terribly imbalanced.

### Age distribution among patients with and without pneumonia

In [None]:
annot_df["patient_age"].plot(kind="hist")

In [None]:
annot_df[annot_df["patient_age"] > 100]["patient_age"].tolist()

There seems to be some extreme values, probably due to human error. Since they are only a few, we remove them.

In [None]:
annot_df = annot_df[annot_df["patient_age"] < 100]

Now we look at the age distribution across patients with and without pneumonia:

In [None]:
sns.catplot(
    annot_df,
    y="patient_age",
    x="pneumonia",
    hue="patient_gender",
    bw=0.25,
    cut=0,
    split=True,
    kind="violin",
)

Let's look at the quantile distribution:

In [None]:
pneumonia_quantiles = pd.Series(
    {
        f"{q*100:.0f}%": annot_df[annot_df["pneumonia"].astype(bool)][
            "patient_age"
        ].quantile(q)
        for q in np.arange(0.1, 1, 0.1)
    }
)
pneumonia_quantiles

It seems that most patients (80%) with pneumonia are between the early 20s and the mid 60s. We can expect our algorithm to perform better in this demographic. Outside of it, our model is expected to perform worse. Patients under the age of 20 are likely still growing and thus the size and shape of their chest cavity is likely different, which could impact the performance of the final algorithm. Similarly, the older patients are more likely to suffer from multiple diseases simultanously, making it harder to distinguish between pneumonia and any other disease.

### View Position distribution among patients with and without pneumonia

In [None]:
annot_df["view_position"].value_counts(normalize=True)

Our dataset contains both posterior-anterior (PA) and anterior-posterior (AP) projections.

- **PA projection**: The standard chest radiograph is acquired with the patient standing up, and with the X-ray beam passing through the patient from Posterior to Anterior. The chest X-ray image produced is viewed as if looking at the patient from the front, face-to-face. The heart is on the right side of the image as you look at it.
- **AP projection**: Sometimes it is not possible for radiographers to acquire a PA chest X-ray. This is usually because the patient is too unwell to stand. The chest X-ray image is still viewed as if looking at the patient face-to-face.

Source and more information on [Radiology Masterclass](https://www.radiologymasterclass.co.uk/tutorials/chest/chest_quality/chest_xray_quality_projection).

In [None]:
sns.histplot(
    annot_df.astype({"pneumonia": str}),
    x="pneumonia",
    hue="view_position",
    stat="percent",
    multiple="fill",
)

In [None]:
view_positions_with = (
    annot_df[annot_df["pneumonia"] == 1]["view_position"]
    .value_counts(normalize=True)
    .rename("Pneumonia presence")
)
view_positions_without = (
    annot_df[annot_df["pneumonia"] == 0]["view_position"]
    .value_counts(normalize=True)
    .rename("Pneumonia abscence")
)
pd.concat(
    [
        view_positions_with,
        view_positions_without,
        abs(view_positions_without - view_positions_with).rename("Difference"),
    ],
    axis=1,
)

Since the main difference between AP and PA is heart size, this shouldn't affect our ability to detect pneumonia. Therefore, we expect this 16% difference to have no impact on final model performance.

### Disease comorbidity

In [None]:
comobidity_mat = annot_df[disease_labels].T.dot(annot_df[disease_labels])
np.fill_diagonal(comobidity_mat.values, 0)
comobidity_mat

In [None]:
sns.heatmap(comobidity_mat, robust=True)

In [None]:
((comobidity_mat["pneumonia"] / annot_df["pneumonia"].sum()) * 100).round(
    2
).sort_values(ascending=False)

Above shows the most common comorbid diseases with pneumonia: infiltration, edema, effusion, etc. The percentages show how many pneumonia cases were also labelled as another disease. Again, infiltration was the most common, appearing in 42% of pneumonia cases.

## 3. Exploration of image pixel data

### Healthy patients (no disease detected)

In [None]:
healthy_images = (
    annot_df[annot_df["no_finding"] == 1]
    .sample(100, random_state=random_seed)["image_name"]
    .tolist()
)
healthy_images_files = [
    img_file
    for img_file in xray_images_root.glob("**/*.png")
    if img_file.name in healthy_images
]

In [None]:
avg_healthy_image = np.mean(
    [np.array(Image.open(img_file).convert("L")) for img_file in healthy_images_files],
    axis=0,
)
plt.imshow(avg_healthy_image, cmap="gray", vmin=0, vmax=255)

While we can observe blurred edges and cavities probably due to sligh differences in patients, overall the lung area looks clear in healthy patients.

### Pneumonia patients

In [None]:
pneumonia_images = (
    annot_df[annot_df["pneumonia"] == 1]
    .sample(100, random_state=random_seed)["image_name"]
    .tolist()
)
pneumonia_images_files = [
    img_file
    for img_file in xray_images_root.glob("**/*.png")
    if img_file.name in pneumonia_images
]

In [None]:
avg_pneumonia_image = np.mean(
    [
        np.array(Image.open(img_file).convert("L"))
        for img_file in pneumonia_images_files
    ],
    axis=0,
)
plt.imshow(avg_pneumonia_image, cmap="gray", vmin=0, vmax=255)

Compared to healthy patients, we can see that the lung area is significantly more opaque. This suggests that on average, pneumonia patients show white spots in their lungs, as expected.

## 4. Summary and conclusions

In [None]:
annot_df.to_csv(data_root.joinpath("processed_annotations.csv"), index=False)

Given the exploratory data analysis above, we will be using the processed sample annotations to train a classifier machine learning model to output a probability of whether pneumonia is present or absent in a given x-ray image.

We should ensure that patients are separated into training and validation sets to avoid data leakage. Moreover, other metadata such as gender and age should also be equally distributed in each set.