# Introduction

Machine Learning has many fields where it can have powerful impact. Healthcare is one of the fields that can be benefited mostly by using ML. Advancement in ML has created possibilities of automating challenging tasks in healthcare sector like diagnosis of diseases, estimating risk, monitoring treatment etc. One of the challenging task in healthcare is analyzing X-ray images. Chest X-rays can be used to detect many diseases like early-stage lung cancers, pneumonia and more. Failing to detect them might cause serious effect for patients. In 2017, 2.5M people died from pneumonia worldwide (source: https://ourworldindata.org/pneumonia). The death rate can be reduce if the patients are diagnosed and brought under treatment in time. Deploying Machine Learning models to detect diseases like these can improve the survival rate of the patients and save thousands of life.


## Dataset
One of the most common public dataset of chest X-rays is “ChestX-ray8” dataset. This dataset contains over 100k frontal-view X-ray images of more than 30k unique patients. The images have nine labels which includes Atelectasis, Cardiomegaly, Effusion, Infiltration, Mass, Nodule, Pneumonia, Pneumathorax and No Finding. Dataset link: https://arxiv.org/abs/1705.02315

In this notebook, I am going to explore the "ChestX-ray8" dataset and create the train, validation and test dataset for training Pneumonia Detection Model.

In [1]:
# import libraries
import numpy as np
import pandas as pd

In [2]:
# read in the csv file
dataset = pd.read_csv("Data_Entry_2017_v2020.csv")

In [3]:
dataset.head()

Unnamed: 0,Image Index,Finding Labels,Follow-up #,Patient ID,Patient Age,Patient Gender,View Position,OriginalImage[Width,Height],OriginalImagePixelSpacing[x,y]
0,00000001_000.png,Cardiomegaly,0,1,57,M,PA,2682,2749,0.143,0.143
1,00000001_001.png,Cardiomegaly|Emphysema,1,1,58,M,PA,2894,2729,0.143,0.143
2,00000001_002.png,Cardiomegaly|Effusion,2,1,58,M,PA,2500,2048,0.168,0.168
3,00000002_000.png,No Finding,0,2,80,M,PA,2500,2048,0.171,0.171
4,00000003_001.png,Hernia,0,3,74,F,PA,2500,2048,0.168,0.168


In my application, I will only need the 'Image Index', 'Finding Labels', 'Patient ID'. So I will drop the rest of the columns. Also for training using AWS Sagemaker, the label column needs to be the first column so I will also reorder the columns.

In [4]:
# selecting necessary columns
reduced_df = dataset[["Finding Labels", "Image Index", "Patient ID"]]
reduced_df.head()

Unnamed: 0,Finding Labels,Image Index,Patient ID
0,Cardiomegaly,00000001_000.png,1
1,Cardiomegaly|Emphysema,00000001_001.png,1
2,Cardiomegaly|Effusion,00000001_002.png,1
3,No Finding,00000002_000.png,2
4,Hernia,00000003_001.png,3


Next, I will select only the entries that have Pneumonia and label them as 1. The other entries will be labeled as 0.

In [5]:
# find the entries that have Pneumonia
pneumonia = reduced_df[reduced_df["Finding Labels"].str.contains('Pneumonia')].copy()

# set the labels to 1 (positive sample)
pneumonia['Finding Labels'] = 1

pneumonia.head()

Unnamed: 0,Finding Labels,Image Index,Patient ID
73,1,00000013_010.png,13
126,1,00000032_012.png,32
253,1,00000056_000.png,56
276,1,00000061_012.png,61
279,1,00000061_015.png,61


In [6]:
# find the entries that don't have Pneumonia
not_pneumonia = reduced_df[~reduced_df["Finding Labels"].str.contains('Pneumonia')].copy()

# set the labels to 0 (negative sample)
not_pneumonia['Finding Labels'] = 0

not_pneumonia.head()

Unnamed: 0,Finding Labels,Image Index,Patient ID
0,0,00000001_000.png,1
1,0,00000001_001.png,1
2,0,00000001_002.png,1
3,0,00000002_000.png,2
4,0,00000003_001.png,3


In [7]:
# merging the two dataset
pneumonia_dataset = pd.concat([pneumonia, not_pneumonia], axis=0)
pneumonia_dataset = pneumonia_dataset.sort_index()
pneumonia_dataset

Unnamed: 0,Finding Labels,Image Index,Patient ID
0,0,00000001_000.png,1
1,0,00000001_001.png,1
2,0,00000001_002.png,1
3,0,00000002_000.png,2
4,0,00000003_001.png,3
...,...,...,...
112115,1,00030801_001.png,30801
112116,0,00030802_000.png,30802
112117,0,00030803_000.png,30803
112118,0,00030804_000.png,30804


## Creating train, validation and test dataset

First, let's check the fraction of Pneumonia samples and Not Pneumonia samples in our dataset.

In [8]:
def count_class(series):
    """
    Given a series of binary values, return the count of 1 and 0
    """
    pos = series.sum()
    neg = series.shape[0] - pos
    return pos, neg

In [9]:
# calculating class imbalance
n_pneumonia, n_not_pneumonia = count_class(pneumonia_dataset["Finding Labels"])
n_total = pneumonia_dataset.shape[0]

frac_pneumonia = n_pneumonia / n_total
frac_not_pneumonia = n_not_pneumonia / n_total

ratio = n_pneumonia / n_not_pneumonia

print(f"Total samples: {n_total}")
print(f"Fraction of Pneumonia samples: {frac_pneumonia: .4f}")
print(f"Fraction of Normal samples: {frac_not_pneumonia: .4f}")
print(f"Ratio: {ratio: .4f}")

Total samples: 112120
Fraction of Pneumonia samples:  0.0128
Fraction of Normal samples:  0.9872
Ratio:  0.0129


As, there is a huge class imbalance, I will use a weighted loss function for the model. That's why I want to keep the class ratio similar in the train, validation and test datasets also.

For medical datasets, it is necessary to ensure that the train, validation and test dataset don't contain the same patient's data. So let's check if the dataset contains several entries for the same patient.

In [10]:
n_unique_patients = len(set(pneumonia_dataset['Patient ID'].values))

print(f"Number of unique patients: {n_unique_patients}")
print(f"Number of entries: {n_total}")

Number of unique patients: 30805
Number of entries: 112120


As the number of unique patients is less than total entries, there is obviously several entries for the same patient.

Let's also check if Pneumonia dataset and Not Pneumonia dataset have any common patient (i.e. patients recovered from Pneumonia).

In [11]:
# unique patient ids (pid)
pneumonia_pid = set(pneumonia['Patient ID'].values)
not_pneumonia_pid = set(not_pneumonia['Patient ID'].values)
common_pid = pneumonia_pid.intersection(not_pneumonia_pid)

print(f"Number of patients having Pneumonia: {len(pneumonia_pid)}")
print(f"Number of patients not having Pneumonia: {len(not_pneumonia_pid)}")
print(f"Common patients in both dataset: {len(common_pid)}")

Number of patients having Pneumonia: 1008
Number of patients not having Pneumonia: 30725
Common patients in both dataset: 928


Now, I am going to define a function that will split the dataset into training, validation and test dataset while maintain the class ratio. For this, I will need a helper function "maintain_ratio". This function will help to determine the required number of positive and negative cases to maintain the ratio in each sub dataset. The equation for finding the required positive and negative cases is given below.

Here,  $ratio$ = $\frac{pos}{neg}$ <br/>
=> $ratio + 1$ = $\frac{pos+neg}{neg}$ <br/>
=> $neg$ = $\frac{total}{1 + ratio}$

In [15]:
def maintain_ratio(df, frac, ratio):
    """
    Calculates the required positive and negative sample to maintain the given ratio in the subset of the dataframe
    """
    total = round(df.shape[0] * frac)
    neg = round(total / (1 + ratio))
    pos = total - neg
    
    # round should give integer value
    # but for some weird reason, I am getting float value for "neg" hence "pos" is also becoming float
    # this problem happens only when I run jupyter on my local machine. 
    return int(pos), int(neg)  

In [16]:
from random import shuffle, seed

def train_val_test_split(train_frac = 0.6, val_frac = 0.2, test_frac = 0.2, df = pneumonia_dataset,
                         ratio=ratio, common_pid = common_pid, random_seed=1):
    """
    Splits the given dataframe based on unique patients
    and Returns train, validation and testset as Dataframe
    """
    assert(train_frac + val_frac + test_frac <= 1), "Total of train, val and test fraction exeeds 1"
    seed(random_seed)
    
    # calculate the number of positive and negative cases required for each dataset to maintain the ratio
    train_pos, train_neg = maintain_ratio(df, train_frac, ratio)
    val_pos, val_neg = maintain_ratio(df, val_frac, ratio)
    test_pos, test_neg = maintain_ratio(df, test_frac, ratio)
    
    # first, split the patiets that are common in both Pneumonia and Not Pneumonia dataset
    ## convert set to list
    common_pid = list(common_pid)
    shuffle(common_pid)
    
    ## calculate length
    to_train = round(len(common_pid) * train_frac)
    to_val = round(len(common_pid) * val_frac)
    to_test = round(len(common_pid) * train_frac)
    
    ## distribute the common pids 
    train_pid = common_pid[ : to_train]
    val_pid = common_pid[to_train : to_train+to_val]
    test_pid = common_pid[to_train+to_val : to_train+to_val+to_test]
    
    ## create dataframes with corresponding common pids
    train_set = df[df['Patient ID'].isin(train_pid)]
    val_set = df[df['Patient ID'].isin(val_pid)]
    test_set = df[df['Patient ID'].isin(test_pid)]
    
    # count positive and negative sample in each dataframe
    train_common_pos, train_common_neg = count_class(train_set['Finding Labels'])
    val_common_pos, val_common_neg= count_class(val_set['Finding Labels'])
    test_common_pos, test_common_neg = count_class(test_set['Finding Labels'])
    
    # make sure that filled positive cases are not greated that required positive cases
    assert (train_pos >= train_common_pos), "Try different random_seed value"
    assert (val_pos >= val_common_pos), "Try different random_seed value"
    assert (test_pos >= test_common_pos), "Try different random_seed value"
    
    # required positive cases
    req_train = train_pos - train_common_pos
    req_val = val_pos - val_common_pos
    req_test = test_pos - test_common_pos
    
    rest_pneumonia = list(df[(df["Finding Labels"]==1) & (~df["Patient ID"].isin(common_pid))].index)
    shuffle(rest_pneumonia)
    
    # fill up the positive sample requirement for each dataset
    train_set = pd.concat([train_set, df.iloc[rest_pneumonia[: req_train], :]], axis=0)
    val_set = pd.concat([val_set, df.iloc[rest_pneumonia[req_train : req_train+req_val], :]], axis=0)
    test_set = pd.concat([test_set, df.iloc[rest_pneumonia[req_train+req_val : req_train+req_val+req_test], :]], axis=0)
    
    # required negative cases
    req_train = train_neg - train_common_neg
    req_val = val_neg - val_common_neg
    req_test = test_neg - test_common_neg
    
    rest_not_pneumonia = list(df[(df["Finding Labels"] == 0)& (~df["Patient ID"].isin(common_pid))].index)
    shuffle(rest_not_pneumonia)
    
    # fill the negative cases
    train_set = pd.concat([train_set, df.iloc[rest_not_pneumonia[:req_train], :]], axis=0)
    val_set = pd.concat([val_set, df.iloc[rest_not_pneumonia[req_train : req_train+req_val], :]], axis=0)
    test_set = pd.concat([test_set, df.iloc[rest_not_pneumonia[req_train+req_val : req_train+req_val+req_test], :]], axis=0)
    
    return train_set, val_set, test_set

In [17]:
# split dataset
train_set, val_set, test_set = train_val_test_split(train_frac = 0.6, val_frac = 0.2, test_frac = 0.2)