# Prep for Balanced Data Model of Chest X-Rays

In [1]:
"""
Extracts random sampling of image data for a balanced dataset of the NIH Chest X-Ray
full dataset found at https://nihcc.app.box.com/v/ChestXray-NIHCC
"""

# Start with imports for data handling
import pandas as pd
import random
from pathlib import Path
from sklearn.preprocessing import MultiLabelBinarizer, LabelEncoder

In [2]:
# Read in the CSV file
full_data = pd.read_csv(Path("Resources/Data_Entry_2017_v2020.csv"))
full_data.head()

Unnamed: 0,Image Index,Finding Labels,Follow-up #,Patient ID,Patient Age,Patient Gender,View Position,OriginalImage[Width,Height],OriginalImagePixelSpacing[x,y]
0,00000001_000.png,Cardiomegaly,0,1,57,M,PA,2682,2749,0.143,0.143
1,00000001_001.png,Cardiomegaly|Emphysema,1,1,58,M,PA,2894,2729,0.143,0.143
2,00000001_002.png,Cardiomegaly|Effusion,2,1,58,M,PA,2500,2048,0.168,0.168
3,00000002_000.png,No Finding,0,2,80,M,PA,2500,2048,0.171,0.171
4,00000003_001.png,Hernia,0,3,74,F,PA,2500,2048,0.168,0.168


### Utilizing code from Aleksandar to isolate individual diagnoses from Finding Labels

See data_prep.ipynb for explained process

In [3]:
# Split 'Finding Labels' into a list of labels
full_data['Finding Labels'] = full_data['Finding Labels'].apply(lambda x: x.split('|'))

# One-hot encode the findings
mlb = MultiLabelBinarizer()
labels_encoded = mlb.fit_transform(full_data['Finding Labels'])
labels_encoded_df = pd.DataFrame(labels_encoded, columns=mlb.classes_)

# Concatenate with original dataframe
full_data = pd.concat([full_data, labels_encoded_df], axis=1)

# Show the count of each label in the dataset
label_counts = full_data.iloc[:, -len(mlb.classes_):].sum().sort_values(ascending=False)
print(label_counts)

No Finding            60361
Infiltration          19894
Effusion              13317
Atelectasis           11559
Nodule                 6331
Mass                   5782
Pneumothorax           5302
Consolidation          4667
Pleural_Thickening     3385
Cardiomegaly           2776
Emphysema              2516
Edema                  2303
Fibrosis               1686
Pneumonia              1431
Hernia                  227
dtype: int64


### We can see from here the data is very unbalanced.

**Hypothesis:**  If we take all 227 with Hernias and randomly select 227 from each of the other categories we should get a much more balanced sample data set from which to work with. There may still be some who have multiple diagnoses. this will still provide a sample training and testing set of around 3400. We can check and handle outliers and perform other data prepocessing on the much smaller but balanced set.

In [4]:
# Start with taking all Hernias
hernia_df = full_data.loc[full_data["Hernia"] == 1]
hernia_df.shape

(227, 26)

In [5]:
# Now let's sample all others
none_df = full_data.loc[full_data["No Finding"] == 1].sample(n=227, random_state=1)
infiltration_df = full_data.loc[full_data["Infiltration"] == 1].sample(n=227, random_state=1)
effusion_df = full_data.loc[full_data["Effusion"] == 1].sample(n=227, random_state=1)
atelectasis_df = full_data.loc[full_data["Atelectasis"] == 1].sample(n=227, random_state=1)
nodule_df = full_data.loc[full_data["Nodule"] == 1].sample(n=227, random_state=1)
mass_df = full_data.loc[full_data["Mass"] == 1].sample(n=227, random_state=1)
pneumothorax_df = full_data.loc[full_data["Pneumothorax"] == 1].sample(n=227, random_state=1)
consolidation_df = full_data.loc[full_data["Consolidation"] == 1].sample(n=227, random_state=1)
pleural_thickening_df = full_data.loc[full_data["Pleural_Thickening"] == 1].sample(n=227, random_state=1)
cardiomegaly_df = full_data.loc[full_data["Cardiomegaly"] == 1].sample(n=227, random_state=1)
emphysema_df = full_data.loc[full_data["Emphysema"] == 1].sample(n=227, random_state=1)
edema_df = full_data.loc[full_data["Edema"] == 1].sample(n=227, random_state=1)
fibrosis_df = full_data.loc[full_data["Fibrosis"] == 1].sample(n=227, random_state=1)
pneumonia_df = full_data.loc[full_data["Pneumonia"] == 1].sample(n=227, random_state=1)

In [6]:
# And pack them all together
frames = [none_df,
          infiltration_df,
          effusion_df,
          atelectasis_df,
          nodule_df,
          mass_df,
          pneumothorax_df,
          consolidation_df,
          pleural_thickening_df,
          cardiomegaly_df,
          emphysema_df,
          edema_df,
          fibrosis_df,
          pneumonia_df,
          hernia_df]
balanced_df = pd.concat(frames)
balanced_df.shape

(3405, 26)

In [7]:
# Drop any dupliate images that may have been randomly selected
balanced_df = balanced_df.drop_duplicates(subset=["Image Index"])
balanced_df.shape

(3343, 26)

In [8]:
# Now let's look at the balance of our new sample data
label_counts = balanced_df.iloc[:, -len(mlb.classes_):].sum().sort_values(ascending=False)
print(label_counts)

Infiltration          926
Effusion              876
Atelectasis           679
Mass                  476
Nodule                450
Pneumothorax          444
Consolidation         418
Pleural_Thickening    381
Edema                 333
Emphysema             306
Cardiomegaly          298
Pneumonia             297
Fibrosis              283
Hernia                227
No Finding            227
dtype: int64


### An observation

While the data is still not totally balanced per say, we can see that some diagnoses are more likely to be comorbid with others. Given that many images have multiple diagnoses, we will never git a totally balanced dataset. Training a model with this subset of the data may prove to have an interesting result especially when our model can return any multiple diagnoses.

### Back to additional preprossessing from Aleksandar

Although slightly different as this CSV came from NIH rather than Kaggle

In [9]:
# Remove rows with any missing values
balanced_df = balanced_df.dropna()

# Encode other categorical data
le_gender = LabelEncoder()
balanced_df['Patient Gender'] = le_gender.fit_transform(balanced_df["Patient Gender"])

le_view = LabelEncoder()
balanced_df["View Position"] = le_view.fit_transform(balanced_df["View Position"])

### And a little of my own thoughts

In [10]:
## Remove columns that would have no predictive value for medical diagnoses (and the labels we have now enocded)
cleaner_df = balanced_df.drop(["Finding Labels",
                               "Follow-up #",
                               "Patient ID",
                               "OriginalImage[Width",
                               "Height]",
                               "OriginalImagePixelSpacing[x",
                               "y]"], axis=1)

# sort by image number and reset index
cleaner_df = cleaner_df.sort_values("Image Index")

# View sample results
cleaner_df


Unnamed: 0,Image Index,Patient Age,Patient Gender,View Position,Atelectasis,Cardiomegaly,Consolidation,Edema,Effusion,Emphysema,Fibrosis,Hernia,Infiltration,Mass,No Finding,Nodule,Pleural_Thickening,Pneumonia,Pneumothorax
11,00000003_000.png,81,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
4,00000003_001.png,74,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
5,00000003_002.png,75,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
6,00000003_003.png,76,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0
7,00000003_004.png,77,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112031,00030744_000.png,51,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
112047,00030753_007.png,54,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0
112078,00030774_000.png,43,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
112092,00030786_002.png,61,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,1


In [12]:
# This file is now ready for adding the image data, training/testing split, and then any scaling/normalization of the data
# Export the data to new CSV file
cleaner_df.to_csv(Path("Resources/Clean_n_Balanced.csv"), index=False)