# Notebook for creating the preprocessed dataframe of PadChest images

We will filter the dataframe based on the following points:
- Exclude instances with NaN in the labels
- Exclude instances with 'suboptimal study' in labels
- Only keep 'AP', 'PA' and 'AP_horizontal' projections
- Strip all labels, so they do not have spaces in them (for instance so we do not distinguish between 'pneumonia' and ' pneumonia')
- Lowercase all labels
- Exclude empty strings ('') in labels
- Remove invalid instances (given by the image preprocessing)
- Remove duplicates in label lists (fx when 'pneumonia' appears twice for one instance)

In [7]:
# Imports
import pandas as pd
import ast

In [8]:
# Loading the original, full dataframe
data = pd.read_csv("/home/data_shares/purrlab/padchest/PADCHEST_chest_x_ray_images_labels_160K_01.02.19.csv", index_col=0)

### Preprocessing the 'Labels' column and filtering based on the 'Projection' column

In [9]:
# Excluding NaNs in the labels
data_prep = data[~data["Labels"].isna()]

# Excluding labels including the 'suboptimal study' label
data_prep = data_prep[~data_prep["Labels"].str.contains('suboptimal study')]

# Keeping only the PA, AP and AP_horizontal projections
data_prep = data_prep[(data_prep['Projection'] == 'PA') | (data_prep['Projection'] == 'AP') | (data_prep['Projection'] == 'AP_horizontal')]

In [10]:
# Stripping and lowercasing all individual labels
stripped_lowercased_labels = []

for label_list in list(data_prep['Labels']):
    label_list = ast.literal_eval(label_list)
    prepped_labels = []
    
    for label in label_list:
        if label != '':
            new_label = label.strip(' ').lower()   # Stripping and lowercasing
            prepped_labels.append(new_label)
    
    # Removing label duplicates in this appending
    stripped_lowercased_labels.append(list(set(prepped_labels)))

# Applying it to the preprocessed dataframe
data_prep['Labels'] = stripped_lowercased_labels

In [13]:
len(data_prep)

109070

In [12]:
# Removing invalid images, found through manual inspection of images for annotation
invalid_images = pd.read_csv('/home/caap/LabelReliability_and_PathologyDetection_in_ChestXrays/Data/Invalid_images.csv', index_col=0)
data_prep_no_invalid = data_prep[~data_prep['ImageID'].isin(invalid_images['ImageID'])]

EmptyDataError: No columns to parse from file

In [6]:
print(len(data_prep_no_invalid))
data_prep_no_invalid[:2]

NameError: name 'data_prep_no_invalid' is not defined

## Saving the preprocessed dataframe in a file

In [16]:
data_prep_no_invalid = data_prep_no_invalid.reset_index(drop=True)

In [17]:
#data_prep_no_invalid.to_csv('Data/preprocessed_df.csv', sep=",")
#print('Saved :)')

Saved :)
