# Notebook for creating the preprocessed dataframe of PadChest images

We will filter the dataframe based on the following points:
- Exclude instances with NaN in the labels
- Exclude instances with 'suboptimal study' in labels
- Only keep 'AP', 'PA' and 'AP_horizontal' projections
- Strip all labels, so they do not have spaces in them (for instance so we do not distinguish between 'pneumonia' and ' pneumonia')
- Lowercase all labels
- Exclude empty strings ('') in labels
- Remove invalid instances (given by the image preprocessing)
- Remove duplicates in label lists (fx when 'pneumonia' appears twice for one instance)

In [1]:
# Imports
import pandas as pd
import ast

In [2]:
# Loading the original, full dataframe
data = pd.read_csv("/home/data_shares/purrlab/padchest/PADCHEST_chest_x_ray_images_labels_160K_01.02.19.csv", index_col=0)

  exec(code_obj, self.user_global_ns, self.user_ns)


### Preprocessing the 'Labels' column and filtering based on the 'Projection' column

In [3]:
# Excluding NaNs in the labels
data_prep = data[~data["Labels"].isna()]

# Excluding labels including the 'suboptimal study' label
data_prep = data_prep[~data_prep["Labels"].str.contains('suboptimal study')]

# Keeping only the PA, AP and AP_horizontal projections
data_prep = data_prep[(data_prep['Projection'] == 'PA') | (data_prep['Projection'] == 'AP') | (data_prep['Projection'] == 'AP_horizontal')]

In [4]:
# Stripping and lowercasing all individual labels
stripped_lowercased_labels = []

for label_list in list(data_prep['Labels']):
    label_list = ast.literal_eval(label_list)
    prepped_labels = []
    
    for label in label_list:
        if label != '':
            new_label = label.strip(' ').lower()   # Stripping and lowercasing
            prepped_labels.append(new_label)
    
    # Removing label duplicates in this appending
    stripped_lowercased_labels.append(list(set(prepped_labels)))

# Applying it to the preprocessed dataframe
data_prep['Labels'] = stripped_lowercased_labels

In [20]:
# Removing invalid images, found through manual inspection of images for annotation
invalid_images = pd.read_csv('/home/caap/LabelReliability_and_PathologyDetection_in_ChestXrays/Data/Invalid_images.csv', index_col=0)
invalid_images.columns = list(data_prep.columns ) +["path"]
data_prep_no_invalid = data_prep[~data_prep['ImageID'].isin(invalid_images['ImageID'])]

In [21]:
print(len(data_prep_no_invalid))
data_prep_no_invalid[:2]

109044


Unnamed: 0,ImageID,ImageDir,StudyDate_DICOM,StudyID,PatientID,PatientBirth,PatientSex_DICOM,ViewPosition_DICOM,Projection,MethodProjection,...,ExposureTime,RelativeXRayExposure_DICOM,ReportID,Report,MethodLabel,Labels,Localizations,LabelsLocalizationsBySentence,labelCUIS,LocalizationsCUIS
0,20536686640136348236148679891455886468_k6ga29.png,0,20140915,20536686640136348236148679891455886468,839860488694292331637988235681460987,1930.0,F,POSTEROANTERIOR,PA,Manual review of DICOM fields,...,10,-1.42,4765777,sin hallazg patolog edad pacient .,Physician,[normal],[],"[['normal'], ['normal']]",[],[]
2,135803415504923515076821959678074435083_fzis7b...,0,20150914,135803415504923515076821959678074435083,313572750430997347502932654319389875966,1929.0,M,POSTEROANTERIOR,PA,Manual review of DICOM fields,...,10,,4991845,cambi pulmonar cronic sever . sign fibrosis b...,Physician,"[kyphosis, ground glass pattern, pseudonodule,...","['loc basal', 'loc basal bilateral']","[['pulmonary fibrosis', 'loc basal bilateral']...",['C0034069' 'C0742362' 'C2115817' 'C3544344'],['C1282378']


## Saving the preprocessed dataframe in a file

In [22]:
data_prep_no_invalid = data_prep_no_invalid.reset_index(drop=True)

In [23]:
data_prep_no_invalid.to_csv('/home/caap/LabelReliability_and_PathologyDetection_in_ChestXrays/Data/preprocessed_df.csv', sep=",")
print('Saved :)')

Saved :)
