# Data Cleaning 

The initial data is composed of three datasets:
1. `stage_2_train_labels.csv`: containing the patient Id, information about coordinates that I won't be using for the current problem, and a target binary column: 1 for pneumonia and 0 for no pneumonia.
2. `stage_2_detailed_class_info.csv`: containing the patient Id, and wether the chest X-Ray (CXR) is normal, has pneumonia or is not normal but does not have pneumonia.
3. A corpus containing a 26684 DICOM files (.dcm), which are composed of an image and metadata about the patient and the CXR, which may be useful to our particular problem.

In the current notebook we will be importing the CSVs and dropping duplicates, joining them, dropping the columns we won't be needing and creating a new dataframe by joining both.

Also, two columns, containing the names of the files both in .dcm format and .jpg will be created.

Afterwards, metadata available from the .dcm files will be extracted and appended to the resulting dataframe.

## Environment 

### Imports 

In [None]:
# General imports
import numpy as np
import pandas as pd

# System and file management
import os

# DICOM
import pydicom
from pydicom.filereader import dcmread

### Functions 

In [2]:
def dicom_info_extractor(file_path, file_name, info):
    """
    This function extracts information of the metadata of a given dicom file.
    Inputs:
        - file_path: path of the file to parse.
        - file_name: name of the specific file to parse.
        - info: the exact info you want to extract. You have to choose between: 'age', 'sex' and 'pos'(position of the patient)
    Outputs:
        - A string containing the metadata.
    """
    file = os.path.join(file_path, file_name)
    dicom = dcmread(file)
    
    if info == 'age':
        return dicom.PatientAge
    elif info == 'sex':
        return dicom.PatientSex
    elif info == 'pos':
        return dicom.ViewPosition

### Paths 

In [3]:
PATH = 'data/'
CSV_PATH = os.path.join(PATH,'csv')
IMG_POOL_PATH = os.path.join(PATH,'pool')

## Import and arrange CSVs

In [4]:
# Labels CSVs
info = pd.read_csv(os.path.join(CSV_PATH, 'stage_2_train_labels.csv')).drop_duplicates(subset='patientId')
detailed = pd.read_csv(os.path.join(CSV_PATH, 'stage_2_detailed_class_info.csv')).drop_duplicates(subset='patientId')

In [5]:
# Joining both
labels = detailed.join(info.set_index('patientId'), on='patientId', how='left').drop(columns=['x', 'y', 'width', 'height'])

In [6]:
# Adding .dcm extension to file name.
labels['dcm_file_name'] = labels.patientId.apply(lambda i: i+'.dcm')

In [7]:
# Adding .jpg extension to file name.
labels['jpg_file_name'] = labels.patientId.apply(lambda i: i+'.jpg')

In [8]:
# Extracting sex information from the images by using the function defined above:
labels['sex'] = labels.dcm_file_name.apply(lambda x: dicom_info_extractor(file_path=IMG_POOL_PATH,
                                                                          file_name=x,
                                                                          info='sex'))

In [9]:
# Extracting age information from the images by using the function defined above:
labels['age'] = labels.dcm_file_name.apply(lambda x: dicom_info_extractor(file_path=IMG_POOL_PATH,
                                                                      file_name=x,
                                                                      info='age'))

In [10]:
# Extracting CXR view information from the images by using the function defined above:
labels['view'] = labels.dcm_file_name.apply(lambda x: dicom_info_extractor(file_path=IMG_POOL_PATH,
                                                                      file_name=x,
                                                                      info='pos'))

In [12]:
# Rearranging and changing column names
labels = labels[['patientId', 'dcm_file_name', 'jpg_file_name','class', 'sex', 'age', 'view', 'Target']] # Rearranging columns.
labels.columns = ['patient_id', 'dcm_file_name', 'jpg_file_name', 'type', 'sex', 'age', 'view', 'target'] # Changing column names.

In [13]:
# Encoding the classes to numerical values
type_dict = {'Normal': 0, 'No Lung Opacity / Not Normal': 1, 'Lung Opacity':2}
labels['target_3'] = labels.type.map(type_dict)

In [15]:
# one-hot-encoding and appending to dataframe
labels = pd.concat([labels, pd.get_dummies(labels['target_3'], prefix='type')], axis=1)

In [16]:
labels.head()

Unnamed: 0,patient_id,dcm_file_name,jpg_file_name,type,sex,age,view,target,target_3,type_0,type_1,type_2
0,0004cfab-14fd-4e49-80ba-63a80b6bddd6,0004cfab-14fd-4e49-80ba-63a80b6bddd6.dcm,0004cfab-14fd-4e49-80ba-63a80b6bddd6.jpg,No Lung Opacity / Not Normal,F,51,PA,0,1,0,1,0
1,00313ee0-9eaa-42f4-b0ab-c148ed3241cd,00313ee0-9eaa-42f4-b0ab-c148ed3241cd.dcm,00313ee0-9eaa-42f4-b0ab-c148ed3241cd.jpg,No Lung Opacity / Not Normal,F,48,PA,0,1,0,1,0
2,00322d4d-1c29-4943-afc9-b6754be640eb,00322d4d-1c29-4943-afc9-b6754be640eb.dcm,00322d4d-1c29-4943-afc9-b6754be640eb.jpg,No Lung Opacity / Not Normal,M,19,AP,0,1,0,1,0
3,003d8fa0-6bf1-40ed-b54c-ac657f8495c5,003d8fa0-6bf1-40ed-b54c-ac657f8495c5.dcm,003d8fa0-6bf1-40ed-b54c-ac657f8495c5.jpg,Normal,M,28,PA,0,0,1,0,0
4,00436515-870c-4b36-a041-de91049b9ab4,00436515-870c-4b36-a041-de91049b9ab4.dcm,00436515-870c-4b36-a041-de91049b9ab4.jpg,Lung Opacity,F,32,AP,1,2,0,0,1


In [17]:
labels.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26684 entries, 0 to 30225
Data columns (total 12 columns):
patient_id       26684 non-null object
dcm_file_name    26684 non-null object
jpg_file_name    26684 non-null object
type             26684 non-null object
sex              26684 non-null object
age              26684 non-null object
view             26684 non-null object
target           26684 non-null int64
target_3         26684 non-null int64
type_0           26684 non-null uint8
type_1           26684 non-null uint8
type_2           26684 non-null uint8
dtypes: int64(2), object(7), uint8(3)
memory usage: 2.1+ MB


In [18]:
labels.shape

(26684, 12)

## Export clean CSV

In [19]:
labels.to_csv(os.path.join(CSV_PATH,'cxr_information.csv'),index=False)