# X-Ray Image Labelling & Reporting
## Data Cleaning & Preprocessing
**`Team AKAKI!` | `Minerva University`**

---

## Project Overview

In summary, we are measuring the similarity of predicted captions to the actual captions provided by doctors.

This is broken down into multiple steps:
- Clean Indiana X-Ray imaging data.
- Explore ways to increase and engineer features for better results.
- Use machine learning, NLP, computer vision, and other methods to label the chest X-Rays.
- Compare the labels we generate against the actual label provided by the doctors.

For example,
- **Real Caption:** the lungs are hyperinflated with coarse interstitial markings compatible with obstructive pulmonary disease and emphysema periodseq there is chronic pleuralparenchymal scarring within the lung bases periodseq no lobar consolidation is seen periodseq no pleural effusion or pneumothorax periodseq heart size is normal period.
- **Prediction Caption:** typical findings of pulmonary consolidation periodseq no pneumothorax periodseq there is no evidence for effusion.

---

## 1. Import Libraries

In [1]:
import string
import contractions
import pandas as pd
import matplotlib.pyplot as plt
from cleantext import clean
from sklearn.model_selection import train_test_split

import pickle
from glob import glob

!ls /datasets/gdrive/XRay-AKAKI

images_normalized	 indiana_reports.csv
indiana_projections.csv  radiology_vocabulary_final.xlsx


## 2. Load the data

In [2]:
data = pd.read_csv('../data/raw/raw_merged_xray_data.csv')

In [4]:
# defines image path
images_path = "/datasets/gdrive/XRay-AKAKI/images_normalized"

# Use glob to grab images from path .jpg or jpeg
images_file_names = glob(images_path + '/*')

print(len(images_file_names))

7693


## 3. Quick data preview

In [5]:
data

Unnamed: 0,uid,MeSH,Problems,image,indication,comparison,findings,impression,filename,projection
0,1,normal,normal,Xray Chest PA and Lateral,Positive TB test,None.,The cardiac silhouette and mediastinum size ar...,Normal chest x-XXXX.,1_IM-0001-4001.dcm.png,Frontal
1,1,normal,normal,Xray Chest PA and Lateral,Positive TB test,None.,The cardiac silhouette and mediastinum size ar...,Normal chest x-XXXX.,1_IM-0001-3001.dcm.png,Lateral
2,2,Cardiomegaly/borderline;Pulmonary Artery/enlarged,Cardiomegaly;Pulmonary Artery,"Chest, 2 views, frontal and lateral",Preop bariatric surgery.,None.,Borderline cardiomegaly. Midline sternotomy XX...,No acute pulmonary findings.,2_IM-0652-1001.dcm.png,Frontal
3,2,Cardiomegaly/borderline;Pulmonary Artery/enlarged,Cardiomegaly;Pulmonary Artery,"Chest, 2 views, frontal and lateral",Preop bariatric surgery.,None.,Borderline cardiomegaly. Midline sternotomy XX...,No acute pulmonary findings.,2_IM-0652-2001.dcm.png,Lateral
4,3,normal,normal,Xray Chest PA and Lateral,"rib pain after a XXXX, XXXX XXXX steps this XX...",,,"No displaced rib fractures, pneumothorax, or p...",3_IM-1384-1001.dcm.png,Frontal
...,...,...,...,...,...,...,...,...,...,...
7461,3997,Opacity/lung/upper lobe/right/round/small;Gran...,Opacity;Granuloma,PA and lateral views of the chest.,XXXX-year-old male with positive PPD.,None available.,"Heart size within normal limits. Small, nodula...","No acute findings, no evidence for active TB.",3997_IM-2048-1002.dcm.png,Lateral
7462,3998,normal,normal,"PA and lateral chest XXXX, XXXX XXXX comparis...",tuberculosis positive PPD,,,Heart size is normal and the lungs are clear.,3998_IM-2048-1001.dcm.png,Frontal
7463,3998,normal,normal,"PA and lateral chest XXXX, XXXX XXXX comparis...",tuberculosis positive PPD,,,Heart size is normal and the lungs are clear.,3998_IM-2048-1002.dcm.png,Lateral
7464,3999,normal,normal,"CHEST PA and LATERAL: on XXXX, XXXX.",This is a XXXX-year-old female patient with sh...,"Chest x-XXXX, XXXX, XXXX.",,The cardiac silhouette is normal in size and c...,3999_IM-2049-1001.dcm.png,Frontal


In [6]:
images_file_names[0:5]

['/datasets/gdrive/XRay-AKAKI/images_normalized/596_IM-2188-25001.dcm.png',
 '/datasets/gdrive/XRay-AKAKI/images_normalized/680_IM-2251-1001.dcm.png',
 '/datasets/gdrive/XRay-AKAKI/images_normalized/932_IM-2430-2001.dcm.png',
 '/datasets/gdrive/XRay-AKAKI/images_normalized/3523_IM-1721-1002.dcm.png',
 '/datasets/gdrive/XRay-AKAKI/images_normalized/3022_IM-1397-2001.dcm.png']

In [7]:
# see distinct values in image column
data['image'].unique()[:5]

array(['Xray Chest PA and Lateral', 'Chest, 2 views, frontal and lateral',
       'PA and lateral views of the chest XXXX, XXXX at XXXX hours ',
       'PA and Lateral Chest. XXXX, XXXX at XXXX ',
       'PA and lateral chest x-XXXX XXXX. '], dtype=object)

In [8]:
# see distinct values in comparison column
data['comparison'].unique()[:5]

array(['None.', nan, 'None available', 'XXXX, XXXX',
       'Two views of the chest dated XXXX.'], dtype=object)

In [10]:
# see distinct values in impression column
data['impression'].unique()[:5]

array(['Normal chest x-XXXX.', 'No acute pulmonary findings.',
       'No displaced rib fractures, pneumothorax, or pleural effusion identified. Well-expanded and clear lungs. Mediastinal contour within normal limits. No acute cardiopulmonary abnormality identified.',
       '1. Bullous emphysema and interstitial fibrosis. 2. Probably scarring in the left apex, although difficult to exclude a cavitary lesion. 3. Opacities in the bilateral upper lobes could represent scarring, however the absence of comparison exam, recommend short interval followup radiograph or CT thorax to document resolution.',
       'No acute cardiopulmonary abnormality.'], dtype=object)

By examining the unique values in the columns above, it helped us gather more of an understanding of what they are. These descriptions of the columns are stored in the reference.txt file.

---

## 4. Data Cleaning & Pre-Processing

### a. Drop unimportant rows and columns
Drop rows with `NaNs` on both findings and impressions as these will be difficult to train and predict captions for

In [None]:
#Drop unnecessary columns – image and comparison
#projection already tells us what type of image – so image unncessary
data_dropped_cols = data.drop(columns=['image','comparison','indication'])

#Drop rows with NaNs on both findings and impressions
data_dropped_nans = data_dropped_cols.dropna(subset=['findings','impression'], how='all')

### b. Fuse the findings and impression columns to get richer outcome variable `caption`

In [None]:
#Append the impression to findings to make new caption column
data_dropped_nans['caption'] = data_dropped_nans[['findings','impression']].astype('str').agg(' '.join, axis=1)

print(f"Dropped {len(data)-len(data_dropped_nans)} rows from {len(data)} rows. New row count: {len(data_dropped_nans)}")

Dropped 40 rows from 7466 rows. New row count: 7426
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [None]:
data_dropped_nans.head()

Unnamed: 0,uid,MeSH,Problems,findings,impression,filename,projection,caption
0,1,normal,normal,The cardiac silhouette and mediastinum size ar...,Normal chest x-XXXX.,1_IM-0001-4001.dcm.png,Frontal,The cardiac silhouette and mediastinum size ar...
1,1,normal,normal,The cardiac silhouette and mediastinum size ar...,Normal chest x-XXXX.,1_IM-0001-3001.dcm.png,Lateral,The cardiac silhouette and mediastinum size ar...
2,2,Cardiomegaly/borderline;Pulmonary Artery/enlarged,Cardiomegaly;Pulmonary Artery,Borderline cardiomegaly. Midline sternotomy XX...,No acute pulmonary findings.,2_IM-0652-1001.dcm.png,Frontal,Borderline cardiomegaly. Midline sternotomy XX...
3,2,Cardiomegaly/borderline;Pulmonary Artery/enlarged,Cardiomegaly;Pulmonary Artery,Borderline cardiomegaly. Midline sternotomy XX...,No acute pulmonary findings.,2_IM-0652-2001.dcm.png,Lateral,Borderline cardiomegaly. Midline sternotomy XX...
4,3,normal,normal,,"No displaced rib fractures, pneumothorax, or p...",3_IM-1384-1001.dcm.png,Frontal,"nan No displaced rib fractures, pneumothorax, ..."


### c. Clean out placeholders, trailing punctuations and excess whitespace

In [None]:
### MESH AND PROBLEMS COLUMNS
#Clean MeSH and Problems
data_dropped_nans['MeSH'] = data_dropped_nans['MeSH'].str.replace('/',' ').str.replace(';',',')
data_dropped_nans['Problems'] = data_dropped_nans['Problems'].str.replace('/',' ').str.replace(';',',')

### CLEAN ALL TEXT COLUMNS EXCEPT FILENAME
cols_to_clean = ['MeSH','Problems','findings','impression','projection','caption']

#Clean out placeholders of the form xx-year-old, xxxx, x-XXX, etc.
data_dropped_nans[cols_to_clean] = data_dropped_nans[cols_to_clean].replace('[xX]+-?\s?year-?\s?old\s?([xX]+|with|w+)?\s?(with)?|[xX]+ are intact|[xX]+-[xX]+|-[xX]+|[xX]{2,}|nan', '', 
                                              regex=True)

#Remove leading and trailing commas and fullstops, etc.
data_dropped_nans[cols_to_clean] = data_dropped_nans[cols_to_clean].replace('(^[.,;/:\s]+)|([.,;/:\s]+$)', '', regex=True)

#Clear excess whitespace between words and punctuation 
data_dropped_nans[cols_to_clean] = data_dropped_nans[cols_to_clean].replace(r'\s+([,?.!;"])',r'\1', regex=True)

#Clean out left over gender with .....
data_dropped_nans[cols_to_clean] = data_dropped_nans[cols_to_clean].replace('(male|female)? with','',regex=True)

#Get rid of numbering in impression or findings
data_dropped_nans[cols_to_clean] = data_dropped_nans[cols_to_clean].replace('\d+[.]',"", regex=True)

data_dropped_nans.caption[2]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


'Borderline cardiomegaly. Midline sternotomy. Enlarged pulmonary arteries. Clear lungs. Inferior. No acute pulmonary findings'

### d. Replace missing values

In [None]:
#If the MeSH and the problems were recorded as normal, fill findings na with 'no unusual findings
filter_q = (data_dropped_nans.MeSH == 'normal') & (data_dropped_nans.Problems == 'normal')
data_dropped_nans.loc[filter_q, 'findings'] = data_dropped_nans.loc[filter_q, 'findings'].fillna('no unusual findings')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


In [None]:
#Replacing the nan values
data_dropped_nans['findings'] = data_dropped_nans['findings'].fillna('no findings')
data_dropped_nans['impression'] = data_dropped_nans['impression'].fillna('no impression')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [None]:
data_dropped_nans

Unnamed: 0,uid,MeSH,Problems,findings,impression,filename,projection,caption
0,1,normal,normal,The cardiac silhouette and mediastinum size ar...,Normal chest,1_IM-0001-4001.dcm.png,Frontal,The cardiac silhouette and mediastinum size ar...
1,1,normal,normal,The cardiac silhouette and mediastinum size ar...,Normal chest,1_IM-0001-3001.dcm.png,Lateral,The cardiac silhouette and mediastinum size ar...
2,2,"Cardiomegaly borderline,Pulmonary Artery enlarged","Cardiomegaly,Pulmonary Artery",Borderline cardiomegaly. Midline sternotomy. E...,No acute pulmonary findings,2_IM-0652-1001.dcm.png,Frontal,Borderline cardiomegaly. Midline sternotomy. E...
3,2,"Cardiomegaly borderline,Pulmonary Artery enlarged","Cardiomegaly,Pulmonary Artery",Borderline cardiomegaly. Midline sternotomy. E...,No acute pulmonary findings,2_IM-0652-2001.dcm.png,Lateral,Borderline cardiomegaly. Midline sternotomy. E...
4,3,normal,normal,no unusual findings,"No displaced rib fractures, pneumothorax, or p...",3_IM-1384-1001.dcm.png,Frontal,"No displaced rib fractures, pneumothorax, or p..."
...,...,...,...,...,...,...,...,...
7461,3997,"Opacity lung upper lobe right round small,Gran...","Opacity,Granuloma","Heart sizein normal limits. Small, nodular opa...","No acute findings, no evidence for active TB",3997_IM-2048-1002.dcm.png,Lateral,"Heart sizein normal limits. Small, nodular opa..."
7462,3998,normal,normal,no unusual findings,Heart size is normal and the lungs are clear,3998_IM-2048-1001.dcm.png,Frontal,Heart size is normal and the lungs are clear
7463,3998,normal,normal,no unusual findings,Heart size is normal and the lungs are clear,3998_IM-2048-1002.dcm.png,Lateral,Heart size is normal and the lungs are clear
7464,3999,normal,normal,no unusual findings,The cardiac silhouette is normal in size and c...,3999_IM-2049-1001.dcm.png,Frontal,The cardiac silhouette is normal in size and c...


### e. Additional text cleaning
- Case folding
- Fixing unicode
- Replace contractions, etc.

In [None]:
#Get all punctuations except fullstop 
punct_with_fullstop = string.punctuation.replace('.','')

def clean_text(text,clean_all=True, clean_with_clean_txt=False, clean_punc=False, 
               clean_contractions = False):
    
    """
    Function to clean text leveraging the cleantext, string and contractions packages
    
    Input:
        - text (str): Uncleaned tweet
        - clean_all (bool): Perform all cleaning operations
        - clean_with_clean_txt (bool): Perform cleaning with clean-text package
        - clean_punc (bool): Remove punctuations except fullstop
        - clean_contractions (bool): Replace contractions with their full words
        
    Output:
        - cleaned_text (str): Cleaned tweet
    """

    #Perform all cleaning operations
    if clean_all:
        clean_with_clean_txt = True
        clean_punc = True
        clean_contractions = True

    #Use clean-text package to fix unicode, case fold, etc.
    if clean_with_clean_txt:
        cleaned_text=clean(text,
                        fix_unicode=True, # fix various unicode errors
                        to_ascii=True,    # transliterate to closest ASCII representation
                        lower=True,       # lowercase text
                        no_line_breaks=True, # fully strip line breaks
                        no_urls=True,      # replace all URLs with ''
                        no_emails=True,   # replace all email addresses with ''
                        no_phone_numbers=True, # replace all phone numbers with ''
                        no_currency_symbols= True, # replace all currency symbols with ''
                        )
    
    if clean_punc:
        #Remove punctuations except fullstop
        #We don't remove fullstops because it helps separate sentences
        cleaned_text = cleaned_text.translate(str.maketrans('', '', punct_with_fullstop))


    if clean_contractions:
        #Replace contractions with full words
        cleaned_text = contractions.fix(cleaned_text)

    
    return cleaned_text

In [None]:
#Perform full clean on findings,impression, and captions column using the function above
for col in ['findings','impression','caption']:
    data_dropped_nans[col] = data_dropped_nans[col].apply(clean_text)

#Partial cleaning on MeSH and Problem
for col in ['MeSH','Problems']:
    data_dropped_nans[col] = data_dropped_nans[col].apply(clean_text, args=(False,True,False,True))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [None]:
#Random sample from the different columns to visualize output of cleaning process
for col in ['MeSH','Problems','findings','impression','caption']:
    print(col.capitalize())
    print('------------------')
    for idx,txt in enumerate(data_dropped_nans[col].sample(5)):
        print(str(idx+1)+')',txt)
    print('\n')

Mesh
------------------
1) pleural effusion right small,pulmonary atelectasis base right scattered,surgical instruments
2) technical quality of image unsatisfactory,cardiomegaly severe
3) normal
4) opacity lung base left patchy,lucency diaphragm right,pleural effusion left small,atherosclerosis aorta,arthritis,pulmonary atelectasis base left,pneumonia base left
5) lung, hyperlucent,lung hyperdistention,technical quality of image unsatisfactory


Problems
------------------
1) normal
2) catheters, indwelling,thoracic vertebrae
3) granuloma
4) normal
5) aorta,aortic aneurysm,cicatrix,spondylosis,aorta,medical device


Findings
------------------
1) heart size moderately enlarged stable mediastinal contours. lateral view curvilinear densities over the heart suggestive of coronary artery stents. diaphragm eventration. no focal alveolar consolidation no definite pleural effusion seen. no typical findings of pulmonary edema
2) there is chronic asymmetric elevation of the right hemidiaphragm.

## 5. Data Exportation

### a. Split Dataset into Frontal and Lateral, and Train and Test

In [None]:
# Split into frontal and lateral datasets
frontal_df = data_dropped_nans.query("projection == 'Frontal' ")
lateral_df = data_dropped_nans.query("projection == 'Lateral' ")

print(frontal_df.shape[0], lateral_df.shape[0])

3794 3632


In [None]:
#Split into train and test set for frontal dataset
frontal_train,frontal_test = train_test_split(frontal_df, test_size=0.25, random_state=42, shuffle=True)

#Split into train and test set for lateral dataset
lateral_train,lateral_test = train_test_split(lateral_df, test_size=0.25, random_state=1, shuffle=True)

print('frontal train size:',frontal_train.shape[0], 'frontal test size:',frontal_test.shape[0])
print('lateral train size:',lateral_train.shape[0], 'lateral test size:',lateral_test.shape[0])

frontal train size: 2845 frontal test size: 949
lateral train size: 2724 lateral test size: 908


### b. Write dataframes to pickles

We use pickles here as a way of easily storing our dataframes without using too much data.

In [None]:
#Write full cleaned dataset to pickle
data_dropped_nans.to_pickle('../data/interim/full_cleaned_data.pickle')

#Write full frontal and laterals datasets to pickle
frontal_df.to_pickle('../data/interim/full_frontal.pickle')
lateral_df.to_pickle('../data/interim/full_lateral.pickle')

#Write train datasets to pickle
frontal_train.to_pickle('../data/train/frontal_train.pickle')
lateral_train.to_pickle('../data/train/lateral_train.pickle')

#Write test datasets to pickle
frontal_test.to_pickle('../data/test/frontal_test.pickle')
lateral_test.to_pickle('../data/test/lateral_test.pickle')

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=d67d5ca8-f99a-4baf-8748-4aa99efbd09b' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>