# Tutorial on processing CheXpertPlus dataset

Due to the [license](https://stanfordaimi.azurewebsites.net/datasets/5158c524-d3ab-4e02-96e9-6ee9efc110a1) of the CheXpertPlus dataset, we are unable to redistribute the original chest X-ray (CXR) images and detailed radiological reports.

In this tutorial, we will show how to access and process the CheXpertPlus dataset, and link the CXRs with the textual data in our ICG-CXR dataset.

## 1. Accessing CheXpertPlus Dataset

The homepage of CheXpertPlus is at: https://stanfordaimi.azurewebsites.net/datasets/5158c524-d3ab-4e02-96e9-6ee9efc110a1

After logging in, you can download this dataset using `AzCopy` or `Azure Storage Explorer`. The downloaded file structure should look like:
```
chexpertplus_data
├── chexbert_labels.zip
├── df_chexpert_plus_240401.csv
├── radgraph-XL-annotations.zip
├── DICOM
└── PNG
    ├── train
    │   ├── patient00001
    │   |   ├── study1
    │   |   |   ├── view1_frontal.png
    │   |   |   └── ...
    │   |   ├── study2
    │   |   └── ...
    │   ├── patient00002
    |   └── ...
    └── valid
```

## 2. Retrieving CXR Data in the CheXpertPlus Dataset.

Then, we link the CXRs from the CheXpertPlus dataset to the CheXpertPlus study pairs in our ICG-CXR dataset. Before we do this, let's first import the necessary libraries:

In [1]:
import os
import sys
import json
import shutil
import random
import warnings
import itertools
import numpy as np
import pandas as pd
from tqdm import tqdm
from collections import defaultdict
import matplotlib.pyplot as plt


def read_json(json_path, encoding='utf-8'):
    with open(json_path, 'r', encoding=encoding) as f:
        data = json.load(f)
    return data

def write_json(data, json_path, write_mode='w', encoding='utf-8', ensure_ascii=False):
    with open(json_path, write_mode, encoding=encoding) as f:
        json.dump(data, f, indent=4, ensure_ascii=ensure_ascii)


def show_multiple_images(
    images, 
    nrows=1, 
    ncols=None, 
    titles=None, 
    suptitle=None, 
    tight=True, 
    cmaps='gray', 
    figsize=None, 
    dpi=None, 
    set_axis_off=True,
):

    num_imgs = len(images)
    ncols = num_imgs // nrows if ncols is None else ncols
    if ncols * nrows < num_imgs:
        ncols += 1
    num_plots = int(nrows * ncols)
    cmaps = [cmaps] * num_imgs if not isinstance(cmaps, (tuple, list)) else cmaps
    titles = [titles] * num_imgs if not isinstance(titles, (tuple, list)) else titles
    assert num_imgs <= num_plots, f'num_imgs = {num_imgs}, nrows = {nrows}, ncols = {ncols}.'
    fig, axes = plt.subplots(nrows, ncols, squeeze=False, figsize=figsize, dpi=dpi)
    axes = axes.flatten()
    for i in range(num_imgs):
        strtype = str(type(images[i]))
        if 'torch.Tensor' in strtype:
            img = images[i].cpu().squeeze()
        elif 'numpy.ndarray' in strtype:
            img = images[i].squeeze()
        elif 'Image.Image' in strtype:
            img = np.array(images[i]).squeeze()
        else:
            img = images[i]
        
        axes[i].imshow(img, cmap=cmaps[i])
        if set_axis_off:
            axes[i].axis('off')
        if titles is not None:
            axes[i].set_title(titles[i])
    
    if suptitle is not None:
        fig.suptitle(suptitle)
    
    if num_imgs < num_plots:
        for i in range(num_imgs, num_plots):
            axes[i].axis('off')
            
    if tight:
        fig.tight_layout()
    

def find_files_recursively(directory, ext='', inclusions='', exclusions=None):
    matched_paths = []
    
    # Ensure inclusions and exclusions are lists
    if not isinstance(inclusions, (list, tuple)):
        inclusions = [inclusions]
    if exclusions is not None and not isinstance(exclusions, (list, tuple)):
        exclusions = [exclusions]

    if not os.path.exists(directory):
        raise ValueError(f'Path does not exist: {directory}.')

    for root, dirs, files in os.walk(directory):
        filtered_dirs = []
        
        for d in dirs:
            full_path = os.path.join(root, d)
            # Only treat it as a file if it's actually a file
            if d.endswith('.nii.gz') and os.path.isfile(full_path):
                files.append(d)  # Treat it as a file
            else:
                filtered_dirs.append(d)  # Keep directories that are actually directories

        dirs[:] = filtered_dirs  # Modify dirs in place to control recursion

        for file in files:
            cond1 = all(s in file for s in inclusions)
            cond2 = file.endswith(ext)
            cond3 = all(s not in file for s in exclusions) if exclusions else True

            if cond1 and cond2 and cond3:
                matched_paths.append(os.path.join(root, file))

    return sorted(matched_paths)

### 2.1. Know the ICG-CXR Data

We will use the JSON files in the ICG-CXR (CheXpertPlus Ext.) dataset to retrieve the CXR images in the ChexertPlus dataset. 

So, first thing first, let's take a look at the JSON file in the ICG-CXR dataset.

In [2]:
icgcxr_chexp_dir = './chexpertplus'

# Recursively find all JSON files in the ICG-CXR (CheXpertPlus Ext.) directory
icgcxr_chexp_meta_paths = find_files_recursively(icgcxr_chexp_dir, ext='.json')

icgcxr_chexp_patients = [_.split('/')[-3] for _ in icgcxr_chexp_meta_paths]
icgcxr_chexp_patients = sorted(list(set(icgcxr_chexp_patients)))
print(f"Data number in ICG-CXR (ChexPertPlus Ext.): {len(icgcxr_chexp_meta_paths)}")
print(f"Patient number in ICG-CXR (ChexPertPlus Ext.): {len(icgcxr_chexp_patients)}")

Data number in ICG-CXR (ChexPertPlus Ext.): 3655
Patient number in ICG-CXR (ChexPertPlus Ext.): 2498


In [3]:
data_example = read_json(icgcxr_chexp_meta_paths[0])
data_example

{'changes-of-findings': 'There is blunting of the left costophrenic angle, which was previously sharp, indicating the presence of a trace pleural effusion.',
 'progression-description': 'Mild pleural effusion has developed in the left lower lung.',
 'comment': {'confidence': 5,
  'reason': 'Clear and consistent observations of stable findings with a newly noted mild blunting of the left costophrenic angle.'},
 'reference-report': {'findings': 'Due to the CheXpertPlus license, we are unable to redistribute the CXR data. Please see `process_chexpertplus.ipynb` for retrieving these data from the CheXpertPlus dataset.',
  'impression': 'Due to the CheXpertPlus license, we are unable to redistribute the CXR data. Please see `process_chexpertplus.ipynb` for retrieving these data from the CheXpertPlus dataset.'},
 'followup-report': {'findings': 'Due to the CheXpertPlus license, we are unable to redistribute the CXR data. Please see `process_chexpertplus.ipynb` for retrieving these data from 

### 2.2. Retrieve CXR Images

Now let's retrieve the CXRs.

In [None]:
chexp_png_root = '/your/path/to/chexpertplus_data'
chexp_png_dir = os.path.join(chexp_png_root, 'PNG')

for path in tqdm(icgcxr_chexp_meta_paths):
    meta = read_json(path)
    
    # This is the image identifier of the prior image in a pair of consecutive studies:
    ref_img_id = meta['reference-dicom-id']  
    # This is the image identifier of the subsequent image in a pair of consecutive studies:
    flu_img_id = meta['followup-dicom-id']
    
    # Source image paths
    src_ref_img_path = os.path.join(chexp_png_dir, ref_img_id + '.png')
    src_flu_img_path = os.path.join(chexp_png_dir, flu_img_id + '.png')
    
    # Target image paths
    tgt_ref_img_path = path.replace('.json', '-ref-init.png')
    tgt_flu_img_path = path.replace('.json', '-flu-init.png')
    
    # Retreive the CXRs
    shutil.copy(src_ref_img_path, tgt_ref_img_path)
    shutil.copy(src_flu_img_path, tgt_flu_img_path)


### 2.3. Retrieve CXR Reports (optional)

Since ICG-CXR dataset directly provides the disease progression prompts and the radiological differences between two consectuive CXR images in a study pair, this step is optional.

But if you would like to play with the original text data in CheXpertPlus dataset, you can run the following code to retrieve them.

In [None]:
chexp_report_path = os.path.join(chexp_png_root, 'df_chexpert_plus_240401.csv')
chexp_reports = pd.read_csv(chexp_report_path, usecols=['path_to_dcm', 'section_findings', 'section_impression'])

# We need to reorganize the `chexp_reports`, so we can use the image identifier to find the corresponding report.
chexp_reports_reorg = {}
for idx, row in chexp_reports.iterrows():
    path_to_dcm = row['path_to_dcm']
    # `path_to_dcm` is a string like `train/patient42142/study5/view1_frontal.dcm`
    img_identifier = path_to_dcm.replace('.dcm', '')
    chexp_reports_reorg[img_identifier] = {
        'findings': row['section_findings'],
        'impression': row['section_impression']
    }

# We then can retrieve the textual data.
for path in tqdm(icgcxr_chexp_meta_paths):
    meta = read_json(path)
    
    ref_img_id = meta['reference-dicom-id']
    ref_findings = str(chexp_reports_reorg[ref_img_id]['findings'])
    ref_impression = str(chexp_reports_reorg[ref_img_id]['impression'])
    
    flu_img_id = meta['followup-dicom-id']
    flu_findings = str(chexp_reports_reorg[flu_img_id]['findings'])
    flu_impression = str(chexp_reports_reorg[flu_img_id]['impression'])
    
    meta['reference-report'] = {'findings': ref_findings, 'impression': ref_impression}
    meta['followup-report'] = {'findings': flu_findings, 'impression': flu_impression}
    
    write_json(meta, path)
    

## 3. Image Registration

The registration is performed using the SimpleITK library. The code is somehow complex and tedious; it is also time-consuming unless run with multiple processes. So we are not putting it in this Jupyter notebook. Please see `register_chexpertplus.py` for details.

However, some images may be poorly registered and need to be manually fixed. In this case, we recommend using Jupyter notebook to fix those images in an interactive way. 

For more information, please contact clma24@m.fudan.edu.cn