# CXR-ML-GZSL

## Overview

The goal of this notebook is to reproduce the findings of the paper, "Multi-Label Generalized Zero Shot Learning for the Classification of Disease in Chest Radiographs" using the provided code.

* Paper: https://arxiv.org/abs/2107.06563
* Code: https://github.com/nyuad-cai/CXR-ML-GZSL/

The provided code is four years old, so some changes were needed to resolve deprecation warnings and errors. Additionally, I cleaned up some imports and whitespace and adapted the code for a Jupyter notebook. However, my goal was to use the code as is in all other cases.

In [1]:
import os
import urllib.request
import hashlib
import tarfile

## Dataset

The paper used a dataset developed by another paper, initially known as `ChestX-ray8`, but then renamed to `ChestX-ray14` when the dataset was expanded from eight to fourteen distinct disease labels.

* Paper: https://arxiv.org/abs/1705.02315
* Dataset: https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345

The dataset link includes code for downloading the dataset, as well as the expected MD5 checksums for the dataset, both of which I have leveraged below.

In [2]:
DATA_PATH = 'data/nih_chest_xrays'

if not os.path.exists(DATA_PATH):
    os.makedirs(DATA_PATH)

if os.path.isfile(f'{DATA_PATH}/Data_Entry_2017_v2020.csv'):
    print("Using existing data entry file")
else:
    assert False, f"Please download Data_Entry_2017_v2020.csv from the dataset link and place it in {DATA_PATH}"

IMAGE_PATH = f'{DATA_PATH}/images'

if os.path.exists(IMAGE_PATH):
    print("Using existing image data")
else:
    # Credit: https://nihcc.app.box.com/v/ChestXray-NIHCC/file/371647823217
    links = [
        'https://nihcc.box.com/shared/static/vfk49d74nhbxq3nqjg0900w5nvkorp5c.gz',
        'https://nihcc.box.com/shared/static/i28rlmbvmfjbl8p2n3ril0pptcmcu9d1.gz',
        'https://nihcc.box.com/shared/static/f1t00wrtdk94satdfb9olcolqx20z2jp.gz',
        'https://nihcc.box.com/shared/static/0aowwzs5lhjrceb3qp67ahp0rd1l1etg.gz',
        'https://nihcc.box.com/shared/static/v5e3goj22zr6h8tzualxfsqlqaygfbsn.gz',
        'https://nihcc.box.com/shared/static/asi7ikud9jwnkrnkj99jnpfkjdes7l6l.gz',
        'https://nihcc.box.com/shared/static/jn1b4mw4n6lnh74ovmcjb8y48h8xj07n.gz',
        'https://nihcc.box.com/shared/static/tvpxmn7qyrgl0w8wfh9kqfjskv6nmm1j.gz',
        'https://nihcc.box.com/shared/static/upyy3ml7qdumlgk2rfcvlb9k6gvqq2pj.gz',
        'https://nihcc.box.com/shared/static/l6nilvfa9cg3s28tqv1qc1olm3gnz54p.gz',
        'https://nihcc.box.com/shared/static/hhq8fkdgvcari67vfhs7ppg2w6ni4jze.gz',
        'https://nihcc.box.com/shared/static/ioqwiy20ihqwyr8pf4c24eazhh281pbu.gz'
    ]

    # Credit: https://nihcc.app.box.com/v/ChestXray-NIHCC/file/249502714403
    md5_checksums = [
        'fe8ed0a6961412fddcbb3603c11b3698',
        'ab07a2d7cbe6f65ddd97b4ed7bde10bf',
        '2301d03bde4c246388bad3876965d574',
        '9f1b7f5aae01b13f4bc8e2c44a4b8ef6',
        '1861f3cd0ef7734df8104f2b0309023b',
        '456b53a8b351afd92a35bc41444c58c8',
        '1075121ea20a137b87f290d6a4a5965e',
        'b61f34cec3aa69f295fbb593cbd9d443',
        '442a3caa61ae9b64e61c561294d1e183',
        '09ec81c4c31e32858ad8cf965c494b74',
        '499aefc67207a5a97692424cf5dbeed5',
        'dc9fda1757c2de0032b63347a7d2895c'
    ]

    for idx, link in enumerate(links):
        fn = os.path.join(DATA_PATH, 'images_%02d.tar.gz' % (idx + 1))

        print(f'downloading {fn}...')
        urllib.request.urlretrieve(link, fn)

        print(f"Checking MD5 checksum for {fn}...")
        with open(fn, 'rb') as f:
            file_md5 = hashlib.md5(f.read()).hexdigest()

        assert file_md5 == md5_checksums[idx], "Invalid MD5 checksum"

        print(f"Extracting {fn}...")
        with tarfile.open(fn, 'r:gz') as tar:
            tar.extractall(path=DATA_PATH)

        print(f"Deleting {fn}...")
        os.remove(fn)

assert len([f for f in os.listdir(IMAGE_PATH) if os.path.isfile(os.path.join(IMAGE_PATH, f))]) == 112120, "Dataset is not the expected size!"

print("Dataset download complete")

Using existing data entry file
downloading data/nih_chest_xrays/images_01.tar.gz...
Checking MD5 checksum for data/nih_chest_xrays/images_01.tar.gz...
Extracting data/nih_chest_xrays/images_01.tar.gz...
Deleting data/nih_chest_xrays/images_01.tar.gz...
downloading data/nih_chest_xrays/images_02.tar.gz...
Checking MD5 checksum for data/nih_chest_xrays/images_02.tar.gz...
Extracting data/nih_chest_xrays/images_02.tar.gz...
Deleting data/nih_chest_xrays/images_02.tar.gz...
downloading data/nih_chest_xrays/images_03.tar.gz...
Checking MD5 checksum for data/nih_chest_xrays/images_03.tar.gz...
Extracting data/nih_chest_xrays/images_03.tar.gz...
Deleting data/nih_chest_xrays/images_03.tar.gz...
downloading data/nih_chest_xrays/images_04.tar.gz...
Checking MD5 checksum for data/nih_chest_xrays/images_04.tar.gz...
Extracting data/nih_chest_xrays/images_04.tar.gz...
Deleting data/nih_chest_xrays/images_04.tar.gz...
downloading data/nih_chest_xrays/images_05.tar.gz...
Checking MD5 checksum for da