In [6]:
from pathlib import Path
import pandas as pd
from torch
import pydicom

ROOT_DIR = Path("/dfs/scratch1/sabrieyuboglu/data/mimic-cxr-2.0.0.physionet.org")

# MIMIC-CXR Data Overview
Copied from –> source: https://mimic-cxr.mit.edu/data/overview/ (may not be able to access this link without access to the dataset)

A records file, cxr-record-list.csv.gz, provides a mapping between the image (dicom_id), the study (study_id), and the patient (subject_id). Another records file, cxr-study-list.csv.gz, provides a mapping between the studies (study_id) and patients (subject_id).

All patient identifiers begin with the digit 1 and have a total length of 8 digits. All study identifiers begin with the digit 5 and have a total length of 8 digits. DICOM file names are unique 40 character hexadecimal strings with dashes separating groups of eight characters.

Images are provided in DICOM format; see the image section for more information about the images.

Reports are provided as plain text files; see the reports section for more information about the reports.

## Data Organization
Data files are made available in a hierarchical strcture. The following block lists the first patient’s records as an demonstrative example (MIMIC-CXR v2.0.0):
```
files/
 p10/
   p10000032/
    s50414267/
      02aa804e-bde0afdd-112c0b34-7bc16630-4e384014.dcm.gz
      174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.dcm.gz
    s53189527/
      2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab.dcm.gz
      e084de3b-be89b11e-20fe3f9f-9c8d8dfe-4cfd202c.dcm.gz
    s53911762/
      68b5c4b1-227d0485-9cc38c3f-7b84ab51-4b472714.dcm.gz
      fffabebf-74fd3a1f-673b6b41-96ec0ac9-2ab69818.dcm.gz
    s56699142/
      ea030e7a-2e3b1346-bc518786-7a8fd698-f673b44c.dcm.gz
    s50414267.txt
    s53189527.txt
    s53911762.txt
    s56699142.txt
 ...
```
You will notea high level folder: p10. This is done to avoid having many files in a single directory. All patient folders are stored in a higher level folder which is identical to the first 3 characters of their folder name, i.e. p10000032 will be in folder p10, p11000011 will be in p11, and so on.

Above, this patient (10000032) has four studies. Most of the studies have two scans (usually a frontal and a lateral chest x-ray), but one study 56699142 has only one image. Each study is associated with a de-identified free-text radiology report (e.g. s56699142.txt). Note that the identifiers are random, and do not indicate order of the studies in any way.


https://physionet.org/content/mimic-cxr-jpg/2.0.0/

In [21]:
# CSV mapping each `study_id` onto the `dicom_id` and the path where the DICOM lives   
dicom_df = pd.read_csv(ROOT_DIR / "cxr-record-list.csv")
dicom_df.head()

Unnamed: 0,subject_id,study_id,dicom_id,path
0,10000032,50414267,02aa804e-bde0afdd-112c0b34-7bc16630-4e384014,files/p10/p10000032/s50414267/02aa804e-bde0afd...
1,10000032,50414267,174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962,files/p10/p10000032/s50414267/174413ec-4ec4c1f...
2,10000032,53189527,2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab,files/p10/p10000032/s53189527/2a2277a9-b0ded15...
3,10000032,53189527,e084de3b-be89b11e-20fe3f9f-9c8d8dfe-4cfd202c,files/p10/p10000032/s53189527/e084de3b-be89b11...
4,10000032,53911762,68b5c4b1-227d0485-9cc38c3f-7b84ab51-4b472714,files/p10/p10000032/s53911762/68b5c4b1-227d048...


In [22]:
# CSV mapping each `study_id` onto the path where the study report lives   
study_df = pd.read_csv(ROOT_DIR / "cxr-study-list.csv")
study_df.head()

Unnamed: 0,subject_id,study_id,path
0,10000032,50414267,files/p10/p10000032/s50414267.txt
1,10000032,53189527,files/p10/p10000032/s53189527.txt
2,10000032,53911762,files/p10/p10000032/s53911762.txt
3,10000032,56699142,files/p10/p10000032/s56699142.txt
4,10000764,57375967,files/p10/p10000764/s57375967.txt


## Structured labels
Copied from –> source: https://mimic-cxr.mit.edu/data/overview/ (may not be able to access this link without access to the dataset)
The mimic-cxr-2.0.0-chexpert.csv.gz and mimic-cxr-2.0.0-negbio.csv.gz files are compressed comma delimited value files. A total of 227,827 studies are assigned a label by CheXpert and NegBio. Eight studies could not be labeled due to a lack of a findings or impression section. The first three columns are:

`subject_id` - An integer unique for an individual patient
`study_id` - An integer unique for an individual study (i.e. an individual radiology report with one or more images associated with it)
The remaining columns are labels as presented in the CheXpert article [8]:
```
Atelectasis
Cardiomegaly
Consolidation
Edema
Enlarged Cardiomediastinum
Fracture
Lung Lesion
Lung Opacity
Pleural Effusion
Pneumonia
Pneumothorax
Pleural Other
Support Devices
No Finding
```
Note that "No Finding" is the absence of any of the 13 descriptive labels and a check that the text does not mention a specified set of other common findings beyond those covered by the descriptive labels. Thus, it is possible for a study in the CheXpert set to have no labels assigned. For example, study 57,321,224 has the following findings/impression text: "Hyperinflation.  No evidence of acute disease.". Normally this would be assigned a label of "No Finding", but the use of "hyperinflation" suppresses the labeling of no finding. For details see the CheXpert article [8], and the list of phrases are publicly available in their code repository (phrases/mention/no_finding.txt). There are 2,414 studies which do not have a label assigned by CheXpert. Conversely, all studies present in the provided files have been assigned a label by NegBio.

Each label column contains one of four values: 1.0, -1.0, 0.0, or missing. These labels have the following interpretation:

`1.0` - The label was positively mentioned in the associated study, and is present in one or more of the corresponding images
e.g. "A large pleural effusion".  
`0.0` - The label was negatively mentioned in the associated study, and therefore should not be present in any of the corresponding images. 
e.g. "No pneumothorax."  
`-1.0` - The label was either: (1) mentioned with uncertainty in the report, and therefore may or may not be present to some degree in the corresponding image, or (2) mentioned with ambiguous language in the report and it is unclear if the pathology exists or not. 

Explicit uncertainty: "The cardiac size cannot be evaluated."
Ambiguous language: "The cardiac contours are stable."

`Missing` (empty element) - No mention of the label was made in the report


In [16]:
# CSV mapping each `study_id` onto the label extracted by chexpert    
labels_df = pd.read_csv(ROOT_DIR / "mimic-cxr-2.0.0-chexpert.csv")
labels_df.head()

Unnamed: 0,subject_id,study_id,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged Cardiomediastinum,Fracture,Lung Lesion,Lung Opacity,No Finding,Pleural Effusion,Pleural Other,Pneumonia,Pneumothorax,Support Devices
0,10000032,50414267,,,,,,,,,1.0,,,,,
1,10000032,53189527,,,,,,,,,1.0,,,,,
2,10000032,53911762,,,,,,,,,1.0,,,,,
3,10000032,56699142,,,,,,,,,1.0,,,,,
4,10000764,57375967,,,1.0,,,,,,,,,-1.0,,


In [24]:
# Do we have labels for all of the studies in MIMIC? 
(1 - study_df.study_id.isin(labels_df.study_id.unique())).sum()

8

This checks out --> "Eight studies could not be labeled due to a lack of a findings or impression section." from the MIMIC website.  