# Loading patient-matched EHR and CXR data into VolView Insight

The goal of this notebook is to be a lightweight example of loading patient-matched chest X-Ray and EHR data into the [VolView Insight web app](https://github.com/KitwareMedical/volview-insight). You will need access to matching example data (matching meaning both images and EHR data for the same patient). In this example, MIMIC data were used. MIMIC-CXR is a standard database of DICOM chest X-ray data, and it is several terabytes. If you are trying to load in this data at Kitware Medical, there is a NAS system that has all this data pre-installed. Please ask around for access to the NAS system, which has all of the data (MIMIC-IV, MIMIC-CXR, and MIMIC-IV FHIR). Once you can access the NAS, find a way to mount the directory containing the MIMIC data to the computer where you're running this notebook.  On the NAS, the data should be at `/volume1/data/physionet.org/files/`, unless things have changed since this notebook was written. Then, once the NAS is mounted onto your computer, create a symlink to the location of the data (to `/volume1/data/`).

Assuming that `PROJECT_ROOT` is the root of this repository, and that you have made a symlink `PROJECT_ROOT/local` point to `/the/mounted/location/of/volume1/data`, the paths in this notebook should work seamlessly. Lastly, you may have to tune the URLs for the HAPI FHIR and Orthanc servers below if they are not on ports `3000` or `8042`, respectively, or if they are not being run on `localhost`.

## Notebook Structure:

1. 📦 Environment Setup
    - Install packages (if needed)
    - Import libraries (torch, transformers, pandas, etc.)

2. 📂 Data Access and Loading
    - Connect to MIMIC-IV & MIMIC-CXR
    - Decide on a small number of patients to load into the VolView Insight web app
    - Load the corresponding MIMIC-IV FHIR resources (in MIMIC-on-FHIR) into the HAPI FHIR server
    - Load the corresponding MIMIC-CXR data into the Orthanc server

## 📦 Environment Setup

In [None]:
from fhirclient.models.patient import Patient
import gzip
import json
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import requests
import time

PROJECT_ID = "02-load-pt-matched-data-into-volview-insight"

# Local directory paths
PROJECT_ROOT = Path("/set/this/to/a/path/containing/local")
MIMIC_IV_BASE = Path("local/physionet.org/files/mimiciv/3.1/")
MIMIC_CXR_BASE = Path("local/physionet.org/files/mimic-cxr/2.0.0/")
MIMIC_IV_FHIR_BASE = Path("local/physionet.org/files/mimic-iv-fhir/2.1/")

# Server paths
HAPI_FHIR_BASE = "http://localhost:3000/hapi-fhir-jpaserver/fhir/"
ORTHANC_BASE = "http://localhost:8042/instances"

In [None]:
os.chdir(PROJECT_ROOT)

# Construct paths
data_dir = os.path.join(PROJECT_ROOT, "data", PROJECT_ID)
img_dir = os.path.join(PROJECT_ROOT, "img", PROJECT_ID)
results_dir = os.path.join(PROJECT_ROOT, "results", PROJECT_ID)

# Create directories if they don't exist
os.makedirs(data_dir, exist_ok=True)
os.makedirs(img_dir, exist_ok=True)
os.makedirs(results_dir, exist_ok=True)

def create_result(result: int | float | str, name_of_result_file_without_extension: str):
    """
    Writes a single result to a one-column CSV file in the results directory.
    The header and filename must match `name_of_result_file_without_extension`.

    Raises:
        ValueError: If the header and filename don't match.
    """
    filename = f"{name_of_result_file_without_extension}.csv"
    result_file_path = os.path.join(results_dir, filename)

    header = name_of_result_file_without_extension
    with open(result_file_path, mode='w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow([header])
        writer.writerow([str(result)])

    # Validation: Ensure header matches file name
    if header != os.path.splitext(os.path.basename(result_file_path))[0]:
        raise ValueError("Header must match the filename without extension.")

## 📂 Data Access and Loading

In [None]:
# Load MIMIC-IV patients dataframe
patients_df = pd.read_csv(MIMIC_IV_BASE / "hosp/patients.csv.gz")
patients_df

In [None]:
# Load MIMIC-CXR chest X-Ray records dataframe
image_record_df = pd.read_csv(MIMIC_CXR_BASE / "cxr-record-list.csv.gz")
image_record_df

In [None]:
# Load the MIMIC-IV patients dataframe
patients_df = patients_df[patients_df['subject_id'].isin(image_record_df['subject_id'])]
patients_df

In [None]:
# Choose 10 patients with images
selected_patients = patients_df.iloc[:10]['subject_id']
selected_images = image_record_df[image_record_df['subject_id'].isin(selected_patients)][['subject_id', 'path']]
selected_images

In [None]:
# Now, get the FHIR resources needed to load into the HAPI FHIR database. Make
# sure to save the MimicPatient "id", which is the unique reference to the patient
# as it pertains to the FHIR system from which the data was transferred. Furthermore,
# when posting the resource to the new FHIR database, a successful POST will return
# an automatically generated new "id" in the Location header. Instead, we want to use
# the same "id" from the previous location. The scoped "identifier" is irrelevant
# for anything here.

# Convert Series to DataFrame
if isinstance(selected_patients, pd.Series):
    selected_patients = selected_patients.to_frame(name='subject_id')
    
# Add the column with default None (if it doesn't exist already)
if 'orig_patient_reference' not in selected_patients.columns:
    selected_patients['orig_patient_reference'] = None
    
headers = {
    'Content-Type': 'application/fhir+json',
    'Accept': 'application/fhir+json',
    'Accept-Charset': 'UTF-8',
}

with gzip.open(MIMIC_IV_FHIR_BASE / "fhir" / "MimicPatient.ndjson.gz", 'rt') as f:
    for line in f:
        patient_json = json.loads(line)
        subject_id = int(patient_json["identifier"][0]["value"])
        if subject_id in selected_patients['subject_id'].values:
            patient = Patient(patient_json)
            selected_patients.loc[selected_patients['subject_id'] == subject_id, 'orig_patient_reference'] = patient_json["id"]
            response = requests.put(f"{HAPI_FHIR_BASE}/Patient/{patient_json['id']}", headers=headers, json=patient.as_json())
            print(f"UPLOADING PATIENT {subject_id}: {response.status_code}, headers:\n {response.headers}\n")

In [None]:
# Now, load the DICOM files into the Orthanc server. Note that for each `.dcm`, the PatientID is a LO ("Long String") that *does* match the patient ID specified in selected_images['subject_id'].
for idx, row in selected_images.iterrows():
    dicom_path = MIMIC_CXR_BASE / row['path']

    if not dicom_path.exists():
        print(f"[Warning] File not found: {dicom_path}")
        continue

    with open(dicom_path, 'rb') as f:
        response = requests.post(
            ORTHANC_BASE,
            headers={'Content-Type': 'application/dicom'},
            data=f
        )

    if response.status_code == 200:
        print(f"Uploaded DICOM for subject {row['subject_id']}: {dicom_path.name}")
    else:
        print(f"[Error] Failed to upload {dicom_path.name}: {response.status_code} - {response.text}")

In [None]:
# MIMIC-IV on FHIR has a few extra FHIR resources which they have defined
# themselves. We are currently only interested in the base FHIR resources; those
# that don't require transferring the definition of the resource as well as the
# resource.
BASE_FHIR_RESOURCES = ["Condition", "Encounter", "Medication",
"MedicationAdministration", "MedicationDispense", "MedicationRequest",
"ObservationChartevents", "Organization", "Patient", "Procedure"]

RESOURCE_FILES = [MIMIC_IV_FHIR_BASE / "fhir" / ("Mimic" + resource + ".ndjson.gz") for resource in BASE_FHIR_RESOURCES if resource != "Patient"]

headers = {
    'Content-Type': 'application/fhir+json',
    'Accept': 'application/fhir+json',
    'Accept-Charset': 'UTF-8',
}

# Upload other resources
for file in RESOURCE_FILES:
    resource_type = file.name.replace("Mimic", "").replace(".ndjson.gz", "").split('.')[0]  # e.g., "Condition"
    print(f"\nProcessing {resource_type}...")

    success_count = 0
    start_time = time.time()  # Start timing

    with gzip.open(file, 'rt') as f:
        for line in f:
            resource_json = json.loads(line)
            resource_orig_id = resource_json["id"]

            if "subject" not in resource_json or "reference" not in resource_json["subject"]:
                print(f"Skipping {resource_type} because the resources are not linked to patients")
                break
                
            orig_patient_reference = resource_json["subject"]["reference"].split("/")[1]

            if orig_patient_reference not in selected_patients['orig_patient_reference'].values:
                continue

            response = requests.put(
                f"{HAPI_FHIR_BASE}/{resource_json['resourceType']}/{resource_orig_id}",
                headers=headers,
                json=resource_json
            )

            if response.status_code in [200, 201]:
                success_count += 1

    end_time = time.time()
    elapsed = end_time - start_time
    print(f"{success_count} {resource_type} resources successfully uploaded in {elapsed:.2f} seconds.")