# Loading patient-matched EHR and CXR data into VolView Insight (Docker Volumes)

This notebook loads patient-matched chest X-Ray and EHR data into **local Docker volumes** instead of uploading to running servers. The data is saved in the exact format that Docker containers expect.

## Key Changes:
- ✅ **No server dependencies** - works without HAPI FHIR or Orthanc running
- ✅ **Docker volume format** - data saved in format containers expect
- ✅ **Automatic mounting** - docker-compose automatically mounts the volumes
- ✅ **Version control friendly** - uses relative paths

## Directory Structure Created:
```
volumes/
├── hapi-fhir-data/
│   ├── patients.json                    # All patient data
│   ├── patient_[id].json               # Individual patient files
│   ├── condition/
│   │   └── condition.json
│   ├── encounter/
│   │   └── encounter.json
│   └── ... (other resource types)
└── orthanc-data/
    ├── patient_10000032/
    │   ├── 02aa804e-bde0afdd-112c0b34-7bc16630-4e384014.dcm
    │   └── ... (other DICOM files)
    ├── patient_10000764/
    │   └── ... (DICOM files)
    └── dicom_metadata.json              # Metadata about copied files
```


## 📦 Environment Setup


In [1]:
from fhirclient.models.patient import Patient
import gzip
import json
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import shutil
import time

PROJECT_ID = "Load-pt-matched-data-into-volview-insight-volumes"

# Local directory paths
PROJECT_ROOT = Path("/home/local/KHQ/andinet.enquobahrie/data/volview_insight-internal")
MIMIC_IV_BASE = Path("/home/local/KHQ/andinet.enquobahrie/mnt/physionet/physionet.org/files/mimiciv/3.1/")
MIMIC_CXR_BASE = Path("/home/local/KHQ/andinet.enquobahrie/mnt/physionet/physionet.org/files/mimic-cxr/2.0.0/")
MIMIC_IV_FHIR_BASE = Path("/home/local/KHQ/andinet.enquobahrie/mnt/physionet/physionet.org/files/mimic-iv-fhir/2.1/")

# Docker volume paths (relative to volview-insight project root)
# These paths should match the docker-compose.yml volume bindings
VOLVIEW_INSIGHT_ROOT = Path("/home/local/KHQ/andinet.enquobahrie/s/volview-insight-docker")
LOCAL_FHIR_VOLUME = VOLVIEW_INSIGHT_ROOT / "volumes" / "hapi-fhir-data"
LOCAL_ORTHANC_VOLUME = VOLVIEW_INSIGHT_ROOT / "volumes" / "orthanc-data"

print(f"Volume paths:")
print(f"  FHIR data: {LOCAL_FHIR_VOLUME}")
print(f"  Orthanc data: {LOCAL_ORTHANC_VOLUME}")


Volume paths:
  FHIR data: /home/local/KHQ/andinet.enquobahrie/s/volview-insight-docker/volumes/hapi-fhir-data
  Orthanc data: /home/local/KHQ/andinet.enquobahrie/s/volview-insight-docker/volumes/orthanc-data


In [2]:
# Create Docker volume directories
LOCAL_FHIR_VOLUME.mkdir(parents=True, exist_ok=True)
LOCAL_ORTHANC_VOLUME.mkdir(parents=True, exist_ok=True)

print(f"Created volume directories:")
print(f"  FHIR data: {LOCAL_FHIR_VOLUME}")
print(f"  Orthanc data: {LOCAL_ORTHANC_VOLUME}")
print(f"  FHIR exists: {LOCAL_FHIR_VOLUME.exists()}")
print(f"  Orthanc exists: {LOCAL_ORTHANC_VOLUME.exists()}")


Created volume directories:
  FHIR data: /home/local/KHQ/andinet.enquobahrie/s/volview-insight-docker/volumes/hapi-fhir-data
  Orthanc data: /home/local/KHQ/andinet.enquobahrie/s/volview-insight-docker/volumes/orthanc-data
  FHIR exists: True
  Orthanc exists: True


## 📂 Data Access and Loading


In [4]:
# Load MIMIC-IV patients dataframe
patients_df = pd.read_csv(MIMIC_IV_BASE / "hosp/patients.csv.gz")
patients_df


Unnamed: 0,subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
0,10000032,F,52,2180,2014 - 2016,2180-09-09
1,10000048,F,23,2126,2008 - 2010,
2,10000058,F,33,2168,2020 - 2022,
3,10000068,F,19,2160,2008 - 2010,
4,10000084,M,72,2160,2017 - 2019,2161-02-13
...,...,...,...,...,...,...
364622,19999828,F,46,2147,2017 - 2019,
364623,19999829,F,28,2186,2008 - 2010,
364624,19999840,M,58,2164,2008 - 2010,2164-09-17
364625,19999914,F,49,2158,2017 - 2019,


In [5]:
# Load MIMIC-CXR chest X-Ray records dataframe
image_record_df = pd.read_csv(MIMIC_CXR_BASE / "cxr-record-list.csv.gz")
print(f"Loaded {len(image_record_df)} image records from MIMIC-CXR")
image_record_df


Loaded 377110 image records from MIMIC-CXR


Unnamed: 0,subject_id,study_id,dicom_id,path
0,10000032,50414267,02aa804e-bde0afdd-112c0b34-7bc16630-4e384014,files/p10/p10000032/s50414267/02aa804e-bde0afd...
1,10000032,50414267,174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962,files/p10/p10000032/s50414267/174413ec-4ec4c1f...
2,10000032,53189527,2a2277a9-b0ded155-c0de8eb9-c124d10e-82c5caab,files/p10/p10000032/s53189527/2a2277a9-b0ded15...
3,10000032,53189527,e084de3b-be89b11e-20fe3f9f-9c8d8dfe-4cfd202c,files/p10/p10000032/s53189527/e084de3b-be89b11...
4,10000032,53911762,68b5c4b1-227d0485-9cc38c3f-7b84ab51-4b472714,files/p10/p10000032/s53911762/68b5c4b1-227d048...
...,...,...,...,...
377105,19999733,57132437,428e2c18-5721d8f3-35a05001-36f3d080-9053b83c,files/p19/p19999733/s57132437/428e2c18-5721d8f...
377106,19999733,57132437,58c403aa-35ff8bd9-73e39f54-8dc9cc5d-e0ec3fa9,files/p19/p19999733/s57132437/58c403aa-35ff8bd...
377107,19999987,55368167,58766883-376a15ce-3b323a28-6af950a0-16b793bd,files/p19/p19999987/s55368167/58766883-376a15c...
377108,19999987,58621812,7ba273af-3d290f8d-e28d0ab4-484b7a86-7fc12b08,files/p19/p19999987/s58621812/7ba273af-3d290f8...


In [6]:
# Filter patients to only those with images
patients_df = patients_df[patients_df['subject_id'].isin(image_record_df['subject_id'])]
print(f"Found {len(patients_df)} patients with both EHR and imaging data")
patients_df


Found 61868 patients with both EHR and imaging data


Unnamed: 0,subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
0,10000032,F,52,2180,2014 - 2016,2180-09-09
26,10000764,M,86,2132,2014 - 2016,
31,10000898,F,80,2188,2014 - 2016,
33,10000935,F,52,2182,2008 - 2010,2187-11-12
38,10000980,F,73,2186,2008 - 2010,2193-08-26
...,...,...,...,...,...,...
364603,19999287,F,71,2191,2008 - 2010,2197-09-02
364609,19999376,M,44,2145,2011 - 2013,
364611,19999442,M,41,2146,2008 - 2010,
364618,19999733,F,19,2152,2011 - 2013,


In [7]:
# Choose 10 patients with images
selected_patients = patients_df.iloc[:10]['subject_id']
selected_images = image_record_df[image_record_df['subject_id'].isin(selected_patients)][['subject_id', 'path']]

selected_images
#print(f"Selected {len(selected_patients)} patients:")
#for patient_id in selected_patients:
#    patient_images = selected_images[selected_images['subject_id'] == patient_id]
#    print(f"  Patient {patient_id}: {len(patient_images)} images")

#print(f"\nTotal images to process: {len(selected_images)}")
#selected_images.head()


Unnamed: 0,subject_id,path
0,10000032,files/p10/p10000032/s50414267/02aa804e-bde0afd...
1,10000032,files/p10/p10000032/s50414267/174413ec-4ec4c1f...
2,10000032,files/p10/p10000032/s53189527/2a2277a9-b0ded15...
3,10000032,files/p10/p10000032/s53189527/e084de3b-be89b11...
4,10000032,files/p10/p10000032/s53911762/68b5c4b1-227d048...
...,...,...
62,10001401,files/p10/p10001401/s55350604/d9db838d-4612fd1...
63,10001401,files/p10/p10001401/s56534136/d69651ae-dc7bacc...
64,10001401,files/p10/p10001401/s57492692/a83c7ff9-2d42639...
65,10001401,files/p10/p10001401/s58747570/19e55bee-714bb19...


## 💾 Save FHIR Patient Resources to Local Volume


In [8]:
# Save FHIR Patient resources to local volume instead of uploading to server
# Make sure to save the MimicPatient "id", which is the unique reference to the patient
# as it pertains to the FHIR system from which the data was transferred.

# Convert Series to DataFrame
if isinstance(selected_patients, pd.Series):
    selected_patients = selected_patients.to_frame(name='subject_id')
    
# Add the column with default None (if it doesn't exist already)
if 'orig_patient_reference' not in selected_patients.columns:
    selected_patients['orig_patient_reference'] = None

# Save patient data to local volume
patient_data = []
print("Processing FHIR Patient resources...")

with gzip.open(MIMIC_IV_FHIR_BASE / "fhir" / "MimicPatient.ndjson.gz", 'rt') as f:
    for line in f:
        patient_json = json.loads(line)
        subject_id = int(patient_json["identifier"][0]["value"])
        if subject_id in selected_patients['subject_id'].values:
            patient_data.append(patient_json)
            selected_patients.loc[selected_patients['subject_id'] == subject_id, 'orig_patient_reference'] = patient_json["id"]
            
            # Save individual patient file
            patient_file = LOCAL_FHIR_VOLUME / f"patient_{patient_json['id']}.json"
            with open(patient_file, 'w') as pf:
                json.dump(patient_json, pf, indent=2)
            print(f"Saved Patient {subject_id} to {patient_file.name}")

# Save all patient data to a single file for easy loading
patients_file = LOCAL_FHIR_VOLUME / "patients.json"
with open(patients_file, 'w') as f:
    json.dump(patient_data, f, indent=2)

print(f"\n✅ Saved {len(patient_data)} patients to {patients_file.name}")
print(f"📁 Individual patient files saved to: {LOCAL_FHIR_VOLUME}")


Processing FHIR Patient resources...
Saved Patient 10000032 to patient_0a8eebfd-a352-522e-89f0-1d4a13abdebc.json
Saved Patient 10000764 to patient_62c0431d-ad3e-5f73-a4e0-b2358beacf78.json
Saved Patient 10000898 to patient_67dbcef1-d72b-545a-a9cf-6b2587f8a628.json
Saved Patient 10000935 to patient_7f448141-4f68-5e99-88aa-0dedae4bcfcd.json
Saved Patient 10000980 to patient_b9751acd-fc69-520d-b2e6-f7e232dbea03.json
Saved Patient 10001038 to patient_e6cebc51-37de-5ced-8f9b-fdac06a2fc71.json
Saved Patient 10001122 to patient_d377090e-f214-5fc2-a883-671ede9a7463.json
Saved Patient 10001176 to patient_7efc6f11-1cb4-5d64-8dcd-7c66ef1b90fc.json
Saved Patient 10001217 to patient_a6e7e991-6801-5425-b435-4ca6b7decfcc.json
Saved Patient 10001401 to patient_f22618de-6490-5e23-a1c4-4de3f02a7b97.json

✅ Saved 10 patients to patients.json
📁 Individual patient files saved to: /home/local/KHQ/andinet.enquobahrie/s/volview-insight-docker/volumes/hapi-fhir-data


## 🏥 Copy DICOM Files to Local Volume


In [9]:
# Copy DICOM files to local volume instead of uploading to Orthanc server
print("Copying DICOM files to local volume...")

dicom_files_copied = []
total_files = len(selected_images)
processed_files = 0

for idx, row in selected_images.iterrows():
    source_path = MIMIC_CXR_BASE / row['path']
    
    if not source_path.exists():
        print(f"[Warning] File not found: {source_path}")
        continue
    
    # Create patient-specific directory
    patient_dir = LOCAL_ORTHANC_VOLUME / f"patient_{row['subject_id']}"
    patient_dir.mkdir(exist_ok=True)
    
    # Copy DICOM file
    dest_path = patient_dir / source_path.name
    shutil.copy2(source_path, dest_path)
    
    dicom_files_copied.append({
        'subject_id': row['subject_id'],
        'source_path': str(source_path),
        'dest_path': str(dest_path),
        'file_size': dest_path.stat().st_size
    })
    
    processed_files += 1
    if processed_files % 10 == 0:
        print(f"Processed {processed_files}/{total_files} files...")

# Save metadata about copied files
metadata_file = LOCAL_ORTHANC_VOLUME / "dicom_metadata.json"
with open(metadata_file, 'w') as f:
    json.dump(dicom_files_copied, f, indent=2)

print(f"\n✅ Copied {len(dicom_files_copied)} DICOM files to {LOCAL_ORTHANC_VOLUME}")
print(f"📁 Metadata saved to: {metadata_file.name}")

# Show summary by patient
print(f"\n📊 Summary by patient:")
for patient_id in selected_patients['subject_id']:
    patient_files = [f for f in dicom_files_copied if f['subject_id'] == patient_id]
    total_size = sum(f['file_size'] for f in patient_files)
    print(f"  Patient {patient_id}: {len(patient_files)} files ({total_size / 1024 / 1024:.1f} MB)")


Copying DICOM files to local volume...
Processed 10/67 files...
Processed 20/67 files...
Processed 30/67 files...
Processed 40/67 files...
Processed 50/67 files...
Processed 60/67 files...

✅ Copied 67 DICOM files to /home/local/KHQ/andinet.enquobahrie/s/volview-insight-docker/volumes/orthanc-data
📁 Metadata saved to: dicom_metadata.json

📊 Summary by patient:
  Patient 10000032: 7 files (99.8 MB)
  Patient 10000764: 3 files (44.5 MB)
  Patient 10000898: 5 files (74.2 MB)
  Patient 10000935: 10 files (115.3 MB)
  Patient 10000980: 16 files (208.4 MB)
  Patient 10001038: 3 files (44.5 MB)
  Patient 10001122: 5 files (58.9 MB)
  Patient 10001176: 5 files (53.3 MB)
  Patient 10001217: 3 files (44.4 MB)
  Patient 10001401: 10 files (134.4 MB)


## 📋 Save Other FHIR Resources to Local Volume


In [10]:
# Save other FHIR resources to local volume instead of uploading to HAPI FHIR server
BASE_FHIR_RESOURCES = ["Condition", "Encounter", "Medication",
"MedicationAdministration", "MedicationDispense", "MedicationRequest",
"ObservationChartevents", "Organization", "Patient", "Procedure"]

RESOURCE_FILES = [MIMIC_IV_FHIR_BASE / "fhir" / ("Mimic" + resource + ".ndjson.gz") for resource in BASE_FHIR_RESOURCES if resource != "Patient"]

# Create resource-specific directories
for resource in BASE_FHIR_RESOURCES:
    if resource != "Patient":
        (LOCAL_FHIR_VOLUME / resource.lower()).mkdir(exist_ok=True)

print("Processing other FHIR resources...")

# Process and save resources
total_resources_saved = 0
for file in RESOURCE_FILES:
    resource_type = file.name.replace("Mimic", "").replace(".ndjson.gz", "").split('.')[0]
    print(f"\nProcessing {resource_type}...")
    
    resource_data = []
    success_count = 0
    start_time = time.time()
    
    with gzip.open(file, 'rt') as f:
        for line in f:
            resource_json = json.loads(line)
            
            if "subject" not in resource_json or "reference" not in resource_json["subject"]:
                print(f"Skipping {resource_type} because resources are not linked to patients")
                break
                
            orig_patient_reference = resource_json["subject"]["reference"].split("/")[1]
            
            if orig_patient_reference not in selected_patients['orig_patient_reference'].values:
                continue
            
            resource_data.append(resource_json)
            success_count += 1
    
    # Save all resources of this type to a single file
    resource_file = LOCAL_FHIR_VOLUME / resource_type.lower() / f"{resource_type.lower()}.json"
    with open(resource_file, 'w') as f:
        json.dump(resource_data, f, indent=2)
    
    end_time = time.time()
    elapsed = end_time - start_time
    print(f"✅ {success_count} {resource_type} resources saved to {resource_file.name} in {elapsed:.2f} seconds")
    total_resources_saved += success_count

print(f"\n🎉 All FHIR resources saved to {LOCAL_FHIR_VOLUME}")
print(f"📊 Total resources saved: {total_resources_saved}")


Processing other FHIR resources...

Processing Condition...
✅ 415 Condition resources saved to condition.json in 54.07 seconds

Processing Encounter...
✅ 26 Encounter resources saved to encounter.json in 9.01 seconds

Processing Medication...
Skipping Medication because resources are not linked to patients
✅ 0 Medication resources saved to medication.json in 0.02 seconds

Processing MedicationAdministration...
✅ 1745 MedicationAdministration resources saved to medicationadministration.json in 370.78 seconds

Processing MedicationDispense...
✅ 815 MedicationDispense resources saved to medicationdispense.json in 184.27 seconds

Processing MedicationRequest...
✅ 1031 MedicationRequest resources saved to medicationrequest.json in 233.35 seconds

Processing ObservationChartevents...
✅ 3559 ObservationChartevents resources saved to observationchartevents.json in 3961.55 seconds

Processing Organization...
Skipping Organization because resources are not linked to patients
✅ 0 Organization res

## 📊 Summary and Next Steps


In [11]:
# Summary of what was created
print("🎉 Data Loading Complete!")
print("=" * 50)

print(f"\n📁 Volume Directories Created:")
print(f"  FHIR data: {LOCAL_FHIR_VOLUME}")
print(f"  Orthanc data: {LOCAL_ORTHANC_VOLUME}")

print(f"\n👥 Patients Processed: {len(selected_patients)}")
print(f"🏥 DICOM Files Copied: {len(dicom_files_copied)}")
print(f"📋 FHIR Resources Saved: {total_resources_saved}")

print(f"\n📂 Directory Structure:")
print(f"volumes/")
print(f"├── hapi-fhir-data/")
print(f"│   ├── patients.json")
print(f"│   ├── patient_*.json")
for resource in BASE_FHIR_RESOURCES:
    if resource != "Patient":
        print(f"│   ├── {resource.lower()}/")
        print(f"│   │   └── {resource.lower()}.json")
print(f"└── orthanc-data/")
for patient_id in selected_patients['subject_id']:
    patient_files = [f for f in dicom_files_copied if f['subject_id'] == patient_id]
    print(f"    ├── patient_{patient_id}/")
    print(f"    │   └── {len(patient_files)} DICOM files")
print(f"    └── dicom_metadata.json")

print(f"\n🚀 Next Steps:")
print(f"1. Start Docker Compose: docker-compose up -d")
print(f"2. Access VolView Insight: http://localhost:8080")
print(f"3. The data will be automatically available in the containers!")

print(f"\n💡 Benefits:")
print(f"✅ No server dependencies during data preparation")
print(f"✅ Data persists between container restarts")
print(f"✅ Version control friendly (relative paths)")
print(f"✅ Easy to share and reproduce")
print(f"✅ Docker automatically mounts the volumes")


🎉 Data Loading Complete!

📁 Volume Directories Created:
  FHIR data: /home/local/KHQ/andinet.enquobahrie/s/volview-insight-docker/volumes/hapi-fhir-data
  Orthanc data: /home/local/KHQ/andinet.enquobahrie/s/volview-insight-docker/volumes/orthanc-data

👥 Patients Processed: 10
🏥 DICOM Files Copied: 67
📋 FHIR Resources Saved: 7633

📂 Directory Structure:
volumes/
├── hapi-fhir-data/
│   ├── patients.json
│   ├── patient_*.json
│   ├── condition/
│   │   └── condition.json
│   ├── encounter/
│   │   └── encounter.json
│   ├── medication/
│   │   └── medication.json
│   ├── medicationadministration/
│   │   └── medicationadministration.json
│   ├── medicationdispense/
│   │   └── medicationdispense.json
│   ├── medicationrequest/
│   │   └── medicationrequest.json
│   ├── observationchartevents/
│   │   └── observationchartevents.json
│   ├── organization/
│   │   └── organization.json
│   ├── procedure/
│   │   └── procedure.json
└── orthanc-data/
    ├── patient_10000032/
    │   └── 7 D