# Medical Data Gateway — Demo Walkthrough

This notebook walks through the full prototype pipeline:

1. **DICOM Inspection** — load a file, explore metadata
2. **The Privacy Problem** — show PHI tags, then anonymize
3. **Medical Visualization** — windowing and HU conversion
4. **Intensity Clustering** — K-Means on pixel values
5. **Fleet Quality Control** — scanner anomaly detection
6. **What Production Looks Like** — gaps and next steps

> **Reminder:** This is a learning prototype. Nothing in this notebook constitutes medical advice or a compliance-ready solution.


In [None]:
import sys
import os
sys.path.insert(0, os.path.dirname(os.getcwd()))

import numpy as np
import matplotlib.pyplot as plt
import pydicom
from pydicom.dataset import FileDataset
from pydicom.uid import ExplicitVRLittleEndian

%matplotlib inline
print("Imports OK")


## 1. DICOM Inspection

We create a synthetic DICOM dataset (since we have no real scan files in this repo) and inspect its metadata.


In [None]:
def make_synthetic_dicom(patient_name="Doe^John", patient_id="12345", rows=64, cols=64):
    """Create a synthetic DICOM dataset for demonstration."""
    file_meta = pydicom.Dataset()
    file_meta.MediaStorageSOPClassUID = pydicom.uid.UID("1.2.840.10008.5.1.4.1.1.2")
    file_meta.MediaStorageSOPInstanceUID = pydicom.uid.generate_uid()
    file_meta.TransferSyntaxUID = ExplicitVRLittleEndian

    ds = FileDataset(filename_or_obj=None, dataset={}, file_meta=file_meta, preamble=b"\0" * 128)
    ds.PatientName = patient_name
    ds.PatientID = patient_id
    ds.PatientBirthDate = "19800101"
    ds.PatientSex = "M"
    ds.StudyDate = "20230601"
    ds.ContentDate = "20230601"
    ds.StudyTime = "120000"
    ds.ContentTime = "120000"
    ds.AccessionNumber = "ACC001"
    ds.StudyID = "STUDY001"
    ds.InstitutionName = "City General Hospital"
    ds.ReferringPhysicianName = "Smith^Jane"
    ds.Modality = "CT"
    ds.RescaleSlope = 1.0
    ds.RescaleIntercept = -1024.0
    ds.WindowCenter = 40.0
    ds.WindowWidth = 400.0
    ds.Rows = rows
    ds.Columns = cols
    ds.SamplesPerPixel = 1
    ds.PhotometricInterpretation = "MONOCHROME2"
    ds.PixelRepresentation = 0
    ds.BitsAllocated = 16
    ds.BitsStored = 16
    ds.HighBit = 15

    # Synthetic pixel data: simulate tissue density variation
    rng = np.random.default_rng(42)
    pixels = rng.integers(800, 1200, size=(rows, cols), dtype=np.uint16)
    # Add a bright "bone" region
    pixels[20:30, 20:40] = 1800
    ds.PixelData = pixels.tobytes()
    return ds

ds = make_synthetic_dicom()
print("DICOM Dataset loaded")
print(f"  Patient Name   : {ds.PatientName}")
print(f"  Patient ID     : {ds.PatientID}")
print(f"  Institution    : {ds.InstitutionName}")
print(f"  Study Date     : {ds.StudyDate}")
print(f"  Modality       : {ds.Modality}")
print(f"  Image size     : {ds.Rows} x {ds.Columns}")


## 2. The Privacy Problem

DICOM files embed patient identity in dozens of header tags.  If a scan file
leaves the mobile unit unmodified, patient identity travels with it.

Below we anonymize the dataset using `src.anonymizer` and verify the result.


In [None]:
from src.anonymizer import anonymize_dataset, TAGS_TO_REMOVE

print("=== BEFORE ANONYMIZATION ===")
phi_tags = ["PatientName", "PatientID", "PatientBirthDate", "InstitutionName"]
for tag in phi_tags:
    print(f"  {tag}: {getattr(ds, tag, '(not present)')}")

anonymize_dataset(ds, station_name="DEMO_UNIT_01")

print("\n=== AFTER ANONYMIZATION ===")
for tag in phi_tags:
    print(f"  {tag}: {getattr(ds, tag, '(removed)')} ")

print(f"\n  PatientIdentityRemoved : {ds.PatientIdentityRemoved}")
print(f"  StationName            : {ds.StationName}")
print(f"  DeidentificationMethod : {ds.DeidentificationMethod[:60]}...")
print(f"\nTotal PHI tags targeted for removal: {len(TAGS_TO_REMOVE)}")


## 3. Medical Visualization — Windowing

CT pixel values encode tissue density in Hounsfield Units (HU).  A *window*
maps a clinically relevant HU range to the 0–255 display range.  The same
slice looks completely different through a bone window vs a brain window.


In [None]:
from src.windowing import WINDOW_PRESETS, window_from_dataset

# Re-create a fresh dataset (anonymize modified the previous one in-place)
ds2 = make_synthetic_dicom()

presets = list(WINDOW_PRESETS.keys())
fig, axes = plt.subplots(1, len(presets), figsize=(4 * len(presets), 4))

for ax, preset in zip(axes, presets):
    windowed = window_from_dataset(ds2, preset=preset)
    center, width = WINDOW_PRESETS[preset]
    ax.imshow(windowed, cmap="gray")
    ax.set_title(f"{preset.replace('_', ' ').title()}\nC={center}, W={width}", fontsize=10)
    ax.axis("off")

fig.suptitle("Same CT Slice Through Different Clinical Windows", fontsize=13)
fig.tight_layout()
plt.show()
print("The bright square (simulated bone) appears very differently across windows.")


## 4. Intensity Clustering

K-Means groups pixels by their windowed intensity value.  With k=3 clusters
on a CT slice, the groups loosely correspond to air/soft-tissue/bone —
but the algorithm has no knowledge of anatomy.

**This is visualisation, not segmentation.**


In [None]:
from src.clustering import cluster_scan

ds3 = make_synthetic_dicom()
windowed, cluster_map, silhouette = cluster_scan(ds3, n_clusters=3)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

ax1.imshow(windowed, cmap="gray")
ax1.set_title("Windowed Scan (soft_tissue preset)", fontsize=11)
ax1.axis("off")

ax2.imshow(cluster_map, cmap="plasma")
ax2.set_title(
    f"K-Means Intensity Clustering (k=3)\nSilhouette score: {silhouette:.3f}",
    fontsize=11
)
ax2.axis("off")

fig.tight_layout()
plt.show()
print(f"Silhouette score: {silhouette:.3f}  (range: -1 to +1; higher = better separated clusters)")
print("Note: cluster labels are arbitrary integers — they don't map to named tissue types.")


## 5. Fleet Quality Control

In a real deployment, dozens of mobile scanners send files to a central
server.  `src.scanner_qc` extracts simple statistics from each file and
clusters the fleet to surface potential outliers.

We simulate a small fleet of synthetic scans below.


In [None]:
import tempfile, os
from src.scanner_qc import run_qc, ScanFeatures
from src.visualization import plot_fleet_qc

# Write synthetic DICOMs to a temp folder
tmp_dir = tempfile.mkdtemp()

def write_synthetic(path, mean_val, std_val, size=32):
    rng = np.random.default_rng(hash(path) % (2**32))
    pixels = rng.normal(mean_val, std_val, size=(size, size)).clip(0, 4095).astype(np.uint16)
    file_meta = pydicom.Dataset()
    file_meta.MediaStorageSOPClassUID = pydicom.uid.UID("1.2.840.10008.5.1.4.1.1.2")
    file_meta.MediaStorageSOPInstanceUID = pydicom.uid.generate_uid()
    file_meta.TransferSyntaxUID = ExplicitVRLittleEndian
    ds = FileDataset(path, {}, file_meta=file_meta, preamble=b"\0" * 128)
    ds.Rows, ds.Columns = size, size
    ds.SamplesPerPixel = 1
    ds.PhotometricInterpretation = "MONOCHROME2"
    ds.PixelRepresentation = 0
    ds.BitsAllocated = 16
    ds.BitsStored = 16
    ds.HighBit = 15
    ds.PixelData = pixels.tobytes()
    ds.save_as(path)

# 6 "normal" scans + 2 "anomalous" scans
for i in range(6):
    write_synthetic(os.path.join(tmp_dir, f"scan_{i:02d}.dcm"), mean_val=1000, std_val=150)
for i in range(6, 8):
    write_synthetic(os.path.join(tmp_dir, f"scan_{i:02d}.dcm"), mean_val=400, std_val=50)

records, matrix, labels, silhouette = run_qc(tmp_dir, n_clusters=2)

print(f"Files processed: {len(records)}")
print(f"Silhouette score: {silhouette:.3f}")
print("\nCluster assignments:")
for rec, label in zip(records, labels):
    print(f"  {rec.filename}  →  Group {label}  "
          f"(density={rec.avg_density:.0f}, contrast={rec.contrast:.0f})")


In [None]:
fig = plot_fleet_qc(records, labels, silhouette)
plt.show()
print("Scans in a different cluster may warrant investigation.")


## 6. What Production Looks Like

This prototype deliberately keeps things simple.  A production system would need:

| Gap | What's needed |
|---|---|
| UID remapping | Generate new unlinkable UIDs for every study/series/SOP |
| Pixel scrubbing | OCR or ML model to detect burned-in text annotations |
| Real upload | Authenticated HTTPS / S3 / DICOM-web with TLS |
| Compliance audit | HIPAA/GDPR specialist review of the de-identification pipeline |
| Containerisation | Docker image for consistent deployment on edge hardware |
| Validated segmentation | Model trained on annotated ground-truth data, not K-Means |

The `src/pipeline.py` module shows the exponential backoff pattern that
any real upload should use — the mock just needs to be swapped for a real HTTP call.

```python
# Replace mock_upload with a real call, e.g.:
import requests
def real_upload(filename, url, auth_token):
    with open(filename, "rb") as f:
        resp = requests.post(url, data=f, headers={"Authorization": f"Bearer {auth_token}"})
    resp.raise_for_status()
    return True
```
