# Day 3 - Patch Extraaction Pipeline (128x128, stride=64)

---

## 1. Introduction & Reproducibility Notes
**Goal:** Extract 128x128 patches from normalized CT slices (Day 2 outputs), and produce a patch dataset with `patch_manifest.csv` and `patch_meta.json`.

**Why This matter:**
- Patch-level training increases sample count and reduces GPU memory.
- Manifest/meta ensure provenance: each path know its source slice, coordinates and statistics.

**Assumption from Day 2:**
- Each slice is `float32` and normalized to [0, 1].
- `train_manifest.csv` contains a `path` column (relative path).

**Reproducibility notes:**
- Use this notebook for exploration/debugging.
- Later migrate logic into a script `prepare_patches.py`.
- Keep config immutable (`frozen=True`) to prevent silent changes.

---

## 2. Configuration
**Purpose:** Define a single immutable configuration object for patch extraction.  
**Expected:** Only defines the `PathConfig` dataclass, no output.

In [22]:
from dataclasses import dataclass
import numpy as np
import os
import pandas as pd
import math, json, time
from pathlib import Path
from typing import Optional

@dataclass(frozen=True)
class PatchConfig:
    """
    Configuration for grid-based patch extraction
    """
    patch_size: int = 128
    stride: int = 64
    pad_mode: str = "none"      #keep 'none' for reproducibility

---
## 3. Sliding Window Coordinates
**Purpose:** Compute top-left `(row, col)` coordinates or valid sliding windows.  
**Validation examples:**
- 512x512 → 49 coords (7x7 grid).
- 480x512 → 52 coords.
- First few coords: `(0, 0), (0, 64), (0, 128)`.


In [23]:
def sliding_window_coords(h: int, w: int, cfg: PathConfig):
    """
    Return list of (row, col) top-left coordinates for valid patches.
    No padding; drops incomplete pathces at the borders
    """
    ps, st = cfg.patch_size, cfg.stride
    if h < ps or w < ps:
        return []
    
    n_rows = 1 + (h - ps) // st
    n_cols = 1 + (w - ps) // st
    rows = [r * st for r in range(n_rows)]
    cols = [c * st for c in range(n_cols)]
    return [(r, c) for r in rows for c in cols]

#quick smoke test
cfg = PatchConfig(patch_size=128, stride=64)
coords_A = sliding_window_coords(512, 512, cfg)     # expected 7*7 = 49
coords_B = sliding_window_coords(480, 512, cfg)     # expected 6*7 = 42
print("512x512 ->", len(coords_A), "coords (expected 49)")
print("480x512 ->", len(coords_B), "coords (expected 42)")
print("First 3 coords (A):", coords_A[:3])

512x512 -> 49 coords (expected 49)
480x512 -> 42 coords (expected 42)
First 3 coords (A): [(0, 0), (0, 64), (0, 128)]


---
## 4. Patch Extraction Core
**Purpose:** Extract patches from a 2D slice using coordinates.  
**Returns**
- `patches`: `(N, 128, 128)`, float32
- `coords`: list of `(row, col)`

**Validation**
Zeros 512x512 → `(49, 128, 128)` patches, min/max = 0


In [25]:
def extract_patches_2d(img2d: np.ndarray, cfg: PatchConfig):
    """
    Extract (N, ps, ps) patches and their (row, col) coordinates from a 2D array.
    - img2d: ndarray (H, W), float32 preferred
    - returns: patches (N, ps, ps) float32, coords list[(row, col)]
    """
    assert img2d.ndim == 2, "img2d must be 2D"
    if img2d.dtype != np.float32:
        img2d = img2d.astype(np.float32, copy=False)
    
    H, W = img2d.shape
    coords = sliding_window_coords(H, W, cfg)
    ps = cfg.patch_size

    patches = np.stack([img2d[r:r+ps, c:c+ps] for (r, c) in coords], axis=0) if coords else \
                np.empty((0, ps, ps), dtype=np.float32)
    
    # Sanity checks
    assert patches.ndim == 3, "patches must be (N, ps, ps)"
    if patches.size > 0:
        assert patches.shape[1] == ps and patches.shape[2] == ps, "Wrong patch size"
        assert patches.dtype == np.float32

    return patches, coords

#Quick dry-run on zeros
cfg = PatchConfig(patch_size=128, stride=64)
imgA = np.zeros((512, 512), dtype=np.float32)
patchesA, coordsA = extract_patches_2d(imgA, cfg)
print("Extracted:", patchesA.shape, "expected (49, 128, 128)")
print("First patch min/max:", patchesA[0].min(), patchesA[0].max())

Extracted: (49, 128, 128) expected (49, 128, 128)
First patch min/max: 0.0 0.0


---
## 5. Synthetic Tests
**Purpose:** Validate patching on random arrays.  
**Why:** Ensures variability and avoids bugs where patches are identical.  
**Expected:** `(49,128,128)` patches, different mean/std per patch.

In [26]:
rng = np.random.default_rng(0)
imgR = rng.normal(loc=0.0, scale=1.0, size=(512, 512)).astype(np.float32)

cfg = PatchConfig(patch_size=128, stride=64)
pR, cR = extract_patches_2d(imgR, cfg)
print("Random image ->", pR.shape, "patches")

#compare a few patch stats to ensure they differ
for i in range(3):
    print(f"patch[{i}] mean/std:", float(pR[i].mean()), float(pR[i].std()))

Random image -> (49, 128, 128) patches
patch[0] mean/std: 0.005805877037346363 0.9954078793525696
patch[1] mean/std: 0.0017823539674282074 1.0042778253555298
patch[2] mean/std: 0.0008969246409833431 1.0121972560882568


---
## 6. Connect to Day-2 Data
**Purpose:** Load `train_manifest.csv`, resolve relative → absolute paths, test on real slices.  
**Expected:**
- Paths exists (`exists? True`).
- Real slice (e.g., 362x362) → ~16 patches per slice

In [27]:
TRAIN_DIR = r'D:\cosc_4372\projects\lowdose_ct_project\data\prepared\lodopab\train'


# ---------- A) Locate and read manifest (RELATIVE paths) ----------
candidates = ["train_manifest.csv", "manifest.csv"]
manifest_path: Optional[str] = next(
    (os.path.join(TRAIN_DIR, n) for n in candidates if os.path.exists(os.path.join(TRAIN_DIR, n))),
    None
)
assert manifest_path is not None, f"No manifest found in: {TRAIN_DIR}"
print("Using manifest:", manifest_path)

df = pd.read_csv(manifest_path)
print("Columns:", list(df.columns), "| Num rows", len(df))
assert "path" in df.columns, "Manifest must contain a 'path' column (relative to data_root)."

# Infer data_root: .../data/prepared/lodopab/train  → hop up to .../data
data_root = os.path.abspath(os.path.join(TRAIN_DIR, "..", "..", ".."))   #hop up to ...\data
print("data_root", data_root)

def to_abs(p_rel: str) -> str:
    """
    Convert a repository-relative path (e.g., 'prepared/lodopab/train/train_000000.npy')
    into a local absolute path using 'data/root'.

    NOTE:
    - This is ONLY for local loading/testing in the notebook.
    - Do NOT persist absolute paths into any public CSV/JSON.
    """
    p_rel_norm = os.path.normpath(p_rel)
    return os.path.normpath(os.path.join(data_root, p_rel_norm))

# ---------- B) Local-only resolution for quick sanity tests ----------
df["_abs_path_local"] = df["path"].apply(to_abs)

print(df[["path", "_abs_path_local"]].head(3).to_string(index=False))

# ---------- C) Quick sanity: load a couple slices and extract patches ----------
cfg = PatchConfig(patch_size=128, stride=64)

n_test = min(2, len(df))
for i in range(n_test):
    p_local = df.loc[i, "_abs_path_local"]
    exists = os.path.exists(p_local)
    print(f"[{i}] exist? {exists} -> {p_local}")
    assert exists, f"Local file not found: {p_local}"

    x = np.load(p_local).astype(np.float32)
    patches, coords = extract_patches_2d(x, cfg)
    print(f"    slice={x.shape}, patches={patches.shape[0]}, first={coords[0] if coords else None}")
    if patches.size > 0:
        print(f"    patch[0] min/max={float(patches[0].min())}/{float(patches[0].max())}, dtype={patches.dtype}")



Using manifest: D:\cosc_4372\projects\lowdose_ct_project\data\prepared\lodopab\train\train_manifest.csv
Columns: ['index', 'path', 'min', 'max'] | Num rows 10
data_root D:\cosc_4372\projects\lowdose_ct_project\data
                                   path                                                                       _abs_path_local
prepared/lodopab/train/train_000000.npy D:\cosc_4372\projects\lowdose_ct_project\data\prepared\lodopab\train\train_000000.npy
prepared/lodopab/train/train_000001.npy D:\cosc_4372\projects\lowdose_ct_project\data\prepared\lodopab\train\train_000001.npy
prepared/lodopab/train/train_000002.npy D:\cosc_4372\projects\lowdose_ct_project\data\prepared\lodopab\train\train_000002.npy
[0] exist? True -> D:\cosc_4372\projects\lowdose_ct_project\data\prepared\lodopab\train\train_000000.npy
    slice=(362, 362), patches=16, first=(0, 0)
    patch[0] min/max=0.0/0.369978129863739, dtype=float32
[1] exist? True -> D:\cosc_4372\projects\lowdose_ct_project\data\prepar

---
## 7. Estimation & Naming Preview
**Purpose:** Count expected patches and preview filename scheme.  
**Naming scheme:** `{split}_{sliceIndex:06d}_{row:03d}_{col:03d}.npy`  
Example: `train_000123_064_192.npy`.  
**Expected:** ~160 patches for 10 slices, plus filename previews.

In [28]:
PATCH_ROOT = Path(r"D:\cosc_4372\projects\lowdose_ct_project\data\patches\lodopab\train")
PATCH_ROOT.mkdir(parents=True, exist_ok=True)

cfg = PatchConfig(patch_size=128, stride=64)

total_patches = 0
example_names = []

for i in range(len(df)):
    p = df.loc[i, "_abs_path_local"]
    x = np.load(p).astype(np.float32)
    patches, coords = extract_patches_2d(x, cfg)
    total_patches += len(patches)

    # preview first 2 filenames per slice (just strings; we don't save here)
    for j, (r, c) in enumerate(coords[:2]):
        patch_id = f"train_{i:06d}_{r:03d}_{c:03d}"
        fname = PATCH_ROOT / f"{patch_id}.npy"
        if len(example_names) < 4:
            example_names.append(str(fname))


print("Will save to:", PATCH_ROOT)
print("Total slices:", len(df))
print("Estimated total patches:", total_patches)
print("Example filenames:")
for s in example_names:
    print(" ", s)

Will save to: D:\cosc_4372\projects\lowdose_ct_project\data\patches\lodopab\train
Total slices: 10
Estimated total patches: 160
Example filenames:
  D:\cosc_4372\projects\lowdose_ct_project\data\patches\lodopab\train\train_000000_000_000.npy
  D:\cosc_4372\projects\lowdose_ct_project\data\patches\lodopab\train\train_000000_000_064.npy
  D:\cosc_4372\projects\lowdose_ct_project\data\patches\lodopab\train\train_000001_000_000.npy
  D:\cosc_4372\projects\lowdose_ct_project\data\patches\lodopab\train\train_000001_000_064.npy


---
## 8. Save Patches + Build Manifest/Meta
**Purpose:** Save all patches to disk, build manifest and meta for provenance.  

**Manifest fields:**
- patch_id, slice_index, src_path, row, col, patch_path
- min, max, mean, std, dtype  

**Meta fields:**
- num_slices, num_num_patches, patch_size, strid, source_manifest, created  

**Expeced:** ~160 `.npy` files, plus `patch_manifest.csv` and `patch_meta.json`.

In [29]:
PATCH_ROOT = Path(r"D:\cosc_4372\projects\lowdose_ct_project\data\patches\lodopab\train")
PATCH_ROOT.mkdir(parents=True, exist_ok=True)

manifest_rows = []
start = time.time()

for i in range(len(df)):
    # Local load using absolute path (NOT saved to CSV)
    src_abs = df.loc[i, "_abs_path_local"]
    x = np.load(src_abs).astype(np.float32)

    patches, coords = extract_patches_2d(x, cfg)

    for (patch_arr, (r, c)) in zip(patches, coords):
        patch_id = f"train_{i:06d}_{r:03d}_{c:03d}"

        # Absolute path for local write
        patch_abs = PATCH_ROOT / f"{patch_id}.npy"
        np.save(patch_abs, patch_arr)

        # ---- Save RELATIVE paths in manifest 
        src_rel = os.path.relpath(src_abs, data_root)
        patch_rel = os.path.relpath(patch_abs, data_root)

        manifest_rows.append({
            "patch_id": patch_id,
            "slice_index": i,
            "src_path": src_rel,
            "row": r,
            "col": c,
            "patch_path": str(patch_rel),
            "min": float(patch_arr.min()),
            "max": float(patch_arr.max()),
            "mean": float(patch_arr.mean()),
            "std": float(patch_arr.std()),
            "dtype": str(patch_arr.dtype)
        })
    
print(f"Save {len(manifest_rows)} patches to {PATCH_ROOT} in {time.time()-start:.2f}s")

# Save manifest (CSV) - contains RELATIVE paths only
manifest_csv = PATCH_ROOT / "patch_manifest.csv"
pd.DataFrame(manifest_rows).to_csv(manifest_csv, index=False)
print("Wrote manifest:", manifest_csv)

# Save meta.json - keep relative pointers for portability
source_manifest_rel = os.path.relpath(manifest_path, data_root)
path_dir_rel = os.path.relpath(PATCH_ROOT, data_root)

meta = {
    "num_slices": int(len(df)),
    "num_patches": int(len(manifest_rows)),
    "patch_size": cfg.patch_size,
    "stride": cfg.stride,
    "source_manifest": source_manifest_rel,
    "patch_dir": path_dir_rel,
    "created": time.strftime("%Y-%m-%d %H:%M:%S"),
}
with open(PATCH_ROOT / "patch_meta.json", "w") as f:
    json.dump(meta, f, indent=2)

print("Wrote meta.json")

Save 160 patches to D:\cosc_4372\projects\lowdose_ct_project\data\patches\lodopab\train in 0.07s
Wrote manifest: D:\cosc_4372\projects\lowdose_ct_project\data\patches\lodopab\train\patch_manifest.csv
Wrote meta.json


--- 
## 9. Unit tests  
**Purpose:** Validate patch dataset integrity.
**Checks:**
- Counts match between manifest and meta.  
- Dtype = float32.  
- Values ∈ [0, 1].
- Shapes = (128,128).  

**Expected:** Output "Unit Tests passed".

In [31]:
patch_manifest = PATCH_ROOT / "patch_manifest.csv"
patch_meta = PATCH_ROOT / "patch_meta.json"

df_p = pd.read_csv(patch_manifest)
with open(patch_meta, "r") as f:
    meta_p = json.load(f)

print("Manifest rows:", len(df_p))
print("Meta summary:", meta_p)

# --- Test ---
# 1) Count check
assert len(df_p) == meta_p["num_patches"], "Mismatch: manifest vs meta.json count"

# 2) Dtype check
assert df_p["dtype"].nunique() == 1 and df_p["dtype"].iloc[0] == "float32", "All patches must be float32"

# 3) Range check
assert (df_p["min"] >= 0).all(), "Patch min below 0!"
assert (df_p["max"] <= 1).all(), "Patch max above 1!"

# 4) Patch schema sanity
assert df_p["patch_path"].str.contains(r"^[^:\\/].+").all(), "patch_path should be RELATIVE"
assert df_p["src_path"].str.contains(r"^[^:\\/].+").all(), "src_path should be RELATIVE"

# 5) Shape check on a few random samples (resolve to ABS using data_root)
shape_ok = True

# Randomly load 3 patches and check shape
sample_rows = df_p.sample(n=min(3, len(df_p)), random_state=42)
for _, row in sample_rows.iterrows():
    patch_abs = os.path.normpath(os.path.join(data_root, row["patch_path"]))
    x = np.load(patch_abs)
    if x.shape != (meta_p["patch_size"], meta_p["patch_size"]):
        shape_ok=False
    print(f"Check {row['patch_id']} -> shape={x.shape}, min={x.min():.4f}, max={x.max():.4f}")

assert shape_ok, "Some patch shapes not correct"
print("Unit tests passed: shapes, dtype, range all valid")

Manifest rows: 160
Meta summary: {'num_slices': 10, 'num_patches': 160, 'patch_size': 128, 'stride': 64, 'source_manifest': 'prepared\\lodopab\\train\\train_manifest.csv', 'patch_dir': 'patches\\lodopab\\train', 'created': '2025-10-03 23:43:24'}
Check train_000006_128_064 -> shape=(128, 128), min=0.0000, max=0.2908
Check train_000006_192_000 -> shape=(128, 128), min=0.0000, max=0.3913
Check train_000008_192_064 -> shape=(128, 128), min=0.0000, max=0.3979
Unit tests passed: shapes, dtype, range all valid


---
## 10. Conclusion & Next Steps  
**Day 3 results:**
- Patch extraction pipeline implemented (valid crops, 128x128, stride 64).
- Train split processed with manifest/meta.
- Unit tests confirm reproducibility.  

**Next steps:**
- Repeat for val/test splits.
- Package into `prepare_patches.py` CLI script.
- Optional: add patch filtering (e.g., remove low-variance patches).

---  

## 11. Re-run instructions
- Prereqs: Python 3.10+, NumPy, Pandas; Day-2 outputs ready.
- Run notebook top to bottom.
- Change only `PathConfig` in Cell 2.
- Expected runtime: seconds for 10 slices; lineaer scaling with dataset size.

---  

## 12. Common Pitfalls
- **Path not found:** resolve relative → absolute.
- **Values outside [0, 1]:** check Day-2 normalization.
- **Wrong patch count:** verify image size, patch_size, stride.