
# DOLPHIN — Quickstart Demo Notebook (NDD Diagnostic)

This notebook helps you set up and run the forked repository **`DOLPHIN-NDD_diagnostic`** (fork of SCUT-DLVC/DOLPHIN).  
It guides you through environment setup, minimal preprocessing, quick evaluation, and feature/embedding extraction.

> **Tested context:** Windows 10/11 + Conda + Python 3.8–3.11.  
> **Repo:** `https://github.com/RichardLadislav/DOLPHIN-NDD_diagnostic` (fork of SCUT‑DLVCLab/DOLPHIN)

---



## 1) Environment setup

You can use either **Conda** or **pip**. The original repo pins Python 3.8.16; it also works with Python 3.10/3.11 if you adjust dependencies (not guaranteed).  
If you hit **NumPy 2.0** compatibility errors, prefer **NumPy < 2.0** or rebuild affected wheels.

### Option A — Conda (recommended on Windows)
```bash
# Create env (choose one Python version that works for you)
conda create -n dolphin python=3.10 -y
conda activate dolphin

# Install PyTorch (choose your CUDA version; see https://pytorch.org/get-started/locally/)
# Example: CUDA 12.x build
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Core deps from requirements; pin numpy<2 to avoid ABI mismatches in some libs
pip install -r requirements.txt --no-deps
pip install numpy<2 matplotlib termcolor Pillow opencv-python timm pytorch-wavelets
```

### Option B — Pure pip (virtualenv)
```bash
python -m venv .venv
# Windows:
.\.venv\Scripts\activate
# Linux/macOS:
source .venv/bin/activate

pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt --no-deps
pip install numpy<2 matplotlib termcolor Pillow opencv-python timm pytorch-wavelets
```



## 2) Clone the repository & set working directory

> If you've already cloned your fork, just update `REPO_DIR` below to match your path.

```bash
# Choose a parent folder
cd C:\dev

# Clone your fork (replace with your URL if different)
git clone https://github.com/RichardLadislav/DOLPHIN-NDD_diagnostic.git
cd DOLPHIN-NDD_diagnostic
```


In [None]:

# Configure Python-side path to the repo so you can import modules directly in this notebook.
import sys, os, pathlib
REPO_DIR = pathlib.Path(r"C:\dev\dolphin_initial_testing\DOLPHIN").resolve()  # <- CHANGE to your local path
if REPO_DIR.exists():
    sys.path.insert(0, str(REPO_DIR))
    print("Repo path added to sys.path:", REPO_DIR)
else:
    print("WARNING: REPO_DIR does not exist. Please edit the path above.")



## 3) Quick GPU / Torch check


In [None]:

import torch
print("Torch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("CUDA device:", torch.cuda.get_device_name(0))



## 4) Data layout and sanity check

Datasets used in the original paper/repo:
- **CASIA-OLHWDB2**
- **DCOH-E**
- **SCUT-COUCH2009**

Create the following structure and place the raw data accordingly:
```
DOLPHIN/
├─ data-raw/
│  ├─ OLHWDB2/           # raw .txt/.dat files
│  ├─ DCOH_E/
│  └─ SCUT_COUCH2009/
└─ data/
```

> If you only want a **tiny dry-run**, put a **very small** subset (e.g., a few writers with a couple of files) into `data-raw/...`.
This will speed up preprocessing and avoid large downloads.


In [None]:

from pathlib import Path

root = REPO_DIR
print("Expecting folders under:", root)

print("\nRaw data present?")
for d in ["data-raw/OLHWDB2", "data-raw/DCOH_E", "data-raw/COUCH2009"]:
    p = root / d
    print(f"{d:30s}", "OK" if p.exists() else "MISSING")

print("\nProcessed data folder:")
print((root / "data").resolve(), "exists?" , (root / "data").exists())



## 5) Preprocess datasets and create the **OLIWER** split

Run the repo's preprocessing script(s). You can run from terminal or within this notebook.
Large datasets will take time; start with a tiny subset for verification.

### CLI (Terminal)
```bash
# From the repo root (adjust interp if you want to resample the trajectories)
python preprocess.py --src_root ./data-raw/OLHWDB2 --interp 4
python preprocess.py --src_root ./data-raw/DCOH_E --interp 4
python preprocess.py --src_root ./data-raw/SCUT_COUCH2009 --interp 4

# Merge and split into OLIWER (train/test)
python divide.py --data_root ./data --out_root ./data/OLIWER --train_ratio 0.8 --seed 42

# Optional: verify PKLs
python verify_pkls.py --root ./data/OLIWER
```

### From notebook (subprocess)
> Uncomment the lines below to run directly here.


In [None]:

# import subprocess, sys
# cmds = [
#     [sys.executable, "preprocess.py", "--src_root", "./data-raw/OLHWDB2", "--interp", "4"],
#     [sys.executable, "preprocess.py", "--src_root", "./data-raw/DCOH_E", "--interp", "4"],
#     [sys.executable, "preprocess.py", "--src_root", "./data-raw/SCUT_COUCH2009", "--interp", "4"],
#     [sys.executable, "divide.py", "--data_root", "./data", "--out_root", "./data/OLIWER", "--train_ratio", "0.8", "--seed", "42"],
#     [sys.executable, "verify_pkls.py", "--root", "./data/OLIWER"],
# ]
# for c in cmds:
#     print("Running:", " ".join(c))
#     subprocess.run(c, cwd=str(REPO_DIR))
print("Preprocessing commands prepared (commented out). Run them in your terminal for full datasets.")



## 6) Load pretrained weights & run a tiny evaluation

Place your weights (e.g., `model.pth`) under `./weights/` in the repo.  
Then run `test.py` with a tiny gallery/query to confirm everything works.

> **Security tip from PyTorch:** If you see a warning about `torch.load(..., weights_only=False)`, prefer `weights_only=True` once your code supports strict state-dict loading.

### CLI example
```bash
python test.py --weights ./weights/model.pth --gallery_root ./data/OLIWER/test-tf.pkl --query_root ./data/OLIWER/test-tf.pkl --topk 5
```


In [None]:

# Example: call test.py programmatically (requires processed data & weights).
import subprocess, sys, pathlib

weights = REPO_DIR / "weights" / "model.pth"        # <- put your file here
gallery = REPO_DIR / "data" / "OLIWER" / "test-tf.pkl"
if weights.exists() and gallery.exists():
    cmd = [sys.executable, "test.py", "--weights", str(weights), "--gallery_root", str(gallery), "--query_root", str(gallery), "--topk", "5"]
    print("Running:", " ".join(cmd))
    subprocess.run(cmd, cwd=str(REPO_DIR))
else:
    print("Weights or gallery PKL missing. Please add them first.")
print("Prepared cell for tiny evaluation (commented).")



## 7) Extract embeddings for custom analysis

Below is a minimal example showing how to load the model class and extract embeddings (`x`) and logits (`y`) from the **Head**.
Adapt paths to your PKL dataset. For a quick check, use a tiny sample.


In [None]:
## Version 1.0 - simple forward pass on a tiny batch from PKL

import pickle, torch
import numpy as np
from pathlib import Path
from joblib import load
# Local imports from the repo
try:
    from model import DOLPHIN
    from dataset import Writing, collate_fn
    print("Imported repo modules successfully.")
except Exception as e:
    print("Import error:", e)
    print("Make sure REPO_DIR is correct and repo is on sys.path.")

# --- Config ---
PKL_PATH = REPO_DIR / "data" / "OLIWER" / "test-tf.pkl"   # <- change to a small PKL for a quick check
NUM_CLASSES = 1000                                        # placeholder; not used if we only take embeddings
BATCH_SIZE = 8

if PKL_PATH.exists():
   # Load a tiny subset
    #data = pickle.load(open(PKL_PATH, "rb"))
    data = load(open(PKL_PATH, "rb"))  # joblib load (faster
    # 'data' is typically a dict: {writer_id: [np.array(T, 3), ...]}
    # Flatten into list of (writer_id, array)
    samples = []
    for k, arrs in data.items():
        for a in arrs[:2]:   # take at most 2 per writer to keep tiny
            samples.append((k, a))
            if len(samples) >= 32:
                break
        if len(samples) >= 32:
            break

    # Build torch tensors (x, y, p): x=(T,3) ~ [x, y, p], y=label, p=valid_mask (optional)
    def to_tensor(sample):
        lbl, arr = sample
        arr = np.asarray(arr, dtype=np.float32)
        x = torch.tensor(arr.T, dtype=torch.float32)  # shape (3, T)
        y = torch.tensor(lbl, dtype=torch.long)
        p = torch.ones(x.shape[-1], dtype=torch.float32)
        return x, y, p

    tensor_samples = [to_tensor(s) for s in samples]

    # Collate (repo provides a collate_fn)
    batch = collate_fn(tensor_samples)
    xs, ys, ps, lens = batch  # shapes: [B, 3, T], [B], [B, T], [B]

    # Create model (dims per paper/code)
    model = DOLPHIN(d_in=3, num_classes=NUM_CLASSES).eval()
    with torch.no_grad():
        # Forward: returns embedding and logits
        emb, logits = model(xs)
    print("Embeddings shape:", emb.shape)   # (B, d_hidden) per Head
    print("Logits shape:", logits.shape)

else:
    print("PKL not found at:", PKL_PATH)
    print("Please run preprocessing and divide steps first.")


In [None]:
import torch, random, numpy as np
from torch.utils.data import DataLoader, Subset
from joblib import load
from tqdm import tqdm   # <-- progress bar
from model import DOLPHIN
from dataset import Writing  # your class

# 1) Load your joblib-saved file
PKL_PATH = REPO_DIR / "data" / "OLIWER" / "test-tf.pkl"
handwriting_info = load(str(PKL_PATH), mmap_mode='r')

# 2) Dataset
ds = Writing(handwriting_info, transform=None, train=False)
num_classes = ds.users_cnt
print(f"Users: {num_classes} | Samples: {len(ds)} | F={ds.feature_dims}")

# 3) Collate -> (xs[B,3,T], ys[B], ps[B,T], lens[B])
def collate_fn_dolphin(batch, cols=(0,1,2)):
    arrs, lens, labels = zip(*batch)
    lens = np.asarray(lens, dtype=np.int64)
    B, T_max = len(arrs), int(lens.max())

    xs = torch.zeros(B, T_max, len(cols), dtype=torch.float32)  # (B,T,3)
    ps = torch.zeros(B, T_max, dtype=torch.float32)
    ys = torch.tensor(labels, dtype=torch.long)
    lens_t = torch.tensor(lens, dtype=torch.int64)

    for i, (a, L) in enumerate(zip(arrs, lens)):
        a = torch.as_tensor(a, dtype=torch.float32)     # (T,F)
        sel = a[:L, cols]                               # select (x,y,p)
        xs[i, :L, :] = sel
        ps[i, :L] = 1.0

    return xs, ys, ps, lens_t

# 4) Small loader for a smoke test
idx = list(range(len(ds))); random.shuffle(idx); idx = idx[:64]
loader = DataLoader(
    Subset(ds, idx),
    batch_size=16,
    shuffle=False,
    num_workers=0,
    #collate_fn=lambda b: collate_fn_dolphin(b, cols=(0,1,2))
)

# 5) Model (+ correct call)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = DOLPHIN(d_in=3, num_classes=num_classes).to(device).eval()

# --- tqdm progress bar for inference ---
embeddings, logits_all = [], []
with torch.no_grad():
    for xs, ys, ps, lens in tqdm(loader, desc="Running DOLPHIN inference", unit="batch"):
        xs = xs.to(device)
        lens = lens.to(device)
        y_vec, y_prob, f3 = model(xs, lens)
        embeddings.append(y_vec.cpu())
        logits_all.append(y_prob.cpu())

# Combine all batches if needed
embeddings = torch.cat(embeddings)
logits_all = torch.cat(logits_all)

print(f"\nDone. Collected {len(embeddings)} embeddings.")
print("Embedding shape:", tuple(embeddings.shape))
print("Logits shape:", tuple(logits_all.shape))



## 8) Visualize one handwriting trajectory

A quick 2D plot of `(x, y)` over time with pressure as line width/alpha.  
This helps sanity-check that the trajectories look reasonable after preprocessing.


In [None]:
# ================================================
# 📊 Visualize a handwriting trajectory (x-y-p)
# ================================================

import matplotlib.pyplot as plt
import numpy as np
import random

def plot_trajectory(sample_array, title="Trajectory (x–y over time)"):
    """
    Plot handwriting trajectory with pressure-modulated line width.
    sample_array: (T, 3) -> [x, y, p]
    """
    arr = np.asarray(sample_array, dtype=np.float32)
    if arr.ndim != 2 or arr.shape[1] < 2:
        print("Unexpected shape:", arr.shape)
        return
    
    x, y = arr[:, 0], arr[:, 1]
    p = arr[:, 2] if arr.shape[1] >= 3 else np.ones_like(x)

    # Normalize pressure (0–1)
    p_norm = (p - p.min()) / (p.ptp() + 1e-6)

    plt.figure(figsize=(6, 6))
    for i in range(1, len(x)):
        lw = 0.5 + 3.0 * float(p_norm[i])  # line width based on pressure
        plt.plot([x[i - 1], x[i]], [y[i - 1], y[i]], color="black", linewidth=lw)
    plt.gca().invert_yaxis()
    plt.axis("equal")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.title(title)
    plt.show()

# ================================================
# 🧩 Choose a sample to visualize
# ================================================
try:
    # You already have 'ds' from: ds = Writing(handwriting_info, transform=None, train=False)
    # Pick random or specific index
    idx = random.randint(0, len(ds) - 1)
    sample, length, label = ds[idx]    # sample: (T, F)
    print(f"Selected sample idx={idx}, label={label}, length={length}")

    # Pick columns corresponding to (x, y, p)
    COLS = (0, 1, 2)  # change if time_functions outputs differ (e.g., (x,y,t,p,...))
    arr_T3 = sample[:, COLS]

    # Plot
    plot_trajectory(arr_T3, title=f"Writer {label} – sample {idx}")

except Exception as e:
    print(f"Visualization error: {e}")
    print("Make sure you've already loaded the dataset `ds` and joblib file.")



## 9) Troubleshooting notes

- **NumPy 2.0 ABI warning/error**: If a C-extension was built against NumPy 1.x, pin `numpy<2` or rebuild the package.  
- **`torch.load` pickle warning**: Use `weights_only=True` when the code supports strict state dict loading; otherwise ensure the source of the `.pth` is trusted.  
- **Qt/PySide event loop warnings**: If you see GUI loop warnings when plotting, try running this notebook with the default Jupyter kernel (no PySide6 integration).  
- **OpenCV import errors**: On Windows, prefer `pip install opencv-python` (not `opencv-contrib-python` unless needed).  
- **CUDA not available**: The code can run on CPU but will be slow. Verify your NVIDIA driver + CUDA runtime matches the installed PyTorch build.
- **Huge PKLs**: Start with a tiny subset to validate the pipeline; once it works, scale up.



## 10) Next steps

- Swap the **Head** for a clinical/diagnostic head (binary or multi-class) and fine-tune on your dataset.  
- Add early/late fusion with image-based ViT on rendered strokes (if exploring multimodal).  
- Export embeddings (CSV/NPZ) and test classical ML baselines (LR/SVM/RF) for quick diagnostics.
