# Extend existing patient `.npz` files with paper-faithful handcrafted features + path signatures (iisignature)

This notebook **does not hard-code remote paths**.

You only set one variable (`DATA_ROOT`) or let the auto-detector find it.

What it creates:
- Keeps your original arrays (`X`, `X_raw`, `missing`, `y`, …)
- Adds:
  - `X_hand`, `hand_cols`
  - `X_sig`, `sig_cols`
  - `X_plus`, `plus_cols`
- Writes new `.npz` files into a *parallel* folder so nothing gets overwritten.

Paper-faithful defaults (Sepsis Signatures CinC 2019):
- Signature order: **3**
- Lookback window: **7**
- Transforms: **AddTime + LeadLag**
- Sig channels: **[PartialSOFA, BUN/CR, MAP]**

> Note: CSig (on many channels) is optional and off by default (can be huge).


In [1]:
import os
import re
import glob
from pathlib import Path

import numpy as np
import iisignature

print('iisignature version:', getattr(iisignature, '__version__', 'unknown'))


iisignature version: 0.24


## 1) Set Dataset_Root 
 Important: Make Sure the .npz files already exist, otherwise the code will not work!

In [2]:
DATA_ROOT = Path('/teamspace/studios/this_studio/detecting_Sepsis/3_Model/Time-Series-Library/dataset/NPZ_Patients_preprocessed')

## 2) Choose what to process

You can choose a whole split folder (recommended) and it will process all `*.npz` within.

Examples (relative to `DATA_ROOT`):
- `HIGH_PREPROC_NO_FE/PSV_Patients_TEST`
- `HIGH_PREPROC_NO_FE/PSV_Patients_TRAIN_FIT`
- `HIGH_PREPROC_NO_FE/PSV_Patients_TRAIN_THRESH`
- `LOW_PREPROC_NO_FE/PSV_Patients_TEST`
- `NO_PREPROC_NO_FE/PSV_Patients_TEST`

The output goes to parallel folders, e.g.
- `HIGH_PREPROC_SIGN/PSV_Patients_TEST`


In [3]:
# Pick one or more split folders:
SPLITS = [
    # HIGH_PREPROC
    #"HIGH_PREPROC_NO_FE/PSV_Patients_TEST",
    #"HIGH_PREPROC_NO_FE/PSV_Patients_TRAIN_FIT",
    #"HIGH_PREPROC_NO_FE/PSV_Patients_TRAIN_THRESH",

    # LOW_PREPROC
    "LOW_PREPROC_NO_FE/PSV_Patients_TEST",
    "LOW_PREPROC_NO_FE/PSV_Patients_TRAIN_FIT",
    "LOW_PREPROC_NO_FE/PSV_Patients_TRAIN_THRESH",

    # NO_PREPROC
    #"NO_PREPROC_NO_FE/PSV_Patients_TEST",
    #"NO_PREPROC_NO_FE/PSV_Patients_TRAIN_FIT",
    #"NO_PREPROC_NO_FE/PSV_Patients_TRAIN_THRESH",
]


# Optional: only process a small subset while testing
LIMIT_FILES = 0  # 0 = no limit

# Optional: also compute CSig (can be expensive / very high-dimensional)
COMPUTE_CSIG = False

print('Splits:', SPLITS)


Splits: ['LOW_PREPROC_NO_FE/PSV_Patients_TEST', 'LOW_PREPROC_NO_FE/PSV_Patients_TRAIN_FIT', 'LOW_PREPROC_NO_FE/PSV_Patients_TRAIN_THRESH']


## 3) Column sets (your NPZ layouts)

- LOW/NO: 41 cols including `SepsisLabel`
- HIGH: base cols + `extra_0..extra_33` + `SepsisLabel`

We only need indices for a few columns (HR, SBP, BUN, Creatinine, Platelets, Bilirubin_total, MAP).


In [4]:
COLS_NO_LOW = [
    'HR','O2Sat','Temp','SBP','MAP','DBP','Resp','EtCO2','BaseExcess','HCO3','FiO2','pH','PaCO2','SaO2',
    'AST','BUN','Alkalinephos','Calcium','Chloride','Creatinine','Bilirubin_direct','Glucose','Lactate',
    'Magnesium','Phosphate','Potassium','Bilirubin_total','TroponinI','Hct','Hgb','PTT','WBC','Fibrinogen',
    'Platelets','Age','Gender','Unit1','Unit2','HospAdmTime','ICULOS','SepsisLabel'
]

COLS_HIGH_PREFIX = [
    'HR','O2Sat','Temp','SBP','MAP','DBP','Resp','EtCO2','BaseExcess','HCO3','FiO2','pH','PaCO2','SaO2',
    'AST','BUN','Alkalinephos','Calcium','Chloride','Creatinine','Bilirubin_direct','Glucose','Lactate',
    'Magnesium','Phosphate','Potassium','Bilirubin_total','TroponinI','Hct','Hgb','PTT','WBC','Fibrinogen',
    'Platelets','Age','Gender','Unit1','Unit2','HospAdmTime','ICULOS'
]

def cols_high():
    extras = [f'extra_{i}' for i in range(34)]
    return COLS_HIGH_PREFIX + extras + ['SepsisLabel']

STATIC_COLS = {'Age','Gender','Unit1','Unit2','HospAdmTime'}

def detect_mode(path: Path) -> str:
    up = str(path).upper()
    if 'HIGH_PREPROC' in up:
        return 'high'
    if 'LOW_PREPROC' in up:
        return 'low'
    if 'NO_PREPROC' in up:
        return 'none'
    return 'unknown'

def get_cols(mode: str):
    if mode == 'high':
        return cols_high()
    if mode in ('low','none'):
        return COLS_NO_LOW
    raise ValueError(f'Unknown mode: {mode}')

def build_index(cols):
    return {c:i for i,c in enumerate(cols)}

print('OK')


OK


## 4) Paper-faithful handcrafted features

- ShockIndex = HR / SBP
- BUN/CR = BUN / Creatinine
- PartialSOFA = Platelets + Bilirubin_total + Creatinine + MAP-based cardio component
- SOFA_Deterioration = 1 if PartialSOFA increased by >=2 within last 24h


In [5]:
def safe_div(a, b, eps=1e-6):
    return a / np.clip(b, eps, None)

def sofa_platelets(plts):
    out = np.zeros_like(plts, dtype=np.float32)
    out = np.where(plts < 150, 1, out)
    out = np.where(plts < 100, 2, out)
    out = np.where(plts < 50, 3, out)
    out = np.where(plts < 20, 4, out)
    return out

def sofa_bilirubin(bili):
    out = np.zeros_like(bili, dtype=np.float32)
    out = np.where(bili >= 1.2, 1, out)
    out = np.where(bili >= 2.0, 2, out)
    out = np.where(bili >= 6.0, 3, out)
    out = np.where(bili >= 12.0, 4, out)
    return out

def sofa_creatinine(cr):
    out = np.zeros_like(cr, dtype=np.float32)
    out = np.where(cr >= 1.2, 1, out)
    out = np.where(cr >= 2.0, 2, out)
    out = np.where(cr >= 3.5, 3, out)
    out = np.where(cr >= 5.0, 4, out)
    return out

def sofa_map(mapv):
    return np.where(mapv < 70.0, 1.0, 0.0).astype(np.float32)

def compute_partial_sofa(X, idx):
    plts = X[:, idx['Platelets']]
    bili = X[:, idx['Bilirubin_total']]
    cr   = X[:, idx['Creatinine']]
    mapv = X[:, idx['MAP']]
    return (sofa_platelets(plts) + sofa_bilirubin(bili) + sofa_creatinine(cr) + sofa_map(mapv)).astype(np.float32)

def sofa_deterioration(partial_sofa, hours=24, delta=2):
    T = partial_sofa.shape[0]
    out = np.zeros(T, dtype=np.float32)
    for t in range(T):
        a = max(0, t - hours)
        base = np.nanmin(partial_sofa[a:t+1])
        out[t] = 1.0 if (partial_sofa[t] - base) >= delta else 0.0
    return out

print('OK')


OK


## 5) AddTime + LeadLag + Rolling Signatures (iisignature)

We compute a signature **at each time step** using a lookback window (default 7).

Pipeline per time `t`:
1. Take window `X[t-lookback+1 : t+1]`
2. AddTime
3. LeadLag
4. signature(order=3)


In [6]:
def add_time(path):
    L = path.shape[0]
    t = np.linspace(0.0, 1.0, L, dtype=np.float64).reshape(L, 1)
    return np.concatenate([t, path.astype(np.float64)], axis=1)

def lead_lag(path):
    L, d = path.shape
    rep = np.repeat(path, 2, axis=0)  # (2L, d)
    lead = rep[1:]                   # (2L-1, d)
    lag  = rep[:-1]                  # (2L-1, d)
    return np.concatenate([lead, lag], axis=1)

def rolling_signature(Xch, depth=3, lookback=7, use_addtime=True, use_leadlag=True):
    T, d = Xch.shape
    dd = d + (1 if use_addtime else 0)
    dd = dd * (2 if use_leadlag else 1)
    sigdim = iisignature.siglength(dd, depth)
    out = np.zeros((T, sigdim), dtype=np.float32)

    for t in range(T):
        a = max(0, t - lookback + 1)
        seg = Xch[a:t+1].astype(np.float64)
        if use_addtime:
            seg = add_time(seg)
        if use_leadlag:
            seg = lead_lag(seg)
        out[t] = iisignature.sig(seg, depth).astype(np.float32)

    return out

print('OK')


OK


## 6) Output folder mapping (no overwrites)

Input folders:
- `HIGH_PREPROC_NO_FE/...`
- `LOW_PREPROC_NO_FE/...`
- `NO_PREPROC_NO_FE/...`

Output folders:
- `HIGH_PREPROC_SIGN/...`
- `LOW_PREPROC_SIGN/...`
- `NO_PREPROC_SIGN/...`


In [7]:
def out_split_folder(split: str) -> str:
    # Replace the root folder name only
    split = split.replace('HIGH_PREPROC_NO_FE', 'HIGH_PREPROC_SIGN')
    split = split.replace('LOW_PREPROC_NO_FE', 'LOW_PREPROC_SIGN')
    split = split.replace('NO_PREPROC_NO_FE', 'NO_PREPROC_SIGN')
    return split

print('Example:', out_split_folder('HIGH_PREPROC_NO_FE/PSV_Patients_TEST'))


Example: HIGH_PREPROC_SIGN/PSV_Patients_TEST


## 7) Extend one `.npz`

Adds `X_hand`, `X_sig`, `X_plus` (+ optional `X_csig`).


In [8]:
def extend_one_npz(in_path: Path, out_path: Path, *, compute_csig=False, csig_depth=3, csig_lookback=7,
                   exclude_extras_from_csig=True, exclude_iculos_from_csig=True,
                   sig_depth=3, sig_lookback=7):

    def safe_ratio(num, den):
        """NaN-safe ratio: returns NaN if either input missing or den <= 0."""
        num = num.astype(np.float32)
        den = den.astype(np.float32)
        out = np.full_like(num, np.nan, dtype=np.float32)
        valid = np.isfinite(num) & np.isfinite(den) & (den > 0)
        out[valid] = num[valid] / den[valid]
        return out

    mode = detect_mode(in_path)
    cols = get_cols(mode)
    idx = build_index(cols)

    z = np.load(in_path, allow_pickle=True)

    X = z['X'].astype(np.float32)          # zero-filled / preproc
    y = z['y'].astype(np.int64)
    X_raw = z['X_raw'].astype(np.float32)  # NaNs preserved
    missing = z['missing']                 # kept for completeness (not needed for ratios)

    # -------------------------
    # Handcrafted features
    # IMPORTANT FIX: compute ratios from X_raw (NaNs) not X (zeros)
    # -------------------------
    shock = safe_ratio(X_raw[:, idx['HR']], X_raw[:, idx['SBP']]).astype(np.float32)

    # Paper-faithful mapping: Bilirubin / Creatinine
    # Use Bilirubin_total (common choice in this challenge feature set)
    buncr = safe_ratio(X_raw[:, idx['Bilirubin_total']], X_raw[:, idx['Creatinine']]).astype(np.float32)

    # PartialSOFA / deterioration: keep as-is (your implementation likely expects filled values)
    psofa = compute_partial_sofa(X, idx)
    sofa_det = sofa_deterioration(psofa, hours=24, delta=2)

    X_hand = np.column_stack([shock, buncr, psofa, sofa_det]).astype(np.float32)
    hand_cols = np.array(['ShockIndex', 'BUN_CR', 'PartialSOFA', 'SOFA_Deterioration'], dtype='<U32')

    # -------------------------
    # Signatures on [PartialSOFA, Bilirubin/Creatinine, MAP]
    # -------------------------
    sig_channels = np.column_stack([psofa, buncr, X[:, idx['MAP']]]).astype(np.float32)

    X_sig = rolling_signature(sig_channels, depth=sig_depth, lookback=sig_lookback,
                              use_addtime=True, use_leadlag=True)
    sig_cols = np.array([f'Sig_{i}' for i in range(X_sig.shape[1])], dtype='<U32')

    # -------------------------
    # Optional csig
    # -------------------------
    X_csig = None
    csig_cols = None
    if compute_csig:
        nonstat = [c for c in cols if (c not in STATIC_COLS and c != 'SepsisLabel')]
        if exclude_iculos_from_csig and 'ICULOS' in nonstat:
            nonstat.remove('ICULOS')
        if exclude_extras_from_csig:
            nonstat = [c for c in nonstat if not c.startswith('extra_')]

        X_nonstat = X[:, [idx[c] for c in nonstat]].astype(np.float32)
        X_csig = rolling_signature(X_nonstat, depth=csig_depth, lookback=csig_lookback,
                                   use_addtime=True, use_leadlag=True)
        csig_cols = np.array([f'CSig_{i}' for i in range(X_csig.shape[1])], dtype='<U32')

    # -------------------------
    # Concat
    # -------------------------
    if X_csig is None:
        X_plus = np.concatenate([X, X_hand, X_sig], axis=1).astype(np.float32)
        plus_cols = np.concatenate([
            np.array([f'X_{i}' for i in range(X.shape[1])], dtype='<U16'),
            hand_cols, sig_cols
        ])
    else:
        X_plus = np.concatenate([X, X_hand, X_sig, X_csig], axis=1).astype(np.float32)
        plus_cols = np.concatenate([
            np.array([f'X_{i}' for i in range(X.shape[1])], dtype='<U16'),
            hand_cols, sig_cols, csig_cols
        ])

    out_path.parent.mkdir(parents=True, exist_ok=True)

    # Save ONLY what LightGBM needs
    assert X_plus.shape[0] == y.shape[0], 'Mismatch between X_plus and y'
    np.savez(
        out_path,
        X_plus=X_plus.astype(np.float32),
        y=y.astype(np.int64),
    )
    return out_path


## 8) Run on selected split folders

This processes all `.npz` in each selected split and writes to the corresponding `*_SIGN` folder.


In [9]:
from tqdm.auto import tqdm

def process_split(split: str, limit_files: int = 0):
    in_dir = DATA_ROOT / split
    if not in_dir.exists():
        raise FileNotFoundError(f"Split folder not found: {in_dir}")

    out_dir = DATA_ROOT / out_split_folder(split)

    files = sorted(in_dir.glob('*.npz'))
    if limit_files and limit_files > 0:
        files = files[:limit_files]

    print(f"Processing {len(files)} files")
    print(" IN:", in_dir)
    print("OUT:", out_dir)

    for fp in tqdm(files):
        out_fp = out_dir / fp.name
        extend_one_npz(fp, out_fp, compute_csig=COMPUTE_CSIG)

    return out_dir

out_dirs = []
for s in SPLITS:
    out_dirs.append(process_split(s, limit_files=LIMIT_FILES))

print('Done. Output dirs:')
for d in out_dirs:
    print(' -', d)


  from .autonotebook import tqdm as notebook_tqdm


Processing 8068 files
 IN: /teamspace/studios/this_studio/detecting_Sepsis/3_Model/Time-Series-Library/dataset/NPZ_Patients_preprocessed/LOW_PREPROC_NO_FE/PSV_Patients_TEST
OUT: /teamspace/studios/this_studio/detecting_Sepsis/3_Model/Time-Series-Library/dataset/NPZ_Patients_preprocessed/LOW_PREPROC_SIGN/PSV_Patients_TEST


100%|██████████| 8068/8068 [00:28<00:00, 279.52it/s]


Processing 30654 files
 IN: /teamspace/studios/this_studio/detecting_Sepsis/3_Model/Time-Series-Library/dataset/NPZ_Patients_preprocessed/LOW_PREPROC_NO_FE/PSV_Patients_TRAIN_FIT
OUT: /teamspace/studios/this_studio/detecting_Sepsis/3_Model/Time-Series-Library/dataset/NPZ_Patients_preprocessed/LOW_PREPROC_SIGN/PSV_Patients_TRAIN_FIT


100%|██████████| 30654/30654 [03:15<00:00, 156.86it/s]


Processing 1614 files
 IN: /teamspace/studios/this_studio/detecting_Sepsis/3_Model/Time-Series-Library/dataset/NPZ_Patients_preprocessed/LOW_PREPROC_NO_FE/PSV_Patients_TRAIN_THRESH
OUT: /teamspace/studios/this_studio/detecting_Sepsis/3_Model/Time-Series-Library/dataset/NPZ_Patients_preprocessed/LOW_PREPROC_SIGN/PSV_Patients_TRAIN_THRESH


100%|██████████| 1614/1614 [00:04<00:00, 334.89it/s]

Done. Output dirs:
 - /teamspace/studios/this_studio/detecting_Sepsis/3_Model/Time-Series-Library/dataset/NPZ_Patients_preprocessed/LOW_PREPROC_SIGN/PSV_Patients_TEST
 - /teamspace/studios/this_studio/detecting_Sepsis/3_Model/Time-Series-Library/dataset/NPZ_Patients_preprocessed/LOW_PREPROC_SIGN/PSV_Patients_TRAIN_FIT
 - /teamspace/studios/this_studio/detecting_Sepsis/3_Model/Time-Series-Library/dataset/NPZ_Patients_preprocessed/LOW_PREPROC_SIGN/PSV_Patients_TRAIN_THRESH



