# qSOFA Proxy Workbook (No ML)

This notebook computes a **qSOFA approximation** on your preprocessed sepsis dataset and exports the result in the **same folder/workflow layout** you used in *LIGHTGBMv2* — but **without** training any model.

Because your dataset does **not** include GCS, we use a **2-item qSOFA proxy**:

- Respiratory rate **Resp ≥ 22**
- Systolic blood pressure **SBP ≤ 100**

We compute:

- `qSOFA_score_2` in {0,1,2}
- `qSOFA_Label` = 1 iff `qSOFA_score_2 ≥ 2` (i.e., **both** criteria are met)

Finally, we export **per-patient `.psv` files** with a `PredictedProbability` column so you can reuse your existing evaluation notebook to compute the official **utility score**.


In [1]:
from pathlib import Path
import numpy as np
import pandas as pd

# =========================
# Set ONCE (same pattern as LIGHTGBMv2)
# =========================
DATA_DIR = Path("/teamspace/studios/this_studio/detecting_Sepsis/data")

# Point directly to the dataset variant you want to use:
MODE_DIR = DATA_DIR / "High_Preproc_NoFe_CSV"
#MODE_DIR = DATA_DIR / "Low_Preproc_NoFe_CSV"
#MODE_DIR = DATA_DIR / "raw_CSV"

# One run folder name (used under 4_Evaluation/Predictions/)
RUN_NAME = "qSOFA_PROXY_HighPreproc_NoFe"   # change once per experiment

# =========================
# Conventions
# =========================
PATIENT_COL = "Patient_ID"
TIME_COL    = "ICULOS"
LABEL_COL   = "SepsisLabel"  # original label (kept for reference)

# =========================
# Resolve dataset files from MODE_DIR (AUTO)
# =========================
def resolve_dataset_paths(mode_dir: Path):
    if not mode_dir.exists():
        raise FileNotFoundError(f"MODE_DIR does not exist: {mode_dir}")

    candidates = [
        # HIGH
        ("train_fit_HIGH_PREPROC_NO_FE.csv",
         "train_thresh_HIGH_PREPROC_NO_FE.csv",
         "test_HIGH_PREPROC_NO_FE.csv"),
        # LOW
        ("train_fit_LOW_PREPROC_NO_FE.csv",
         "train_thresh_LOW_PREPROC_NO_FE.csv",
         "test_LOW_PREPROC_NO_FE.csv"),
        # RAW
        ("train_fit.csv",
         "train_thresh.csv",
         "test.csv"),
    ]

    for fit, thresh, test in candidates:
        p_fit = mode_dir / fit
        p_thr = mode_dir / thresh
        p_tst = mode_dir / test
        if p_fit.exists() and p_thr.exists() and p_tst.exists():
            return p_fit, p_thr, p_tst

    existing = sorted([p.name for p in mode_dir.glob("*.csv")])
    raise FileNotFoundError(
        f"Could not resolve dataset files in MODE_DIR={mode_dir}\n"
        f"Expected one of these naming schemes:\n"
        f"  HIGH: train_fit_HIGH_PREPROC_NO_FE.csv + train_thresh_HIGH_PREPROC_NO_FE.csv + test_HIGH_PREPROC_NO_FE.csv\n"
        f"  LOW : train_fit_LOW_PREPROC_NO_FE.csv  + train_thresh_LOW_PREPROC_NO_FE.csv  + test_LOW_PREPROC_NO_FE.csv\n"
        f"  RAW : train_fit.csv + train_thresh.csv + test.csv\n"
        f"Existing CSV files in MODE_DIR:\n  {existing}"
    )

TRAIN_FIT_PATH, TRAIN_THRESH_PATH, TEST_PATH = resolve_dataset_paths(MODE_DIR)

# =========================
# Prediction output dirs (same layout)
# =========================
PRED_ROOT = DATA_DIR.parent / "4_Evaluation" / "Predictions" / RUN_NAME
PRED_DIR_THRESH_WORK = PRED_ROOT / "pred_psv_TRAIN_THRESH_WORK"
PRED_DIR_TEST_WORK   = PRED_ROOT / "pred_psv_TEST_WORK"

for d in [PRED_DIR_THRESH_WORK, PRED_DIR_TEST_WORK]:
    d.mkdir(parents=True, exist_ok=True)

print("MODE_DIR:", MODE_DIR)
print("TRAIN_FIT:", TRAIN_FIT_PATH)
print("TRAIN_THRESH:", TRAIN_THRESH_PATH)
print("TEST:", TEST_PATH)
print("PRED_ROOT:", PRED_ROOT)


MODE_DIR: /teamspace/studios/this_studio/detecting_Sepsis/data/High_Preproc_NoFe_CSV
TRAIN_FIT: /teamspace/studios/this_studio/detecting_Sepsis/data/High_Preproc_NoFe_CSV/train_fit_HIGH_PREPROC_NO_FE.csv
TRAIN_THRESH: /teamspace/studios/this_studio/detecting_Sepsis/data/High_Preproc_NoFe_CSV/train_thresh_HIGH_PREPROC_NO_FE.csv
TEST: /teamspace/studios/this_studio/detecting_Sepsis/data/High_Preproc_NoFe_CSV/test_HIGH_PREPROC_NO_FE.csv
PRED_ROOT: /teamspace/studios/this_studio/detecting_Sepsis/4_Evaluation/Predictions/qSOFA_PROXY_HighPreproc_NoFe


## Load the preprocessed CSVs

This follows the same pattern as your v2 notebook:
- `train_fit`: typically used for training (we **don’t** train anything here, but it’s useful for sanity checks)
- `train_thresh`: the split you used for threshold selection + official utility evaluation
- `test`: the official test split (no labels)


In [2]:
def load_preproc_csv(path: Path) -> pd.DataFrame:
    if not path.exists():
        raise FileNotFoundError(path)
    df = pd.read_csv(path)
    # drop accidental index cols
    df = df.loc[:, ~df.columns.str.contains(r"^Unnamed")]
    return df

train_fit = load_preproc_csv(TRAIN_FIT_PATH)
train_thresh = load_preproc_csv(TRAIN_THRESH_PATH)
test_df = load_preproc_csv(TEST_PATH)

# basic checks
for df_name, df in [("train_fit", train_fit), ("train_thresh", train_thresh), ("test", test_df)]:
    print(df_name, df.shape, "patients:", df[PATIENT_COL].nunique())


train_fit (1180166, 78) patients: 30654
train_thresh (61120, 78) patients: 1614
test (310924, 78) patients: 8068


## Compute the qSOFA proxy

Your column names match the PhysioNet convention:

- `Resp` for respiratory rate
- `SBP` for systolic blood pressure

We compute a 2-component qSOFA proxy:

- `rr_flag  = 1 if Resp ≥ 22 else 0`
- `sbp_flag = 1 if SBP ≤ 100 else 0`

Then:

- `qSOFA_score_2 = rr_flag + sbp_flag`  (0–2)
- `qSOFA_Label = 1 if qSOFA_score_2 ≥ 2 else 0`

**NaN behavior:** comparisons with NaN become False, so missing values contribute 0 to the score.


In [3]:
def add_qsofa_proxy(df: pd.DataFrame, resp_col: str = "Resp", sbp_col: str = "SBP") -> pd.DataFrame:
    out = df.copy()

    if resp_col not in out.columns:
        raise KeyError(f"Missing expected resp column: {resp_col}")
    if sbp_col not in out.columns:
        raise KeyError(f"Missing expected SBP column: {sbp_col}")

    rr_flag  = (out[resp_col] >= 22).astype("int8")
    sbp_flag = (out[sbp_col] <= 100).astype("int8")

    out["qSOFA_score_2"] = (rr_flag + sbp_flag).astype("int8")
    out["qSOFA_Label"]   = (out["qSOFA_score_2"] >= 2).astype("int8")

    # Optional probability-style output:
    #  - If you want strict binary predictions: use qSOFA_Label (0/1).
    #  - If you want a graded score: use qSOFA_score_2 / 2 (0, 0.5, 1).
    out["qSOFA_Prob_binary"] = out["qSOFA_Label"].astype(float)
    out["qSOFA_Prob_graded"] = (out["qSOFA_score_2"].astype(float) / 2.0)

    return out

train_fit_q = add_qsofa_proxy(train_fit)
train_thresh_q = add_qsofa_proxy(train_thresh)
test_q = add_qsofa_proxy(test_df)

train_fit_q[["SBP","Resp","qSOFA_score_2","qSOFA_Label"]].head()


Unnamed: 0,SBP,Resp,qSOFA_score_2,qSOFA_Label
0,0.0,0.0,1,0
1,98.0,19.0,1,0
2,122.0,22.0,1,0
3,122.0,30.0,1,0
4,122.0,24.5,1,0


## Quick sanity checks

- Distribution of the proxy label
- How often each criterion fires


In [None]:
def describe_qsofa(df: pd.DataFrame, name: str):
    rr = (df["Resp"] >= 22).mean()
    sbp = (df["SBP"] <= 100).mean()
    lbl = df["qSOFA_Label"].mean()
    print(f"{name}: rows={len(df):,} patients={df[PATIENT_COL].nunique():,}")
    print(f"  RR>=22 rate:  {rr:.3f}")
    print(f"  SBP<=100 rate:{sbp:.3f}")
    print(f"  qSOFA_Label=1:{lbl:.3f}")
    print()

describe_qsofa(train_fit_q, "train_fit")
describe_qsofa(train_thresh_q, "train_thresh")
describe_qsofa(test_q, "test")


train_fit: rows=1,180,166 patients=30,654
  RR>=22 rate:  0.236
  SBP<=100 rate:0.186
  qSOFA_Label=1:0.040

train_thresh: rows=61,120 patients=1,614
  RR>=22 rate:  0.240
  SBP<=100 rate:0.186
  qSOFA_Label=1:0.041

test: rows=310,924 patients=8,068
  RR>=22 rate:  0.241
  SBP<=100 rate:0.190
  qSOFA_Label=1:0.043



## Export per-patient `.psv` files (same evaluation workflow)

Your evaluation pipeline expects one `.psv` file per patient, with a `PredictedProbability` column, in folders like:

- `.../4_Evaluation/Predictions/<RUN_NAME>/pred_psv_TRAIN_THRESH_WORK/`
- `.../4_Evaluation/Predictions/<RUN_NAME>/pred_psv_TEST_WORK/`

We write those files using either:

- **Binary**: `PredictedProbability = qSOFA_Prob_binary` (0 or 1)
- **Graded**: `PredictedProbability = qSOFA_Prob_graded` (0, 0.5, 1)

For strict “qSOFA positive” behavior, binary is usually what you want.


In [5]:
import re

def ensure_empty_dir(dir_path: Path):
    dir_path.mkdir(parents=True, exist_ok=True)
    for f in dir_path.glob("*.psv"):
        f.unlink()

def write_prob_file(out_path: Path, prob: np.ndarray):
    pd.DataFrame({"PredictedProbability": prob.astype(float)}).to_csv(out_path, sep="|", index=False)

def format_patient_id(pid) -> str:
    """Match the PhysioNet-style naming: p000011.psv (zero-padded to 6 digits).

    Accepts pid as int/float/str, including values like:
      - 11
      - "11"
      - "p11"
      - "p000011"
    """
    s = str(pid)
    m = re.search(r"(\d+)", s)
    if not m:
        # Fallback: use raw string
        return s
    n = int(m.group(1))
    return f"p{n:06d}"

def write_psv_predictions_from_df(df: pd.DataFrame, out_dir: Path, prob_col: str):
    if prob_col not in df.columns:
        raise KeyError(f"Missing prob_col='{prob_col}' in df.columns")

    ensure_empty_dir(out_dir)

    # Ensure stable order within patient (by TIME_COL)
    df_sorted = df.sort_values([PATIENT_COL, TIME_COL]).reset_index(drop=True)

    for pid, g in df_sorted.groupby(PATIENT_COL, sort=False):
        prob = g[prob_col].to_numpy()
        fname = format_patient_id(pid)
        out_path = out_dir / f"{fname}.psv"
        write_prob_file(out_path, prob)

# Choose which probability definition to export:
PROB_COL_TO_EXPORT = "qSOFA_Prob_binary"   # strict qSOFA-positive rows
#PROB_COL_TO_EXPORT = "qSOFA_Prob_graded"  # graded {0, 0.5, 1}

write_psv_predictions_from_df(train_thresh_q, PRED_DIR_THRESH_WORK, PROB_COL_TO_EXPORT)
write_psv_predictions_from_df(test_q, PRED_DIR_TEST_WORK, PROB_COL_TO_EXPORT)

print("Wrote:")
print("  Thresh work:", PRED_DIR_THRESH_WORK)
print("  Test work:  ", PRED_DIR_TEST_WORK)


Wrote:
  Thresh work: /teamspace/studios/this_studio/detecting_Sepsis/4_Evaluation/Predictions/qSOFA_PROXY_HighPreproc_NoFe/pred_psv_TRAIN_THRESH_WORK
  Test work:   /teamspace/studios/this_studio/detecting_Sepsis/4_Evaluation/Predictions/qSOFA_PROXY_HighPreproc_NoFe/pred_psv_TEST_WORK


In [6]:
# --- Aggressive alarm policy export (same workflow / folder layout) ---
# Policy: once qSOFA triggers at any hour for a patient, all subsequent hours stay positive.

POLICY_RUN_NAME = "qSOFA_aggresive_Alarm_Policy"  # folder name under 4_Evaluation/Predictions/

PRED_ROOT_ALARM = DATA_DIR.parent / "4_Evaluation" / "Predictions" / POLICY_RUN_NAME
PRED_DIR_ALARM_THRESH_WORK = PRED_ROOT_ALARM / "pred_psv_TRAIN_THRESH_WORK"
PRED_DIR_ALARM_TEST_WORK   = PRED_ROOT_ALARM / "pred_psv_TEST_WORK"

for d in [PRED_DIR_ALARM_THRESH_WORK, PRED_DIR_ALARM_TEST_WORK]:
    d.mkdir(parents=True, exist_ok=True)

def add_aggressive_alarm_policy(
    df: pd.DataFrame,
    base_label_col: str = "qSOFA_Label",
    patient_col: str = PATIENT_COL,
    time_col: str = TIME_COL,
) -> pd.DataFrame:
    """Return df with alarmed label/probability.

    - base_label_col is the per-row trigger (0/1).
    - Alarm policy: within each patient, cumulative max over time.
    """
    if base_label_col not in df.columns:
        raise KeyError(f"Missing base_label_col='{base_label_col}' in df.columns")

    if patient_col not in df.columns:
        raise KeyError(f"Missing patient_col='{patient_col}' in df.columns")

    # Time col fallback (keeps notebook robust across variants)
    if time_col not in df.columns:
        time_col = "Hour" if "Hour" in df.columns else None
    if time_col is None:
        raise KeyError(f"Missing time column: expected '{TIME_COL}' or 'Hour'")

    out = df.sort_values([patient_col, time_col]).copy()

    out["qSOFA_AlarmLabel"] = (
        out.groupby(patient_col)[base_label_col]
           .transform(lambda s: s.fillna(0).astype(int).cummax())
           .astype("int8")
    )
    out["qSOFA_AlarmProb"] = out["qSOFA_AlarmLabel"].astype("float32")
    return out

# Apply policy to the same splits you already export for utility evaluation
train_thresh_alarm = add_aggressive_alarm_policy(train_thresh_q)
test_alarm         = add_aggressive_alarm_policy(test_q)

# Export with the same PSV conventions (one file per patient, p000000.psv naming)
write_psv_predictions_from_df(train_thresh_alarm, PRED_DIR_ALARM_THRESH_WORK, prob_col="qSOFA_AlarmProb")
write_psv_predictions_from_df(test_alarm,         PRED_DIR_ALARM_TEST_WORK,   prob_col="qSOFA_AlarmProb")

print("Wrote aggressive alarm-policy predictions to:")
print("  Thresh work:", PRED_DIR_ALARM_THRESH_WORK)
print("  Test work:  ", PRED_DIR_ALARM_TEST_WORK)


Wrote aggressive alarm-policy predictions to:
  Thresh work: /teamspace/studios/this_studio/detecting_Sepsis/4_Evaluation/Predictions/qSOFA_aggresive_Alarm_Policy/pred_psv_TRAIN_THRESH_WORK
  Test work:   /teamspace/studios/this_studio/detecting_Sepsis/4_Evaluation/Predictions/qSOFA_aggresive_Alarm_Policy/pred_psv_TEST_WORK


## Next step: compute utility with your existing evaluation notebook

You can now run your existing `Threshold_Sweep_And_Official_Eval.ipynb` and point it to:

- `PRED_DIR_THRESH_WORK` (for threshold selection + utility on the threshold split)
- `PRED_DIR_TEST_WORK` (for final evaluation/export)

Notes:
- If you exported **binary** probabilities (0/1), threshold sweeps will mostly behave like a fixed classifier.
- If you exported **graded** probabilities (0/0.5/1), you’ll get a more meaningful threshold sweep curve.


In [7]:
print("RUN_NAME:", RUN_NAME)
print("Thresh work folder:", PRED_DIR_THRESH_WORK)
print("Test work folder:  ", PRED_DIR_TEST_WORK)
print("Exported prob col: ", PROB_COL_TO_EXPORT)


RUN_NAME: qSOFA_PROXY_HighPreproc_NoFe
Thresh work folder: /teamspace/studios/this_studio/detecting_Sepsis/4_Evaluation/Predictions/qSOFA_PROXY_HighPreproc_NoFe/pred_psv_TRAIN_THRESH_WORK
Test work folder:   /teamspace/studios/this_studio/detecting_Sepsis/4_Evaluation/Predictions/qSOFA_PROXY_HighPreproc_NoFe/pred_psv_TEST_WORK
Exported prob col:  qSOFA_Prob_binary
