# CAP5610 HW3 — Tree Ensembles & SHAP Study

This notebook mirrors Homework 3 and is self-contained: with the two datasets (`lncRNA_5_Cancers.csv` and `hw3-drug-screening-data.csv`) placed beside it, you can rerun every block to regenerate the figures, tables, and SHAP artefacts required in the assignment.

In [33]:
%%capture --no-stderr
%pip install --quiet numpy pandas scikit-learn xgboost lightgbm catboost shap psutil 'tqdm[joblib]' polars pyarrow

## Study Roadmap
1. **Shared setup** — import libraries, configure reproducible knobs, and define reusable helpers.
2. **Task 1** — explore the classification dataset and benchmark tree-based classifiers.
3. **Task 2** — interpret the best classifier with SHAP (per-cancer importances + patient force plots).
4. **Task 3** — profile the regression dataset and compare regressors on MAE/MSE/RMSE/R².
5. **Task 4** — run SHAP on the winning regressor for drug-specific insights and least-error explanation.
6. **Conclusion** — summarise artefacts and next steps.

## Optimization Progress Tracker
| Task | Description | Status |
| --- | --- | --- |
| Baseline Benchmarking | Measure end-to-end runtime, memory peaks, and per-stage costs for Tasks 1–4. | ☑ Completed |
| Data Ingestion Optimizations | Evaluate faster file readers, column pruning, and memory mapping strategies. | ☑ Completed |
| Feature Engineering Efficiency | Explore incremental variance selection, sparse representations, and caching pipelines. | ☑ Completed |
| Model Training Parallelism | Investigate multi-core settings, histogram optimizations, and distributed runners for each algorithm. | ☑ Completed |
| SHAP Acceleration | Profile TreeExplainer usage, test approximate SHAP (e.g., FastTreeSHAP), and batch visualisation. | ☑ Completed |
| Experiment Automation | Persistent progress bar + checkpoint system for resumable notebook runs. | ☑ Completed |
| Literature & Benchmark Survey | Compile findings from large-scale ML challenges (e.g., 1BR challenge) for applicable techniques. | ☐ Not started |
| Implementation Plan | Prioritise quick wins vs. deep refactors; outline test matrix for regression coverage. | ☐ Not started |
| Reporting & Validation | Document runtime improvements, ensure parity with assignment outputs, and update report narrative. | ☐ Not started |


## 0. Shared Setup & Reusable Utilities
The following cell loads all required libraries, defines runtime configuration (random seed, feature caps, SHAP sampling budgets), and introduces helper routines for logging and path resolution. Keeping the helpers here ensures the notebook is standalone.

In [34]:
# --- Imports & global configuration -------------------------------------------------
from pathlib import Path
import gc
import hashlib
import json
import math
import os
import re
import time
import warnings
from functools import wraps
from time import perf_counter
from typing import Dict, List, Optional, Tuple

import numpy as np
import pandas as pd
import psutil
from IPython.display import HTML, display
from tqdm.auto import tqdm
try:
    from tqdm.joblib import tqdm_joblib
except Exception:
    from contextlib import contextmanager
    @contextmanager
    def tqdm_joblib(*args, **kwargs):
        yield
from joblib import Parallel, delayed, dump, load

from sklearn.base import clone
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import (
    GradientBoostingClassifier,
    GradientBoostingRegressor,
    RandomForestClassifier,
    RandomForestRegressor,
)
from sklearn.impute import SimpleImputer
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    f1_score,
    mean_absolute_error,
    mean_squared_error,
    r2_score,
)
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

try:
    from xgboost import XGBClassifier, XGBRegressor
    HAVE_XGB = True
except Exception:
    HAVE_XGB = False

try:
    from lightgbm import LGBMClassifier, LGBMRegressor
    HAVE_LGBM = True
except Exception:
    HAVE_LGBM = False

try:
    from catboost import CatBoostClassifier, CatBoostRegressor
    HAVE_CAT = True
except Exception:
    HAVE_CAT = False

try:
    import shap
    HAVE_SHAP = True
except Exception:
    HAVE_SHAP = False

try:
    import polars as pl
    HAVE_POLARS = True
except Exception:
    HAVE_POLARS = False

warnings.filterwarnings("ignore", category=UserWarning)

# Global experiment knobs — identical to the accompanying report.
RANDOM_STATE = 42
CANCER_SET = {"KIRC", "LUAD", "LUSC", "PRAD", "THCA"}
PATIENT_ID_TO_PLOT = "TCGA-39-5011-01A"

# Dataset locations — the notebook first looks beside itself, then inside data/raw/ if present.
CANCER_PRIMARY = Path("lncRNA_5_Cancers.csv")
CANCER_FALLBACK = Path("data/raw/lncRNA_5_Cancers.csv")
REG_PRIMARY = Path("GDSC2_13drugs.csv")
REG_FALLBACK = Path("data/raw/hw3-drug-screening-data.csv")

# Output directory mirrors the course instructions.
OUT_DIR = Path("hw3_outputs")
OUT_DIR.mkdir(parents=True, exist_ok=True)
FEATURE_CACHE_DIR = OUT_DIR / "feature_cache"
FEATURE_CACHE_DIR.mkdir(parents=True, exist_ok=True)
CHECKPOINT_DIR = OUT_DIR / "checkpoints"
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)

# Memory-aware knobs — conservative defaults to keep SHAP tractable on laptops.
MAX_FEATURES_CLASSIF = 1000
MAX_FEATURES_REGRESS = 1200
SHAP_SAMPLES_PER_CLASS = 30
SHAP_SAMPLES_REG = 100
SHAP_BACKGROUND_SIZE = 256
ENABLE_SHAP_CACHE = True
BACKGROUND_SIZE = SHAP_BACKGROUND_SIZE  # backwards compatibility
MAX_PARALLEL = max(1, min((os.cpu_count() or 1), 4))

# Experiment orchestration — persists progress so long runs are resumable.
EXPERIMENT_STEPS = [
    "Task 1: data load",
    "Task 1: model comparison",
    "Task 2: classifier SHAP",
    "Task 3: data load",
    "Task 3: regressor comparison",
    "Task 4: regressor SHAP",
]


class ExperimentTracker:
    """Persist experiment status, durations, and failure notes for recovery."""

    def __init__(self, steps: List[str]):
        self.steps = steps
        self.state_path = OUT_DIR / "experiment_state.json"
        self.state = {step: {"status": "pending"} for step in steps}
        if self.state_path.exists():
            try:
                loaded = json.loads(self.state_path.read_text())
            except json.JSONDecodeError:
                loaded = {}
            for step, info in loaded.items():
                if step in self.state:
                    self.state[step].update(info)
        completed = sum(1 for info in self.state.values() if info.get("status") == "completed")
        total = len(steps)
        self.bar = tqdm(total=total, desc="Experiment automation", leave=False, dynamic_ncols=True)
        if completed:
            self.bar.update(completed)
        self.bar.set_postfix_str("Idle")

    def save(self) -> None:
        self.state_path.write_text(json.dumps(self.state, indent=2))

    def start(self, step: str) -> None:
        info = self.state.setdefault(step, {})
        info.update(
            {
                "status": "running",
                "started_at": time.strftime("%Y-%m-%d %H:%M:%S"),
            }
        )
        self.save()
        self.bar.set_postfix_str(f"Running: {step}")

    def complete(self, step: str, metadata: Optional[Dict[str, float]] = None) -> None:
        info = self.state.setdefault(step, {})
        prev_status = info.get("status")
        info.update(
            {
                "status": "completed",
                "completed_at": time.strftime("%Y-%m-%d %H:%M:%S"),
            }
        )
        if metadata:
            info["metrics"] = {k: float(v) for k, v in metadata.items()}
        self.save()
        if prev_status != "completed":
            self.bar.update(1)
        self.bar.set_postfix_str("Idle")

    def fail(
        self,
        step: str,
        error: Exception,
        seconds: Optional[float] = None,
        delta_gb: Optional[float] = None,
    ) -> None:
        info = self.state.setdefault(step, {})
        payload = {
            "status": "failed",
            "failed_at": time.strftime("%Y-%m-%d %H:%M:%S"),
            "error": f"{error.__class__.__name__}: {error}",
        }
        if seconds is not None:
            payload["seconds"] = float(seconds)
        if delta_gb is not None:
            payload["delta_gb"] = float(delta_gb)
        info.update(payload)
        self.save()
        self.bar.set_postfix_str(f"Failed: {step}")


EXPERIMENT_TRACKER = ExperimentTracker(EXPERIMENT_STEPS)


def checkpoint_path(label: str) -> Path:
    """Return the on-disk location for a checkpoint bundle."""
    return CHECKPOINT_DIR / f"{label}.joblib"


def has_checkpoint(label: str) -> bool:
    return checkpoint_path(label).exists()


def load_checkpoint(label: str):
    return load(checkpoint_path(label))


def save_checkpoint(label: str, payload: Dict[str, object]) -> None:
    dump(payload, checkpoint_path(label), compress=3)


def fingerprint_columns(columns: List[str]) -> str:
    """Stable SHA1 fingerprint for a column ordering (ensures checkpoint compatibility)."""
    joined = "|".join(columns)
    return hashlib.sha1(joined.encode("utf-8")).hexdigest()


def dataset_signature(path: Path) -> str:
    """Hash file metadata so checkpoints invalidate when upstream data changes."""
    stat = path.stat()
    raw = f"{path.resolve()}|{stat.st_size}|{stat.st_mtime_ns}"
    return hashlib.sha1(raw.encode("utf-8")).hexdigest()


def log(message: str) -> None:
    """Timestamped logger so notebook output resembles an experiment log."""
    stamp = time.strftime("%H:%M:%S")
    print(f"[{stamp}] {message}")


PROCESS = psutil.Process(os.getpid())
TIMINGS: List[Dict[str, float]] = []


def current_memory_gb() -> float:
    """Return process RSS memory in gigabytes."""
    return PROCESS.memory_info().rss / (1024 ** 3)


def timed_step(label: str):
    """Decorator to time functions, update checkpoints, and log duration/memory usage."""

    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            if EXPERIMENT_TRACKER is not None:
                EXPERIMENT_TRACKER.start(label)
            start_time = perf_counter()
            start_mem = current_memory_gb()
            try:
                result = func(*args, **kwargs)
            except Exception as exc:  # noqa: BLE001 — we want to persist the failure.
                elapsed = perf_counter() - start_time
                end_mem = current_memory_gb()
                if EXPERIMENT_TRACKER is not None:
                    EXPERIMENT_TRACKER.fail(label, exc, seconds=elapsed, delta_gb=end_mem - start_mem)
                raise
            duration = perf_counter() - start_time
            end_mem = current_memory_gb()
            TIMINGS.append(
                {
                    "step": label,
                    "seconds": duration,
                    "start_gb": start_mem,
                    "end_gb": end_mem,
                    "delta_gb": end_mem - start_mem,
                }
            )
            log(f"{label} finished in {duration:.2f}s (Δmem {end_mem - start_mem:.3f} GB)")
            if EXPERIMENT_TRACKER is not None:
                EXPERIMENT_TRACKER.complete(
                    label,
                    metadata={"seconds": duration, "delta_gb": end_mem - start_mem},
                )
            return result

        return wrapper

    return decorator


def shap_cache_path(label: str) -> Path:
    """Return the cache file path for a SHAP artefact."""
    return FEATURE_CACHE_DIR / f"{label}.npz"


def select_top_variance_features(frame: pd.DataFrame, max_features: int, cache_label: Optional[str] = None) -> pd.DataFrame:
    """Retain high-variance features using vectorised NumPy ops and optional caching."""
    if frame.shape[1] <= max_features:
        return frame
    cache_path: Optional[Path] = None
    if cache_label:
        cache_path = FEATURE_CACHE_DIR / f"{cache_label}_{max_features}.json"
        if cache_path.exists():
            columns = json.loads(cache_path.read_text())
            columns = [col for col in columns if col in frame.columns]
            if len(columns) == max_features:
                log(f"Reusing cached feature mask for {cache_label} ({max_features} columns).")
                return frame.loc[:, columns]
    data = frame.to_numpy(dtype=np.float32, copy=False)
    with np.errstate(invalid="ignore"):
        variances = np.nanvar(data, axis=0)
    top_idx = np.argpartition(variances, -max_features)[-max_features:]
    top_idx = top_idx[np.argsort(variances[top_idx])[::-1]]
    columns = frame.columns[top_idx].tolist()
    if cache_path:
        cache_path.write_text(json.dumps(columns))
    return frame.loc[:, columns]


def tree_background(sample: np.ndarray) -> np.ndarray:
    """Downsample a SHAP background dataset for tree explainers (linear cost w.r.t. columns)."""
    if sample.shape[0] <= SHAP_BACKGROUND_SIZE:
        return sample
    rng = np.random.default_rng(RANDOM_STATE)
    idx = rng.choice(sample.shape[0], size=SHAP_BACKGROUND_SIZE, replace=False)
    return sample[idx]


Experiment automation:  17%|█▋        | 1/6 [00:00<00:00, 42366.71it/s, Idle]

## 1. Data Loading Utilities
These loaders detect identifier/target columns, enforce numeric typing, drop all-NaN features, and apply the variance cap defined above.

In [35]:
def resolve_data_path(primary: Path, fallback: Optional[Path]) -> Optional[Path]:
    """Return the first available dataset path among primary and fallback hints."""
    for candidate in (primary, fallback):
        if candidate is None:
            continue
        candidate_path = Path(candidate)
        if candidate_path.exists():
            return candidate_path
    return None

def detect_id_and_target(df: pd.DataFrame) -> Tuple[Optional[str], Optional[str]]:
    id_col = None
    target_col = None
    for col in df.select_dtypes(include=["object"]).columns:
        if df[col].astype(str).str.contains(r"^TCGA-", na=False).any():
            id_col = col
            break
    if id_col is None:
        for col in df.columns:
            if re.search(r"(id|patient|sample)", col, re.I) and df[col].nunique(dropna=True) > 10:
                id_col = col
                break
    for col in df.columns:
        values = set(map(str, df[col].dropna().unique()))
        if values.issubset(CANCER_SET) and len(values) == len(CANCER_SET):
            target_col = col
            break
    if target_col is None:
        for col in df.columns:
            if re.search(r"(cancer|type|label|class)", col, re.I):
                target_col = col
                break
    return id_col, target_col


def read_csv_efficient(path: Path, usecols: Optional[List[str]] = None, nrows: Optional[int] = None):
    """Fast CSV reader that prefers Polars when available."""
    if HAVE_POLARS:
        try:
            if nrows is not None:
                df_pl = pl.read_csv(path, columns=usecols, n_rows=nrows)
            else:
                df_pl = pl.read_csv(path, columns=usecols, low_memory=True)
            return df_pl.to_pandas(use_pyarrow_extension_array=False)
        except Exception as exc:
            log(f"Polars read failed ({exc}); falling back to pandas.")
    return pd.read_csv(path, usecols=usecols, nrows=nrows)


@timed_step("Task 1: data load")
def memory_savvy_read_cancers(csv_path: Path, max_features: int) -> Tuple[pd.DataFrame, pd.Series, Optional[pd.Series], str, Optional[str]]:
    dataset_sig = dataset_signature(csv_path)
    checkpoint_label = f"task1_data_{dataset_sig}_{max_features}"
    if has_checkpoint(checkpoint_label):
        cached = load_checkpoint(checkpoint_label)
        if cached.get("signature") == dataset_sig and cached.get("max_features") == max_features:
            log("Loaded classification dataset from checkpoint.")
            return (
                cached["X"],
                cached["y"],
                cached.get("ids"),
                cached["target_col"],
                cached.get("id_col"),
            )

    header_cols = read_csv_efficient(csv_path, nrows=0).columns.tolist()
    sample_df = read_csv_efficient(csv_path, nrows=200)
    id_col, target_col = detect_id_and_target(sample_df)
    if target_col is None:
        raise RuntimeError("Unable to detect target column in classification dataset.")

    feature_cols = [c for c in header_cols if c not in {id_col, target_col}]
    selected_cols = feature_cols[:max_features]
    usecols = [target_col] + ([id_col] if id_col else []) + selected_cols

    df = read_csv_efficient(csv_path, usecols=usecols)
    y = df[target_col].astype(str)
    X = df.drop(columns=[target_col])

    ids = None
    if id_col and id_col in X.columns:
        ids = X[id_col].astype(str)
        X = X.drop(columns=[id_col])

    for col in X.columns:
        X[col] = pd.to_numeric(X[col], errors="coerce")
    X = X.astype(np.float32)
    X = X.loc[:, X.notna().any(axis=0)]
    X = select_top_variance_features(X, max_features, cache_label=f"{csv_path.stem}_class")
    result = (X, y, ids, target_col, id_col)
    save_checkpoint(
        checkpoint_label,
        {
            "X": X,
            "y": y,
            "ids": ids,
            "target_col": target_col,
            "id_col": id_col,
            "max_features": max_features,
            "signature": dataset_sig,
            "columns_sig": fingerprint_columns(list(X.columns)),
            "shape": X.shape,
        },
    )
    gc.collect()
    return result


@timed_step("Task 3: data load")
def memory_savvy_read_gdsc2(csv_path: Path, max_features: int) -> Tuple[pd.DataFrame, pd.Series, pd.Series, Dict[str, object]]:
    dataset_sig = dataset_signature(csv_path)
    checkpoint_label = f"task3_data_{dataset_sig}_{max_features}"
    if has_checkpoint(checkpoint_label):
        cached = load_checkpoint(checkpoint_label)
        if cached.get("signature") == dataset_sig and cached.get("max_features") == max_features:
            log("Loaded regression dataset from checkpoint.")
            return cached["X"], cached["y"], cached["keys"], cached["meta"]

    header_cols = read_csv_efficient(csv_path, nrows=0).columns.tolist()
    target_col = "LN_IC50"

    id_cols: List[str] = []
    for cand in ["CELL_LINE_NAME", "cell_line", "CELL_LINE", "CellLine", "cellLine", "cell_line_name"]:
        if cand in header_cols:
            id_cols.append(cand)
            break
    for cand in ["DRUG_NAME", "drug_name", "Drug", "DRUG", "drug"]:
        if cand in header_cols:
            id_cols.append(cand)
            break

    if target_col not in header_cols:
        raise RuntimeError("Expected LN_IC50 column missing in regression dataset.")
    if not id_cols:
        raise RuntimeError("Could not detect cell line / drug identifier columns.")

    feature_cols = [c for c in header_cols if c not in id_cols + [target_col]]
    selected_cols = feature_cols[:max_features]
    usecols = id_cols + [target_col] + selected_cols

    df = read_csv_efficient(csv_path, usecols=usecols)
    y = pd.to_numeric(df[target_col], errors="coerce").astype(np.float32)

    if len(id_cols) >= 2:
        keys = df[id_cols[0]].astype(str) + "|" + df[id_cols[1]].astype(str)
    else:
        keys = df[id_cols[0]].astype(str)

    X = df.drop(columns=id_cols + [target_col])
    for col in X.columns:
        X[col] = pd.to_numeric(X[col], errors="coerce")
    X = X.astype(np.float32)
    X = X.loc[:, X.notna().any(axis=0)]
    X = select_top_variance_features(X, max_features, cache_label=f"{csv_path.stem}_reg")
    meta = {
        "n_rows": len(df),
        "n_features": X.shape[1],
        "id_cols": id_cols,
        "target": target_col,
    }
    result = (X, y, keys, meta)
    save_checkpoint(
        checkpoint_label,
        {
            "X": X,
            "y": y,
            "keys": keys,
            "meta": meta,
            "max_features": max_features,
            "signature": dataset_sig,
            "columns_sig": fingerprint_columns(list(X.columns)),
            "shape": X.shape,
        },
    )
    gc.collect()
    return result


### Data Loading Optimisations
- Prefer **Polars** for CSV ingestion when available (falls back to pandas if installation fails).
- Added lightweight timing/memory instrumentation via `timed_step`, now covering data ingestion in addition to modelling.
- Future runs will highlight load times in the optimisation summary table above.

### Feature Engineering Optimisations
- Cached variance-based feature subsets per dataset to avoid recomputation across runs.
- Switched variance computation to vectorised NumPy for faster execution on wide matrices.
- Reused the Polars-backed reader from the ingestion pass to minimise conversion overhead.

## 2. Modelling & SHAP Utilities
These helpers encapsulate the repetitive parts of Tasks 1–4: training the model suites, logging metrics, and generating SHAP summaries. Artefacts are written to `hw3_outputs/` for direct inclusion in the report.

### Training Optimisations
- Leveraged joblib with a `tqdm`-backed progress bar to evaluate models in parallel (bounded by available CPU cores).
- Ensured all tree ensembles use multi-threaded backends (`n_jobs=-1` where supported).
- Pipelines are re-fit on the full dataset only for the winning model, reducing redundant estimator training.

In [36]:
@timed_step("Task 1: model comparison")
def train_compare_classifiers(X: pd.DataFrame, y: pd.Series, random_state: int) -> Tuple[pd.DataFrame, Pipeline, Dict[int, str]]:
    """Train the required classifiers with parallel execution."""
    class_names = sorted(y.astype(str).unique())
    class_to_idx = {c: i for i, c in enumerate(class_names)}
    idx_to_class = {i: c for c, i in class_to_idx.items()}
    col_signature = fingerprint_columns(list(X.columns))
    model_checkpoint = f"task1_model_{col_signature}_{len(X)}"

    if has_checkpoint(model_checkpoint):
        cached = load_checkpoint(model_checkpoint)
        if cached.get("random_state") == random_state and cached.get("class_names") == class_names:
            log("Loaded Task 1 model comparison from checkpoint.")
            metrics_df = cached["metrics_df"]
            cm_df = cached["confusion_df"]
            report_df = cached["report_df"]
            best_pipeline = cached["best_pipeline"]
            idx_to_class = cached["idx_to_class"]
            best_name = cached["best_model_name"]
            metrics_df.to_csv(OUT_DIR / "task1_model_comparison.csv", index=False)
            cm_df.to_csv(OUT_DIR / "task1_confusion_matrix.csv")
            report_df.to_csv(OUT_DIR / "task1_classification_report.csv")
            (OUT_DIR / "task1_best_model.txt").write_text(str(best_name))
            return metrics_df, best_pipeline, idx_to_class

    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=0.2,
        random_state=random_state,
        stratify=y,
    )

    y_train_series = pd.Series(y_train)
    y_test_series = pd.Series(y_test)
    y_train_encoded = y_train_series.map(class_to_idx).astype(int)
    y_test_array = y_test_series.to_numpy()

    base_estimators: Dict[str, object] = {
        "DecisionTree": DecisionTreeClassifier(random_state=random_state, min_samples_leaf=2, class_weight="balanced"),
        "RandomForest": RandomForestClassifier(n_estimators=120, random_state=random_state, n_jobs=-1, class_weight="balanced_subsample"),
        "GBM": GradientBoostingClassifier(n_estimators=120, learning_rate=0.05, max_depth=3, random_state=random_state),
    }
    if HAVE_XGB:
        base_estimators["XGBoost"] = XGBClassifier(
            objective="multi:softprob",
            eval_metric="mlogloss",
            n_estimators=160,
            learning_rate=0.05,
            max_depth=6,
            subsample=0.8,
            colsample_bytree=0.8,
            tree_method="hist",
            n_jobs=MAX_PARALLEL,
            random_state=random_state,
        )
    if HAVE_LGBM:
        base_estimators["LightGBM"] = LGBMClassifier(
            n_estimators=160,
            learning_rate=0.05,
            num_leaves=63,
            subsample=0.8,
            colsample_bytree=0.8,
            random_state=random_state,
            n_jobs=-1,
        )
    if HAVE_CAT:
        base_estimators["CatBoost"] = CatBoostClassifier(
            iterations=160,
            learning_rate=0.05,
            depth=6,
            loss_function="MultiClass",
            random_seed=random_state,
            verbose=False,
        )

    pipelines = {
        name: Pipeline([
            ("imputer", SimpleImputer(strategy="median", copy=False)),
            ("model", estimator),
        ])
        for name, estimator in base_estimators.items()
    }

    def _fit_pipeline(name: str, pipe: Pipeline):
        local_pipe = clone(pipe)
        local_pipe.fit(X_train, y_train_encoded.to_numpy())
        preds_idx = local_pipe.predict(X_test)
        preds_idx = np.asarray(preds_idx)
        if preds_idx.ndim > 1:
            preds_idx = preds_idx.argmax(axis=1)
        preds_idx = preds_idx.astype(int)
        preds_array = np.array([idx_to_class[int(i)] for i in preds_idx], dtype=object)
        acc = accuracy_score(y_test_array, preds_array)
        f1m = f1_score(y_test_array, preds_array, average="macro")
        return {
            "Model": name,
            "Test_Accuracy": acc,
            "Test_F1_Macro": f1m,
            "preds": preds_array,
        }

    tasks = list(pipelines.items())
    parallelism = min(MAX_PARALLEL, len(tasks))

    with tqdm_joblib(tqdm(total=len(tasks), desc="Task 1: classifiers", unit="model", leave=False)):
        fitted = Parallel(n_jobs=parallelism, backend="loky")(delayed(_fit_pipeline)(name, pipe) for name, pipe in tasks)

    metrics_records = [{"Model": entry["Model"], "Test_Accuracy": entry["Test_Accuracy"], "Test_F1_Macro": entry["Test_F1_Macro"]} for entry in fitted]
    metrics_df = pd.DataFrame(metrics_records).sort_values(["Test_F1_Macro", "Test_Accuracy"], ascending=False).reset_index(drop=True)

    best_name = metrics_df.iloc[0]["Model"]
    best_entry = next(entry for entry in fitted if entry["Model"] == best_name)
    best_preds = best_entry["preds"]

    cm = confusion_matrix(y_test, best_preds, labels=class_names)
    cm_df = pd.DataFrame(cm, index=[f"True_{c}" for c in class_names], columns=[f"Pred_{c}" for c in class_names])
    cm_df.to_csv(OUT_DIR / "task1_confusion_matrix.csv")

    report_df = pd.DataFrame(classification_report(
        y_test,
        best_preds,
        output_dict=True,
        zero_division=0,
        labels=class_names,
        target_names=class_names,
    )).T
    report_df.to_csv(OUT_DIR / "task1_classification_report.csv")

    metrics_df.to_csv(OUT_DIR / "task1_model_comparison.csv", index=False)
    (OUT_DIR / "task1_best_model.txt").write_text(str(best_name))

    best_pipeline = clone(pipelines[best_name])
    y_full_encoded = pd.Series(y).map(class_to_idx).astype(int).to_numpy()
    best_pipeline.fit(X, y_full_encoded)

    save_checkpoint(
        model_checkpoint,
        {
            "metrics_df": metrics_df,
            "confusion_df": cm_df,
            "report_df": report_df,
            "best_pipeline": best_pipeline,
            "best_model_name": best_name,
            "idx_to_class": idx_to_class,
            "random_state": random_state,
            "class_names": class_names,
            "columns_sig": col_signature,
            "n_rows": len(X),
        },
    )
    gc.collect()

    return metrics_df, best_pipeline, idx_to_class


@timed_step("Task 3: regressor comparison")
def train_compare_regressors(X: pd.DataFrame, y: pd.Series, random_state: int) -> Tuple[pd.DataFrame, Pipeline]:
    """Train the regression suite in parallel and return metrics plus the best pipeline."""
    col_signature = fingerprint_columns(list(X.columns))
    model_checkpoint = f"task3_model_{col_signature}_{len(X)}"

    if has_checkpoint(model_checkpoint):
        cached = load_checkpoint(model_checkpoint)
        if cached.get("random_state") == random_state:
            log("Loaded Task 3 regressor comparison from checkpoint.")
            metrics_df = cached["metrics_df"]
            best_pipeline = cached["best_pipeline"]
            best_name = cached["best_model_name"]
            metrics_df.to_csv(OUT_DIR / "task3_regressor_comparison.csv", index=False)
            (OUT_DIR / "task3_best_model.txt").write_text(str(best_name))
            return metrics_df, best_pipeline

    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=0.2,
        random_state=random_state,
    )

    categorical_features = [col for col in ["CELL_LINE_NAME", "DRUG_NAME"] if col in X.columns]
    numeric_features = [col for col in X.columns if col not in categorical_features]

    cat_transformer = Pipeline([
        ("imputer", SimpleImputer(strategy="most_frequent", copy=False)),
        ("encoder", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)),
    ])
    num_transformer = Pipeline([
        ("imputer", SimpleImputer(strategy="median", copy=False)),
    ])

    preprocessor = ColumnTransformer([
        ("cat", cat_transformer, categorical_features),
        ("num", num_transformer, numeric_features),
    ])

    base_estimators: Dict[str, object] = {
        "DecisionTreeReg": DecisionTreeRegressor(random_state=random_state, min_samples_leaf=2),
        "RandomForestReg": RandomForestRegressor(n_estimators=120, random_state=random_state, n_jobs=-1),
        "GBMReg": GradientBoostingRegressor(n_estimators=120, learning_rate=0.05, max_depth=3, random_state=random_state),
    }
    if HAVE_XGB:
        base_estimators["XGBReg"] = XGBRegressor(
            n_estimators=160,
            learning_rate=0.05,
            max_depth=6,
            subsample=0.8,
            colsample_bytree=0.8,
            tree_method="hist",
            n_jobs=MAX_PARALLEL,
            random_state=random_state,
        )
    if HAVE_LGBM:
        base_estimators["LGBMReg"] = LGBMRegressor(
            n_estimators=160,
            learning_rate=0.05,
            num_leaves=63,
            subsample=0.8,
            colsample_bytree=0.8,
            random_state=random_state,
            n_jobs=-1,
        )
    if HAVE_CAT:
        base_estimators["CatBoostReg"] = CatBoostRegressor(
            iterations=160,
            learning_rate=0.05,
            depth=6,
            loss_function="RMSE",
            random_seed=random_state,
            verbose=False,
        )

    pipelines = {
        name: Pipeline([
            ("preprocessor", preprocessor),
            ("reg", estimator),
        ])
        for name, estimator in base_estimators.items()
    }

    def _fit_pipeline(name: str, pipe: Pipeline):
        local_pipe = clone(pipe)
        local_pipe.fit(X_train, y_train)
        preds = local_pipe.predict(X_test)
        mae = mean_absolute_error(y_test, preds)
        mse = mean_squared_error(y_test, preds)
        rmse = math.sqrt(mse)
        r2 = r2_score(y_test, preds)
        return {
            "Model": name,
            "MAE": mae,
            "MSE": mse,
            "RMSE": rmse,
            "R2": r2,
        }

    tasks = list(pipelines.items())
    parallelism = min(MAX_PARALLEL, len(tasks))

    with tqdm_joblib(tqdm(total=len(tasks), desc="Task 3: regressors", unit="model", leave=False)):
        fitted = Parallel(n_jobs=parallelism, backend="loky")(delayed(_fit_pipeline)(name, pipe) for name, pipe in tasks)

    metrics_records = [{
        "Model": entry["Model"],
        "MAE": entry["MAE"],
        "MSE": entry["MSE"],
        "RMSE": entry["RMSE"],
        "R2": entry["R2"],
    } for entry in fitted]
    metrics_df = pd.DataFrame(metrics_records).sort_values(["RMSE", "MAE"], ascending=[True, True]).reset_index(drop=True)

    best_name = metrics_df.iloc[0]["Model"]

    metrics_df.to_csv(OUT_DIR / "task3_regressor_comparison.csv", index=False)
    (OUT_DIR / "task3_best_model.txt").write_text(str(best_name))

    best_pipeline = clone(pipelines[best_name])
    y_full_encoded = pd.Series(y).map(class_to_idx).astype(int).to_numpy()
    best_pipeline.fit(X, y_full_encoded)

    save_checkpoint(
        model_checkpoint,
        {
            "metrics_df": metrics_df,
            "best_pipeline": best_pipeline,
            "best_model_name": best_name,
            "random_state": random_state,
            "columns_sig": col_signature,
            "n_rows": len(X),
        },
    )
    gc.collect()

    return metrics_df, best_pipeline


@timed_step("Task 2: classifier SHAP")
def shap_task2(best_model: Pipeline, X: pd.DataFrame, y: pd.Series, sample_ids: Optional[pd.Series], patient_id: str, idx_to_class: Dict[int, str]) -> List[Tuple[str, object]]:
    if not HAVE_SHAP:
        raise ImportError("SHAP is required for Task 2. Re-run the installation cell if needed.")

    imputer = best_model.named_steps.get("imputer")
    if imputer is not None:
        X_matrix = imputer.transform(X)
        feature_names = list(getattr(imputer, "feature_names_in_", X.columns))
    else:
        X_matrix = X.to_numpy(dtype=np.float32, copy=False)
        feature_names = list(X.columns)

    background = tree_background(X_matrix)

    label_array = y.to_numpy()
    rng = np.random.default_rng(RANDOM_STATE)
    subset_idx: List[int] = []
    for cancer in tqdm(sorted(CANCER_SET), desc="Task 2a: sampling", leave=False):
        class_idx = np.where(label_array == cancer)[0]
        if class_idx.size == 0:
            continue
        if class_idx.size > SHAP_SAMPLES_PER_CLASS:
            class_idx = rng.choice(class_idx, SHAP_SAMPLES_PER_CLASS, replace=False)
        subset_idx.extend(class_idx.tolist())
    subset_idx = np.array(sorted(set(subset_idx)), dtype=int)
    if subset_idx.size == 0:
        subset_idx = np.arange(min(SHAP_SAMPLES_PER_CLASS * len(CANCER_SET), X_matrix.shape[0]))
    X_subset = X_matrix[subset_idx]

    model = best_model.named_steps["model"]
    explainer = shap.TreeExplainer(model, data=background, feature_perturbation="tree_path_dependent")

    cache_key = f"task2_{model.__class__.__name__.lower()}_{len(subset_idx)}_{X_matrix.shape[1]}"
    cache_file = shap_cache_path(cache_key)
    shap_by_class: Optional[List[np.ndarray]] = None
    expected_values: Optional[np.ndarray] = None
    if ENABLE_SHAP_CACHE and cache_file.exists():
        cache = np.load(cache_file, allow_pickle=True)
        cached_idx = cache["indices"]
        if cached_idx.shape == subset_idx.shape and np.array_equal(cached_idx, subset_idx):
            shap_by_class = [np.asarray(arr) for arr in cache["shap"]]
            expected_values = np.asarray(cache["expected"])
        else:
            cache_file.unlink(missing_ok=True)

    if shap_by_class is None:
        shap_output = explainer.shap_values(X_subset)
        if isinstance(shap_output, list):
            shap_by_class = [np.asarray(arr) for arr in shap_output]
        else:
            if shap_output.ndim == 3:
                shap_by_class = [np.asarray(shap_output[:, :, i]) for i in range(shap_output.shape[2])]
            else:
                shap_by_class = [np.asarray(shap_output)]
        expected_values = np.asarray(explainer.expected_value)
        if ENABLE_SHAP_CACHE:
            np.savez_compressed(cache_file, shap=np.array(shap_by_class, dtype=object), expected=expected_values, indices=subset_idx)
    else:
        expected_values = np.asarray(explainer.expected_value)

    records = []
    classes = list(model.classes_) if hasattr(model, "classes_") else list(range(len(shap_by_class)))

    def _label_name(class_value):
        if isinstance(class_value, (np.integer, int)):
            return idx_to_class.get(int(class_value), str(class_value))
        return str(class_value)

    for class_index, class_name in enumerate(tqdm(classes, desc="Task 2a: aggregation", leave=False)):
        shap_matrix = np.asarray(shap_by_class[class_index])
        mean_abs = np.abs(shap_matrix).mean(axis=0)
        top_idx = np.argsort(mean_abs)[::-1][:10]
        label_name = _label_name(class_name)
        for rank, feat_idx in enumerate(top_idx, start=1):
            records.append(
                {
                    "CancerType": label_name,
                    "Rank": rank,
                    "Feature": feature_names[feat_idx],
                    "Mean|SHAP|": float(mean_abs[feat_idx]),
                }
            )
    pd.DataFrame(records).to_csv(OUT_DIR / "task2a_top10_features_per_cancer.csv", index=False)

    shap.initjs()
    if sample_ids is not None and sample_ids.notna().any():
        matches = sample_ids[sample_ids.astype(str) == patient_id]
        patient_position = matches.index[0] if not matches.empty else 0
    else:
        patient_position = 0
    row_matrix = X_matrix[[patient_position]]

    shap_row = explainer.shap_values(row_matrix)
    if isinstance(shap_row, list):
        shap_rows = [np.asarray(arr) for arr in shap_row]
    else:
        if shap_row.ndim == 3:
            shap_rows = [np.asarray(shap_row[:, :, i]) for i in range(shap_row.shape[2])]
        else:
            shap_rows = [np.asarray(shap_row)]

    inline_plots: List[Tuple[str, object]] = []
    expected_array = np.asarray(explainer.expected_value)
    for class_index, class_name in enumerate(tqdm(classes, desc="Task 2b: force plots", leave=False)):
        expected = expected_array[class_index] if expected_array.ndim > 0 else expected_array
        label_name = _label_name(class_name)
        force_plot = shap.force_plot(expected, shap_rows[class_index][0, :], row_matrix[0, :], feature_names=feature_names, matplotlib=False)
        shap.save_html(str(OUT_DIR / f"task2b_forceplot_{label_name}_patient_{patient_id.replace(':','-')}.html"), force_plot)
        inline_plots.append((label_name, force_plot))

    return inline_plots


@timed_step("Task 4: regressor SHAP")
def shap_task4(best_reg_model: Pipeline, X: pd.DataFrame, y: pd.Series, keys: pd.Series) -> None:
    if not HAVE_SHAP:
        raise ImportError("SHAP is required for Task 4. Re-run the installation cell if needed.")

    preprocessor = best_reg_model.named_steps.get("preprocessor")
    if preprocessor is not None:
        X_matrix = preprocessor.transform(X)
        if hasattr(X_matrix, "toarray"):
            X_matrix = X_matrix.toarray()
        feature_names = list(preprocessor.get_feature_names_out())
    else:
        X_matrix = X.to_numpy(dtype=np.float32, copy=False)
        feature_names = list(X.columns)

    background = tree_background(X_matrix)

    rng = np.random.default_rng(RANDOM_STATE)
    sample_idx = rng.choice(X_matrix.shape[0], size=min(SHAP_SAMPLES_REG, X_matrix.shape[0]), replace=False)
    X_subset = X_matrix[sample_idx]
    drugs_subset = keys.iloc[sample_idx]

    model = best_reg_model.named_steps["reg"]
    explainer = shap.TreeExplainer(model, data=background, feature_perturbation="tree_path_dependent")

    cache_key = f"task4_{model.__class__.__name__.lower()}_{len(sample_idx)}_{X_matrix.shape[1]}"
    cache_file = shap_cache_path(cache_key)
    shap_matrix: Optional[np.ndarray] = None
    if ENABLE_SHAP_CACHE and cache_file.exists():
        cache = np.load(cache_file, allow_pickle=True)
        cached_idx = cache["indices"]
        if cached_idx.shape == sample_idx.shape and np.array_equal(cached_idx, sample_idx):
            shap_matrix = np.asarray(cache["shap"])
        else:
            cache_file.unlink(missing_ok=True)

    if shap_matrix is None:
        shap_output = explainer.shap_values(X_subset)
        if isinstance(shap_output, list):
            shap_matrix = np.asarray(shap_output[0])
        else:
            shap_matrix = np.asarray(shap_output)
        if ENABLE_SHAP_CACHE:
            np.savez_compressed(cache_file, shap=shap_matrix, indices=sample_idx)
    else:
        pass

    records = []
    for drug in tqdm(sorted(drugs_subset.unique()), desc="Task 4a: per-drug SHAP", leave=False):
        mask = (drugs_subset == drug).values
        if mask.sum() == 0:
            continue
        mean_abs = np.abs(shap_matrix[mask]).mean(axis=0)
        top_idx = np.argsort(mean_abs)[::-1][:10]
        for rank, feat_idx in enumerate(top_idx, start=1):
            records.append(
                {
                    "Drug": drug,
                    "Rank": rank,
                    "Feature": feature_names[feat_idx],
                    "Mean|SHAP|": float(mean_abs[feat_idx]),
                }
            )
    pd.DataFrame(records).to_csv(OUT_DIR / "task4a_top10_features_per_drug.csv", index=False)

    preds = best_reg_model.predict(X)
    errors = np.abs(preds - y.values)
    idx_min = int(np.argmin(errors))
    least_key = keys.iloc[idx_min]

    row_matrix = X_matrix[[idx_min]]
    shap_row = explainer.shap_values(row_matrix)
    if isinstance(shap_row, list):
        shap_row = np.asarray(shap_row[0])
    else:
        shap_row = np.asarray(shap_row)

    mean_abs = np.abs(shap_row[0, :])
    top_idx = np.argsort(mean_abs)[::-1][:10]
    pd.DataFrame(
        {
            "Rank": np.arange(1, 11),
            "Feature": [feature_names[i] for i in top_idx],
            "Absolute_SHAP": mean_abs[top_idx].astype(float),
        }
    ).to_csv(OUT_DIR / f"task4b_top10_features_least_error_{least_key.replace('|', '_')}.csv", index=False)


## 3. Task 1 — Classification Dataset Reconnaissance
We load the lncRNA expression matrix with the memory-savvy routine. The summary table captures the sample size, retained feature count, and identifier columns for citation in the written report.

In [37]:
cancer_path = resolve_data_path(CANCER_PRIMARY, CANCER_FALLBACK)
assert cancer_path is not None, "Classification CSV missing – please place lncRNA_5_Cancers.csv alongside the notebook."

Xc, yc, sample_ids, class_col, id_col = memory_savvy_read_cancers(cancer_path, MAX_FEATURES_CLASSIF)

summary_cls = pd.DataFrame(
    {
        "rows": [len(Xc)],
        "selected_features": [Xc.shape[1]],
        "target_column": [class_col],
        "id_column": [id_col],
    }
)

log("Task 1 dataset loaded.")
display(summary_cls)
Xc.iloc[:5, :10]


Experiment automation:  33%|███▎      | 2/6 [00:00<00:00, 17.28it/s, Idle]                      

[15:30:32] Loaded classification dataset from checkpoint.
[15:30:32] Task 1: data load finished in 0.06s (Δmem 0.002 GB)
[15:30:32] Task 1 dataset loaded.


Unnamed: 0,rows,selected_features,target_column,id_column
0,2529,1000,Class,Ensembl_ID


Unnamed: 0,ENSG00000005206.15,ENSG00000083622.8,ENSG00000088970.14,ENSG00000099869.7,ENSG00000100181.20,ENSG00000104691.13,ENSG00000115934.11,ENSG00000117242.7,ENSG00000118412.11,ENSG00000122043.9
0,3.390813,0.0,2.918266,0.014832,0.341984,2.194036,0.0,1.56975,1.159419,0.0282
1,3.144547,0.0,1.96141,0.047186,1.677598,2.605298,0.0,1.180583,1.127571,0.131274
2,2.484817,0.0,2.89647,0.0,0.087972,3.176764,0.0,1.690582,1.161923,0.10972
3,2.789058,0.0,2.439171,0.022316,0.502293,2.679842,0.0,1.659525,1.463067,0.0
4,3.258763,0.0,1.94166,0.050283,0.098625,2.841588,0.0,1.296678,1.728514,0.019417


## 4. Task 1 — Model Comparison
We benchmark the required classifiers under a shared preprocessing pipeline (median imputation). Macro-F1 is the primary ranking metric to respect class balance across the five cancers.

In [38]:
cls_results, best_classifier, idx_to_class = train_compare_classifiers(Xc, yc, RANDOM_STATE)

log("Task 1 model sweep complete.")
cls_results


Experiment automation:  33%|███▎      | 2/6 [00:00<00:00, 17.28it/s, Running: Task 1: model comparison]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.033574 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 199047
[LightGBM] [Info] Number of data points in the train set: 2023, number of used features: 979
[LightGBM] [Info] Start training from score -1.567332
[LightGBM] [Info] Start training from score -1.601070
[LightGBM] [Info] Start training from score -1.625885
[LightGBM] [Info] Start training from score -1.635986
[LightGBM] [Info] Start training from score -1.618375


Experiment automation:  33%|███▎      | 2/6 [00:12<00:00, 17.28it/s, Running: Task 1: model comparison]



Experiment automation:  50%|█████     | 3/6 [02:49<03:31, 70.65s/it, Idle]                             

[15:33:22] Task 1: model comparison finished in 169.44s (Δmem -0.179 GB)
[15:33:22] Task 1 model sweep complete.


Unnamed: 0,Model,Test_Accuracy,Test_F1_Macro
0,RandomForest,0.974308,0.974226
1,LightGBM,0.972332,0.972386
2,XGBoost,0.970356,0.970409
3,GBM,0.960474,0.960552
4,DecisionTree,0.93083,0.930806
5,CatBoost,0.20751,0.06874


## 5. Task 1 — Confusion Matrix & Per-Class Report
The confusion matrix and class-wise precision/recall/F1 are required in the homework write-up. They are exported to `hw3_outputs/` and displayed here for quick inspection.

In [39]:
confusion = pd.read_csv(OUT_DIR / "task1_confusion_matrix.csv", index_col=0)
classification_report_df = pd.read_csv(OUT_DIR / "task1_classification_report.csv", index_col=0)

log("Task 1 evaluation artefacts loaded from hw3_outputs/.")
display(confusion)
classification_report_df


[15:33:22] Task 1 evaluation artefacts loaded from hw3_outputs/.


Unnamed: 0,Pred_KIRC,Pred_LUAD,Pred_LUSC,Pred_PRAD,Pred_THCA
True_KIRC,105,0,0,0,0
True_LUAD,0,98,4,0,0
True_LUSC,0,9,91,0,0
True_PRAD,0,0,0,99,0
True_THCA,0,0,0,0,100


Unnamed: 0,precision,recall,f1-score,support
KIRC,1.0,1.0,1.0,105.0
LUAD,0.915888,0.960784,0.937799,102.0
LUSC,0.957895,0.91,0.933333,100.0
PRAD,1.0,1.0,1.0,99.0
THCA,1.0,1.0,1.0,100.0
accuracy,0.974308,0.974308,0.974308,0.974308
macro avg,0.974757,0.974157,0.974226,506.0
weighted avg,0.974723,0.974308,0.974286,506.0


## 6. Task 2 — SHAP on Best Classifier
The top-ranked classifier becomes the subject of SHAP analysis. We compute per-cancer top-10 genes and generate patient-level force plots for `TCGA-39-5011-01A` as mandated.

### SHAP Optimisations
- Sampled per-class subsets deterministically and cached SHAP tensors to avoid recomputation across runs.
- Limited background size to a compact set (256 rows) and reused the TreeExplainer for patient-level explanations.
- Added progress bars for SHAP aggregation steps so long-running tasks expose real-time feedback.

In [40]:
best_classifier_name = cls_results.iloc[0]["Model"]
log(f"Task 2 interprets the {best_classifier_name}.")
inline_force_plots = shap_task2(best_classifier, Xc, yc, sample_ids, PATIENT_ID_TO_PLOT, idx_to_class)

task2_top = pd.read_csv(OUT_DIR / "task2a_top10_features_per_cancer.csv")
log("Task 2 SHAP tables saved to hw3_outputs/.")
task2_top.head(15)


Experiment automation:  50%|█████     | 3/6 [02:49<03:31, 70.65s/it, Running: Task 2: classifier SHAP]

[15:33:22] Task 2 interprets the RandomForest.


Experiment automation:  50%|█████     | 3/6 [02:49<03:31, 70.65s/it, Failed: Task 2: classifier SHAP] 

ExplainerError: The background dataset you provided does not cover all the leaves in the model, so TreeExplainer cannot run with the feature_perturbation="tree_path_dependent" option! Try providing a larger background dataset, no background dataset, or using feature_perturbation="interventional".

In [None]:
# Render cached/returned force plots inline for quick review
for cancer_type, force_plot in inline_force_plots:
    log(f"Force plot for {cancer_type}")
    display(force_plot)


The five force plots above are also saved as HTML files in `hw3_outputs/task2b_forceplot_<Cancer>_patient_TCGA-39-5011-01A.html` for sharing or screenshot capture.

## 7. Task 3 — Regression Dataset Reconnaissance
We repeat the data audit for the GDSC2 drug-response table, capturing dimensionality and ID columns to justify preprocessing choices.

In [None]:
regression_path = resolve_data_path(REG_PRIMARY, REG_FALLBACK)
assert regression_path is not None, "Regression CSV missing – please place hw3-drug-screening-data.csv (renamed to GDSC2_13drugs.csv) alongside the notebook."

Xr, yr, keys, meta = memory_savvy_read_gdsc2(regression_path, MAX_FEATURES_REGRESS)

summary_reg = pd.DataFrame(
    {
        "rows": [meta["n_rows"]],
        "selected_features": [meta["n_features"]],
        "target_column": [meta["target"]],
        "id_columns": [" & ".join(meta["id_cols"])],
    }
)

log("Task 3 dataset loaded.")
display(summary_reg)
Xr.iloc[:5, :10]


## 8. Task 3 — Regressor Comparison
With preprocessing aligned to the classification case, we evaluate the regression ensemble and rank models using RMSE (primary) plus MAE/MSE/R².

In [None]:
reg_results, best_regressor = train_compare_regressors(Xr, yr, RANDOM_STATE)

log("Task 3 model sweep complete.")
reg_results


## 9. Task 4 — SHAP on Best Regressor
We apply SHAP to the top regressor to fulfil Tasks 4a and 4b: per-drug feature rankings and the least-error drug–cell-line explanation.

In [None]:
best_regressor_name = reg_results.iloc[0]["Model"]
log(f"Task 4 interprets the {best_regressor_name}.")
shap_task4(best_regressor, Xr, yr, keys)

per_drug = pd.read_csv(OUT_DIR / "task4a_top10_features_per_drug.csv")
least_error_path = sorted(OUT_DIR.glob("task4b_top10_features_least_error_*.csv"))[-1]
least_error = pd.read_csv(least_error_path)

(per_drug.head(20), least_error)


In [None]:
# Summarise recorded runtimes and memory usage
if TIMINGS:
    timings_df = pd.DataFrame(TIMINGS)
    display(timings_df)
else:
    log("No timings captured yet. Re-run the notebook from the beginning.")

if 'EXPERIMENT_TRACKER' in globals() and EXPERIMENT_TRACKER is not None:
    tracker_df = pd.DataFrame(EXPERIMENT_TRACKER.state).T
    display(tracker_df)
    EXPERIMENT_TRACKER.bar.close()


## 10. Conclusion & Artefact Checklist
- All tables/figures required by the assignment are in `hw3_outputs/`.
- Rerun with different feature caps or SHAP sample sizes by adjusting the configuration cell at the top.
- A natural extension is hyper-parameter tuning around the winning models or annotating the highlighted genes/drugs with biological context.