# CAP5610 HW3 — Tree Ensembles & SHAP Study

This notebook mirrors the Homework 3 brief and documents the full experimental path: data curation, model selection, and SHAP-based interpretation for both the classification and regression tasks. The narrative is intentionally research-style—each section introduces objectives, explains decisions, and records findings.

## Study Roadmap
1. **Environment checks** — load shared helpers, set global knobs, confirm data availability.
2. **Task 1: Classification** — inspect the gene-expression matrix, train the tree-based suite, analyse accuracy/F1.
3. **Task 2: Classifier SHAP** — extract cancer-specific feature importances and patient-level force plots.
4. **Task 3: Regression** — profile the drug-response table, compare regressors using MAE/MSE/RMSE/R².
5. **Task 4: Regressor SHAP** — quantify drug biomarkers and zoom in on the best-predicted drug–cell-line pair.
6. **Conclusion** — catalogue artefacts and possible extensions.

In [None]:
# --- 0. Environment priming -------------------------------------------------------
from importlib import reload
from pathlib import Path

import numpy as np
import pandas as pd
from IPython.display import display

import src.leo.HW3_Mendez as hw3
reload(hw3)  # borrow the script implementation but allow iterative tweaks in notebook

# Reproducibility knobs (kept conservative to avoid memory spikes on laptops).
hw3.MAX_FEATURES_CLASSIF = 1000  # cap genes for Task 1/2
hw3.MAX_FEATURES_REGRESS = 1200  # cap features for Task 3/4
hw3.SHAP_SAMPLES_PER_CLASS = 30  # per-class sample size for SHAP aggregation
hw3.SHAP_SAMPLES_REG = 100       # regression SHAP sampling budget

ROOT = Path.cwd()
CANCER_PATH = hw3.resolve_data_path(hw3.CANCER_CSV, hw3.CANCER_FALLBACK)
REG_PATH = hw3.resolve_data_path(hw3.GDSC2_CSV, hw3.GDSC2_FALLBACK)

assert CANCER_PATH is not None, "Classification CSV missing – checked primary and fallback paths."
assert REG_PATH is not None, "Regression CSV missing – checked primary and fallback paths."

hw3.log(f"Notebook ready. Using {Path(CANCER_PATH).name} and {Path(REG_PATH).name}.")

## Task 1 — Classification Dataset Reconnaissance
We first sanity-check the lncRNA expression matrix: confirm sample size, feature count after variance selection, and the presence of a TCGA identifier column. This mirrors the context section of a research paper.

In [None]:
# Load classification data using the memory-aware helper.
Xc, yc, sample_ids, class_col, id_col = hw3.memory_savvy_read_cancers(str(CANCER_PATH), hw3.MAX_FEATURES_CLASSIF)

summary_cls = pd.DataFrame(
    {
        "rows": [len(Xc)],
        "selected_features": [Xc.shape[1]],
        "target_column": [class_col],
        "id_column": [id_col],
    }
)

hw3.log("Task 1 dataset loaded.")
display(summary_cls)

# Inspect a small slice (5 samples × 10 genes) to verify numeric coercion worked as expected.
Xc.iloc[:5, :10]


## Task 1 — Model Comparison Strategy
We compare six tree-based classifiers under a shared preprocessing pipeline (median imputation). Stratified splits preserve cancer balance. Metrics emphasise macro-F1 to account for class parity, consistent with the assignment brief.

In [None]:
# Train the tree ensemble family and collect results.
cls_results, cls_models, idx_to_class = hw3.train_compare_classifiers(Xc, yc, hw3.RANDOM_STATE)

hw3.log("Task 1 model sweep complete.")
display(cls_results)

best_cls_name = cls_results.iloc[0]["Model"]
best_classifier = cls_models[best_cls_name]

hw3.log(f"Task 1 best classifier: {best_cls_name}")


## Task 1 — Confusion Matrix & Per-Class Metrics
Documenting per-class precision/recall mirrors the reporting expectations. We reuse the CSV exported by the helper for reproducibility, but also display the confusion matrix inline.

In [None]:
confusion = pd.read_csv(hw3.OUT_DIR / "task1_confusion_matrix.csv", index_col=0)
classification_report = pd.read_csv(hw3.OUT_DIR / "task1_classification_report.csv", index_col=0)

hw3.log("Task 1 evaluation artefacts loaded from hw3_outputs/")
display(confusion)
classification_report


## Task 2 — SHAP Rationale
With the winning classifier in hand, we quantify contribution patterns. Mean |SHAP| scores highlight cancer-specific genes, while force plots satisfy the assignment’s patient-level interpretability requirement.

In [None]:
# Run SHAP analysis with memory-aware sampling.
hw3.shap_task2(best_classifier, Xc, yc, sample_ids, hw3.PATIENT_ID_TO_PLOT, idx_to_class)

# Display the aggregated top-10 importance list (first 15 rows for brevity).
task2_top = pd.read_csv(hw3.OUT_DIR / "task2a_top10_features_per_cancer.csv")

hw3.log("Task 2 SHAP outputs generated.")
task2_top.head(15)


Force plots were saved in `hw3_outputs/task2b_forceplot_*`. Open them in a browser to capture screenshots or embed HTML in the written report.

## Task 3 — Regression Dataset Reconnaissance
We repeat the dataset audit for the drug-response table: number of rows, features retained after variance filtering, and ID columns used to define unique drug–cell-line pairs.

In [None]:
Xr, yr, keys, meta = hw3.memory_savvy_read_gdsc2(str(REG_PATH), hw3.MAX_FEATURES_REGRESS)

summary_reg = pd.DataFrame(
    {
        "rows": [meta["n_rows"]],
        "selected_features": [meta["n_features"]],
        "target_column": [meta["target"]],
        "id_columns": [" & ".join(meta["id_cols"])],
    }
)

hw3.log("Task 3 dataset loaded.")
display(summary_reg)
Xr.iloc[:5, :10]


## Task 3 — Regressor Comparison
Analogous to Task 1, we bench the regressors with shared preprocessing. The assignment requests MAE, MSE, RMSE, and R²; we present them in a sortable DataFrame for inclusion in the report.

In [None]:
reg_results, reg_models = hw3.train_compare_regressors(Xr, yr, hw3.RANDOM_STATE)

hw3.log("Task 3 model sweep complete.")
display(reg_results)

best_reg_name = reg_results.iloc[0]["Model"]
best_regressor = reg_models[best_reg_name]

hw3.log(f"Task 3 best regressor: {best_reg_name}")


## Task 4 — SHAP on Best Regressor
We apply TreeExplainer to the winning regressor. Mean |SHAP| per drug corresponds to Task 4a, while Task 4b singles out the drug–cell-line pair with the smallest prediction error.

In [None]:
hw3.shap_task4(best_regressor, Xr, yr, keys)

hw3.log("Task 4 SHAP outputs generated.")

per_drug = pd.read_csv(hw3.OUT_DIR / "task4a_top10_features_per_drug.csv")
least_error_path = sorted(hw3.OUT_DIR.glob("task4b_top10_features_least_error_*.csv"))[-1]
least_error = pd.read_csv(least_error_path)

(per_drug.head(20), least_error)


## Conclusion & Deliverables
- Metrics and SHAP artefacts live in `hw3_outputs/` for direct ingestion into the written report.
- Adjust the feature caps or SHAP sample budgets at the top cell if you replicate on a workstation with more memory.
- Next steps (optional): hyper-parameter tuning around the winning models, or integrating biological annotations for the highlighted genes.