# CAP5610 HW3 – Tree Ensembles & SHAP

This notebook documents the end-to-end pipeline required for Homework 3.
Each section mirrors the assignment tasks and records both rationale and results so the workflow reads like a mini research report.

## Experimental Frame
- **Data**: `lncRNA_5_Cancers.csv` (classification) and `GDSC2_13drugs.csv` (regression).
- **Models**: Decision Tree, Random Forest, Gradient Boosting, XGBoost, LightGBM, CatBoost.
- **Metrics**: Accuracy/F1 for Task 1; MAE/MSE/RMSE/R² for Task 3.
- **Interpretability**: SHAP TreeExplainer on the winning models (Tasks 2 & 4).
- **Reproducibility**: shared config (random seed, feature caps, SHAP sample limits) defined once below.

In [None]:
# Environment priming – import helper module and align configuration.
from importlib import reload
from pathlib import Path

import numpy as np
import pandas as pd
from IPython.display import display

import src.leo.HW3_Mendez as hw3
reload(hw3)  # guarantee we are using the latest script revision when iterating in the notebook

# Tweakable laboratory knobs (kept modest so laptops do not choke)
hw3.MAX_FEATURES_CLASSIF = 1000
hw3.MAX_FEATURES_REGRESS = 1200
hw3.SHAP_SAMPLES_PER_CLASS = 30
hw3.SHAP_SAMPLES_REG = 100

ROOT = Path.cwd()
CANCER_PATH = hw3.resolve_data_path(hw3.CANCER_CSV, hw3.CANCER_FALLBACK)
REG_PATH = hw3.resolve_data_path(hw3.GDSC2_CSV, hw3.GDSC2_FALLBACK)

assert CANCER_PATH is not None, "Classification CSV missing – checked primary and fallback paths."
assert REG_PATH is not None, "Regression CSV missing – checked primary and fallback paths."

hw3.log(f"Notebook ready. Using {Path(CANCER_PATH).name} and {Path(REG_PATH).name}.")

## Task 1 — Classification Dataset Recon
We begin by loading the lncRNA expression matrix, enforcing the same feature cap used in the script.
Capturing shapes and identifier columns up front makes it easy to track provenance in the write-up.

In [None]:
Xc, yc, ids, target_col, id_col = hw3.memory_savvy_read_cancers(str(CANCER_PATH), hw3.MAX_FEATURES_CLASSIF)

summary_cls = pd.DataFrame(
    {
        "rows": [len(Xc)],
        "features": [Xc.shape[1]],
        "target": [target_col],
        "id_column": [id_col],
    }
)

display(summary_cls)
# Peek at a slice of the feature matrix (5 samples × 10 genes) for sanity.
Xc.iloc[:5, :10]


## Task 1 — Model Selection
We run the tree-ensemble suite with shared preprocessing (median imputation).
The helper returns both the comparison table and the fitted pipelines so we can reuse the best model downstream.

In [None]:
cls_results, cls_models, idx_to_class = hw3.train_compare_classifiers(Xc, yc, hw3.RANDOM_STATE)

display(cls_results)

best_cls_name = cls_results.iloc[0]["Model"]
best_classifier = cls_models[best_cls_name]

hw3.log(f"Task 1 best classifier: {best_cls_name}")


## Task 2 — SHAP Analysis on Winning Classifier
Using the same sampled dataset as the script, we compute per-cancer SHAP importance tables and force plots for the specified patient.
Outputs land in `hw3_outputs/` so they can be embedded into the report.

In [None]:
hw3.shap_task2(best_classifier, Xc, yc, ids, hw3.PATIENT_ID_TO_PLOT, idx_to_class)

# Display the aggregated top-10 table to keep the narrative close to the numbers.
task2_table = pd.read_csv(hw3.OUT_DIR / "task2a_top10_features_per_cancer.csv")

task2_table.head(15)


## Task 3 — Regression Dataset Recon
Next, load the drug screening panel, summarise the dimensionality, and retain the composite key (`CELL_LINE_NAME|DRUG_NAME`).

In [None]:
Xr, yr, keys, meta = hw3.memory_savvy_read_gdsc2(str(REG_PATH), hw3.MAX_FEATURES_REGRESS)

summary_reg = pd.DataFrame(
    {
        "rows": [meta["n_rows"]],
        "features": [meta["n_features"]],
        "target": [meta["target"]],
        "id_columns": [" & ".join(meta["id_cols"])]
    }
)

display(summary_reg)
Xr.iloc[:5, :10]


## Task 3 — Model Selection
Repeat the ensemble sweep for regression, collecting all four metrics. The winning pipeline feeds Task 4.

In [None]:
reg_results, reg_models = hw3.train_compare_regressors(Xr, yr, hw3.RANDOM_STATE)

display(reg_results)

best_reg_name = reg_results.iloc[0]["Model"]
best_regressor = reg_models[best_reg_name]

hw3.log(f"Task 3 best regressor: {best_reg_name}")


## Task 4 — SHAP on Winning Regressor
Finally we generate per-drug importances and inspect the least-error drug–cell line pair using SHAP contributions.

In [None]:
hw3.shap_task4(best_regressor, Xr, yr, keys)

task4a_table = pd.read_csv(hw3.OUT_DIR / "task4a_top10_features_per_drug.csv")
task4b_path = sorted(hw3.OUT_DIR.glob("task4b_top10_features_least_error_*.csv"))[-1]
task4b_table = pd.read_csv(task4b_path)

(task4a_table.head(20), task4b_table)


## Conclusion & Artifacts
- Metrics and SHAP exports live under `hw3_outputs/` for inclusion in the written report.
- Adjust `MAX_FEATURES_*` or SHAP sample counts above if you need faster prototypes or deeper feature sweeps.
- Re-run individual cells to regenerate specific figures without repeating the full pipeline.