# Multivariate prediction models (no tree heuristics)

The multivariate experiments now live in their own notebook and focus on reporting the regularized/logistic models plus the forest-based ensembles (including the BRF track). The old decision-tree heuristics are intentionally omitted per the updated analysis plan.


## Dataset and design matrix

We reuse the persistence filter so the multivariate models are aligned with the univariate regressions.


In [None]:

from analysis_utils import (
    load_base_dataset,
    engineer_baseline_features,
    prepare_persistence_dataset,
    evaluate_multivariate_models,
)
from IPython.display import display


In [None]:

raw_df = load_base_dataset()
feature_df, feature_sets = engineer_baseline_features(raw_df)
persistence_df = prepare_persistence_dataset(feature_df, feature_sets["all_features"])
print(f"Design matrix shape: {persistence_df.shape}")


## Model comparison summary

Each pipeline uses stratified 5-fold cross-validation and reports ROC-AUC, PR-AUC, balanced accuracy, F1, and accuracy. The BRF implementation uses the in-repo helper class so we no longer depend on the Colab-only imbalanced-learn install.


In [None]:

metrics_df, feature_tables = evaluate_multivariate_models(
    persistence_df,
    feature_sets["all_features"],
    target_col="aan_persistence",
)
print("Cross-validated metrics:")
display(metrics_df)


## Feature importance snapshots

Tree-based models expose feature importances. This cell surfaces the top signals per model so we can track stability between BRF and the vanilla random forest.


In [None]:

for model_name, table in feature_tables.items():
    print(f"
Top features for {model_name}")
    display(table.head(20))
