# 2.0.2: Feature importances

To understand the influence of certain predictors and datasets on model success, we can retrieve the feature importances from trained models. Knowing feature importance is also crucial when assessing the Area of Applicabilty (Meyer and Pebesma, 2022).

## Imports and config

In [36]:
import joblib
from pathlib import Path

import dask.dataframe as dd
import pandas as pd
from autogluon.tabular import TabularDataset, TabularPredictor

from src.conf.conf import get_config
from src.conf.environment import log
from src.utils.autogluon_utils import get_best_model_ag
from src.utils.dataset_utils import (
    get_models_dir,
    get_cv_models_dir,
    get_train_fn,
    get_cv_splits,
)

cfg = get_config()

In order to retrieve feature importances (FI) from an Autogluon `TabularPredictor`, we can simply call `TabularPredictor.feature_importance()`. However, we must take two considerations when calculating FI:

1. FI should be calculated from data not seen during model training to avoid bias due to overfitting.
1. To properly assess FI with cross-validation, FI scores should be taken from each cross-validation fold model and calculated from the respective held-out folds.

## Prepare the cross-validation models

In [14]:
models_dir: Path = get_models_dir(cfg)
model_runs_dir: Path = [d for d in models_dir.iterdir() if d.is_dir()][0]
model_dir = get_best_model_ag(model_runs_dir)
model = TabularPredictor.load(str(model_dir))

cv_models = get_cv_models_dir(model)

## Load train data and CV splits

Load full training data and drop all Y columns except that of the current model.

In [30]:
full_train = dd.read_parquet(get_train_fn(cfg)).drop(columns=["x", "y"])
y_cols = full_train.columns[full_train.columns.str.startswith("X")].to_list()
# Select y col that matches current model
y_col = [y for y in y_cols if y == model.label][0]

if y_col is None:
    raise ValueError(f"Could not find y column for model {model.label}")

x_cols = full_train.columns[~full_train.columns.str.startswith("X")].to_list()
full_train = full_train[x_cols + [y_col]].compute().reset_index(drop=True)

Load the CV splits and apply them to `full_train`.

In [34]:
cv_splits = get_cv_splits(cfg, model.label)

for i, (_, valid_idx) in enumerate(cv_splits):
    full_train.loc[valid_idx, "split"] = i

## Get feature importances from the held-out fold for each split

In [39]:
for i in range(len(cv_splits)):
    fold_model_fn = cv_models / f"S1F{i + 1}" / "model.pkl"
    if not Path(fold_model_fn).exists():
        raise ValueError(f"Model {fold_model_fn} does not exist")
    
    fold_model = joblib.load(fold_model_fn)

    held_out = full_train[full_train["split"] == i]

    fold_fi = fold_model.compute_feature_importance(X=held_out[x_cols], y=held_out[y_col])
    break