# Benchmark feature selection

This notebook contains the benchmarking of feature selection methods and models. The goal is to compare the performance of different feature selection methods and models on the same dataset. 

Firstly, we will create 5 feature sets based on general-purpose methods and our proposed multi-stage feature selection (look for its implementation in `multi_stage_feature_selection.ipynb` notebook). The following methods will be used:

- **Spearman** which will be done with [`spearmanr`](`https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html`) from `scipy.stats`.
- **Lasso** which will be done with [`Lasso`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html) with cross-validation on top to select `alpha` parameter.
- **SVM + Sequential Forward Selector** which will be done with [`SequentialForwardSelector`](https://rasbt.github.io/mlxtend/api_subpackages/mlxtend.feature_selection/#sequentialfeatureselector) in the `forward=True` manner. The `estimator` parameter underneath will be [`SVM.LinearSVR`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR).
- **SVM + Sequential Backward Selector** which will be done with [`SequentialBackwardSelector`](https://rasbt.github.io/mlxtend/api_subpackages/mlxtend.feature_selection/#sequentialfeatureselector) in the `forward=False` manner. The `estimator` parameter underneath will be [`SVM.LinearSVR`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR).
- **AIC / statistical p-value + Sequential Forward Selector** test which will be done with Sequential Forward Selection (SFS) and Akaike Information Criterion (AIC) using [`OLS`](https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html).
- **AIC / statistical p-value + Sequential Backward Selector** test which will be done with Sequential Backward Selection (SBS) and Akaike Information Criterion (AIC) using [`OLS`](https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html).

**Inputs**: 

Please specify `DATA_HOME` in the ["Loading the data models"](#Loading-the-data-models). The following files are needed:
- Processed data `AD_processed_final.csv`, `RA_processed_final.csv`
- Selected features from multi-stage feature selection `ad_multi_stage_features_{TOP_K}.txt`, `ra_multi_stage_features_{TOP_K}.txt`

**Outputs**:

- `ad_selected_features/` and `ra_selected_features/` directories with the selected features from each method.

# Table of contents

 - [Benchmark feature selection](#Benchmark-feature-selection)<br>
 - [Table of contents](#Table-of-contents)<br>
 - [Preparation](#Preparation)<br>
 - [Loading the data models](#Loading-the-data-models)<br>
 - [Getting `train` and `test` data sets as separate dataframes](#Getting-%60train%60-and-%60test%60-data-sets-as-separate-dataframes)<br>
 - [Feature selection pipeline](#Feature-selection-pipeline)<br>
   - [Get features based on multi-stage feature selection](#Get-features-based-on-multi-stage-feature-selection)<br>
   - [Get features based on `spearmanr` test](#Get-features-based-on-%60spearmanr%60-test)<br>
   - [Get features based on `Lasso`](#Get-features-based-on-%60Lasso%60)<br>
   - [Get features based on `SFS+SVM.LinearSVR`](#Get-features-based-on-%60SFS%2BSVM.LinearSVR%60)<br>
   - [Get features based on `SBS+SVM.LinearSVR`](#Get-features-based-on-%60SBS%2BSVM.LinearSVR%60)<br>
   - [Get features based on `SFS+AIC`](#Get-features-based-on-%60SFS%2BAIC%60)<br>
   - [Get features based on `SBS+AIC`](#Get-features-based-on-%60SBS%2BAIC%60)<br>
 - [Show the results](#Show-the-results)<br>

# Preparation

In [None]:
# System
import os
import sys

import numpy as np

# Data science
import pandas as pd

# Adding the modules to the PYTHONPATH
# add the path to your repository below
REPO_PATH = ""
sys.path.append(REPO_PATH)

from src.utils import *

# Reproducibility
SEED = 42

# Loading the data models

In [None]:
DATA_HOME = ""

# Loading the data
ad_proc_data = pd.read_csv(os.path.join(DATA_HOME, "AD_processed_final.csv"))
ra_proc_data = pd.read_csv(os.path.join(DATA_HOME, "RA_processed_final.csv"))

# Setting up the target, patient ID column names
ad_target_col, ad_patient_id = "endpt_lb_easi1_total_score", "patient_id"
ra_target_col, ra_patient_id = "endpt_lb_das28__crp_", "patient_id"

# Getting the feature sets for both use cases (dropping features with high missingness)
ad_features = ad_proc_data.filter(regex="^ft").dropna(axis="columns").columns.to_list()
ra_features = ra_proc_data.filter(regex="^ft_").dropna(axis="columns").columns.to_list()

# Drop records where target is NaN
ad_proc_data = ad_proc_data.dropna(subset=[ad_target_col]).reset_index(drop=True)
ra_proc_data = ra_proc_data.dropna(
    subset=[ra_target_col, "ft_lb_c_reactive_protein__mg_l_"]
).reset_index(drop=True)

# Creating block lists for both datasets based on clinical and technical teams
ad_block_list = [
    "ft_sl_actarm",
    "ft_sl_actarmcd",
    "ft_sl_ageu",
    "ft_sl_arm",
    "ft_sl_armcd",
    "ft_sl_domain",
    "ft_sl_dthdtc",
    "ft_sl_ethnic",
    "ft_sl_invid",
    "ft_sl_invnam",
    "ft_sl_rfstdtc",
    "ft_sl_rfendtc",
    "ft_sl_rficdtc",
    "ft_sl_rfpendtc",
    "ft_sl_rfxendtc",
    "ft_sl_rfxstdtc",
    "ft_sl_rowid",
    "ft_sl_dthfl",
    "ft_sl_subjid",
    "ft_sl_studyid",
    "ft_cc_easi1_total_score",
] + ad_proc_data.filter(regex="^ft_cc").columns.to_list()
ra_block_list = (
    [
        "ft_sl_actarm",
        "ft_sl_actarmcd",
        "ft_sl_ageu",
        "ft_sl_arm",
        "ft_sl_armcd",
        "ft_sl_domain",
        "ft_sl_dthdtc",
        "ft_sl_dthfl",
        "ft_sl_ethnic",
        "ft_sl_invid",
        "ft_sl_invnam",
        "ft_sl_rfendtc",
        "ft_sl_rficdtc",
        "ft_sl_rfpendtc",
        "ft_sl_rfstdtc",
        "ft_sl_rfxendtc",
        "ft_sl_rfxstdtc",
        "ft_sl_studyid",
        "ft_sl_subjid",
        "ft_sl_siteid",
    ]
    + ra_proc_data.filter(regex="ft_eff").columns.to_list()
    + ["endpt_lb_das28__esr_"]
)

# Getting features used for modeling
ad_features_to_use = [f for f in ad_features if f not in ad_block_list]
ra_features_to_use = [f for f in ra_features if f not in ra_block_list]

print(
    f"AD dataset contains {ad_proc_data.shape[0]} samples and {ad_proc_data.shape[1]} columns.."
)
print()
print(
    f"RA dataset contains {ra_proc_data.shape[0]} samples and {ra_proc_data.shape[1]} columns.."
)
print()
print("AD dataset:")
display(ad_proc_data)
print()
print("RA dataset:")
display(ra_proc_data)

# Getting `train` and `test` data sets as separate dataframes

In [None]:
# Getting `X`s
ad_X_train, ad_X_test = (
    ad_proc_data.loc[ad_proc_data["split"] == "TRAIN", ad_features_to_use],
    ad_proc_data.loc[ad_proc_data["split"] == "TEST", ad_features_to_use],
)
ra_X_train, ra_X_test = (
    ra_proc_data.loc[ra_proc_data["split"] == "TRAIN", ra_features_to_use],
    ra_proc_data.loc[ra_proc_data["split"] == "TEST", ra_features_to_use],
)

# Getting `groups`s
ad_groups_train, ad_groups_test = (
    ad_proc_data.loc[ad_proc_data["split"] == "TRAIN", ad_patient_id],
    ad_proc_data.loc[ad_proc_data["split"] == "TEST", ad_patient_id],
)
ra_groups_train, ra_groups_test = (
    ra_proc_data.loc[ra_proc_data["split"] == "TRAIN", ra_patient_id],
    ra_proc_data.loc[ra_proc_data["split"] == "TEST", ra_patient_id],
)

# Creating `y`s
ad_y_train, ad_y_test = (
    ad_proc_data.loc[ad_proc_data["split"] == "TRAIN", ad_target_col],
    ad_proc_data.loc[ad_proc_data["split"] == "TEST", ad_target_col],
)
ra_y_train, ra_y_test = (
    ra_proc_data.loc[ra_proc_data["split"] == "TRAIN", ra_target_col],
    ra_proc_data.loc[ra_proc_data["split"] == "TEST", ra_target_col],
)

print(
    f"AD dataset contains:\n\t"
    f"TRAIN: {ad_X_train.shape[1]} features and {ad_X_train.shape[0]} samples!\n\t"
    f"TEST: {ad_X_test.shape[1]} features and {ad_X_test.shape[0]} samples!\n\t"
)
print(
    f"RA dataset contains:\n\t"
    f"TRAIN: {ra_X_train.shape[1]} features and {ra_X_train.shape[0]} samples!\n\t"
    f"TEST: {ra_X_test.shape[1]} features and {ra_X_test.shape[0]} samples!\n\t"
)

# Feature selection pipeline

## Get features based on multi-stage feature selection

Let us first list the features selected by multi-stage feature selection (please look at `multi_stage_feature_selection.ipynb` for the implementation).

In [None]:
# AD - read features from files
fn = os.path.join(DATA_HOME, "ad_selected_features", "ad_multi_stage_features_0.05.txt")
ad_multi_features_0_05 = read_list(fn)

# RA - read features from files
fn = os.path.join(DATA_HOME, "ra_selected_features", "ra_multi_stage_features_0.05.txt")
ra_multi_features_0_05 = read_list(fn)

## Get features based on `spearmanr` test

In this section we will select features based on Spearman test using [`spearmanr`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html). We will take `pvalue=0.05` significance level to select features.

In [None]:
from scipy.stats import spearmanr

significance_level = 0.05
statistic_cutoff = 0.2

# AD dataset
print("AD dataset:")
ad_spearman_features = ad_X_train.columns[
    [
        spearmanr(ad_X_train[f], ad_y_train).pvalue <= significance_level
        and spearmanr(ad_X_train[f], ad_y_train).statistic >= statistic_cutoff
        for f in ad_X_train.columns
    ]
].to_list()
print(
    f"The following features were selected ({len(ad_spearman_features)}): {ad_spearman_features}"
)
print()

# Save the features to a file in DATA_HOME
fn = os.path.join(DATA_HOME, "ad_selected_features", "ad_spearman_features.txt")
write_list(ad_spearman_features, fn)


# RA dataset
print("RA dataset:")
ra_spearman_features = ra_X_train.columns[
    [
        spearmanr(ra_X_train[f], ra_y_train).pvalue < significance_level
        and spearmanr(ra_X_train[f], ra_y_train).statistic >= statistic_cutoff
        for f in ra_X_train.columns
    ]
].to_list()
print(
    f"The following features were selected ({len(ra_spearman_features)}): {ra_spearman_features}"
)
# Save the features to a file in DATA_HOME
fn = os.path.join(DATA_HOME, "ra_selected_features", "ra_spearman_features.txt")
write_list(ra_spearman_features, fn)

## Get features based on `Lasso`

In this section we will select our features based on [`LassoCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html).

In [None]:
from sklearn.linear_model import LassoCV

cv = 5

# AD dataset
print("AD dataset:")
ad_lasso = LassoCV(cv=cv, random_state=SEED, n_jobs=-1).fit(ad_X_train, ad_y_train)
ad_lasso_features = ad_X_train.columns[ad_lasso.coef_ != 0].to_list()
print(
    f"The following features were selected ({len(ad_lasso_features)}): {ad_lasso_features}"
)
print()

# Save the features to a file in DATA_HOME
fn = os.path.join(DATA_HOME, "ad_selected_features", "ad_lasso_features.txt")
write_list(ad_lasso_features, fn)

# RA dataset
print("RA dataset:")
ra_lasso = LassoCV(cv=cv, random_state=SEED).fit(ra_X_train, ra_y_train)
ra_lasso_features = ra_X_train.columns[ra_lasso.coef_ != 0].to_list()
print(
    f"The following features were selected ({len(ra_lasso_features)}): {ra_lasso_features}"
)
# Save the features to a file in DATA_HOME
fn = os.path.join(DATA_HOME, "ra_selected_features", "ra_lasso_features.txt")
write_list(ra_lasso_features, fn)

## Get features based on `SFS+SVM.LinearSVR`

In this section we will select features based on [`SequentialForwardSelector`](https://rasbt.github.io/mlxtend/api_subpackages/mlxtend.feature_selection/#sequentialfeatureselector) in the `forward=True` manner. The `estimator` parameter underneath will be [`SVM.LinearSVR`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR). We are using [`mlxtend`](https://github.com/rasbt/mlxtend/tree/master) implementation instead of [`scikit-learn`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html#sklearn.feature_selection.SequentialFeatureSelector) implementation because we can provide `k_features='best'` to return feature subset with the best cross-validation performance and have a really data-driven way to extract features.

In [None]:
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.svm import LinearSVR

cv = 5

# AD dataset
print("AD dataset:")
estimator = LinearSVR(dual=False, loss="squared_epsilon_insensitive", random_state=SEED)
ad_sfs = SequentialFeatureSelector(
    estimator=estimator, cv=cv, k_features=(5, 100), n_jobs=-1, forward=True, verbose=1
).fit(ad_X_train, ad_y_train, groups=ad_groups_train)
ad_sfs_features = ad_X_train.iloc[:, list(ad_sfs.k_feature_idx_)].columns.to_list()
print(
    f"The following features were selected ({len(ad_sfs_features)}): {ad_sfs_features}"
)
print()

# Save the features to a file in DATA_HOME
fn = os.path.join(DATA_HOME, "ad_selected_features", "ad_sfs_features.txt")
write_list(ad_sfs_features, fn)

# RA dataset
print("RA dataset:")
estimator = LinearSVR(dual=False, loss="squared_epsilon_insensitive", random_state=SEED)
ra_sfs = SequentialFeatureSelector(
    estimator=estimator, cv=cv, k_features=(5, 100), n_jobs=-1, forward=True, verbose=1
).fit(ra_X_train, ra_y_train, groups=ra_groups_train)
ra_sfs_features = ra_X_train.iloc[:, list(ra_sfs.k_feature_idx_)].columns.to_list()
print(
    f"The following features were selected ({len(ra_sfs_features)}): {ra_sfs_features}"
)
# Save the features to a file in DATA_HOME
fn = os.path.join(DATA_HOME, "ra_selected_features", "ra_sfs_features.txt")
write_list(ra_sfs_features, fn)

## Get features based on `SBS+SVM.LinearSVR`

In this section we will select features based on [`SequentialBackwardSelector`](https://rasbt.github.io/mlxtend/api_subpackages/mlxtend.feature_selection/#sequentialfeatureselector) in the `forward=False` manner. The `estimator` parameter underneath will be [`SVM.LinearSVR`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR). We are using [`mlxtend`](https://github.com/rasbt/mlxtend/tree/master) implementation instead of [`scikit-learn`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html#sklearn.feature_selection.SequentialFeatureSelector) implementation because we can provide `k_features='best'` to return feature subset with the best cross-validation performance and have a really data-driven way to extract features.

In [None]:
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.svm import LinearSVR

cv = 5

# AD dataset
print("AD dataset:")
estimator = LinearSVR(dual=False, loss="squared_epsilon_insensitive", random_state=SEED)
ad_sbs = SequentialFeatureSelector(
    estimator=estimator, cv=cv, k_features=(5, 100), n_jobs=-1, forward=False, verbose=5
).fit(ad_X_train, ad_y_train, groups=ad_groups_train)
ad_sbs_features = ad_X_train.iloc[:, list(ad_sbs.k_feature_idx_)].columns.to_list()
print(
    f"The following features were selected ({len(ad_sbs_features)}): {ad_sbs_features}"
)
print()

# Save the features to a file in DATA_HOME
fn = os.path.join(DATA_HOME, "ad_selected_features", "ad_sbs_features.txt")
write_list(ad_sbs_features, fn)

# RA dataset
print("RA dataset:")
estimator = LinearSVR(dual=False, loss="squared_epsilon_insensitive", random_state=SEED)
ra_sbs = SequentialFeatureSelector(
    estimator=estimator, cv=cv, k_features=(5, 100), n_jobs=-1, forward=False, verbose=5
).fit(ra_X_train, ra_y_train, groups=ra_groups_train)
ra_sbs_features = ra_X_train.iloc[:, list(ra_sbs.k_feature_idx_)].columns.to_list()
print(
    f"The following features were selected ({len(ra_sbs_features)}): {ra_sbs_features}"
)
# Save the features to a file in DATA_HOME
fn = os.path.join(DATA_HOME, "ra_selected_features", "ra_sbs_features.txt")
write_list(ra_sbs_features, fn)

## Get features based on `SFS+AIC`

In this section we will select features based on Sequential Forward Selection (SFS) and Akaike Information Criterion (AIC) using [`OLS`](https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html).

In [None]:
from joblib import Parallel, delayed
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

cv = 5
max_features = 100


def _calculate_aic(model, data, target):
    """AIC from here: https://machinelearningmastery.com/probabilistic-model-selection-measures/"""
    n, num_params = data.shape[0], data.shape[1]
    mse = mean_squared_error(target, model.predict(data))
    aic = n * np.log(mse) + 2 * num_params

    return aic


def aic_iter(feature, curr_X_train, X_train, y_train, groups, cv):
    temp_X_train = curr_X_train.join(X_train[feature])
    curr_aic = cross_val_score(
        LinearRegression(),
        temp_X_train,
        y_train,
        groups=groups,
        cv=cv,
        n_jobs=-1,
        scoring=_calculate_aic,
    ).mean()
    return curr_aic, feature


def SFS_AIC(X_train, y_train, groups):
    """A helper function to select features based on AIC using SBS technique. Similar to
    https://rasbt.github.io/mlxtend/api_subpackages/mlxtend.feature_selection/#sequentialfeatureselector
    with `forward=True`."""

    aics, added_features = [], []

    # Start with no features
    curr_X_train = pd.DataFrame(index=X_train.index)

    for _ in range(max_features):
        best_aic, best_feature = float("inf"), None

        # Try adding each feature
        results = Parallel(n_jobs=-1)(
            delayed(aic_iter)(feature, curr_X_train, X_train, y_train, groups, cv)
            for feature in X_train.columns
            if feature not in curr_X_train.columns
        )

        # Extract best feature and best AIC from results according to first element of tuple
        best_idx = np.argmin([result[0] for result in results])
        best_aic, best_feature = results[best_idx]

        # Add best feature
        curr_X_train = curr_X_train.join(X_train[best_feature])

        # Adding for output
        aics.append(best_aic)
        added_features.append(best_feature)

    # Finding idx of best AIC
    best_idx = np.argmin(aics)

    # Get selected features
    selected_features = added_features[: best_idx + 1]

    return selected_features, aics, added_features


# AD dataset
print("AD dataset:")
ad_sfs_aic_features, ad_sfs_aics, ad_sfs_aic_added_features = SFS_AIC(
    ad_X_train, ad_y_train, ad_groups_train
)
print(
    f"The following features were selected ({len(ad_sfs_aic_features)}): {ad_sfs_aic_features}"
)
print()

# Save the features to a file in DATA_HOME
fn = os.path.join(DATA_HOME, "ad_selected_features", "ad_sfs_aic_features.txt")
write_list(ad_sfs_aic_features, fn)

# # RA dataset
print("RA dataset:")
ra_sfs_aic_features, ra_sfs_aics, ra_sfs_aic_added_features = SFS_AIC(
    ra_X_train, ra_y_train, ra_groups_train
)
print(
    f"The following features were selected ({len(ra_sfs_aic_features)}): {ra_sfs_aic_features}"
)
print()

# Save the features to a file in DATA_HOME
fn = os.path.join(DATA_HOME, "ra_selected_features", "ra_sfs_aic_features.txt")
write_list(ra_sfs_aic_features, fn)

## Get features based on `SBS+AIC`

In this section we will select features based on Sequential Backward Selection (SBS) and Akaike Information Criterion (AIC) using [`OLS`](https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html).

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

cv = 5


def _calculate_aic(model, data, target):
    """AIC from here: https://machinelearningmastery.com/probabilistic-model-selection-measures/"""
    n, num_params = data.shape[0], data.shape[1]
    mse = mean_squared_error(target, model.predict(data))
    aic = n * np.log(mse) + 2 * num_params

    return aic


def aic_iter(feature, curr_X_train, y_train, groups, cv):
    temp_X_train = curr_X_train.drop(columns=feature)
    curr_aic = cross_val_score(
        LinearRegression(),
        temp_X_train,
        y_train,
        groups=groups,
        cv=cv,
        n_jobs=-1,
        scoring=_calculate_aic,
    ).mean()
    return curr_aic, feature


def SBS_AIC(X_train, y_train, groups):
    """A helper function to select features based on AIC using SBS technique. Similar to
    https://rasbt.github.io/mlxtend/api_subpackages/mlxtend.feature_selection/#sequentialfeatureselector
    with `forward=False`."""

    aics, removed_features = [], []

    # Start with all features
    curr_X_train = X_train.copy()

    for _ in range(X_train.shape[1] - 1):
        best_aic, best_feature = float("inf"), None

        # Try removing each feature
        results = Parallel(n_jobs=-1)(
            delayed(aic_iter)(feature, curr_X_train, y_train, groups, cv)
            for feature in curr_X_train.columns
        )

        # Extract best feature and best AIC from results according to first element of tuple
        best_idx = np.argmin([result[0] for result in results])
        best_aic, best_feature = results[best_idx]

        # Remove best feature
        curr_X_train = curr_X_train.drop(columns=best_feature)

        # Adding for output
        aics.append(best_aic)
        removed_features.append(best_feature)

    # Finding idx of best AIC
    best_idx = np.argmin(aics)

    # Get selected features
    selected_features = list(
        set(X_train.columns) - set(removed_features[: best_idx + 1])
    )

    return selected_features, aics, removed_features


# AD dataset
print("AD dataset:")
ad_sbs_aic_features, ad_sbs_aics, ad_sbs_aic_added_features = SBS_AIC(
    ad_X_train, ad_y_train, ad_groups_train
)
print(
    f"The following features were selected ({len(ad_sbs_aic_features)}): {ad_sbs_aic_features}"
)
print()

# Save the features to a file in DATA_HOME
fn = os.path.join(DATA_HOME, "ad_selected_features", "ad_sbs_aic_features.txt")
write_list(ad_sbs_aic_features, fn)

# # RA dataset
print("RA dataset:")
ra_sbs_aic_features, ra_sbs_aics, ra_sbs_aic_added_features = SBS_AIC(
    ra_X_train, ra_y_train, ra_groups_train
)
print(
    f"The following features were selected ({len(ra_sbs_aic_features)}): {ra_sbs_aic_features}"
)
print()

# Save the features to a file in DATA_HOME
fn = os.path.join(DATA_HOME, "ra_selected_features", "ra_sbs_aic_features.txt")
write_list(ra_sbs_aic_features, fn)

# Show the results

In [None]:
# Get the features and print them
ad_selected_features = {
    "multi_stage_0_05": ad_multi_features_0_05,
    "spearman": ad_spearman_features,
    "lasso": ad_lasso_features,
    "sfs": ad_sfs_features,
    "sbs": ad_sbs_features,
    "sfs_aic": ad_sfs_aic_features,
    "sbs_aic": ad_sbs_aic_features,
}
ra_selected_features = {
    "multi_stage_0_05": ra_multi_features_0_05,
    "spearman": ra_spearman_features,
    "lasso": ra_lasso_features,
    "sfs": ra_sfs_features,
    "sbs": ra_sbs_features,
    "sfs_aic": ra_sfs_aic_features,
    "sbs_aic": ra_sbs_aic_features,
}

print("AD:")
print(ad_selected_features)
print()
print("RA:")
print(ra_selected_features)

In [None]:
for tag, features in ad_selected_features.items():
    print(f"{tag} ({len(features)}):")
    for pref in ["ft_sl", "ft_lb", "ft_mh", "ft_cm"]:
        curr_fts = [t.replace(f"{pref}_", "") for t in features if t.startswith(pref)]
        print(f"{pref} ({len(curr_fts)}): {', '.join(curr_fts)}")
        print()
    print()

In [None]:
for tag, features in ra_selected_features.items():
    print(f"{tag} ({len(features)}):")
    for pref in ["ft_sl", "ft_lb", "ft_mh", "ft_cm"]:
        curr_fts = [t.replace(f"{pref}_", "") for t in features if t.startswith(pref)]
        print(f"{pref} ({len(curr_fts)}): {', '.join(curr_fts)}")
        print()
    print()