# Comparaison de la s√©lection de variables avec STABL (Random Forest, XGBoost, Lasso, ElasticNet)

Ce notebook compare la s√©lection de variables et les performances de classification/r√©gression entre STABL utilisant des mod√®les d‚Äôarbres (Random Forest, XGBoost) avec grid search, et STABL utilisant des mod√®les lin√©aires (Lasso, ElasticNet). Les r√©sultats sont √©valu√©s par validation crois√©e, en termes de performances et de nombre de variables s√©lectionn√©es.

In [None]:
# Importer les biblioth√®ques n√©cessaires
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression, Lasso, ElasticNet
from sklearn.metrics import accuracy_score, roc_auc_score, mean_squared_error
from stabl.stabl import Stabl
from stabl.adaptive import ALasso, ALogitLasso
from stabl.preprocessing import LowInfoFilter
# from stabl.visualization import plot_fdr_graph
from sklearn.feature_selection import VarianceThreshold
from sklearn.base import clone
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline



## 1. Charger et explorer les donn√©es

Nous allons charger un jeu de donn√©es tabulaire (exemple : CyTOF.csv et outcome.csv), afficher ses dimensions, types de variables, et quelques statistiques descriptives.

In [None]:
# 1. Charger les donn√©es

features_path = "/Users/noeamar/Documents/Stanford/data/olivier_data/ina_13OG_df_168_filtered_allstim_new.csv"
outcome_path = "/Users/noeamar/Documents/Stanford/data/olivier_data/outcome_table_all_pre.csv"

features = pd.read_csv(features_path, index_col=0)
outcome = pd.read_csv(outcome_path, index_col=0, dtype={"DOS": int})

# 2. Garder seulement les patients pr√©sents dans les deux fichiers
common_idx = features.index.intersection(outcome.index)
features = features.loc[common_idx]
outcome  = outcome.loc[common_idx]

# 3. Extraire la s√©rie cible
y = outcome["DOS"]

# 4. Diagnostic rapide
print("Features shape:", features.shape)
print("Outcome shape:",  outcome.shape)
print("Features columns:", features.columns.tolist()[:10], "‚Ä¶")
print("Outcome columns:",  outcome.columns.tolist())

print("\n=== Statistiques descriptives des features ===")
print(features.describe().T)

print("\n=== Statistiques descriptives de DOS ===")
print(y.describe())

# 5. Histogramme de la variable continue DOS
plt.figure(figsize=(6,4))
y.hist(bins=20)
plt.title("Distribution de la variable cible (DOS)")
plt.xlabel("DOS")
plt.ylabel("Effectif")
plt.tight_layout()
plt.show()


## 2. D√©finir le pr√©processing

Pipeline de pr√©processing : imputation, standardisation, filtrage faible variance, filtrage low info.

## Configuration des pipelines STABL avec mod√®les d‚Äôarbres et lin√©aires

Dans cette section, nous allons configurer plusieurs pipelines de s√©lection de variables et de mod√©lisation :
- **STABL + Random Forest** : s√©lection de variables avec STABL utilisant Random Forest comme estimateur, avec recherche de grille (grid search) pour optimiser les hyperparam√®tres, et validation crois√©e (cross-validation).
- **STABL + XGBoost** : m√™me principe, mais avec XGBoost comme estimateur.
- **STABL + Lasso** : pipeline lin√©aire utilisant Lasso.
- **STABL + ElasticNet** : pipeline lin√©aire utilisant ElasticNet.

Pour chaque pipeline, nous appliquerons la s√©lection de variables, puis nous √©valuerons les performances en cross-validation et le nombre de variables s√©lectionn√©es. Les r√©sultats seront compar√©s √† la fin du notebook.

## Cross-validation, sauvegarde des r√©sultats et visualisation

Pour chaque mod√®le, nous allons :
- Effectuer une cross-validation (stratifi√©e) sur le dataset.
- Sauvegarder les courbes ROC et PR dans le dossier `Benchmarks results/` avec le nom du mod√®le et du dataset.
- Sauvegarder les importances des variables s√©lectionn√©es et leur nombre.
- Comparer les performances (AUC, accuracy, etc.) et le nombre de variables s√©lectionn√©es entre tous les mod√®les.

Les r√©sultats seront visualis√©s sous forme de tableaux et de graphes pour faciliter la comparaison.

## Conclusion

Ce notebook permet de comparer la s√©lection de variables et les performances de classification entre STABL (Random Forest, XGBoost, Lasso, ElasticNet) sur le dataset CyTOF. Les courbes ROC, PR, importances des variables et tableaux de r√©sultats sont sauvegard√©s dans le dossier `Benchmarks results/` avec des noms explicites pour chaque mod√®le et dataset. Vous pouvez adapter ce pipeline √† d'autres datasets en modifiant la section de chargement des donn√©es.

In [None]:
# Pipeline de pr√©processing (√† utiliser dans chaque pipeline)
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from stabl.preprocessing import LowInfoFilter

preprocessing = Pipeline([
    ("variance_threshold", VarianceThreshold(threshold=0)),
    ("low_info_filter", LowInfoFilter(max_nan_fraction=0.2)),
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])

In [None]:
from pathlib import Path
import numpy as np
import pandas as pd
from tqdm.auto import tqdm, trange

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RepeatedKFold, RandomizedSearchCV
from sklearn.metrics import r2_score

from xgboost import XGBRegressor
from stabl.data import load_onset_data
from stabl.stacked_generalization import stacked_multi_omic

# -------------------------
# 0. Global config
# -------------------------
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# -------------------------
# 1. Load data
# -------------------------
print("üì• Loading onset data ‚Ä¶")
X_dict, _, y, _, _, _ = load_onset_data(features_path, outcome_path)
print(f"‚úî Loaded {len(X_dict)} omics, {y.shape[0]} samples, outcome: {y.name}")

outer_cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=RANDOM_STATE)

# -------------------------
# 2. Pre-processing pipeline
# -------------------------
prepro = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("std",     StandardScaler()),
])

# -------------------------
# 3. Hyper-parameter space
# -------------------------
param_dist = {
    "xgb__learning_rate":    [0.03, 0.05, 0.07, 0.1],
    "xgb__n_estimators":     [300, 500, 800],
    "xgb__max_depth":        [2, 3, 4],
    "xgb__min_child_weight": [1, 5, 10],
    "xgb__gamma":            [0, 0.1, 0.3],
    "xgb__subsample":        [0.6, 0.8, 1.0],
    "xgb__colsample_bytree": [0.6, 0.8, 1.0],
    "xgb__reg_alpha":        [0, 0.5, 1, 2],
    "xgb__reg_lambda":       [0.5, 1, 2, 4],
}

inner_cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=RANDOM_STATE)


def make_xgb_pipeline():
    return Pipeline([
        ("prepro", prepro),
        ("xgb", XGBRegressor(
            objective="reg:squarederror",
            random_state=RANDOM_STATE,
            tree_method="hist",
            eval_metric="rmse",
        )),
    ])

# -------------------------
# 4. Per-omic tuning + OOF preds
# -------------------------
print("\nüîß Hyper‚Äëparameter optimisation par omique ‚Ä¶")

preds_train = pd.DataFrame(index=y.index)
best_params_by_omic = {}

for omic_name, X_omic in tqdm(X_dict.items(), desc="Omics", unit="omic"):

    pipe = make_xgb_pipeline()
    rand_xgb = RandomizedSearchCV(
        estimator=pipe,
        param_distributions=param_dist,
        n_iter=120,
        scoring="r2",
        cv=inner_cv,
        n_jobs=-1,
        verbose=0,
        random_state=RANDOM_STATE,
        refit=True,
    )

    rand_xgb.fit(X_omic, y)  # pas d'early stopping -> √©vite l'erreur fit_params

    best_params_by_omic[omic_name] = rand_xgb.best_params_
    print(f"‚úì {omic_name}: best CV R¬≤ = {rand_xgb.best_score_:.3f}")

    # OOF predictions with outer CV
    fold_preds = pd.Series(index=y.index, dtype=float)
    oof_bar = trange(outer_cv.get_n_splits(), desc=f"OOF {omic_name}", leave=False)
    for k, (tr, te) in enumerate(outer_cv.split(X_omic, y)):
        X_tr, X_te = X_omic.iloc[tr], X_omic.iloc[te]
        y_tr       = y.iloc[tr]

        best_pipe = make_xgb_pipeline().set_params(**rand_xgb.best_params_)
        best_pipe.fit(X_tr, y_tr)
        fold_preds.iloc[te] = best_pipe.predict(X_te)
        oof_bar.update(1)
    oof_bar.close()
    preds_train[omic_name] = fold_preds

# -------------------------
# 5. Late-fusion (stacking)
# -------------------------
print("\nüîó Empilement late‚Äëfusion ‚Ä¶")
stacked_df, weights = stacked_multi_omic(preds_train, y, task_type="regression", n_iter=10000)
print("‚úì R¬≤ stacked = {:.3f}".format(r2_score(y, stacked_df["Stacked Gen. Predictions"])))
print("Poids des omiques :\n", weights)

# -------------------------
# 6. Save artefacts
# -------------------------
output_dir = Path("results_multiomic")
output_dir.mkdir(exist_ok=True)

print("üíæ Saving artefacts ‚Ä¶")

pd.concat({k: pd.Series(v) for k, v in best_params_by_omic.items()}, axis=1).to_csv(output_dir / "best_params_by_omic.csv")
preds_train.to_csv(output_dir / "preds_per_omic.csv")
stacked_df.to_csv(output_dir / "stacked_predictions.csv")
weights.to_csv(output_dir / "stacked_weights.csv")

print("\nüéâ Pipeline termin√©. R√©sultats enregistr√©s dans", output_dir)


In [None]:
from pathlib import Path
import numpy as np
import pandas as pd
from tqdm.auto import tqdm, trange

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RepeatedKFold, RandomizedSearchCV
from sklearn.metrics import r2_score

from sklearn.ensemble import RandomForestRegressor          # ‚Üê RF
from stabl.data import load_onset_data
from stabl.stacked_generalization import stacked_multi_omic

# ------------------------- 0. Global config
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# ------------------------- 1. Load data
print("üì• Loading onset data ‚Ä¶")
X_dict, _, y, _, _, _ = load_onset_data(features_path, outcome_path)
print(f"‚úî Loaded {len(X_dict)} omics, {y.shape[0]} samples, outcome: {y.name}")

outer_cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=RANDOM_STATE)

# ------------------------- 2. Pre-processing pipeline
prepro = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("std",     StandardScaler()),
])

# ------------------------- 3. Hyper-parameter space (RF)
param_dist = {
    "rf__n_estimators"     : [300, 500, 800, 1200],
    "rf__max_depth"        : [None, 5, 10, 20],
    "rf__max_features"     : [0.2, 0.4, 0.6, "auto"],
    "rf__min_samples_split": [2, 5, 10],
    "rf__min_samples_leaf" : [1, 2, 4],
    "rf__bootstrap"        : [True, False],
}

inner_cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=RANDOM_STATE)

def make_rf_pipeline():
    return Pipeline([
        ("prepro", prepro),
        ("rf", RandomForestRegressor(
            random_state   = RANDOM_STATE,
            n_jobs         = -1,        # parall√©lisme interne
            oob_score      = False,
        )),
    ])

# ------------------------- 4. Per-omic tuning + OOF preds
print("\nüîß Hyper-parameter optimisation RF par omique ‚Ä¶")

preds_train = pd.DataFrame(index=y.index)
best_params_by_omic = {}

for omic_name, X_omic in tqdm(X_dict.items(), desc="Omics", unit="omic"):

    pipe = make_rf_pipeline()
    rand_rf = RandomizedSearchCV(
        estimator              = pipe,
        param_distributions    = param_dist,
        n_iter                 = 120,
        scoring                = "r2",
        cv                     = inner_cv,
        n_jobs                 = -1,
        verbose                = 0,
        random_state           = RANDOM_STATE,
        refit                  = True,
    )

    rand_rf.fit(X_omic, y)

    best_params_by_omic[omic_name] = rand_rf.best_params_
    print(f"‚úì {omic_name}: best CV R¬≤ = {rand_rf.best_score_:.3f}")

    # OOF predictions with outer CV
    fold_preds = pd.Series(index=y.index, dtype=float)
    oof_bar = trange(outer_cv.get_n_splits(), desc=f"OOF {omic_name}", leave=False)
    for k, (tr, te) in enumerate(outer_cv.split(X_omic, y)):
        X_tr, X_te = X_omic.iloc[tr], X_omic.iloc[te]
        y_tr       = y.iloc[tr]

        best_pipe = make_rf_pipeline().set_params(**rand_rf.best_params_)
        best_pipe.fit(X_tr, y_tr)
        fold_preds.iloc[te] = best_pipe.predict(X_te)
        oof_bar.update(1)
    oof_bar.close()
    preds_train[omic_name] = fold_preds

# ------------------------- 5. Late-fusion (stacking)
print("\nüîó Empilement late-fusion ‚Ä¶")
stacked_df, weights = stacked_multi_omic(preds_train, y, task_type="regression", n_iter=10000)
print("‚úì R¬≤ stacked = {:.3f}".format(r2_score(y, stacked_df["Stacked Gen. Predictions"])))
print("Poids des omiques :\n", weights)

# ------------------------- 6. Save artefacts
output_dir = Path("results_multiomic_RF")
output_dir.mkdir(exist_ok=True)

print("üíæ Saving artefacts ‚Ä¶")
pd.concat({k: pd.Series(v) for k, v in best_params_by_omic.items()}, axis=1).to_csv(output_dir / "best_params_by_omic.csv")
preds_train.to_csv(output_dir / "preds_per_omic.csv")
stacked_df.to_csv(output_dir / "stacked_predictions.csv")
weights.to_csv(output_dir / "stacked_weights.csv")

print("\nüéâ Pipeline termin√©. R√©sultats enregistr√©s dans", output_dir)


In [None]:
from pathlib import Path
import numpy as np
import pandas as pd
from tqdm.auto import tqdm, trange

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RepeatedKFold, RandomizedSearchCV
from sklearn.metrics import r2_score

from lightgbm import LGBMRegressor                      # ‚Üê LightGBM
from stabl.data import load_onset_data
from stabl.stacked_generalization import stacked_multi_omic

# ------------------------- 0. Global config
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# ------------------------- 1. Load data
print("üì• Loading onset data ‚Ä¶")
X_dict, _, y, _, _, _ = load_onset_data(features_path, outcome_path)
print(f"‚úî Loaded {len(X_dict)} omics, {y.shape[0]} samples, outcome: {y.name}")

outer_cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=RANDOM_STATE)

# ------------------------- 2. Pre-processing pipeline
prepro = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("std",     StandardScaler()),
])

# ------------------------- 3. Hyper-parameter space (LightGBM)
param_dist = {
    "lgb__learning_rate" : [0.01, 0.03, 0.05, 0.07],
    "lgb__n_estimators"  : [500, 800, 1200],
    "lgb__max_depth"     : [-1, 4, 6, 8],
    "lgb__num_leaves"    : [31, 63, 127, 255],       # doit rester ‚â§ 2**max_depth
    "lgb__subsample"     : [0.6, 0.8, 1.0],
    "lgb__colsample_bytree":[0.6, 0.8, 1.0],
    "lgb__min_child_samples":[5, 10, 20],
    "lgb__reg_alpha"     : [0, 0.5, 1],
    "lgb__reg_lambda"    : [0, 0.5, 1],
}

inner_cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=RANDOM_STATE)

def make_lgb_pipeline():
    return Pipeline([
        ("prepro", prepro),
        ("lgb", LGBMRegressor(
            objective     = "regression",
            random_state  = RANDOM_STATE,
            boosting_type = "gbdt",
            n_jobs        = -1,
            verbose       = -1,
        )),
    ])

# ------------------------- 4. Per-omic tuning + OOF preds
print("\nüîß Hyper-parameter optimisation LightGBM par omique ‚Ä¶")

preds_train = pd.DataFrame(index=y.index)
best_params_by_omic = {}

for omic_name, X_omic in tqdm(X_dict.items(), desc="Omics", unit="omic"):

    pipe = make_lgb_pipeline()
    rand_lgb = RandomizedSearchCV(
        estimator           = pipe,
        param_distributions = param_dist,
        n_iter              = 120,
        scoring             = "r2",
        cv                  = inner_cv,
        n_jobs              = -1,
        random_state        = RANDOM_STATE,
        verbose             = 0,
        refit               = True,
    )

    rand_lgb.fit(X_omic, y)

    best_params_by_omic[omic_name] = rand_lgb.best_params_
    print(f"‚úì {omic_name}: best CV R¬≤ = {rand_lgb.best_score_:.3f}")

    # OOF predictions with outer CV
    fold_preds = pd.Series(index=y.index, dtype=float)
    oof_bar = trange(outer_cv.get_n_splits(), desc=f"OOF {omic_name}", leave=False)
    for k, (tr, te) in enumerate(outer_cv.split(X_omic, y)):
        X_tr, X_te = X_omic.iloc[tr], X_omic.iloc[te]
        y_tr       = y.iloc[tr]

        best_pipe = make_lgb_pipeline().set_params(**rand_lgb.best_params_)
        best_pipe.fit(X_tr, y_tr)
        fold_preds.iloc[te] = best_pipe.predict(X_te)
        oof_bar.update(1)
    oof_bar.close()
    preds_train[omic_name] = fold_preds

# ------------------------- 5. Late-fusion (stacking)
print("\nüîó Empilement late-fusion ‚Ä¶")
stacked_df, weights = stacked_multi_omic(preds_train, y, task_type="regression", n_iter=10000)
print("‚úì R¬≤ stacked = {:.3f}".format(r2_score(y, stacked_df["Stacked Gen. Predictions"])))
print("Poids des omiques :\n", weights)

# ------------------------- 6. Save artefacts
output_dir = Path("results_multiomic_LightGBM")
output_dir.mkdir(exist_ok=True)

print("üíæ Saving artefacts ‚Ä¶")
pd.concat({k: pd.Series(v) for k, v in best_params_by_omic.items()}, axis=1).to_csv(output_dir / "best_params_by_omic.csv")
preds_train.to_csv(output_dir / "preds_per_omic.csv")
stacked_df.to_csv(output_dir / "stacked_predictions.csv")
weights.to_csv(output_dir / "stacked_weights.csv")

print("\nüéâ Pipeline termin√©. R√©sultats enregistr√©s dans", output_dir)


In [2]:
# ===============================
# 1. Imports et configuration g√©n√©rale
# ===============================
import os
import shutil
import numpy as np
import pandas as pd
from stabl import data
from stabl.multi_omic_pipelines import multi_omic_stabl_cv
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV, RepeatedKFold
from sklearn.linear_model import LogisticRegression
from stabl.stabl import Stabl
from stabl.adaptive import ALogitLasso
from sklearn.base import clone
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from sklearn.linear_model import Lasso
from stabl.data import load_onset_data
from pathlib import Path
from sklearn.linear_model import ElasticNet

np.random.seed(42)

# ===============================
# 2. D√©finition des splits de validation crois√©e
# ===============================
# Outer CV pour l'√©valuation globale, inner CV pour la recherche d'hyperparam√®tres
outer_cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)
inner_cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=42)

# ===============================
# 3. D√©finition des estimateurs et grilles d'hyperparam√®tres
# ===============================
artificial_type = "knockoff"  # ou "random_permutation"

# Lasso
lasso = Lasso(max_iter=int(1e6), random_state=42)
lasso_cv = GridSearchCV(lasso, param_grid={"alpha": np.logspace(-2, 2, 30)}, scoring="r2", cv=inner_cv, n_jobs=-1)

#ElasticNet
en = ElasticNet(max_iter=int(1e6), random_state=42)
en_params = {"alpha": np.logspace(-2, 2, 10), "l1_ratio": [0.5, 0.7, 0.9]}
en_cv = GridSearchCV(en, param_grid=en_params, scoring="r2", cv=inner_cv, n_jobs=-1)

# RandomForest
rf = RandomForestRegressor(random_state=42, max_features=0.2)
rf_grid = {"max_depth": [3, 5, 7, 9, 11]}
rf_cv = GridSearchCV(rf, scoring='r2', param_grid=rf_grid, cv=inner_cv, n_jobs=-1)

# XGBoost
xgb = XGBRegressor(random_state=42, importance_type="gain", objective="reg:squarederror")
xgb_grid = {"max_depth": [3, 6, 9], "reg_alpha": [0, 0.5, 1, 2]}
xgb_cv = GridSearchCV(xgb, scoring='r2', param_grid=xgb_grid, cv=inner_cv, n_jobs=-1)

# CatBoost
cb = CatBoostRegressor(random_state=42)
cb_grid = {"depth": [3, 5, 7], "learning_rate": [0.01, 0.1, 0.2], "l2_leaf_reg": [1, 3, 5]}
cb_cv = GridSearchCV(cb, scoring='r2', param_grid=cb_grid, cv=inner_cv, n_jobs=-1, verbose=0)

# LightGBM
lgb = LGBMRegressor(random_state=42)
lgb_grid = {"max_depth": [4, 6, 8], "learning_rate": [0.01, 0.1], "num_leaves": [31, 63, 127], "reg_alpha": [0, 1], "reg_lambda": [0, 1]}
lgb_cv = GridSearchCV(estimator=lgb, param_grid=lgb_grid, scoring="r2", cv=inner_cv, n_jobs=-1)

# ===============================
# 4. D√©finition des estimateurs STABL
# ===============================
stabl_lasso = Stabl(
    base_estimator=lasso,
    n_bootstraps=200,
    artificial_type=artificial_type,
    artificial_proportion=1.,
    replace=False,
    fdr_threshold_range=np.arange(0.1, 1, 0.01),
    sample_fraction=0.5,
    random_state=42,
    lambda_grid={"alpha": np.logspace(-2, 2, 10)},
    verbose=1
)

stabl_en = clone(stabl_lasso).set_params(
    base_estimator=en,
    n_bootstraps=100,
    lambda_grid=[{"C": np.logspace(-2, 1, 5), "l1_ratio": [0.5, 0.9]}],
    verbose=1
)

stabl_rf = clone(stabl_lasso).set_params(
    base_estimator=rf,
    n_bootstraps=200,
    lambda_grid=rf_grid,
    verbose=1
)

stabl_xgb = clone(stabl_lasso).set_params(
    base_estimator=xgb,
    n_bootstraps=200,
    lambda_grid=[xgb_grid],
    verbose=1
)

stabl_cb = clone(stabl_lasso).set_params(
    base_estimator=cb,
    n_bootstraps=200,
    lambda_grid=[cb_grid],
    verbose=1
)

stabl_lgb = clone(stabl_lasso).set_params(
    base_estimator=lgb,
    n_bootstraps=200,
    lambda_grid=[lgb_grid],
    verbose=1
)

# ===============================
# 5. Dictionnaire des estimateurs pour le benchmark
# ===============================
estimators = {
    "lasso": lasso_cv,
    "rf": rf_cv,
    "xgb": xgb_cv,
#   "cb": cb_cv,
#   "lgb": lgb_cv,
    "stabl_lasso": stabl_lasso,
    "stabl_rf": stabl_rf,
    "stabl_xgb": stabl_xgb,
#   "stabl_cb" : stabl_cb,
#   "stabl_lgb": stabl_lgb,
    }

models = [
    "Lasso",
    "RandomForest",
    "XGBoost",
#    "CatBoost",
#    "LightGBM",
    "STABL Lasso",
    "STABL RandomForest",
    "STABL XGBoost",
#    "STABL CatBoost"
#    "STABL LightGBM"
]


# juste apr√®s avoir construit ton dict estimators, ajoute :
estimators["en"]        = estimators["lasso"]        # placeholder vide
estimators["stabl_en"]  = estimators["stabl_lasso"]  # placeholder vide

estimators["cb"]        = estimators["lasso"]        # placeholder vide
estimators["stabl_cb"]  = estimators["stabl_lasso"]  # placeholder vide

estimators["lgb"]        = estimators["lasso"]        # placeholder vide
estimators["stabl_lgb"]  = estimators["stabl_lasso"]  # placeholder vide


# final_classifiers = {"Logit": LogisticRegression(penalty="l1", solver="liblinear", class_weight="balanced", max_iter=int(1e6), random_state=42),
#                      "RandomForest": RandomForestClassifier(n_estimators=500, random_state=42),
#                      "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric="logloss", n_estimators=200,random_state=42),
#                     }


# ===============================
# 6. Chargement des donn√©es (exemple COVID-19)
# ===============================
# X_train, X_valid, y_train, y_valid, ids, task_type = data.load_covid_19("data/COVID-19")
X_train, X_val, y_train, y_val, groups, task_type = load_onset_data(features_path, outcome_path)

# ===============================
# 7. Lancement du benchmark multi-omic STABL
# ===============================
save_path = "./Benchmarks results/Regresssion data Olivier + XGB+ 200 bootstraps/KO"

# Nettoyage du dossier de sauvegarde si besoin
if os.path.exists(save_path):
    shutil.rmtree(save_path)

print("Run CV on Olivier dataset")
# print(groups.value_counts())

multi_omic_stabl_cv(
    data_dict=X_train,
    y=y_train,
    outer_splitter=outer_cv,
    estimators=estimators,
    task_type=task_type,
    save_path=save_path,
    outer_groups=groups,
    early_fusion=False,
    late_fusion=True,
    n_iter_lf=1000,
    models=models
)



ValueError: Aucun stim d√©tect√©: v√©rifie tes noms de colonnes/features.

In [1]:
# ===============================
# 1. Imports et configuration g√©n√©rale
# ===============================
import os
import shutil
import numpy as np
import pandas as pd
from stabl import data
from stabl.multi_omic_pipelines import multi_omic_stabl_cv
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV, RepeatedKFold
from sklearn.linear_model import LogisticRegression
from stabl.stabl import Stabl
from stabl.adaptive import ALogitLasso
from sklearn.base import clone
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from sklearn.linear_model import Lasso
from stabl.data import load_onset_data, load_artificial_data
from pathlib import Path
from sklearn.linear_model import ElasticNet

np.random.seed(42)

# ===============================
# 2. D√©finition des splits de validation crois√©e
# ===============================
# Outer CV pour l'√©valuation globale, inner CV pour la recherche d'hyperparam√®tres
outer_cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)
inner_cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=42)

features_path = "/Users/noeamar/Documents/Stanford/datasets/X_linear.csv"
outcome_path = "/Users/noeamar/Documents/Stanford/datasets/y_linear.csv"
# ===============================
# 3. D√©finition des estimateurs et grilles d'hyperparam√®tres
# ===============================
artificial_type = "knockoff"  # ou "random_permutation"

# Lasso
lasso = Lasso(max_iter=int(1e6), random_state=42)
lasso_cv = GridSearchCV(lasso, param_grid={"alpha": np.logspace(-2, 2, 30)}, scoring="r2", cv=inner_cv, n_jobs=-1)

#ElasticNet
en = ElasticNet(max_iter=int(1e6), random_state=42)
en_params = {"alpha": np.logspace(-2, 2, 10), "l1_ratio": [0.5, 0.7, 0.9]}
en_cv = GridSearchCV(en, param_grid=en_params, scoring="r2", cv=inner_cv, n_jobs=-1)

# RandomForest
rf = RandomForestRegressor(random_state=42, max_features=0.2)
rf_grid = {"max_depth": [3, 5, 7, 9, 11]}
rf_cv = GridSearchCV(rf, scoring='r2', param_grid=rf_grid, cv=inner_cv, n_jobs=-1)

# XGBoost
xgb = XGBRegressor(random_state=42, importance_type="gain", objective="reg:squarederror")
xgb_grid = {"max_depth": [3, 6, 9], "reg_alpha": [0, 0.5, 1, 2]}
xgb_cv = GridSearchCV(xgb, scoring='r2', param_grid=xgb_grid, cv=inner_cv, n_jobs=-1)

# CatBoost
cb = CatBoostRegressor(random_state=42)
cb_grid = {"depth": [3, 5, 7], "learning_rate": [0.01, 0.1, 0.2], "l2_leaf_reg": [1, 3, 5]}
cb_cv = GridSearchCV(cb, scoring='r2', param_grid=cb_grid, cv=inner_cv, n_jobs=-1, verbose=0)

# LightGBM
lgb = LGBMRegressor(random_state=42)
lgb_grid = {"max_depth": [4, 6, 8], "learning_rate": [0.01, 0.1], "num_leaves": [31, 63, 127], "reg_alpha": [0, 1], "reg_lambda": [0, 1]}
lgb_cv = GridSearchCV(estimator=lgb, param_grid=lgb_grid, scoring="r2", cv=inner_cv, n_jobs=-1)

# ===============================
# 4. D√©finition des estimateurs STABL
# ===============================
stabl_lasso = Stabl(
    base_estimator=lasso,
    n_bootstraps=300,
    artificial_type=artificial_type,
    artificial_proportion=1.,
    replace=False,
    fdr_threshold_range=np.arange(0.1, 1, 0.01),
    sample_fraction=0.5,
    random_state=42,
    lambda_grid={"alpha": np.logspace(-2, 2, 10)},
    verbose=1
)

stabl_en = clone(stabl_lasso).set_params(
    base_estimator=en,
    n_bootstraps=100,
    lambda_grid=[{"C": np.logspace(-2, 1, 5), "l1_ratio": [0.5, 0.9]}],
    verbose=1
)

stabl_rf = clone(stabl_lasso).set_params(
    base_estimator=rf,
    n_bootstraps=300,
    lambda_grid=rf_grid,
    verbose=1
)

stabl_xgb = clone(stabl_lasso).set_params(
    base_estimator=xgb,
    n_bootstraps=300,
    lambda_grid=[xgb_grid],
    verbose=1
)

stabl_cb = clone(stabl_lasso).set_params(
    base_estimator=cb,
    n_bootstraps=300,
    lambda_grid=[cb_grid],
    verbose=1
)

stabl_lgb = clone(stabl_lasso).set_params(
    base_estimator=lgb,
    n_bootstraps=300,
    lambda_grid=[lgb_grid],
    verbose=1
)

# ===============================
# 5. Dictionnaire des estimateurs pour le benchmark
# ===============================
estimators = {
    "lasso": lasso_cv,
    "rf": rf_cv,
    "xgb": xgb_cv,
#   "cb": cb_cv,
#   "lgb": lgb_cv,
    "stabl_lasso": stabl_lasso,
    "stabl_rf": stabl_rf,
    "stabl_xgb": stabl_xgb,
#   "stabl_cb" : stabl_cb,
#   "stabl_lgb": stabl_lgb,
    }

models = [
    "Lasso",
    "RandomForest",
    "XGBoost",
#    "CatBoost",
#    "LightGBM",
    "STABL Lasso",
    "STABL RandomForest",
    "STABL XGBoost",
#    "STABL CatBoost"
#    "STABL LightGBM"
]


# juste apr√®s avoir construit ton dict estimators, ajoute :
estimators["en"]        = estimators["lasso"]        # placeholder vide
estimators["stabl_en"]  = estimators["stabl_lasso"]  # placeholder vide

estimators["cb"]        = estimators["lasso"]        # placeholder vide
estimators["stabl_cb"]  = estimators["stabl_lasso"]  # placeholder vide

estimators["lgb"]        = estimators["lasso"]        # placeholder vide
estimators["stabl_lgb"]  = estimators["stabl_lasso"]  # placeholder vide


# final_classifiers = {"Logit": LogisticRegression(penalty="l1", solver="liblinear", class_weight="balanced", max_iter=int(1e6), random_state=42),
#                      "RandomForest": RandomForestClassifier(n_estimators=500, random_state=42),
#                      "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric="logloss", n_estimators=200,random_state=42),
#                     }


# ===============================
# 6. Chargement des donn√©es (exemple COVID-19)
# ===============================
# X_train, X_valid, y_train, y_valid, ids, task_type = data.load_covid_19("data/COVID-19")
X_train, X_val, y_train, y_val, groups, task_type = load_artificial_data(features_path, outcome_path)

# ===============================
# 7. Lancement du benchmark multi-omic STABL
# ===============================
save_path = "./Benchmarks results/Regresssion Artificial data + XGB + 200 bootstraps/KO"

# Nettoyage du dossier de sauvegarde si besoin
if os.path.exists(save_path):
    shutil.rmtree(save_path)

print("Run CV on Olivier dataset")
# print(groups.value_counts())

multi_omic_stabl_cv(
    data_dict=X_train,
    y=y_train,
    outer_splitter=outer_cv,
    estimators=estimators,
    task_type=task_type,
    save_path=save_path,
    outer_groups=groups,
    early_fusion=False,
    late_fusion=True,
    n_iter_lf=1000,
    models=models
)

  from tqdm.autonotebook import tqdm


ValueError: Le fichier y contient plus de lignes que X.

In [None]:
# ===============================
# 1. Imports et configuration g√©n√©rale
# ===============================
import os
import shutil
import numpy as np
import pandas as pd
from stabl import data
from stabl.multi_omic_pipelines import multi_omic_stabl_cv_combined  # nouvelle fonction
from sklearn.model_selection import RepeatedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from stabl.stabl import Stabl
from sklearn.base import clone
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from sklearn.linear_model import Lasso, ElasticNet
from stabl.data import load_onset_data
from pathlib import Path

np.random.seed(42)

features_path = "/Users/noeamar/Documents/Stanford/data/olivier_data/ina_13OG_df_168_filtered_allstim_new.csv"
outcome_path = "/Users/noeamar/Documents/Stanford/data/olivier_data/outcome_table_all_pre.csv"

# ===============================
# 2. D√©finition des splits de validation crois√©e
# ===============================
outer_cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)
inner_cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=42)

# ===============================
# 3. D√©finition des estimateurs et grilles d'hyperparam√®tres
# ===============================
artificial_type = "knockoff"
# base estimators
lasso = Lasso(max_iter=int(1e6), random_state=42)
lasso_cv = GridSearchCV(lasso, {'alpha': np.logspace(-2,2,30)}, scoring='r2', cv=inner_cv, n_jobs=-1)

en = ElasticNet(max_iter=int(1e6), random_state=42)
en_cv = GridSearchCV(en, {'alpha': np.logspace(-2,2,10), 'l1_ratio':[0.5,0.7,0.9]}, scoring='r2', cv=inner_cv, n_jobs=-1)

rf = RandomForestRegressor(random_state=42, max_features=0.2)
rf_cv = GridSearchCV(rf, {'max_depth':[3,5,7,9,11]}, scoring='r2', cv=inner_cv, n_jobs=-1)

xgb = XGBRegressor(random_state=42, importance_type='gain', objective='reg:squarederror')
xgb_cv = GridSearchCV(xgb, {'max_depth':[3,6,9], 'reg_alpha':[0,0.5,1,2]}, scoring='r2', cv=inner_cv, n_jobs=-1)

cb = CatBoostRegressor(random_state=42)
cb_cv = GridSearchCV(cb, {'depth':[3,5,7], 'learning_rate':[0.01,0.1,0.2], 'l2_leaf_reg':[1,3,5]}, scoring='r2', cv=inner_cv, n_jobs=-1, verbose=0)

lgb = LGBMRegressor(random_state=42)
lgb_cv = GridSearchCV(lgb, {'max_depth':[4,6,8], 'learning_rate':[0.01,0.1], 'num_leaves':[31,63,127], 'reg_alpha':[0,1], 'reg_lambda':[0,1]}, scoring='r2', cv=inner_cv, n_jobs=-1)

# ===============================
# 4. D√©finition des estimateurs STABL
# ===============================
stabl_lasso = Stabl(base_estimator=lasso, n_bootstraps=100, artificial_type=artificial_type,
                    artificial_proportion=1., replace=False,
                    fdr_threshold_range=np.arange(0.1,1,0.01), sample_fraction=0.5,
                    random_state=42, lambda_grid={'alpha':np.logspace(-2,2,10)}, verbose=1)
stabl_rf = clone(stabl_lasso).set_params(base_estimator=rf, lambda_grid={'max_depth':[3,5,7,9,11]})
stabl_xgb = clone(stabl_lasso).set_params(base_estimator=xgb, lambda_grid=[{'max_depth':[3,6,9], 'reg_alpha':[0,0.5,1,2]}])
stabl_cb  = clone(stabl_lasso).set_params(base_estimator=cb, lambda_grid=[{'depth':[3,5,7], 'learning_rate':[0.01,0.1,0.2], 'l2_leaf_reg':[1,3,5]}])
stabl_lgb = clone(stabl_lasso).set_params(base_estimator=lgb, lambda_grid=[{'max_depth':[4,6,8], 'learning_rate':[0.01,0.1], 'num_leaves':[31,63,127], 'reg_alpha':[0,1], 'reg_lambda':[0,1]}])

# ===============================
# 5. Dictionnaire des estimateurs pour le benchmark
# ===============================
estimators = {
#    'lasso': lasso_cv,
#    'rf':   rf_cv,
    'xgb':  xgb_cv,
#    'cb':   cb_cv,
#    'lgb':  lgb_cv,
#    'stabl_lasso': stabl_lasso,
#    'stabl_rf':    stabl_rf,
    'stabl_xgb':   stabl_xgb,
#    'stabl_cb':    stabl_cb,
#    'stabl_lgb':   stabl_lgb
}
models = [
#    'Lasso',
#    'RandomForest',
    'XGBoost',
#    'CatBoost',
#    'LightGBM',
#    'STABL Lasso',
#    'STABL RandomForest',
    'STABL XGBoost',
#    'STABL CatBoost',
#    'STABL LightGBM',
    'STABL_Combined'
]

# ===============================
# 6. Chargement des donn√©es
# ===============================
X_train, X_val, y_train, y_val, groups, task_type = load_onset_data(features_path, outcome_path)

# ===============================
# 7. Lancement du benchmark multi-omic STABL combin√©
# ===============================
save_path = './Benchmarks results/Combined Selection/KO'
if os.path.exists(save_path): shutil.rmtree(save_path)

# juste apr√®s avoir construit ton dict estimators, ajoute :
estimators["en"]        = estimators["xgb"]        # placeholder vide
estimators["stabl_en"]  = estimators["stabl_xgb"]  # placeholder vide

estimators["cb"]        = estimators["xgb"]        # placeholder vide
estimators["stabl_cb"]  = estimators["stabl_xgb"]  # placeholder vide

estimators["lgb"]        = estimators["xgb"]        # placeholder vide
estimators["stabl_lgb"]  = estimators["stabl_xgb"]  # placeholder vide

estimators["rf"]        = estimators["xgb"]        # placeholder vide
estimators["stabl_rf"]  = estimators["stabl_xgb"]  # placeholder vide

estimators["lasso"]        = estimators["xgb"]        # placeholder vide
estimators["stabl_lasso"]  = estimators["stabl_xgb"]  # placeholder vide

multi_omic_stabl_cv_combined(
    data_dict     = X_train,
    y             = y_train,
    outer_splitter= outer_cv,
    estimators    = estimators,
    task_type     = task_type,
    save_path     = save_path,
    outer_groups  = groups,
    early_fusion  = False,
    late_fusion   = False,
    n_iter_lf     = 1000,
    vote_weights  = [0.4, 0.3, 0.3], # Poids pour Lasso, RF, XGB
    fdr_alpha     = 0.3,
    models        = models
)


In [None]:
# ===============================
# 0.¬†Imports,¬†config¬†g√©n√©rale¬†&¬†parall√©lisme
# ===============================
import os
import shutil
from pathlib import Path

import numpy as np
import pandas as pd

# Limite explicite¬†: **un seul thread BLAS/OMP** par processus
# (√©vite un double parall√©lisme si Joblib ouvre d√©j√† plusieurs processus)
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"

# STABL & data utils
from stabl.multi_omic_pipelines import multi_omic_stabl_cv
from stabl.stabl import Stabl
from stabl.adaptive import ALogitLasso
from stabl.data import load_onset_data

# Scikit‚Äëlearn models & helpers
from sklearn.model_selection import RepeatedKFold, GridSearchCV
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.base import clone

# Gradient‚Äëboosting libs
from xgboost import XGBRegressor
# from catboost import CatBoostRegressor
# from lightgbm import LGBMRegressor

np.random.seed(42)

# ===============================
# 1.¬†Cross‚Äëvalidation splitters
# ===============================
outer_cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)
inner_cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=42)

# ===============================
# 2.¬†Estimators & hyper‚Äëparameter grids
# ===============================
artificial_type = "knockoff"  # ou "random_permutation"

# -- Lasso
lasso      = Lasso(max_iter=int(1e6), random_state=42)
lasso_grid = {"alpha": np.logspace(-2, 2, 30)}
lasso_cv   = GridSearchCV(lasso, param_grid=lasso_grid, scoring="r2", cv=inner_cv, n_jobs=-1)

# -- ElasticNet
en        = ElasticNet(max_iter=int(1e6), random_state=42)
en_grid   = {"alpha": np.logspace(-2, 2, 10), "l1_ratio": [0.5, 0.7, 0.9]}
en_cv     = GridSearchCV(en, param_grid=en_grid, scoring="r2", cv=inner_cv, n_jobs=-1)

# -- RandomForest | *un seul thread par processus Joblib*
rf      = RandomForestRegressor(random_state=42, max_features=0.2, n_jobs=1)
rf_grid = {"max_depth": [3, 5, 7, 9, 11]}
rf_cv   = GridSearchCV(rf, param_grid=rf_grid, scoring="r2", cv=inner_cv, n_jobs=-1)

# -- XGBoost | *un seul thread interne pour √©viter le double parall√©lisme*
xgb      = XGBRegressor(
    random_state=42,
    objective="reg:squarederror",
    importance_type="gain",
    tree_method="hist",
    nthread=1           # <‚Äë‚Äë¬†cl√© !
)
xgb_grid = {"max_depth": [3, 6, 9], "reg_alpha": [0, 0.5, 1, 2]}
xgb_cv   = GridSearchCV(xgb, param_grid=xgb_grid, scoring="r2", cv=inner_cv, n_jobs=-1)

# (CatBoost / LightGBM d√©sactiv√©s pour l'instant)

# ===============================
# 3.¬†STABL wrappers (n_jobs=1 pour √©viter un 3·µâ niveau de parall√©lisme)
# ===============================
stabl_base_kwargs = dict(
    n_bootstraps=100,
    artificial_type=artificial_type,
    artificial_proportion=1.0,
    replace=False,
    fdr_threshold_range=np.arange(0.1, 1, 0.01),
    sample_fraction=0.5,
    random_state=42,
    verbose=1, 
)

stabl_lasso = Stabl(base_estimator=lasso,  lambda_grid={"alpha": np.logspace(-2, 2, 10)}, **stabl_base_kwargs)

stabl_en = clone(stabl_lasso).set_params(
    base_estimator=en,
    lambda_grid=en_grid
)

stabl_rf = clone(stabl_lasso).set_params(
    base_estimator=rf,
    lambda_grid=rf_grid
)

stabl_xgb = clone(stabl_lasso).set_params(
    base_estimator=xgb,
    lambda_grid=xgb_grid
)

# ===============================
# 4.¬†Dictionnaire des estimateurs
# ===============================
estimators = {
    "lasso":       lasso_cv,
#    "en":          en_cv,
    "rf":          rf_cv,
    "xgb":         xgb_cv,
    "stabl_lasso": stabl_lasso,
#    "stabl_en":    stabl_en,
    "stabl_rf":    stabl_rf,
    "stabl_xgb":   stabl_xgb,
}

models = [
    "Lasso",
#    "ElasticNet",
    "RandomForest",
    "XGBoost",
    "STABL Lasso",
#    "STABL ElasticNet",
    "STABL RandomForest",
    "STABL XGBoost",
]

# juste apr√®s avoir construit ton dict estimators, ajoute :
estimators["en"]        = estimators["xgb"]        # placeholder vide
estimators["stabl_en"]  = estimators["stabl_xgb"]  # placeholder vide

estimators["cb"]        = estimators["xgb"]        # placeholder vide
estimators["stabl_cb"]  = estimators["stabl_xgb"]  # placeholder vide

estimators["lgb"]        = estimators["xgb"]        # placeholder vide
estimators["stabl_lgb"]  = estimators["stabl_xgb"]  # placeholder vide

# ===============================
# 5.¬†Chargement des donn√©es
# ===============================
features_path = "/Users/noeamar/Documents/Stanford/data/olivier_data/ina_13OG_df_168_filtered_allstim_new.csv"  #¬†√† adapter
outcome_path  = "/Users/noeamar/Documents/Stanford/data/olivier_data/outcome_table_all_pre.csv"   #¬†√† adapter

X_train, X_val, y_train, y_val, groups, task_type = load_onset_data(features_path, outcome_path)

# ===============================
# 6.¬†Lancement du benchmark
# ===============================
save_path = Path("./benchmark_results_time/KO")
if save_path.exists():
    shutil.rmtree(save_path)

print("\nüöÄ¬†Lancement du benchmark STABL¬†‚Ä¶")

multi_omic_stabl_cv(
    data_dict=X_train,
    y=y_train,
    outer_splitter=outer_cv,
    estimators=estimators,
    task_type=task_type,
    save_path=str(save_path),
    outer_groups=groups,
    early_fusion=False,
    late_fusion=True,
    n_iter_lf=1000,
    models=models,
)

print("\n‚úÖ¬†Benchmark termin√©. R√©sultats dans", save_path)

# temps = 5h30 a peu pres
