## ⚠️ Important Configuration Note

**NIRS4ALL Pipeline Configuration Format:**

All configurations in this notebook use the correct NIRS4ALL pipeline format:

```python
config = {
    "pipeline": [
        # CV splitter (if needed)
        ShuffleSplit(n_splits=3, test_size=0.25),
        
        # Model configuration
        {
            "model": PLSRegression(),
            "finetune_params": {
                "cv_mode": "per_fold",
                "param_strategy": "global_average", 
                "use_full_train_for_final": True,  # Optional
                "n_trials": 10,
                "model_params": {
                    "n_components": ("int", 1, 15)
                }
            }
        }
    ]
}
```

The key points:
- Use `"pipeline"` as the top-level key (not "name" or "steps")
- Pipeline is a **list** of transformers, splitters, and models
- Model configurations are **dictionaries** within the pipeline list
- `finetune_params` contains all optimization settings

# NIRS4ALL Finetuning Strategies Demo

This notebook demonstrates all the different finetuning strategies and cross-validation modes available in NIRS4ALL.

**Features Demonstrated:**
- Cross-validation modes: `simple`, `per_fold`, `nested`
- Parameter strategies: `per_fold_best`, `global_best`, `global_average`
- Full training option: `use_full_train_for_final`
- Different model types and parameter spaces

All examples use small synthetic datasets for fast execution.

In [1]:
# Setup and imports
import numpy as np
import pandas as pd
import time
from sklearn.cross_decomposition import PLSRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import ShuffleSplit

from nirs4all.pipeline.runner import PipelineRunner
from nirs4all.pipeline.config import PipelineConfigs

# Generate synthetic dataset for demonstrations
np.random.seed(42)

def create_demo_dataset():
    """Create small synthetic dataset for fast demos."""
    n_samples = 80
    n_features = 30

    # Create synthetic spectra
    X = np.random.randn(n_samples, n_features)

    # Add spectral-like structure
    wavelengths = np.linspace(1000, 2000, n_features)
    for i in range(n_samples):
        X[i] += 0.3 * np.sin(wavelengths / 150) + 0.2 * np.cos(wavelengths / 200)

    # Create target with relationship to spectra
    y = (np.sum(X[:, 5:15], axis=1) +
         0.5 * np.sum(X[:, 20:25], axis=1) +
         0.3 * np.random.randn(n_samples))

    return {
        'X': X,
        'y': y,
        'folds': 3,
        'train': 0.7,
        'val': 0.15,
        'test': 0.15,
        'random_state': 42
    }

demo_data_config = create_demo_dataset()
print(f"📊 Demo dataset: {demo_data_config['X'].shape[0]} samples, {demo_data_config['X'].shape[1]} features")
print(f"🎯 Target range: {demo_data_config['y'].min():.2f} to {demo_data_config['y'].max():.2f}")

📊 Demo dataset: 80 samples, 30 features
🎯 Target range: -4.45 to 10.94


## 1. Cross-Validation Modes

NIRS4ALL supports three CV modes with different levels of rigor and computational cost.

In [2]:
# 1. SIMPLE CV: Fastest, least rigorous
print("🚀 1. SIMPLE CV - Optimize on combined data, train on folds")

simple_config = {
    "pipeline": [
        ShuffleSplit(n_splits=3, test_size=.25),  # Create CV folds
        {
            "model": PLSRegression(),
            "finetune_params": {
                "cv_mode": "simple",  # 🎯 SIMPLE MODE
                "param_strategy": "per_fold_best",
                "n_trials": 5,
                "verbose": 1,
                "model_params": {
                    "n_components": ("int", 1, 8)
                },
                "train_params": {"verbose": 0}
            }
        }
    ]
}

from nirs4all.dataset.loader import create_synthetic_dataset
data = create_synthetic_dataset(demo_data_config)

start = time.time()
config = PipelineConfigs(simple_config, "simple_demo")
runner = PipelineRunner()
result, _, _ = runner.run(config, data)
elapsed = time.time() - start

print(f"✅ Simple CV completed in {elapsed:.1f}s")
print(f"📊 Generated {len(result._predictions)} prediction sets")
if result._predictions:
    print(f"🔑 Prediction keys: {result._predictions.list_keys()}")

🚀 1. SIMPLE CV - Optimize on combined data, train on folds
[94m🚀 Starting pipeline config_simple_demo_1206ac on dataset synthetic_test_dataset[0m
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[94m🔄 Running 2 steps in sequential mode[0m
[92m🔷 Step 1: ShuffleSplit(n_splits=3, test_size=0.25)[0m
🔹 Executing controller CrossValidatorController with operator ShuffleSplit
Generated 3 folds.
💾 Saved 1_folds_ShuffleSplit.csv to results\synthetic_test_dataset\config_simple_demo_1206ac\1_folds_ShuffleSplit.csv
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[97mUpdate: 📊 Dataset: synthetic_test_dataset
Features (samples=80, sources=1):
- Source 0: (80, 1, 30), processings=['raw'], min=-

  from .autonotebook import tqdm as notebook_tqdm
[I 2025-09-26 15:06:53,557] A new study created in memory with name: no-name-32f12f4a-4da6-4e00-99b6-245e75db5b4c


🔍 Optimizing 1 parameters with random search (5 trials)...
(126, 30) (126,) (14, 30) (14, 1)


[I 2025-09-26 15:06:53,563] Trial 0 finished with value: 2.831748843780041 and parameters: {'n_components': 1}. Best is trial 0 with value: 2.831748843780041.
[I 2025-09-26 15:06:53,568] Trial 1 finished with value: 0.13372231918262292 and parameters: {'n_components': 5}. Best is trial 1 with value: 0.13372231918262292.
[I 2025-09-26 15:06:53,575] Trial 2 finished with value: 0.04757569019788793 and parameters: {'n_components': 8}. Best is trial 2 with value: 0.04757569019788793.
[I 2025-09-26 15:06:53,580] Trial 3 finished with value: 2.831748843780041 and parameters: {'n_components': 1}. Best is trial 2 with value: 0.04757569019788793.
[I 2025-09-26 15:06:53,586] Trial 4 finished with value: 0.09074267040675044 and parameters: {'n_components': 6}. Best is trial 2 with value: 0.04757569019788793.


🏆 Best parameters found: {'n_components': 8}
🔄 Training 3 fold models with best parameters...
(42, 30) (42,) (14, 30) (14, 1)
(42, 30) (42,) (14, 30) (14, 1)
(42, 30) (42,) (14, 30) (14, 1)
✅ Simple CV completed successfully
💾 Saved 2_finetuned_PLSRegression_1.pkl to results\synthetic_test_dataset\config_simple_demo_1206ac\2_finetuned_PLSRegression_1.pkl
💾 Saved 2_predictions_finetuned_2.csv to results\synthetic_test_dataset\config_simple_demo_1206ac\2_predictions_finetuned_2.csv
💾 Saved 2_trained_PLSRegression_3_simple_cv_fold1.pkl to results\synthetic_test_dataset\config_simple_demo_1206ac\2_trained_PLSRegression_3_simple_cv_fold1.pkl
💾 Saved 2_predictions_trained_4_simple_cv_fold1.csv to results\synthetic_test_dataset\config_simple_demo_1206ac\2_predictions_trained_4_simple_cv_fold1.csv
💾 Saved 2_trained_PLSRegression_5_simple_cv_fold2.pkl to results\synthetic_test_dataset\config_simple_demo_1206ac\2_trained_PLSRegression_5_simple_cv_fold2.pkl
💾 Saved 2_predictions_trained_6_simple_

In [3]:
# 2. PER-FOLD CV: Standard approach, good balance
print("\n⚖️ 2. PER-FOLD CV - Optimize on each fold separately")

per_fold_config = {
    "pipeline": [
        ShuffleSplit(n_splits=3, test_size=.25),  # Create CV folds
        {
            "model": PLSRegression(),
            "finetune_params": {
                "cv_mode": "per_fold",  # 🎯 PER-FOLD MODE
                "param_strategy": "per_fold_best",
                "n_trials": 4,  # Fewer trials since it runs on each fold
                "verbose": 1,
                "model_params": {
                    "n_components": ("int", 1, 8)
                },
                "train_params": {"verbose": 0}
            }
        }
    ]
}

start = time.time()
config = PipelineConfigs(per_fold_config, "per_fold_demo")
result, _, _ = runner.run(config, data)
elapsed = time.time() - start

print(f"✅ Per-fold CV completed in {elapsed:.1f}s")
print(f"📊 Generated {len(result._predictions)} prediction sets")
if result._predictions:
    print(f"🔑 Prediction keys: {result._predictions.list_keys()}")

[I 2025-09-26 15:06:53,622] A new study created in memory with name: no-name-bd9baffc-0e32-401a-bc8e-8f624b8dd87a
[I 2025-09-26 15:06:53,628] Trial 0 finished with value: 4.32007458039583 and parameters: {'n_components': 3}. Best is trial 0 with value: 4.32007458039583.
[I 2025-09-26 15:06:53,634] Trial 1 finished with value: 2.9699056444356553 and parameters: {'n_components': 5}. Best is trial 1 with value: 2.9699056444356553.
[I 2025-09-26 15:06:53,640] Trial 2 finished with value: 3.51816225848337 and parameters: {'n_components': 4}. Best is trial 1 with value: 2.9699056444356553.
[I 2025-09-26 15:06:53,647] Trial 3 finished with value: 2.5745211957465703 and parameters: {'n_components': 6}. Best is trial 3 with value: 2.5745211957465703.
[I 2025-09-26 15:06:53,648] A new study created in memory with name: no-name-efc94447-b19c-4d32-9741-60e03c385621
[I 2025-09-26 15:06:53,653] Trial 0 finished with value: 3.2403212969365516 and parameters: {'n_components': 5}. Best is trial 0 with 


⚖️ 2. PER-FOLD CV - Optimize on each fold separately
[94m🚀 Starting pipeline config_per_fold_demo_4c61a9 on dataset synthetic_test_dataset[0m
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[94m🔄 Running 2 steps in sequential mode[0m
[92m🔷 Step 3: ShuffleSplit(n_splits=3, test_size=0.25)[0m
🔹 Executing controller CrossValidatorController with operator ShuffleSplit
Generated 3 folds.
💾 Saved 3_folds_ShuffleSplit.csv to results\synthetic_test_dataset\config_per_fold_demo_4c61a9\3_folds_ShuffleSplit.csv
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[92m🔷 Step 4: (finetune) PLSRegression()[0m
🔹 Executing controller SklearnModelController with operator PLSRegression
🔍 Per-fold CV

In [4]:
# 3. NESTED CV: Most rigorous, highest computational cost
print("\n🔬 3. NESTED CV - Inner CV for optimization, outer CV for evaluation")

nested_config = {
    "pipeline": [
        ShuffleSplit(n_splits=3, test_size=.25),  # Create CV folds
        {
            "model": PLSRegression(),
            "finetune_params": {
                "cv_mode": "nested",  # 🎯 NESTED MODE
                "inner_cv": 2,  # Small for demo
                "param_strategy": "per_fold_best",
                "n_trials": 3,  # Very small for demo
                "verbose": 1,
                "model_params": {
                    "n_components": ("int", 1, 6)
                },
                "train_params": {"verbose": 0}
            }
        }
    ]
}

start = time.time()
config = PipelineConfigs(nested_config, "nested_demo")
result, _, _ = runner.run(config, data)
elapsed = time.time() - start

print(f"✅ Nested CV completed in {elapsed:.1f}s")
print(f"📊 Generated {len(result._predictions)} prediction sets")
if result._predictions:
    print(f"🔑 Prediction keys: {result._predictions.list_keys()}")

print(f"\n⏱️ CV Mode Comparison (approximate times):")
print(f"  Simple CV:    Fastest")
print(f"  Per-fold CV:  3x slower")
print(f"  Nested CV:    5-10x slower")

[I 2025-09-26 15:06:53,732] A new study created in memory with name: no-name-47ba55bd-6a50-425a-a78b-906938284dba



🔬 3. NESTED CV - Inner CV for optimization, outer CV for evaluation
[94m🚀 Starting pipeline config_nested_demo_d52524 on dataset synthetic_test_dataset[0m
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[94m🔄 Running 2 steps in sequential mode[0m
[92m🔷 Step 5: ShuffleSplit(n_splits=3, test_size=0.25)[0m
🔹 Executing controller CrossValidatorController with operator ShuffleSplit
Generated 3 folds.
💾 Saved 5_folds_ShuffleSplit.csv to results\synthetic_test_dataset\config_nested_demo_d52524\5_folds_ShuffleSplit.csv
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[92m🔷 Step 6: (finetune) PLSRegression()[0m
🔹 Executing controller SklearnModelController with operator PLSRegression
🔍 

[I 2025-09-26 15:06:53,743] Trial 0 finished with value: 6.590670929381243 and parameters: {'n_components': 4}. Best is trial 0 with value: 6.590670929381243.
[I 2025-09-26 15:06:53,751] Trial 1 finished with value: 6.590670929381243 and parameters: {'n_components': 4}. Best is trial 0 with value: 6.590670929381243.
[I 2025-09-26 15:06:53,761] Trial 2 finished with value: 6.378044723814769 and parameters: {'n_components': 6}. Best is trial 2 with value: 6.378044723814769.


🏋️ Outer fold 2/3...

[I 2025-09-26 15:06:53,767] A new study created in memory with name: no-name-d880af6f-d287-4463-8533-a577757c6b59
[I 2025-09-26 15:06:53,775] Trial 0 finished with value: 6.667593489768158 and parameters: {'n_components': 4}. Best is trial 0 with value: 6.667593489768158.
[I 2025-09-26 15:06:53,785] Trial 1 finished with value: 6.667593489768158 and parameters: {'n_components': 4}. Best is trial 0 with value: 6.667593489768158.
[I 2025-09-26 15:06:53,793] Trial 2 finished with value: 6.695758060721871 and parameters: {'n_components': 5}. Best is trial 0 with value: 6.667593489768158.



🏋️ Outer fold 3/3...


[I 2025-09-26 15:06:53,799] A new study created in memory with name: no-name-29aa942b-778d-4942-a6d5-b17451162c8c
[I 2025-09-26 15:06:53,809] Trial 0 finished with value: 8.76296574269475 and parameters: {'n_components': 3}. Best is trial 0 with value: 8.76296574269475.
[I 2025-09-26 15:06:53,817] Trial 1 finished with value: 8.76296574269475 and parameters: {'n_components': 3}. Best is trial 0 with value: 8.76296574269475.
[I 2025-09-26 15:06:53,827] Trial 2 finished with value: 8.55846215593177 and parameters: {'n_components': 6}. Best is trial 2 with value: 8.55846215593177.


✅ Nested CV completed successfully
💾 Saved 6_nested_cv_outer_fold1_PLSRegression_15.pkl to results\synthetic_test_dataset\config_nested_demo_d52524\6_nested_cv_outer_fold1_PLSRegression_15.pkl
💾 Saved 6_predictions_nested_cv_outer_fold1_16.csv to results\synthetic_test_dataset\config_nested_demo_d52524\6_predictions_nested_cv_outer_fold1_16.csv
💾 Saved 6_nested_cv_outer_fold2_PLSRegression_17.pkl to results\synthetic_test_dataset\config_nested_demo_d52524\6_nested_cv_outer_fold2_PLSRegression_17.pkl
💾 Saved 6_predictions_nested_cv_outer_fold2_18.csv to results\synthetic_test_dataset\config_nested_demo_d52524\6_predictions_nested_cv_outer_fold2_18.csv
💾 Saved 6_nested_cv_outer_fold3_PLSRegression_19.pkl to results\synthetic_test_dataset\config_nested_demo_d52524\6_nested_cv_outer_fold3_PLSRegression_19.pkl
💾 Saved 6_predictions_nested_cv_outer_fold3_20.csv to results\synthetic_test_dataset\config_nested_demo_d52524\6_predictions_nested_cv_outer_fold3_20.csv
-----------------------------

## 2. Parameter Strategies

Different strategies for aggregating parameters across cross-validation folds.

In [5]:
# Parameter Strategy 1: PER_FOLD_BEST (default)
print("🎯 1. PER_FOLD_BEST - Each fold uses its own optimized parameters")

per_fold_best_config = {
    "pipeline": [
        ShuffleSplit(n_splits=3, test_size=.25),
        {
            "model": PLSRegression(),
            "finetune_params": {
                "cv_mode": "per_fold",
                "param_strategy": "per_fold_best",  # 🎯 DEFAULT STRATEGY
                "n_trials": 4,
                "verbose": 1,
                "model_params": {
                    "n_components": ("int", 1, 8)
                },
                "train_params": {"verbose": 0}
            }
        }
    ]
}

start = time.time()
config = PipelineConfigs(per_fold_best_config, "per_fold_best_demo")
result, _, _ = runner.run(config, data)
elapsed = time.time() - start

print(f"✅ Per-fold best completed in {elapsed:.1f}s")
print(f"📊 Generated {len(result._predictions)} prediction sets")

# Get performance
if result._predictions:
    combined = result._predictions.combine_folds(
        "sample_data", config.name, "PLSRegression", "test_fold"
    )
    if combined:
        y_true = combined['y_true'].flatten()
        y_pred = combined['y_pred'].flatten()
        rmse = np.sqrt(mean_squared_error(y_true, y_pred))
        r2 = r2_score(y_true, y_pred)
        print(f"🎯 Performance: RMSE={rmse:.3f}, R²={r2:.3f}")

[I 2025-09-26 15:06:53,870] A new study created in memory with name: no-name-5a72dbb9-3963-4172-9283-978c8b9d61d4
[I 2025-09-26 15:06:53,876] Trial 0 finished with value: 3.1536673068704446 and parameters: {'n_components': 5}. Best is trial 0 with value: 3.1536673068704446.
[I 2025-09-26 15:06:53,882] Trial 1 finished with value: 2.9214336046847986 and parameters: {'n_components': 6}. Best is trial 1 with value: 2.9214336046847986.
[I 2025-09-26 15:06:53,887] Trial 2 finished with value: 3.3056925963368826 and parameters: {'n_components': 4}. Best is trial 1 with value: 2.9214336046847986.
[I 2025-09-26 15:06:53,891] Trial 3 finished with value: 3.1536673068704446 and parameters: {'n_components': 5}. Best is trial 1 with value: 2.9214336046847986.
[I 2025-09-26 15:06:53,893] A new study created in memory with name: no-name-744be270-befa-4e5e-9228-0a1e2eb09935
[I 2025-09-26 15:06:53,899] Trial 0 finished with value: 3.8714537446639987 and parameters: {'n_components': 2}. Best is trial 0

🎯 1. PER_FOLD_BEST - Each fold uses its own optimized parameters
[94m🚀 Starting pipeline config_per_fold_best_demo_4c61a9 on dataset synthetic_test_dataset[0m
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[94m🔄 Running 2 steps in sequential mode[0m
[92m🔷 Step 7: ShuffleSplit(n_splits=3, test_size=0.25)[0m
🔹 Executing controller CrossValidatorController with operator ShuffleSplit
Generated 3 folds.
💾 Saved 7_folds_ShuffleSplit.csv to results\synthetic_test_dataset\config_per_fold_best_demo_4c61a9\7_folds_ShuffleSplit.csv
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[92m🔷 Step 8: (finetune) PLSRegression()[0m
🔹 Executing controller SklearnModelController with operator PLSReg

[I 2025-09-26 15:06:53,907] Trial 2 finished with value: 3.050189854233274 and parameters: {'n_components': 3}. Best is trial 2 with value: 3.050189854233274.
[I 2025-09-26 15:06:53,913] Trial 3 finished with value: 0.9683112065412375 and parameters: {'n_components': 8}. Best is trial 3 with value: 0.9683112065412375.
[I 2025-09-26 15:06:53,915] A new study created in memory with name: no-name-14090daf-bc03-4e32-bce1-ccae842f78f6
[I 2025-09-26 15:06:53,921] Trial 0 finished with value: 1.7624151984669005 and parameters: {'n_components': 6}. Best is trial 0 with value: 1.7624151984669005.
[I 2025-09-26 15:06:53,926] Trial 1 finished with value: 1.9075781663768288 and parameters: {'n_components': 5}. Best is trial 0 with value: 1.7624151984669005.
[I 2025-09-26 15:06:53,930] Trial 2 finished with value: 1.7624151984669005 and parameters: {'n_components': 6}. Best is trial 0 with value: 1.7624151984669005.
[I 2025-09-26 15:06:53,935] Trial 3 finished with value: 3.7553081951153033 and par

🎛️ Finetuning fold 3/3...
🔍 Optimizing 1 parameters with random search (4 trials)...
(42, 30) (42,) (14, 30) (14, 1)
✅ Per-fold CV completed successfully
💾 Saved 8_finetuned_PLSRegression_21_per_fold_cv_fold1.pkl to results\synthetic_test_dataset\config_per_fold_best_demo_4c61a9\8_finetuned_PLSRegression_21_per_fold_cv_fold1.pkl
💾 Saved 8_predictions_finetuned_22_per_fold_cv_fold1.csv to results\synthetic_test_dataset\config_per_fold_best_demo_4c61a9\8_predictions_finetuned_22_per_fold_cv_fold1.csv
💾 Saved 8_finetuned_PLSRegression_23_per_fold_cv_fold2.pkl to results\synthetic_test_dataset\config_per_fold_best_demo_4c61a9\8_finetuned_PLSRegression_23_per_fold_cv_fold2.pkl
💾 Saved 8_predictions_finetuned_24_per_fold_cv_fold2.csv to results\synthetic_test_dataset\config_per_fold_best_demo_4c61a9\8_predictions_finetuned_24_per_fold_cv_fold2.csv
💾 Saved 8_finetuned_PLSRegression_25_per_fold_cv_fold3.pkl to results\synthetic_test_dataset\config_per_fold_best_demo_4c61a9\8_finetuned_PLSRegre

In [6]:
# Parameter Strategy 2: GLOBAL_BEST
print("\n🏆 2. GLOBAL_BEST - Single best parameter set for all folds")

global_best_config = {
    "pipeline": [
        ShuffleSplit(n_splits=3, test_size=.25),
        {
            "model": PLSRegression(),
            "finetune_params": {
                "cv_mode": "per_fold",
                "param_strategy": "global_best",  # 🎯 GLOBAL BEST STRATEGY
                "n_trials": 4,
                "verbose": 1,
                "model_params": {
                    "n_components": ("int", 1, 8)
                },
                "train_params": {"verbose": 0}
            }
        }
    ]
}

start = time.time()
config = PipelineConfigs(global_best_config, "global_best_demo")
result, _, _ = runner.run(config, data)
elapsed = time.time() - start

print(f"✅ Global best completed in {elapsed:.1f}s")
print(f"📊 Generated {len(result._predictions)} prediction sets")

if result._predictions:
    combined = result._predictions.combine_folds(
        "sample_data", config.name, "PLSRegression", "test_fold"
    )
    if combined:
        y_true = combined['y_true'].flatten()
        y_pred = combined['y_pred'].flatten()
        rmse = np.sqrt(mean_squared_error(y_true, y_pred))
        r2 = r2_score(y_true, y_pred)
        print(f"🎯 Performance: RMSE={rmse:.3f}, R²={r2:.3f}")

[I 2025-09-26 15:06:53,963] A new study created in memory with name: no-name-54213c7f-d0e3-4a4f-8f87-64cfbbb76ab2
[I 2025-09-26 15:06:53,969] Trial 0 finished with value: 3.119835572618817 and parameters: {'n_components': 5}. Best is trial 0 with value: 3.119835572618817.
[I 2025-09-26 15:06:53,974] Trial 1 finished with value: 2.4273903584050704 and parameters: {'n_components': 7}. Best is trial 1 with value: 2.4273903584050704.
[I 2025-09-26 15:06:53,979] Trial 2 finished with value: 3.502549045081746 and parameters: {'n_components': 4}. Best is trial 1 with value: 2.4273903584050704.
[I 2025-09-26 15:06:53,984] Trial 3 finished with value: 2.6440505475265668 and parameters: {'n_components': 6}. Best is trial 1 with value: 2.4273903584050704.
[I 2025-09-26 15:06:53,986] A new study created in memory with name: no-name-6058c8d7-f695-469b-98d3-73b4dbffaa84
[I 2025-09-26 15:06:53,991] Trial 0 finished with value: 2.6886753336663625 and parameters: {'n_components': 5}. Best is trial 0 wi


🏆 2. GLOBAL_BEST - Single best parameter set for all folds
[94m🚀 Starting pipeline config_global_best_demo_c0374c on dataset synthetic_test_dataset[0m
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[94m🔄 Running 2 steps in sequential mode[0m
[92m🔷 Step 9: ShuffleSplit(n_splits=3, test_size=0.25)[0m
🔹 Executing controller CrossValidatorController with operator ShuffleSplit
Generated 3 folds.
💾 Saved 9_folds_ShuffleSplit.csv to results\synthetic_test_dataset\config_global_best_demo_c0374c\9_folds_ShuffleSplit.csv
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[92m🔷 Step 10: (finetune) PLSRegression()[0m
🔹 Executing controller SklearnModelController with operator PLSRegression


[I 2025-09-26 15:06:54,020] Trial 1 finished with value: 3.3758056643189853 and parameters: {'n_components': 8}. Best is trial 0 with value: 3.3758056643189853.
[I 2025-09-26 15:06:54,025] Trial 2 finished with value: 4.724794511830299 and parameters: {'n_components': 3}. Best is trial 0 with value: 3.3758056643189853.
[I 2025-09-26 15:06:54,031] Trial 3 finished with value: 4.724794511830299 and parameters: {'n_components': 3}. Best is trial 0 with value: 3.3758056643189853.


🏆 Global best parameters: {'n_components': 7}
✅ Per-fold CV completed successfully
💾 Saved 10_finetuned_PLSRegression_27_per_fold_cv_fold1.pkl to results\synthetic_test_dataset\config_global_best_demo_c0374c\10_finetuned_PLSRegression_27_per_fold_cv_fold1.pkl
💾 Saved 10_predictions_finetuned_28_per_fold_cv_fold1.csv to results\synthetic_test_dataset\config_global_best_demo_c0374c\10_predictions_finetuned_28_per_fold_cv_fold1.csv
💾 Saved 10_finetuned_PLSRegression_29_per_fold_cv_fold2.pkl to results\synthetic_test_dataset\config_global_best_demo_c0374c\10_finetuned_PLSRegression_29_per_fold_cv_fold2.pkl
💾 Saved 10_predictions_finetuned_30_per_fold_cv_fold2.csv to results\synthetic_test_dataset\config_global_best_demo_c0374c\10_predictions_finetuned_30_per_fold_cv_fold2.csv
💾 Saved 10_finetuned_PLSRegression_31_per_fold_cv_fold3.pkl to results\synthetic_test_dataset\config_global_best_demo_c0374c\10_finetuned_PLSRegression_31_per_fold_cv_fold3.pkl
💾 Saved 10_predictions_finetuned_32_per_

In [7]:
# Parameter Strategy 3: GLOBAL_AVERAGE ⭐ NEW
print("\n🌍 3. GLOBAL_AVERAGE - Optimize by averaging across ALL folds")

global_average_config = {
    "pipeline": [
        ShuffleSplit(n_splits=3, test_size=.25),
        {
            "model": PLSRegression(),
            "finetune_params": {
                "cv_mode": "per_fold",
                "param_strategy": "global_average",  # 🎯 NEW STRATEGY
                "n_trials": 3,  # Fewer trials since more expensive per trial
                "verbose": 1,
                "model_params": {
                    "n_components": ("int", 1, 8)
                },
                "train_params": {"verbose": 0}
            }
        }
    ]
}

start = time.time()
config = PipelineConfigs(global_average_config, "global_average_demo")
result, _, _ = runner.run(config, data)
elapsed = time.time() - start

print(f"✅ Global average completed in {elapsed:.1f}s")
print(f"📊 Generated {len(result._predictions)} prediction sets")

if result._predictions:
    combined = result._predictions.combine_folds(
        "sample_data", config.name, "PLSRegression", "test_fold"
    )
    if combined:
        y_true = combined['y_true'].flatten()
        y_pred = combined['y_pred'].flatten()
        rmse = np.sqrt(mean_squared_error(y_true, y_pred))
        r2 = r2_score(y_true, y_pred)
        print(f"🎯 Performance: RMSE={rmse:.3f}, R²={r2:.3f}")

print(f"\n📊 Parameter Strategy Comparison:")
print(f"  per_fold_best:  Each fold optimized individually")
print(f"  global_best:    Best performing params used for all folds")
print(f"  global_average: Params optimized for average performance ⭐")

[I 2025-09-26 15:06:54,068] A new study created in memory with name: no-name-ea4b3d99-c209-4ee4-83e9-591d10e0ce1d
[I 2025-09-26 15:06:54,082] Trial 0 finished with value: 10.513848496645641 and parameters: {'n_components': 4}. Best is trial 0 with value: 10.513848496645641.
[I 2025-09-26 15:06:54,095] Trial 1 finished with value: 10.513848496645641 and parameters: {'n_components': 4}. Best is trial 0 with value: 10.513848496645641.



🌍 3. GLOBAL_AVERAGE - Optimize by averaging across ALL folds
[94m🚀 Starting pipeline config_global_average_demo_d7f2f7 on dataset synthetic_test_dataset[0m
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[94m🔄 Running 2 steps in sequential mode[0m
[92m🔷 Step 11: ShuffleSplit(n_splits=3, test_size=0.25)[0m
🔹 Executing controller CrossValidatorController with operator ShuffleSplit
Generated 3 folds.
💾 Saved 11_folds_ShuffleSplit.csv to results\synthetic_test_dataset\config_global_average_demo_d7f2f7\11_folds_ShuffleSplit.csv
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[92m🔷 Step 12: (finetune) PLSRegression()[0m
🔹 Executing controller SklearnModelController with operator PLS

[I 2025-09-26 15:06:54,119] Trial 2 finished with value: 11.2829692139246 and parameters: {'n_components': 1}. Best is trial 0 with value: 10.513848496645641.


🏆 Global best parameters: {'n_components': 4}
📊 Best average score: 10.5138
🔄 Training 3 final models with global best parameters...
(42, 30) (42,) (14, 30) (14, 1)
(42, 30) (42,) (14, 30) (14, 1)
(42, 30) (42,) (14, 30) (14, 1)
✅ Global Average CV completed successfully
💾 Saved 12_trained_PLSRegression_33_global_avg_cv_fold1.pkl to results\synthetic_test_dataset\config_global_average_demo_d7f2f7\12_trained_PLSRegression_33_global_avg_cv_fold1.pkl
💾 Saved 12_predictions_trained_34_global_avg_cv_fold1.csv to results\synthetic_test_dataset\config_global_average_demo_d7f2f7\12_predictions_trained_34_global_avg_cv_fold1.csv
💾 Saved 12_trained_PLSRegression_35_global_avg_cv_fold2.pkl to results\synthetic_test_dataset\config_global_average_demo_d7f2f7\12_trained_PLSRegression_35_global_avg_cv_fold2.pkl
💾 Saved 12_predictions_trained_36_global_avg_cv_fold2.csv to results\synthetic_test_dataset\config_global_average_demo_d7f2f7\12_predictions_trained_36_global_avg_cv_fold2.csv
💾 Saved 12_train

## 3. Full Training Option ⭐ NEW

The `use_full_train_for_final` option lets you use CV for optimization but train the final model on all available training data.

In [8]:
# Traditional approach: separate models per fold
print("🔄 TRADITIONAL: Separate models per fold")

traditional_config = {
    "pipeline": [
        ShuffleSplit(n_splits=3, test_size=.25),
        {
            "model": PLSRegression(),
            "finetune_params": {
                "cv_mode": "per_fold",
                "param_strategy": "global_average",
                "use_full_train_for_final": False,  # 🎯 TRADITIONAL
                "n_trials": 3,
                "verbose": 1,
                "model_params": {
                    "n_components": ("int", 1, 8)
                },
                "train_params": {"verbose": 0}
            }
        }
    ]
}

start = time.time()
config = PipelineConfigs(traditional_config, "traditional_demo")
result_traditional, _, _ = runner.run(config, data)
elapsed = time.time() - start

print(f"✅ Traditional approach completed in {elapsed:.1f}s")
print(f"📊 Generated {len(result_traditional._predictions)} prediction sets")
print(f"🔑 Keys: {result_traditional._predictions.list_keys()}")

[I 2025-09-26 15:06:54,162] A new study created in memory with name: no-name-65d9b59b-4921-4f93-8c3d-c60b5ecb2bae
[I 2025-09-26 15:06:54,177] Trial 0 finished with value: 11.53709932036127 and parameters: {'n_components': 4}. Best is trial 0 with value: 11.53709932036127.
[I 2025-09-26 15:06:54,191] Trial 1 finished with value: 11.590515481026053 and parameters: {'n_components': 8}. Best is trial 0 with value: 11.53709932036127.
[I 2025-09-26 15:06:54,204] Trial 2 finished with value: 11.543476125180026 and parameters: {'n_components': 3}. Best is trial 0 with value: 11.53709932036127.


🔄 TRADITIONAL: Separate models per fold
[94m🚀 Starting pipeline config_traditional_demo_2f0f5c on dataset synthetic_test_dataset[0m
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[94m🔄 Running 2 steps in sequential mode[0m
[92m🔷 Step 13: ShuffleSplit(n_splits=3, test_size=0.25)[0m
🔹 Executing controller CrossValidatorController with operator ShuffleSplit
Generated 3 folds.
💾 Saved 13_folds_ShuffleSplit.csv to results\synthetic_test_dataset\config_traditional_demo_2f0f5c\13_folds_ShuffleSplit.csv
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[92m🔷 Step 14: (finetune) PLSRegression()[0m
🔹 Executing controller SklearnModelController with operator PLSRegression
🌍 Global Average 

In [9]:
# NEW Full Training approach: single model on full data
print("\n🎯 FULL TRAINING: Single model on combined data ⭐")

full_train_config = {
    "pipeline": [
        ShuffleSplit(n_splits=3, test_size=.25),
        {
            "model": PLSRegression(),
            "finetune_params": {
                "cv_mode": "per_fold",
                "param_strategy": "global_average",
                "use_full_train_for_final": True,  # 🎯 NEW OPTION
                "n_trials": 3,
                "verbose": 1,
                "model_params": {
                    "n_components": ("int", 1, 8)
                },
                "train_params": {"verbose": 0}
            }
        }
    ]
}

start = time.time()
config = PipelineConfigs(full_train_config, "full_train_demo")
result_full_train, _, _ = runner.run(config, data)
elapsed = time.time() - start

print(f"✅ Full training approach completed in {elapsed:.1f}s")
print(f"📊 Generated {len(result_full_train._predictions)} prediction sets")
print(f"🔑 Keys: {result_full_train._predictions.list_keys()}")

# Compare approaches
print(f"\n📊 COMPARISON:")
print(f"  Traditional: {len(result_traditional._predictions)} prediction sets (multiple models)")
print(f"  Full Train:  {len(result_full_train._predictions)} prediction sets (single model)")
print(f"\n💡 Full Training Benefits:")
print(f"  ✓ Uses all available training data")
print(f"  ✓ Single model for deployment")
print(f"  ✓ Often better performance")
print(f"  ✓ Simpler model management")

[I 2025-09-26 15:06:54,237] A new study created in memory with name: no-name-2b3dcbba-0aaf-4e7b-a996-223d833689b2
[I 2025-09-26 15:06:54,253] Trial 0 finished with value: 9.578812769031337 and parameters: {'n_components': 6}. Best is trial 0 with value: 9.578812769031337.



🎯 FULL TRAINING: Single model on combined data ⭐
[94m🚀 Starting pipeline config_full_train_demo_c97b52 on dataset synthetic_test_dataset[0m
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[94m🔄 Running 2 steps in sequential mode[0m
[92m🔷 Step 15: ShuffleSplit(n_splits=3, test_size=0.25)[0m
🔹 Executing controller CrossValidatorController with operator ShuffleSplit
Generated 3 folds.
💾 Saved 15_folds_ShuffleSplit.csv to results\synthetic_test_dataset\config_full_train_demo_c97b52\15_folds_ShuffleSplit.csv
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[92m🔷 Step 16: (finetune) PLSRegression()[0m
🔹 Executing controller SklearnModelController with operator PLSRegression
🌍 Global 

[I 2025-09-26 15:06:54,266] Trial 1 finished with value: 9.792100848166182 and parameters: {'n_components': 3}. Best is trial 0 with value: 9.578812769031337.
[I 2025-09-26 15:06:54,280] Trial 2 finished with value: 9.578812769031337 and parameters: {'n_components': 6}. Best is trial 0 with value: 9.578812769031337.


🏆 Global best parameters: {'n_components': 6}
📊 Best average score: 9.5788
🎯 Training single model on full training data (global_avg)...
📊 Combined training data: 126 samples
📊 Combined test data: 42 samples
✅ Applied optimized parameters: {'n_components': 6}
🏋️ Training model with 126 samples...
✅ Single model training on full data completed successfully
💾 Saved 16_global_avg_model_PLSRegression_45.pkl to results\synthetic_test_dataset\config_full_train_demo_c97b52\16_global_avg_model_PLSRegression_45.pkl
💾 Saved 16_predictions_global_avg_model_46.csv to results\synthetic_test_dataset\config_full_train_demo_c97b52\16_predictions_global_avg_model_46.csv
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[94m✅ Pipeline config_full_train_demo_c97b52 completed successfully on dataset synthetic_test_dataset[0m
✅ Full training approach compl

## 4. Different Model Types

Test finetuning strategies with different model types and parameter spaces.

In [10]:
# Model 1: PLS Regression (continuous parameter)
print("🧮 1. PLS Regression - Continuous parameter optimization")

pls_config = {
    "pipeline": [
        ShuffleSplit(n_splits=3, test_size=.25),
        {
            "model": PLSRegression(),
            "finetune_params": {
                "cv_mode": "simple",  # Fast for demo
                "param_strategy": "global_average",
                "n_trials": 5,
                "verbose": 1,
                "model_params": {
                    "n_components": ("int", 1, 10)  # 🎯 Integer parameter
                },
                "train_params": {"verbose": 0}
            }
        }
    ]
}

start = time.time()
config = PipelineConfigs(pls_config, "pls_demo")
result, _, _ = runner.run(config, data)
elapsed = time.time() - start

print(f"✅ PLS completed in {elapsed:.1f}s")
if result._predictions:
    keys = result._predictions.list_keys()
    if keys:
        key_parts = keys[0].split('_', 3)
        if len(key_parts) >= 4:
            pred_data = result._predictions.get_prediction_data(*key_parts)
            if pred_data:
                y_true = pred_data['y_true'].flatten()
                y_pred = pred_data['y_pred'].flatten()
                rmse = np.sqrt(mean_squared_error(y_true, y_pred))
                r2 = r2_score(y_true, y_pred)
                print(f"🎯 PLS Performance: RMSE={rmse:.3f}, R²={r2:.3f}")

[I 2025-09-26 15:06:54,321] A new study created in memory with name: no-name-54b71f89-56b4-45b1-9fb2-4f5cc442086b
[I 2025-09-26 15:06:54,327] Trial 0 finished with value: 0.06641292204265331 and parameters: {'n_components': 6}. Best is trial 0 with value: 0.06641292204265331.
[I 2025-09-26 15:06:54,332] Trial 1 finished with value: 0.10276666343404751 and parameters: {'n_components': 5}. Best is trial 0 with value: 0.06641292204265331.
[I 2025-09-26 15:06:54,337] Trial 2 finished with value: 0.4159393790482339 and parameters: {'n_components': 3}. Best is trial 0 with value: 0.06641292204265331.
[I 2025-09-26 15:06:54,343] Trial 3 finished with value: 0.06641292204265331 and parameters: {'n_components': 6}. Best is trial 0 with value: 0.06641292204265331.


🧮 1. PLS Regression - Continuous parameter optimization
[94m🚀 Starting pipeline config_pls_demo_febb66 on dataset synthetic_test_dataset[0m
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[94m🔄 Running 2 steps in sequential mode[0m
[92m🔷 Step 17: ShuffleSplit(n_splits=3, test_size=0.25)[0m
🔹 Executing controller CrossValidatorController with operator ShuffleSplit
Generated 3 folds.
💾 Saved 17_folds_ShuffleSplit.csv to results\synthetic_test_dataset\config_pls_demo_febb66\17_folds_ShuffleSplit.csv
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[92m🔷 Step 18: (finetune) PLSRegression()[0m
🔹 Executing controller SklearnModelController with operator PLSRegression
🔍 Simple CV: Fine

[I 2025-09-26 15:06:54,349] Trial 4 finished with value: 0.0409985299268884 and parameters: {'n_components': 8}. Best is trial 4 with value: 0.0409985299268884.


🏆 Best parameters found: {'n_components': 8}
🔄 Training 3 fold models with best parameters...
(42, 30) (42,) (14, 30) (14, 1)
(42, 30) (42,) (14, 30) (14, 1)
(42, 30) (42,) (14, 30) (14, 1)
✅ Simple CV completed successfully
💾 Saved 18_finetuned_PLSRegression_47.pkl to results\synthetic_test_dataset\config_pls_demo_febb66\18_finetuned_PLSRegression_47.pkl
💾 Saved 18_predictions_finetuned_48.csv to results\synthetic_test_dataset\config_pls_demo_febb66\18_predictions_finetuned_48.csv
💾 Saved 18_trained_PLSRegression_49_simple_cv_fold1.pkl to results\synthetic_test_dataset\config_pls_demo_febb66\18_trained_PLSRegression_49_simple_cv_fold1.pkl
💾 Saved 18_predictions_trained_50_simple_cv_fold1.csv to results\synthetic_test_dataset\config_pls_demo_febb66\18_predictions_trained_50_simple_cv_fold1.csv
💾 Saved 18_trained_PLSRegression_51_simple_cv_fold2.pkl to results\synthetic_test_dataset\config_pls_demo_febb66\18_trained_PLSRegression_51_simple_cv_fold2.pkl
💾 Saved 18_predictions_trained_52_

In [11]:
# Model 2: Ridge Regression (float parameter)
print("\n📐 2. Ridge Regression - Float parameter optimization")

ridge_config = {
    "pipeline": [
        ShuffleSplit(n_splits=3, test_size=.25),
        {
            "model": Ridge(),
            "finetune_params": {
                "cv_mode": "simple",
                "param_strategy": "global_average",
                "n_trials": 5,
                "verbose": 1,
                "model_params": {
                    "alpha": ("float", 0.1, 10.0)  # 🎯 Float parameter
                },
                "train_params": {"verbose": 0}
            }
        }
    ]
}

start = time.time()
config = PipelineConfigs(ridge_config, "ridge_demo")
result, _, _ = runner.run(config, data)
elapsed = time.time() - start

print(f"✅ Ridge completed in {elapsed:.1f}s")
if result._predictions:
    keys = result._predictions.list_keys()
    if keys:
        key_parts = keys[0].split('_', 3)
        if len(key_parts) >= 4:
            pred_data = result._predictions.get_prediction_data(*key_parts)
            if pred_data:
                y_true = pred_data['y_true'].flatten()
                y_pred = pred_data['y_pred'].flatten()
                rmse = np.sqrt(mean_squared_error(y_true, y_pred))
                r2 = r2_score(y_true, y_pred)
                print(f"🎯 Ridge Performance: RMSE={rmse:.3f}, R²={r2:.3f}")

[I 2025-09-26 15:06:54,400] A new study created in memory with name: no-name-4a660835-4a59-4865-9879-2563a975a4a4



📐 2. Ridge Regression - Float parameter optimization
[94m🚀 Starting pipeline config_ridge_demo_ca3b40 on dataset synthetic_test_dataset[0m
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[94m🔄 Running 2 steps in sequential mode[0m
[92m🔷 Step 19: ShuffleSplit(n_splits=3, test_size=0.25)[0m
🔹 Executing controller CrossValidatorController with operator ShuffleSplit
Generated 3 folds.
💾 Saved 19_folds_ShuffleSplit.csv to results\synthetic_test_dataset\config_ridge_demo_ca3b40\19_folds_ShuffleSplit.csv
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[92m🔷 Step 20: (finetune) Ridge()[0m
🔹 Executing controller SklearnModelController with operator Ridge
🔍 Simple CV: Finetuning on full

[I 2025-09-26 15:06:54,405] Trial 0 finished with value: 0.23077033956845602 and parameters: {'alpha': 7.7804537396434705}. Best is trial 0 with value: 0.23077033956845602.
[I 2025-09-26 15:06:54,411] Trial 1 finished with value: 0.11347101877133052 and parameters: {'alpha': 3.984846289449217}. Best is trial 1 with value: 0.11347101877133052.
[I 2025-09-26 15:06:54,415] Trial 2 finished with value: 0.11366061866283417 and parameters: {'alpha': 3.991897068025448}. Best is trial 1 with value: 0.11347101877133052.
[I 2025-09-26 15:06:54,419] Trial 3 finished with value: 0.22531107564767203 and parameters: {'alpha': 7.619147142300146}. Best is trial 1 with value: 0.11347101877133052.
[I 2025-09-26 15:06:54,423] Trial 4 finished with value: 0.2102480779091517 and parameters: {'alpha': 7.168890403267846}. Best is trial 1 with value: 0.11347101877133052.


🏆 Best parameters found: {'alpha': 3.984846289449217}
🔄 Training 3 fold models with best parameters...
(42, 30) (42,) (14, 30) (14, 1)
(42, 30) (42,) (14, 30) (14, 1)
(42, 30) (42,) (14, 30) (14, 1)
✅ Simple CV completed successfully
💾 Saved 20_finetuned_Ridge_55.pkl to results\synthetic_test_dataset\config_ridge_demo_ca3b40\20_finetuned_Ridge_55.pkl
💾 Saved 20_predictions_finetuned_56.csv to results\synthetic_test_dataset\config_ridge_demo_ca3b40\20_predictions_finetuned_56.csv
💾 Saved 20_trained_Ridge_57_simple_cv_fold1.pkl to results\synthetic_test_dataset\config_ridge_demo_ca3b40\20_trained_Ridge_57_simple_cv_fold1.pkl
💾 Saved 20_predictions_trained_58_simple_cv_fold1.csv to results\synthetic_test_dataset\config_ridge_demo_ca3b40\20_predictions_trained_58_simple_cv_fold1.csv
💾 Saved 20_trained_Ridge_59_simple_cv_fold2.pkl to results\synthetic_test_dataset\config_ridge_demo_ca3b40\20_trained_Ridge_59_simple_cv_fold2.pkl
💾 Saved 20_predictions_trained_60_simple_cv_fold2.csv to result

In [12]:
# Model 3: Random Forest (categorical and integer parameters)
print("\n🌲 3. Random Forest - Mixed parameter types")

rf_config = {
    "pipeline": [
        ShuffleSplit(n_splits=3, test_size=.25),
        {
            "model": RandomForestRegressor(random_state=42),
            "finetune_params": {
                "cv_mode": "simple",
                "param_strategy": "per_fold_best",  # Faster for RF
                "n_trials": 4,  # RF can be slow
                "verbose": 1,
                "model_params": {
                    "n_estimators": [10, 20, 30],  # 🎯 Categorical parameter
                    "max_depth": [3, 5, 7]         # 🎯 Categorical parameter
                },
                "train_params": {"verbose": 0}
            }
        }
    ]
}

start = time.time()
config = PipelineConfigs(rf_config, "rf_demo")
result, _, _ = runner.run(config, data)
elapsed = time.time() - start

print(f"✅ Random Forest completed in {elapsed:.1f}s")
if result._predictions:
    keys = result._predictions.list_keys()
    if keys:
        key_parts = keys[0].split('_', 3)
        if len(key_parts) >= 4:
            pred_data = result._predictions.get_prediction_data(*key_parts)
            if pred_data:
                y_true = pred_data['y_true'].flatten()
                y_pred = pred_data['y_pred'].flatten()
                rmse = np.sqrt(mean_squared_error(y_true, y_pred))
                r2 = r2_score(y_true, y_pred)
                print(f"🎯 RF Performance: RMSE={rmse:.3f}, R²={r2:.3f}")

print(f"\n🎛️ Parameter Type Summary:")
print(f"  PLS:     ('int', min, max) - Integer range")
print(f"  Ridge:   ('float', min, max) - Float range")
print(f"  RF:      [val1, val2, val3] - Categorical list")

[I 2025-09-26 15:06:54,459] A new study created in memory with name: no-name-726571ba-5209-43e1-886c-3956ce3d8069



🌲 3. Random Forest - Mixed parameter types
[94m🚀 Starting pipeline config_rf_demo_707130 on dataset synthetic_test_dataset[0m
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[94m🔄 Running 2 steps in sequential mode[0m
[92m🔷 Step 21: ShuffleSplit(n_splits=3, test_size=0.25)[0m
🔹 Executing controller CrossValidatorController with operator ShuffleSplit
Generated 3 folds.
💾 Saved 21_folds_ShuffleSplit.csv to results\synthetic_test_dataset\config_rf_demo_707130\21_folds_ShuffleSplit.csv
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[92m🔷 Step 22: (finetune) RandomForestRegressor(random_state=42)[0m
🔹 Executing controller SklearnModelController with operator RandomForestRegressor


[I 2025-09-26 15:06:54,554] Trial 0 finished with value: 2.900292467644007 and parameters: {'n_estimators': 30, 'max_depth': 3}. Best is trial 0 with value: 2.900292467644007.
[I 2025-09-26 15:06:54,683] Trial 1 finished with value: 1.214873214161031 and parameters: {'n_estimators': 30, 'max_depth': 7}. Best is trial 1 with value: 1.214873214161031.
[I 2025-09-26 15:06:54,732] Trial 2 finished with value: 1.4688841824561942 and parameters: {'n_estimators': 10, 'max_depth': 7}. Best is trial 1 with value: 1.214873214161031.
[I 2025-09-26 15:06:54,773] Trial 3 finished with value: 3.169007985609278 and parameters: {'n_estimators': 10, 'max_depth': 3}. Best is trial 1 with value: 1.214873214161031.


🏆 Best parameters found: {'n_estimators': 30, 'max_depth': 7}
🔄 Training 3 fold models with best parameters...
(42, 30) (42,) (14, 30) (14, 1)
(42, 30) (42,) (14, 30) (14, 1)
(42, 30) (42,) (14, 30) (14, 1)
✅ Simple CV completed successfully
💾 Saved 22_finetuned_RandomForestRegressor_63.pkl to results\synthetic_test_dataset\config_rf_demo_707130\22_finetuned_RandomForestRegressor_63.pkl
💾 Saved 22_predictions_finetuned_64.csv to results\synthetic_test_dataset\config_rf_demo_707130\22_predictions_finetuned_64.csv
💾 Saved 22_trained_RandomForestRegressor_65_simple_cv_fold1.pkl to results\synthetic_test_dataset\config_rf_demo_707130\22_trained_RandomForestRegressor_65_simple_cv_fold1.pkl
💾 Saved 22_predictions_trained_66_simple_cv_fold1.csv to results\synthetic_test_dataset\config_rf_demo_707130\22_predictions_trained_66_simple_cv_fold1.csv
💾 Saved 22_trained_RandomForestRegressor_67_simple_cv_fold2.pkl to results\synthetic_test_dataset\config_rf_demo_707130\22_trained_RandomForestRegress

## 5. Best Practice Combinations

Recommended combinations for different use cases.

In [13]:
# Use Case 1: Quick Prototyping
print("🚀 PROTOTYPING: Fast and simple")

prototype_config = {
    "pipeline": [
        ShuffleSplit(n_splits=3, test_size=.25),
        {
            "model": PLSRegression(),
            "finetune_params": {
                "cv_mode": "simple",        # ⚡ Fastest CV mode
                "param_strategy": "per_fold_best",  # 🎯 Standard strategy
                "n_trials": 8,              # 📊 More trials since it's fast
                "verbose": 1,
                "model_params": {
                    "n_components": ("int", 1, 10)
                },
                "train_params": {"verbose": 0}
            }
        }
    ]
}

start = time.time()
config = PipelineConfigs(prototype_config, "prototype_demo")
result, _, _ = runner.run(config, data)
prototype_time = time.time() - start

print(f"✅ Prototyping completed in {prototype_time:.1f}s")
print("💡 Use for: Quick experiments, initial model testing")

[I 2025-09-26 15:06:55,088] A new study created in memory with name: no-name-d32fd691-786c-4c7c-82dc-45250a413eb3
[I 2025-09-26 15:06:55,095] Trial 0 finished with value: 0.14286875830932175 and parameters: {'n_components': 4}. Best is trial 0 with value: 0.14286875830932175.
[I 2025-09-26 15:06:55,099] Trial 1 finished with value: 0.7513361955536707 and parameters: {'n_components': 2}. Best is trial 0 with value: 0.14286875830932175.
[I 2025-09-26 15:06:55,110] Trial 2 finished with value: 0.14286875830932175 and parameters: {'n_components': 4}. Best is trial 0 with value: 0.14286875830932175.
[I 2025-09-26 15:06:55,117] Trial 3 finished with value: 0.06591775050504516 and parameters: {'n_components': 6}. Best is trial 3 with value: 0.06591775050504516.
[I 2025-09-26 15:06:55,125] Trial 4 finished with value: 0.05478362071166592 and parameters: {'n_components': 7}. Best is trial 4 with value: 0.05478362071166592.
[I 2025-09-26 15:06:55,132] Trial 5 finished with value: 0.0547836207116

🚀 PROTOTYPING: Fast and simple
[94m🚀 Starting pipeline config_prototype_demo_608a92 on dataset synthetic_test_dataset[0m
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[94m🔄 Running 2 steps in sequential mode[0m
[92m🔷 Step 23: ShuffleSplit(n_splits=3, test_size=0.25)[0m
🔹 Executing controller CrossValidatorController with operator ShuffleSplit
Generated 3 folds.
💾 Saved 23_folds_ShuffleSplit.csv to results\synthetic_test_dataset\config_prototype_demo_608a92\23_folds_ShuffleSplit.csv
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[92m🔷 Step 24: (finetune) PLSRegression()[0m
🔹 Executing controller SklearnModelController with operator PLSRegression
🔍 Simple CV: Finetuning on ful

In [14]:
# Use Case 2: Production Deployment
print("\n🎯 PRODUCTION: Best generalization + Single model")

production_config = {
    "pipeline": [
        ShuffleSplit(n_splits=3, test_size=.25),
        {
            "model": PLSRegression(),
            "finetune_params": {
                "cv_mode": "per_fold",              # ⚖️ Good balance of rigor and speed
                "param_strategy": "global_average", # 🌍 Most generalizable parameters
                "use_full_train_for_final": True,  # 🎯 Single model for deployment
                "n_trials": 6,                     # 📊 Moderate trials
                "verbose": 1,
                "model_params": {
                    "n_components": ("int", 1, 12)
                },
                "train_params": {"verbose": 0}
            }
        }
    ]
}

start = time.time()
config = PipelineConfigs(production_config, "production_demo")
result, _, _ = runner.run(config, data)
production_time = time.time() - start

print(f"✅ Production approach completed in {production_time:.1f}s")
print("💡 Use for: Production deployment, single model needed")
print(f"📊 Models generated: {len(result._predictions)} (should be 1 for deployment)")

[I 2025-09-26 15:06:55,183] A new study created in memory with name: no-name-1f9f64b2-d1c5-4049-95fc-86970aab3d7b


[I 2025-09-26 15:06:55,195] Trial 0 finished with value: 14.475699861645126 and parameters: {'n_components': 1}. Best is trial 0 with value: 14.475699861645126.
[I 2025-09-26 15:06:55,211] Trial 1 finished with value: 14.389765533135874 and parameters: {'n_components': 8}. Best is trial 1 with value: 14.389765533135874.



🎯 PRODUCTION: Best generalization + Single model
[94m🚀 Starting pipeline config_production_demo_facf04 on dataset synthetic_test_dataset[0m
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[94m🔄 Running 2 steps in sequential mode[0m
[92m🔷 Step 25: ShuffleSplit(n_splits=3, test_size=0.25)[0m
🔹 Executing controller CrossValidatorController with operator ShuffleSplit
Generated 3 folds.
💾 Saved 25_folds_ShuffleSplit.csv to results\synthetic_test_dataset\config_production_demo_facf04\25_folds_ShuffleSplit.csv
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[92m🔷 Step 26: (finetune) PLSRegression()[0m
🔹 Executing controller SklearnModelController with operator PLSRegression
🌍 Global 

[I 2025-09-26 15:06:55,226] Trial 2 finished with value: 14.389765533135874 and parameters: {'n_components': 8}. Best is trial 1 with value: 14.389765533135874.
[I 2025-09-26 15:06:55,237] Trial 3 finished with value: 14.475699861645126 and parameters: {'n_components': 1}. Best is trial 1 with value: 14.389765533135874.
[I 2025-09-26 15:06:55,252] Trial 4 finished with value: 14.389158122016283 and parameters: {'n_components': 7}. Best is trial 4 with value: 14.389158122016283.
[I 2025-09-26 15:06:55,264] Trial 5 finished with value: 14.189845278287764 and parameters: {'n_components': 2}. Best is trial 5 with value: 14.189845278287764.


🏆 Global best parameters: {'n_components': 2}
📊 Best average score: 14.1898
🎯 Training single model on full training data (global_avg)...
📊 Combined training data: 126 samples
📊 Combined test data: 42 samples
✅ Applied optimized parameters: {'n_components': 2}
🏋️ Training model with 126 samples...
✅ Single model training on full data completed successfully
💾 Saved 26_global_avg_model_PLSRegression_79.pkl to results\synthetic_test_dataset\config_production_demo_facf04\26_global_avg_model_PLSRegression_79.pkl
💾 Saved 26_predictions_global_avg_model_80.csv to results\synthetic_test_dataset\config_production_demo_facf04\26_predictions_global_avg_model_80.csv
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[94m✅ Pipeline config_production_demo_facf04 completed successfully on dataset synthetic_test_dataset[0m
✅ Production approach complet

In [15]:
# Use Case 3: Research/Academic
print("\n🔬 RESEARCH: Maximum rigor")

research_config = {
    "pipeline": [
        ShuffleSplit(n_splits=3, test_size=.25),
        {
            "model": PLSRegression(),
            "finetune_params": {
                "cv_mode": "nested",                # 🎓 Most rigorous CV
                "inner_cv": 2,                      # 🔄 Inner folds (small for demo)
                "param_strategy": "global_average", # 🌍 Unbiased parameter selection
                "n_trials": 4,                      # 📊 Fewer trials due to high cost
                "verbose": 1,
                "model_params": {
                    "n_components": ("int", 1, 8)
                },
                "train_params": {"verbose": 0}
            }
        }
    ]
}

start = time.time()
config = PipelineConfigs(research_config, "research_demo")
result, _, _ = runner.run(config, data)
research_time = time.time() - start

print(f"✅ Research approach completed in {research_time:.1f}s")
print("💡 Use for: Academic publications, unbiased evaluation")

# Time comparison
print(f"\n⏱️ Use Case Time Comparison:")
print(f"  Prototyping: {prototype_time:.1f}s (fastest)")
print(f"  Production:  {production_time:.1f}s (balanced)")
print(f"  Research:    {research_time:.1f}s (most rigorous)")


🔬 RESEARCH: Maximum rigor
[94m🚀 Starting pipeline config_research_demo_4d2c11 on dataset synthetic_test_dataset[0m
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[94m🔄 Running 2 steps in sequential mode[0m
[92m🔷 Step 27: ShuffleSplit(n_splits=3, test_size=0.25)[0m
🔹 Executing controller CrossValidatorController with operator ShuffleSplit
Generated 3 folds.
💾 Saved 27_folds_ShuffleSplit.csv to results\synthetic_test_dataset\config_research_demo_4d2c11\27_folds_ShuffleSplit.csv
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[92m🔷 Step 28: (finetune) PLSRegression()[0m
🔹 Executing controller SklearnModelController with operator PLSRegression
🔍 Nested CV: 3 outer folds with inne

## 6. Summary and Recommendations

**Quick Reference Guide:**

### CV Modes:
- **`simple`**: Fast prototyping, limited rigor
- **`per_fold`**: Standard choice, good balance ⭐
- **`nested`**: Research/academic, maximum rigor

### Parameter Strategies:
- **`per_fold_best`**: Default, each fold optimized individually
- **`global_best`**: Consistent parameters across folds
- **`global_average`**: Most generalizable parameters ⭐

### Full Training Option:
- **`use_full_train_for_final: True`**: Single model, more training data ⭐
- **`use_full_train_for_final: False`**: Multiple models, traditional approach

### Recommended Combinations:
1. **Prototyping**: `simple` + `per_fold_best`
2. **Production**: `per_fold` + `global_average` + `use_full_train_for_final: True` ⭐
3. **Research**: `nested` + `global_average`

### Parameter Types:
- **Integer**: `("int", min_val, max_val)`
- **Float**: `("float", min_val, max_val)`
- **Categorical**: `[option1, option2, option3]`

In [16]:
# Final demonstration: The recommended production setup
print("🏆 RECOMMENDED SETUP: Production-ready configuration")

recommended_config = {
    "pipeline": [
        ShuffleSplit(n_splits=3, test_size=.25),
        {
            "model": PLSRegression(),
            "finetune_params": {
                "cv_mode": "per_fold",               # ⚖️ Good balance
                "param_strategy": "global_average",  # 🌍 Best generalization
                "use_full_train_for_final": True,   # 🎯 Single deployment model
                "n_trials": 10,                     # 📊 Thorough optimization
                "verbose": 1,
                "model_params": {
                    "n_components": ("int", 1, 15)
                },
                "train_params": {"verbose": 0}
            }
        }
    ]
}

print("Configuration:", recommended_config["pipeline"][1]["finetune_params"])

start = time.time()
config = PipelineConfigs(recommended_config, "recommended_demo")
result, _, _ = runner.run(config, data)
elapsed = time.time() - start

print(f"\n✅ Recommended setup completed in {elapsed:.1f}s")
print(f"📊 Generated {len(result._predictions)} prediction sets")

if result._predictions:
    keys = result._predictions.list_keys()
    print(f"🔑 Final prediction key: {keys[0] if keys else 'None'}")

    if keys:
        key_parts = keys[0].split('_', 3)
        if len(key_parts) >= 4:
            pred_data = result._predictions.get_prediction_data(*key_parts)
            if pred_data:
                y_true = pred_data['y_true'].flatten()
                y_pred = pred_data['y_pred'].flatten()
                rmse = np.sqrt(mean_squared_error(y_true, y_pred))
                r2 = r2_score(y_true, y_pred)

                print(f"\n🎯 Final Model Performance:")
                print(f"  RMSE: {rmse:.4f}")
                print(f"  R²:   {r2:.4f}")
                print(f"  Test samples: {len(y_true)}")

print(f"\n🎉 Demo completed! You've seen all the finetuning strategies in action.")
print(f"💡 Try different combinations for your specific use case.")



🏆 RECOMMENDED SETUP: Production-ready configuration
Configuration: {'cv_mode': 'per_fold', 'param_strategy': 'global_average', 'use_full_train_for_final': True, 'n_trials': 10, 'verbose': 1, 'model_params': {'n_components': ('int', 1, 15)}, 'train_params': {'verbose': 0}}
[94m🚀 Starting pipeline config_recommended_demo_7b54c0 on dataset synthetic_test_dataset[0m
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[94m🔄 Running 2 steps in sequential mode[0m
[92m🔷 Step 29: ShuffleSplit(n_splits=3, test_size=0.25)[0m
🔹 Executing controller CrossValidatorController with operator ShuffleSplit
Generated 3 folds.
💾 Saved 29_folds_ShuffleSplit.csv to results\synthetic_test_dataset\config_recommended_demo_7b54c0\29_folds_ShuffleSplit.csv
----------------------------------------------------------------------------------------------------------

