# Tutorial 2: Advanced NIRS Analysis

Welcome to the advanced NIRS4All tutorial! This guide covers sophisticated techniques for professional NIRS data analysis, including multi-source analysis, hyperparameter optimization, and advanced visualizations.

## What You'll Master
1. **Multi-Source Analysis** - Handle multiple datasets simultaneously
2. **Hyperparameter Optimization** - Automated model tuning with Optuna
3. **Configuration Generation** - Advanced pipeline customization
4. **Advanced Visualizations** - Professional-grade charts and analysis
5. **Neural Networks** - Deep learning for NIRS data
6. **Everything Together** - Complex real-world workflows

## Prerequisites
- Completion of Tutorial 1 (Beginner's Guide)
- Understanding of cross-validation and model evaluation
- Basic knowledge of hyperparameter tuning concepts

Let's dive into advanced NIRS analysis! 🚀

## Part 1: Multi-Source Data Analysis

Multi-source analysis refers to datasets with multiple target variables (multi-target regression). This is different from multiple datasets - here we have one dataset but predict multiple properties simultaneously.

### Step 1.1: Import Advanced Libraries

In [6]:
# Standard library imports
import os
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt

# Third-party imports
from sklearn.cross_decomposition import PLSRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import ShuffleSplit, RepeatedKFold
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVR

# NIRS4All imports
from nirs4all.dataset import DatasetConfigs
from nirs4all.dataset.predictions import Predictions
from nirs4all.dataset.prediction_analyzer import PredictionAnalyzer
from nirs4all.operators.models.cirad_tf import nicon, customizable_nicon
from nirs4all.operators.transformations import (
    Gaussian, SavitzkyGolay, StandardNormalVariate, Haar,
    MultiplicativeScatterCorrection, Detrend, FirstDerivative
)
from nirs4all.pipeline import PipelineConfigs, PipelineRunner

# Enable visual feedback
os.environ['DISABLE_EMOJIS'] = '0'

print("✅ Advanced libraries imported successfully!")
print("🧠 Neural network models available")
print("🔬 Multiple spectral preprocessing techniques loaded")

✅ Advanced libraries imported successfully!
🧠 Neural network models available
🔬 Multiple spectral preprocessing techniques loaded


### Step 1.2: Configure Multi-Source Pipeline

Let's build a sophisticated pipeline that works across multiple datasets:

In [None]:
# Advanced multi-source pipeline
multi_source_pipeline = [
    # Advanced scaling with custom range
    MinMaxScaler(feature_range=(0.1, 0.8)),

    # Comprehensive feature augmentation
    {"feature_augmentation": [
        MultiplicativeScatterCorrection(),
        Gaussian(),
        StandardNormalVariate(),
        SavitzkyGolay(),
        Haar(),
        [MultiplicativeScatterCorrection(), Gaussian()],
        [MultiplicativeScatterCorrection(), StandardNormalVariate()],
        [StandardNormalVariate(), SavitzkyGolay()],
        [Detrend(), FirstDerivative()],
    ]},

    # Robust cross-validation
    ShuffleSplit(n_splits=5, test_size=0.2, random_state=42),
    {"y_processing": MinMaxScaler()},

    # Diverse model ensemble
    {"model": PLSRegression(15), "name": "PLS-15"},
    {"model": PLSRegression(25), "name": "PLS-25"},
    {"model": ElasticNet(alpha=0.1), "name": "ElasticNet"},
    {"model": RandomForestRegressor(n_estimators=10, max_depth=5), "name": "RandomForest"},

    # Neural network with custom training parameters
    {
        "model": nicon,
        "name": "DeepNIRS",
        "train_params": {
            "epochs": 20,
            "patience": 20,
            "batch_size": 64,
            "verbose": 0
        }
    }
]

print("🔧 Multi-source pipeline configured with:")
print("   • 9 preprocessing strategies (including combinations)")
print("   • 5 different model types")
print("   • 5-fold cross-validation")
print("   • Neural network with custom training")
print(f"   • Total configurations: {5 * 9} = 45")

🔧 Multi-source pipeline configured with:
   • 9 preprocessing strategies (including combinations)
   • 5 different model types
   • 5-fold cross-validation
   • Neural network with custom training
   • Total configurations: 45 = 45


### Step 1.3: Run Multi-Source Analysis

In [10]:
# Configure multi-source dataset (multiple targets)
multi_source_data = 'sample_data/multi'  # Single dataset with multiple target variables
multi_config = PipelineConfigs(multi_source_pipeline, "Advanced_MultiSource")
dataset_config = DatasetConfigs(multi_source_data)

print(f"🏃‍♂️ Running multi-source analysis (multi-target regression)...")
print("⚠️  This may take several minutes due to neural network training")

runner = PipelineRunner(save_files=False, verbose=1)
multi_predictions, predictions_per_dataset = runner.run(multi_config, dataset_config)

print(f"✅ Multi-source analysis completed!")
print(f"📊 Total predictions: {len(multi_predictions)}")
print(f"📁 Multi-source results: {len(predictions_per_dataset)}")

📊 Multiple train_x files found for sample_data/multi: 3 sources detected.
📊 Multiple test_x files found for sample_data/multi: 3 sources detected.
🏃‍♂️ Running multi-source analysis (multi-target regression)...
⚠️  This may take several minutes due to neural network training
[94m🚀 Starting Nirs4all run(s) with 1 pipeline on 1 dataset (1 total runs).[0m
📊 Dataset: multi (regression)
Features (samples=189, sources=3):
- Source 0: (189, 1, 2151), processings=['raw'], min=-0.265, max=1.436, mean=0.466, var=0.149)
- Source 1: (189, 1, 2151), processings=['raw'], min=-0.265, max=1.436, mean=0.466, var=0.149)
- Source 2: (189, 1, 2151), processings=['raw'], min=-0.265, max=1.436, mean=0.466, var=0.149)
Targets: (samples=189, targets=1, processings=['numeric'])
- numeric: min=1.33, max=128.31, mean=30.779
Indexes:
- "train", ['raw']: 130 samples
- "test", ['raw']: 59 samples
[94m🚀 Starting pipeline Advanced_MultiSource_e90c40 on dataset multi[0m
--------------------------------------------

KeyboardInterrupt: 

### Step 1.4: Analyze Multi-Source Results

Let's examine how models perform across different datasets:

In [None]:
# Analyze multi-target results
print("📊 Multi-Source Analysis Results:")
print("=" * 60)

# Check if we have multi-target results
if predictions_per_dataset:
    for source_name, source_result in predictions_per_dataset.items():
        source_predictions = source_result['run_predictions']
        top_3 = source_predictions.top_k(3, 'rmse')

        print(f"\n🎯 Multi-Source Data: {source_name}")
        print(f"   Total predictions: {len(source_predictions)}")
        print("   Top 3 models for multi-target regression:")

        for idx, model in enumerate(top_3):
            preprocessing = model['preprocessings'] if model['preprocessings'] else 'None'
            print(f"     {idx+1}. {model['model_name']} | RMSE: {model['rmse']:.4f} | R²: {model['r2']:.4f}")
            print(f"        Preprocessing: {preprocessing}")
            # Check if this model handles multiple targets
            if hasattr(model.get('y_pred', []), 'shape') and len(model['y_pred'].shape) > 1:
                print(f"        Target dimensions: {model['y_pred'].shape}")

# Overall multi-target analysis
print("\n🏆 Best Multi-Target Models:")
overall_top = multi_predictions.top_k(5, 'rmse')
for idx, model in enumerate(overall_top):
    print(f"{idx+1}. {model['model_name']} | RMSE: {model['rmse']:.4f} | R²: {model['r2']:.4f}")
    if 'dataset_name' in model:
        print(f"    Source: {model['dataset_name']}")

📊 Multi-Source Analysis Results:


NameError: name 'predictions_per_dataset' is not defined

### Step 1.5: Multi-Source Visualizations

In [None]:
# Create multi-source visualizations
multi_analyzer = PredictionAnalyzer(multi_predictions)

# 1. Performance across preprocessing methods
print("📊 Creating visualization 1/3: Models vs Preprocessing")
fig1 = multi_analyzer.plot_variable_heatmap(
    x_var="model_name",
    y_var="preprocessings",
    metric='rmse'
)
plt.title("Multi-Target Performance: Models vs Preprocessing")

# 2. Top models comparison
print("📊 Creating visualization 2/3: Top Models Comparison")
fig2 = multi_analyzer.plot_top_k_comparison(k=8, metric='rmse')
plt.title("Top 8 Multi-Target Models")

# 3. Performance distribution
print("📊 Creating visualization 3/3: Performance Distribution")
fig3 = multi_analyzer.plot_variable_candlestick(
    filters={"partition": "test"},
    variable="model_name"
)
plt.title("Multi-Target Model Performance Distribution")

plt.show()
print("🎨 Multi-source visualizations completed!")

## Part 2: Hyperparameter Optimization with Optuna

NIRS4All integrates with Optuna for sophisticated hyperparameter optimization. Let's optimize PLS and neural network models:

### Step 2.1: Configure Hyperparameter Optimization

In [None]:
# Hyperparameter optimization pipeline
optimization_pipeline = [
    "chart_2d",
    MinMaxScaler(),
    {"y_processing": MinMaxScaler()},

    # Feature augmentation for optimization
    {"feature_augmentation": {
        "_or_": [
            StandardNormalVariate(),
            SavitzkyGolay(),
            MultiplicativeScatterCorrection(),
            Gaussian()
        ],
        "size": [1, 2],
        "count": 4
    }},

    ShuffleSplit(n_splits=3, test_size=0.25),

    # PLS with hyperparameter optimization
    {
        "model": PLSRegression(),
        "name": "OptimizedPLS",
        "finetune_params": {
            "n_trials": 30,
            "verbose": 2,
            "approach": "single",
            "eval_mode": "best",
            "sample": "tpe",  # Tree-structured Parzen Estimator
            "model_params": {
                'n_components': ('int', 5, 30),
            }
        }
    },

    # Neural network with hyperparameter optimization
    {
        "model": customizable_nicon,
        "name": "OptimizedNeuralNet",
        "finetune_params": {
            "n_trials": 20,
            "verbose": 2,
            "sample": "hyperband",
            "approach": "single",
            "model_params": {
                "filters_1": [16, 32, 64, 128],
                "filters_2": [16, 32, 64, 128],
                "filters_3": [16, 32, 64, 128]
            },
            "train_params": {
                "epochs": 20,
                "verbose": 0
            }
        },
        "train_params": {
            "epochs": 100,
            "patience": 15,
            "verbose": 0
        }
    }
]

# Add baseline models for comparison
for n_comp in [10, 15, 20]:
    optimization_pipeline.append({
        "model": PLSRegression(n_components=n_comp),
        "name": f"Baseline-PLS-{n_comp}"
    })

print("🎯 Hyperparameter optimization pipeline configured:")
print("   • PLS optimization: 30 trials (n_components: 5-30)")
print("   • Neural net optimization: 20 trials (filter sizes)")
print("   • 3 baseline PLS models for comparison")
print("   • 4 preprocessing combinations")

### Step 2.2: Run Hyperparameter Optimization

In [None]:
# Run optimization
opt_config = PipelineConfigs(optimization_pipeline, "Advanced_Optimization")
dataset_config = DatasetConfigs('sample_data/regression')

print("🎯 Starting hyperparameter optimization...")
print("⚠️  This will take several minutes - Optuna is testing many configurations")
print("📊 Watch the verbose output to see optimization progress")

runner = PipelineRunner(save_files=True, verbose=1)  # Save optimized models
opt_predictions, _ = runner.run(opt_config, dataset_config)

print(f"✅ Hyperparameter optimization completed!")
print(f"🔍 Total configurations tested: {len(opt_predictions)}")

### Step 2.3: Analyze Optimization Results

In [None]:
# Compare optimized vs baseline models
opt_top = opt_predictions.top_k(8, 'rmse')

print("🏆 Top Models After Optimization:")
print("=" * 80)

optimized_models = []
baseline_models = []

for idx, model in enumerate(opt_top):
    model_type = "🎯 OPTIMIZED" if "Optimized" in model['model_name'] else "📊 Baseline"
    preprocessing = model['preprocessings'] if model['preprocessings'] else 'None'

    print(f"{idx+1}. {model_type} | {model['model_name']}")
    print(f"    RMSE: {model['rmse']:.4f} | R²: {model['r2']:.4f} | MAE: {model['mae']:.4f}")
    print(f"    Preprocessing: {preprocessing}")
    print()

    if "Optimized" in model['model_name']:
        optimized_models.append(model)
    else:
        baseline_models.append(model)

# Calculate improvement
if optimized_models and baseline_models:
    best_optimized = min(optimized_models, key=lambda x: x['rmse'])
    best_baseline = min(baseline_models, key=lambda x: x['rmse'])

    improvement = ((best_baseline['rmse'] - best_optimized['rmse']) / best_baseline['rmse']) * 100
    print(f"📈 Optimization Improvement: {improvement:.2f}% RMSE reduction")
    print(f"   Best Optimized: {best_optimized['rmse']:.4f} RMSE")
    print(f"   Best Baseline: {best_baseline['rmse']:.4f} RMSE")

### Step 2.4: Optimization Visualizations

In [None]:
# Create optimization-focused visualizations
opt_analyzer = PredictionAnalyzer(opt_predictions)

# 1. Optimized vs Baseline comparison
print("📊 Creating optimization visualizations...")

fig1 = opt_analyzer.plot_top_k_comparison(k=8, metric='rmse')
plt.title("Hyperparameter Optimization Results")

# 2. Model performance heatmap
fig2 = opt_analyzer.plot_variable_heatmap(
    x_var="model_name",
    y_var="preprocessings",
    metric='rmse',
    best_only=False
)
plt.title("Optimization: Models vs Preprocessing")

plt.show()
print("🎨 Optimization visualizations completed!")

## Part 3: Custom Models and Transformers

NIRS4All allows you to integrate custom models and transformers into your pipelines. Let's create custom components:

### Step 3.1: Create a Custom Transformer

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class CustomSpectralNormalizer(BaseEstimator, TransformerMixin):
    """
    Custom transformer that applies area normalization to spectral data
    """

    def __init__(self, method='area'):
        self.method = method

    def fit(self, X, y=None):
        # For transformers, fit usually just returns self
        return self

    def transform(self, X):
        X = np.array(X)

        if self.method == 'area':
            # Normalize by area under the curve
            areas = np.trapz(X, axis=1, dx=1.0)
            # Avoid division by zero
            areas = np.where(areas == 0, 1, areas)
            X_normalized = X / areas[:, np.newaxis]

        elif self.method == 'max':
            # Normalize by maximum value
            max_vals = np.max(X, axis=1)
            max_vals = np.where(max_vals == 0, 1, max_vals)
            X_normalized = X / max_vals[:, np.newaxis]

        elif self.method == 'l2':
            # L2 normalization
            norms = np.linalg.norm(X, axis=1)
            norms = np.where(norms == 0, 1, norms)
            X_normalized = X / norms[:, np.newaxis]

        else:
            raise ValueError(f"Unknown normalization method: {self.method}")

        return X_normalized

    def get_feature_names_out(self, input_features=None):
        # Required for newer sklearn versions
        if input_features is None:
            return np.array([f"feature_{i}" for i in range(self.n_features_in_)])
        return input_features

print("✅ Custom transformer created: CustomSpectralNormalizer")
print("   • Methods: 'area', 'max', 'l2'")
print("   • Follows sklearn TransformerMixin pattern")

### Step 3.2: Create a Custom Model

In [None]:
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.linear_model import Ridge
from sklearn.decomposition import PCA

class CustomPCARegressor(BaseEstimator, RegressorMixin):
    """
    Custom model that combines PCA dimensionality reduction with Ridge regression
    """

    def __init__(self, n_components=10, ridge_alpha=1.0):
        self.n_components = n_components
        self.ridge_alpha = ridge_alpha

    def fit(self, X, y):
        # Store the classes_ attribute for compatibility
        if hasattr(y, 'shape') and len(y.shape) > 1 and y.shape[1] > 1:
            self.n_outputs_ = y.shape[1]
        else:
            self.n_outputs_ = 1

        # Initialize and fit PCA
        self.pca_ = PCA(n_components=self.n_components)
        X_pca = self.pca_.fit_transform(X)

        # Initialize and fit Ridge regression
        self.ridge_ = Ridge(alpha=self.ridge_alpha)
        self.ridge_.fit(X_pca, y)

        return self

    def predict(self, X):
        # Transform with PCA then predict with Ridge
        X_pca = self.pca_.transform(X)
        return self.ridge_.predict(X_pca)

    def get_params(self, deep=True):
        return {
            'n_components': self.n_components,
            'ridge_alpha': self.ridge_alpha
        }

    def set_params(self, **params):
        for key, value in params.items():
            setattr(self, key, value)
        return self

print("✅ Custom model created: CustomPCARegressor")
print("   • Combines PCA + Ridge regression")
print("   • Hyperparameters: n_components, ridge_alpha")
print("   • Follows sklearn RegressorMixin pattern")

### Step 3.3: Test Custom Components in Pipeline

In [None]:
# Create a pipeline with custom components
custom_pipeline = [
    MinMaxScaler(),
    {"y_processing": MinMaxScaler()},

    # Use our custom transformer
    CustomSpectralNormalizer(method='area'),

    # Add some standard preprocessing for comparison
    {"feature_augmentation": {
        "_or_": [
            StandardNormalVariate(),
            SavitzkyGolay(),
            CustomSpectralNormalizer(method='max'),  # Another custom transformer
            CustomSpectralNormalizer(method='l2')
        ],
        "size": [1, 2],
        "count": 3
    }},

    ShuffleSplit(n_splits=3, test_size=0.25),

    # Standard models
    {"model": PLSRegression(15), "name": "StandardPLS"},

    # Our custom model
    {"model": CustomPCARegressor(n_components=12, ridge_alpha=0.1), "name": "CustomPCA-Ridge"},
    {"model": CustomPCARegressor(n_components=20, ridge_alpha=1.0), "name": "CustomPCA-Ridge-L2"},
]

print("🔧 Custom pipeline configured with:")
print("   • Custom spectral normalization transformer")
print("   • Custom PCA+Ridge regression model")
print("   • Mixed with standard NIRS4All components")

### Step 3.4: Run Custom Pipeline

In [None]:
# Run custom pipeline
custom_config = PipelineConfigs(custom_pipeline, "Advanced_Custom")
dataset_config = DatasetConfigs('sample_data/regression')

print("🏃‍♂️ Running pipeline with custom components...")
runner = PipelineRunner(save_files=False, verbose=1)
custom_predictions, _ = runner.run(custom_config, dataset_config)

print(f"✅ Custom pipeline completed!")
print(f"📊 Generated {len(custom_predictions)} predictions")

# Analyze custom results
top_custom = custom_predictions.top_k(5, 'rmse')
print("\n🏆 Top 5 Models (Including Custom):")
for idx, model in enumerate(top_custom):
    model_type = "🔧 CUSTOM" if "Custom" in model['model_name'] else "📊 Standard"
    preprocessing = model['preprocessings'] if model['preprocessings'] else 'None'
    print(f"{idx+1}. {model_type} | {model['model_name']}")
    print(f"    RMSE: {model['rmse']:.4f} | R²: {model['r2']:.4f}")
    print(f"    Preprocessing: {preprocessing}\n")

### Step 3.5: Visualize Custom Component Performance

In [None]:
# Create custom component analysis
custom_analyzer = PredictionAnalyzer(custom_predictions)

# Compare custom vs standard models
fig1 = custom_analyzer.plot_top_k_comparison(k=6, metric='rmse')
plt.title("Custom vs Standard Models Performance")

# Show preprocessing effects including custom transformers
fig2 = custom_analyzer.plot_variable_heatmap(
    x_var="model_name",
    y_var="preprocessings",
    metric='rmse',
    best_only=False
)
plt.title("Custom Components: Models vs Preprocessing")

plt.show()

print("🎨 Custom component visualizations completed!")
print("💡 Custom components successfully integrated into NIRS4All pipeline!")

In [None]:
# Create custom component analysis
custom_analyzer = PredictionAnalyzer(custom_predictions)

# Compare custom vs standard models
fig1 = custom_analyzer.plot_top_k_comparison(k=6, metric='rmse')
plt.title("Custom vs Standard Models Performance")

# Show preprocessing effects including custom transformers
fig2 = custom_analyzer.plot_variable_heatmap(
    x_var="model_name",
    y_var="preprocessings",
    metric='rmse',
    best_only=False
)
plt.title("Custom Components: Models vs Preprocessing")

plt.show()

print("🎨 Custom component visualizations completed!")
print("💡 Custom components successfully integrated into NIRS4All pipeline!")

### Step 3.5: Visualize Custom Component Performance

In [None]:
# Run custom pipeline
custom_config = PipelineConfigs(custom_pipeline, "Advanced_Custom")
dataset_config = DatasetConfigs('sample_data/regression')

print("🏃‍♂️ Running pipeline with custom components...")
runner = PipelineRunner(save_files=False, verbose=1)
custom_predictions, _ = runner.run(custom_config, dataset_config)

print(f"✅ Custom pipeline completed!")
print(f"📊 Generated {len(custom_predictions)} predictions")

# Analyze custom results
top_custom = custom_predictions.top_k(5, 'rmse')
print("\n🏆 Top 5 Models (Including Custom):")
for idx, model in enumerate(top_custom):
    model_type = "🔧 CUSTOM" if "Custom" in model['model_name'] else "📊 Standard"
    preprocessing = model['preprocessings'] if model['preprocessings'] else 'None'
    print(f"{idx+1}. {model_type} | {model['model_name']}")
    print(f"    RMSE: {model['rmse']:.4f} | R²: {model['r2']:.4f}")
    print(f"    Preprocessing: {preprocessing}\n")

### Step 3.4: Run Custom Pipeline

In [None]:
# Create a pipeline with custom components
custom_pipeline = [
    MinMaxScaler(),
    {"y_processing": MinMaxScaler()},

    # Use our custom transformer
    CustomSpectralNormalizer(method='area'),

    # Add some standard preprocessing for comparison
    {"feature_augmentation": {
        "_or_": [
            StandardNormalVariate(),
            SavitzkyGolay(),
            CustomSpectralNormalizer(method='max'),  # Another custom transformer
            CustomSpectralNormalizer(method='l2')
        ],
        "size": [1, 2],
        "count": 3
    }},

    ShuffleSplit(n_splits=3, test_size=0.25),

    # Standard models
    {"model": PLSRegression(15), "name": "StandardPLS"},

    # Our custom model
    {"model": CustomPCARegressor(n_components=12, ridge_alpha=0.1), "name": "CustomPCA-Ridge"},
    {"model": CustomPCARegressor(n_components=20, ridge_alpha=1.0), "name": "CustomPCA-Ridge-L2"},
]

print("🔧 Custom pipeline configured with:")
print("   • Custom spectral normalization transformer")
print("   • Custom PCA+Ridge regression model")
print("   • Mixed with standard NIRS4All components")

### Step 3.3: Test Custom Components in Pipeline

In [None]:
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.linear_model import Ridge
from sklearn.decomposition import PCA

class CustomPCARegressor(BaseEstimator, RegressorMixin):
    """
    Custom model that combines PCA dimensionality reduction with Ridge regression
    """

    def __init__(self, n_components=10, ridge_alpha=1.0):
        self.n_components = n_components
        self.ridge_alpha = ridge_alpha

    def fit(self, X, y):
        # Store the classes_ attribute for compatibility
        if hasattr(y, 'shape') and len(y.shape) > 1 and y.shape[1] > 1:
            self.n_outputs_ = y.shape[1]
        else:
            self.n_outputs_ = 1

        # Initialize and fit PCA
        self.pca_ = PCA(n_components=self.n_components)
        X_pca = self.pca_.fit_transform(X)

        # Initialize and fit Ridge regression
        self.ridge_ = Ridge(alpha=self.ridge_alpha)
        self.ridge_.fit(X_pca, y)

        return self

    def predict(self, X):
        # Transform with PCA then predict with Ridge
        X_pca = self.pca_.transform(X)
        return self.ridge_.predict(X_pca)

    def get_params(self, deep=True):
        return {
            'n_components': self.n_components,
            'ridge_alpha': self.ridge_alpha
        }

    def set_params(self, **params):
        for key, value in params.items():
            setattr(self, key, value)
        return self

print("✅ Custom model created: CustomPCARegressor")
print("   • Combines PCA + Ridge regression")
print("   • Hyperparameters: n_components, ridge_alpha")
print("   • Follows sklearn RegressorMixin pattern")

### Step 3.2: Create a Custom Model

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class CustomSpectralNormalizer(BaseEstimator, TransformerMixin):
    """
    Custom transformer that applies area normalization to spectral data
    """

    def __init__(self, method='area'):
        self.method = method

    def fit(self, X, y=None):
        # For transformers, fit usually just returns self
        return self

    def transform(self, X):
        X = np.array(X)

        if self.method == 'area':
            # Normalize by area under the curve
            areas = np.trapz(X, axis=1, dx=1.0)
            # Avoid division by zero
            areas = np.where(areas == 0, 1, areas)
            X_normalized = X / areas[:, np.newaxis]

        elif self.method == 'max':
            # Normalize by maximum value
            max_vals = np.max(X, axis=1)
            max_vals = np.where(max_vals == 0, 1, max_vals)
            X_normalized = X / max_vals[:, np.newaxis]

        elif self.method == 'l2':
            # L2 normalization
            norms = np.linalg.norm(X, axis=1)
            norms = np.where(norms == 0, 1, norms)
            X_normalized = X / norms[:, np.newaxis]

        else:
            raise ValueError(f"Unknown normalization method: {self.method}")

        return X_normalized

    def get_feature_names_out(self, input_features=None):
        # Required for newer sklearn versions
        if input_features is None:
            return np.array([f"feature_{i}" for i in range(self.n_features_in_)])
        return input_features

print("✅ Custom transformer created: CustomSpectralNormalizer")
print("   • Methods: 'area', 'max', 'l2'")
print("   • Follows sklearn TransformerMixin pattern")

## Part 3: Custom Models and Transformers

NIRS4All allows you to integrate custom models and transformers into your pipelines. Let's create custom components:

### Step 3.1: Create a Custom Transformer

## Part 4: Advanced Configuration Generation

Let's explore advanced pipeline configuration techniques for complex workflows:

### Step 4.1: Dynamic Pipeline Generation

In [None]:
# Advanced configuration generation function
def create_advanced_pipeline(complexity_level="medium", include_neural_nets=True, custom_preprocessing=None):
    """
    Generate dynamic pipelines based on complexity requirements

    Args:
        complexity_level: 'simple', 'medium', 'complex'
        include_neural_nets: Whether to include deep learning models
        custom_preprocessing: List of custom preprocessing techniques
    """

    # Base pipeline
    pipeline = []

    # Visualization based on complexity
    if complexity_level in ['medium', 'complex']:
        pipeline.append("chart_2d")

    # Scaling
    if complexity_level == 'simple':
        pipeline.append(MinMaxScaler())
    else:
        pipeline.append(MinMaxScaler(feature_range=(0.1, 0.9)))

    # Target processing
    pipeline.append({"y_processing": MinMaxScaler()})

    # Preprocessing based on complexity
    if custom_preprocessing:
        preprocessing_options = custom_preprocessing
    elif complexity_level == 'simple':
        preprocessing_options = [StandardNormalVariate(), SavitzkyGolay()]
    elif complexity_level == 'medium':
        preprocessing_options = [
            StandardNormalVariate(), SavitzkyGolay(),
            MultiplicativeScatterCorrection(), Gaussian()
        ]
    else:  # complex
        preprocessing_options = [
            StandardNormalVariate(), SavitzkyGolay(),
            MultiplicativeScatterCorrection(), Gaussian(),
            Detrend(), FirstDerivative(), Haar()
        ]

    # Feature augmentation configuration
    if complexity_level == 'simple':
        augmentation_config = {"_or_": preprocessing_options, "size": [1], "count": 2}
    elif complexity_level == 'medium':
        augmentation_config = {"_or_": preprocessing_options, "size": [1, 2], "count": 4}
    else:  # complex
        augmentation_config = {"_or_": preprocessing_options, "size": [1, (1, 2), (2, 3)], "count": 6}

    pipeline.append({"feature_augmentation": augmentation_config})

    # Cross-validation
    if complexity_level == 'simple':
        pipeline.append(ShuffleSplit(n_splits=3, test_size=0.25))
    elif complexity_level == 'medium':
        pipeline.append(ShuffleSplit(n_splits=5, test_size=0.2))
    else:  # complex
        pipeline.append(RepeatedKFold(n_splits=5, n_repeats=2, random_state=42))

    # Models based on complexity
    if complexity_level == 'simple':
        models = [
            {"model": PLSRegression(10), "name": "PLS-10"},
            {"model": PLSRegression(15), "name": "PLS-15"}
        ]
    elif complexity_level == 'medium':
        models = [
            {"model": PLSRegression(10), "name": "PLS-10"},
            {"model": PLSRegression(20), "name": "PLS-20"},
            {"model": ElasticNet(), "name": "ElasticNet"},
            {"model": RandomForestRegressor(n_estimators=50), "name": "RandomForest"}
        ]
    else:  # complex
        models = [
            {"model": PLSRegression(15), "name": "PLS-15"},
            {"model": PLSRegression(25), "name": "PLS-25"},
            {"model": ElasticNet(alpha=0.1), "name": "ElasticNet"},
            {"model": RandomForestRegressor(n_estimators=100), "name": "RandomForest"},
            {"model": GradientBoostingRegressor(n_estimators=50), "name": "GradientBoosting"},
            {"model": SVR(kernel='rbf'), "name": "SVR"}
        ]

    # Add neural networks if requested
    if include_neural_nets and complexity_level in ['medium', 'complex']:
        if complexity_level == 'medium':
            models.append({
                "model": nicon,
                "name": "SimpleNeuralNet",
                "train_params": {"epochs": 50, "verbose": 0}
            })
        else:  # complex
            models.append({
                "model": nicon,
                "name": "DeepNeuralNet",
                "train_params": {"epochs": 150, "patience": 20, "verbose": 0}
            })

    pipeline.extend(models)

    return pipeline

# Generate different complexity pipelines
simple_config = create_advanced_pipeline("simple", include_neural_nets=False)
medium_config = create_advanced_pipeline("medium", include_neural_nets=True)
complex_config = create_advanced_pipeline("complex", include_neural_nets=True)

print("🔧 Dynamic pipeline generation completed:")
print(f"   • Simple pipeline: {len([x for x in simple_config if 'model' in str(x)])} models")
print(f"   • Medium pipeline: {len([x for x in medium_config if 'model' in str(x)])} models")
print(f"   • Complex pipeline: {len([x for x in complex_config if 'model' in str(x)])} models")

### Step 3.2: Run Dynamic Configurations

In [None]:
# Test the medium complexity configuration
print("🏃‍♂️ Testing medium complexity configuration...")

dynamic_config = PipelineConfigs(medium_config, "Advanced_Dynamic")
dataset_config = DatasetConfigs('sample_data/regression')

runner = PipelineRunner(save_files=False, verbose=1)
dynamic_predictions, _ = runner.run(dynamic_config, dataset_config)

print(f"✅ Dynamic configuration completed!")
print(f"📊 Generated {len(dynamic_predictions)} predictions")

# Quick analysis
top_dynamic = dynamic_predictions.top_k(5, 'rmse')
print("\n🏆 Top 5 Models from Dynamic Configuration:")
for idx, model in enumerate(top_dynamic):
    print(f"{idx+1}. {model['model_name']} | RMSE: {model['rmse']:.4f} | R²: {model['r2']:.4f}")

### Step 4.2: Run Dynamic Configurations

In [None]:
def create_comprehensive_analysis(predictions, title="Advanced NIRS Analysis"):
    """
    Create a comprehensive analysis dashboard
    """
    analyzer = PredictionAnalyzer(predictions)

    # Create a large figure with multiple subplots
    fig = plt.figure(figsize=(20, 15))
    fig.suptitle(title, fontsize=16, fontweight='bold')

    # 1. Top models comparison (top left)
    plt.subplot(2, 3, 1)
    analyzer.plot_top_k_comparison(k=6, metric='rmse')
    plt.title("Top 6 Models: RMSE")

    # 2. Model vs preprocessing heatmap (top center)
    plt.subplot(2, 3, 2)
    analyzer.plot_variable_heatmap(
        x_var="model_name",
        y_var="preprocessings",
        metric='rmse',
        best_only=True
    )
    plt.title("Best Models: Preprocessing Heatmap")

    # 3. Performance distribution (top right)
    plt.subplot(2, 3, 3)
    analyzer.plot_variable_candlestick(
        filters={"partition": "test"},
        variable="model_name"
    )
    plt.title("Performance Distribution")

    # 4. R² comparison (bottom left)
    plt.subplot(2, 3, 4)
    analyzer.plot_top_k_comparison(k=6, metric='r2')
    plt.title("Top 6 Models: R²")

    # 5. Full preprocessing heatmap (bottom center)
    plt.subplot(2, 3, 5)
    analyzer.plot_variable_heatmap(
        x_var="model_name",
        y_var="preprocessings",
        metric='rmse',
        best_only=False,
        display_n=False
    )
    plt.title("All Results: Preprocessing Heatmap")

    # 6. Model performance summary (bottom right)
    plt.subplot(2, 3, 6)

    # Get performance statistics
    top_models = predictions.top_k(10, 'rmse')
    model_names = [model['model_name'] for model in top_models]
    rmse_values = [model['rmse'] for model in top_models]
    r2_values = [model['r2'] for model in top_models]

    # Create scatter plot
    plt.scatter(rmse_values, r2_values, s=100, alpha=0.7, c=range(len(rmse_values)), cmap='viridis')
    plt.xlabel('RMSE')
    plt.ylabel('R²')
    plt.title('RMSE vs R² (Top 10 Models)')
    plt.grid(True, alpha=0.3)

    # Add annotations for top 3
    for i in range(min(3, len(top_models))):
        plt.annotate(f'{i+1}', (rmse_values[i], r2_values[i]),
                    xytext=(5, 5), textcoords='offset points', fontweight='bold')

    plt.tight_layout()
    plt.show()

    return fig

print("📊 Advanced visualization functions defined")
print("🎨 Ready to create comprehensive analysis dashboards")

### Step 5.2: Generate Advanced Visualizations

In [None]:
# Create comprehensive analysis for our best dataset
print("🎨 Creating comprehensive analysis dashboard...")

# Use the multi-source predictions for rich visualizations
comprehensive_fig = create_comprehensive_analysis(
    multi_predictions,
    "NIRS4All Advanced Analysis Dashboard"
)

print("✅ Comprehensive dashboard created!")

## Part 5: Advanced Visualization Techniques

Let's create professional-grade visualizations for comprehensive analysis:

### Step 5.1: Custom Visualization Functions

In [None]:
def generate_performance_report(predictions, report_name="Advanced Analysis Report"):
    """
    Generate a detailed performance report
    """
    print(f"\n📋 {report_name}")
    print("=" * 60)

    # Overall statistics
    print(f"Total predictions: {len(predictions)}")

    # Get all metrics
    all_predictions = list(predictions)
    rmse_values = [p['rmse'] for p in all_predictions]
    r2_values = [p['r2'] for p in all_predictions]
    mae_values = [p['mae'] for p in all_predictions]

    print(f"\n📊 Performance Statistics:")
    print(f"   RMSE - Mean: {np.mean(rmse_values):.4f}, Std: {np.std(rmse_values):.4f}, Min: {np.min(rmse_values):.4f}")
    print(f"   R²   - Mean: {np.mean(r2_values):.4f}, Std: {np.std(r2_values):.4f}, Max: {np.max(r2_values):.4f}")
    print(f"   MAE  - Mean: {np.mean(mae_values):.4f}, Std: {np.std(mae_values):.4f}, Min: {np.min(mae_values):.4f}")

    # Model type analysis
    model_types = {}
    for pred in all_predictions:
        model_name = pred['model_name']
        model_type = model_name.split('-')[0] if '-' in model_name else model_name
        if model_type not in model_types:
            model_types[model_type] = []
        model_types[model_type].append(pred['rmse'])

    print(f"\n🔬 Model Type Performance (Average RMSE):")
    for model_type, rmse_list in sorted(model_types.items(), key=lambda x: np.mean(x[1])):
        avg_rmse = np.mean(rmse_list)
        count = len(rmse_list)
        print(f"   {model_type:15s}: {avg_rmse:.4f} (n={count})")

    # Top performers
    top_5 = predictions.top_k(5, 'rmse')
    print(f"\n🏆 Top 5 Performers:")
    for idx, model in enumerate(top_5):
        preprocessing = model['preprocessings'] if model['preprocessings'] else 'None'
        print(f"   {idx+1}. {model['model_name']} | RMSE: {model['rmse']:.4f} | R²: {model['r2']:.4f}")
        print(f"      Preprocessing: {preprocessing}")
        if 'dataset_name' in model:
            print(f"      Dataset: {model['dataset_name']}")

    print("\n" + "=" * 60)

# Generate reports for different analyses
generate_performance_report(multi_predictions, "Multi-Source Analysis Report")
generate_performance_report(opt_predictions, "Hyperparameter Optimization Report")
generate_performance_report(dynamic_predictions, "Dynamic Configuration Report")

## Part 5: Putting It All Together - Complete Workflow

Let's create a comprehensive workflow that combines all advanced techniques:

### Step 5.1: Ultimate NIRS Pipeline

In [None]:
# Ultimate comprehensive pipeline
ultimate_pipeline = [
    # Comprehensive visualization
    "chart_2d",

    # Advanced scaling
    MinMaxScaler(feature_range=(0.05, 0.95)),

    # Ultimate feature augmentation
    {"feature_augmentation": {
        "_or_": [
            StandardNormalVariate(),
            SavitzkyGolay(),
            MultiplicativeScatterCorrection(),
            Gaussian(),
            Detrend(),
            FirstDerivative(),
            [StandardNormalVariate(), SavitzkyGolay()],
            [MultiplicativeScatterCorrection(), Gaussian()],
            [Detrend(), FirstDerivative()],
            [StandardNormalVariate(), MultiplicativeScatterCorrection(), SavitzkyGolay()]
        ],
        "size": [1, 2, 3],
        "count": 8
    }},

    # Robust cross-validation
    RepeatedKFold(n_splits=5, n_repeats=2, random_state=42),
    {"y_processing": MinMaxScaler()},

    # Optimized PLS models
    {
        "model": PLSRegression(),
        "name": "UltimatePLS",
        "finetune_params": {
            "n_trials": 25,
            "verbose": 1,
            "approach": "single",
            "sample": "tpe",
            "model_params": {
                'n_components': ('int', 5, 35),
            }
        }
    },

    # Diverse model ensemble
    {"model": ElasticNet(alpha=0.01), "name": "TunedElasticNet"},
    {"model": RandomForestRegressor(n_estimators=200, max_depth=15), "name": "TunedRandomForest"},
    {"model": GradientBoostingRegressor(n_estimators=100, learning_rate=0.1), "name": "TunedGradientBoosting"},

    # Advanced neural network
    {
        "model": nicon,
        "name": "UltimateDeepNIRS",
        "train_params": {
            "epochs": 200,
            "patience": 25,
            "batch_size": 32,
            "verbose": 0
        }
    }
]

print("🚀 Ultimate NIRS pipeline configured:")
print("   • 10 preprocessing strategies (including 3-step combinations)")
print("   • Hyperparameter-optimized PLS (25 trials)")
print("   • 4 additional tuned models")
print("   • Advanced neural network with extended training")
print("   • Repeated 5-fold cross-validation")
print(f"   • Estimated total configurations: {5 * 8} = 40+")

### Step 5.3: Performance Summary Report

In [None]:
# Configure ultimate analysis
ultimate_config = PipelineConfigs(ultimate_pipeline, "Ultimate_NIRS_Analysis")

# Use the best available dataset
ultimate_datasets = ['sample_data/regression', 'sample_data/regression_2']
ultimate_dataset_config = DatasetConfigs(ultimate_datasets)

print("🚀 Starting ultimate NIRS analysis...")
print("⚠️  This is the most comprehensive analysis - expect 10-15 minutes")
print("📊 Processing multiple datasets with advanced optimization")

# Run with full capabilities
ultimate_runner = PipelineRunner(save_files=True, verbose=1)
ultimate_predictions, ultimate_per_dataset = ultimate_runner.run(ultimate_config, ultimate_dataset_config)

print(f"🎉 Ultimate analysis completed!")
print(f"📊 Total predictions: {len(ultimate_predictions)}")
print(f"🗂️  Datasets processed: {len(ultimate_per_dataset)}")

### Step 5.3: Ultimate Results Analysis

In [None]:
# Comprehensive ultimate analysis
print("🔍 Analyzing ultimate results...")

# Generate comprehensive report
generate_performance_report(ultimate_predictions, "🚀 ULTIMATE NIRS ANALYSIS REPORT 🚀")

# Create ultimate dashboard
print("\n🎨 Creating ultimate analysis dashboard...")
ultimate_fig = create_comprehensive_analysis(
    ultimate_predictions,
    "🚀 Ultimate NIRS4All Analysis Dashboard 🚀"
)

# Save the best model for future use
ultimate_best = ultimate_predictions.top_k(1, 'rmse')[0]
ultimate_model_id = ultimate_best['id']

print(f"\n💾 Best model saved with ID: {ultimate_model_id}")
print(f"🏆 Ultimate champion: {ultimate_best['model_name']}")
print(f"📈 Performance: RMSE={ultimate_best['rmse']:.4f}, R²={ultimate_best['r2']:.4f}")
print(f"🔬 Preprocessing: {ultimate_best['preprocessings'] if ultimate_best['preprocessings'] else 'None'}")
print(f"📁 Dataset: {ultimate_best.get('dataset_name', 'N/A')}")

## Part 6: Putting It All Together - Complete Workflow

Let's create a comprehensive workflow that combines all advanced techniques:

### Step 6.1: Ultimate NIRS Pipeline

In [None]:
# Test the ultimate model on new data
print("🧪 Testing ultimate model on independent dataset...")

# Use a different dataset for final validation
test_dataset = DatasetConfigs('sample_data/regression_3')

# Make predictions with the ultimate model
final_predictor = PipelineRunner(save_files=False, verbose=1)
final_predictions, _ = final_predictor.predict(ultimate_model_id, test_dataset, verbose=1)

print(f"✅ Final testing completed!")
print(f"📊 Generated {len(final_predictions)} final predictions")
print(f"📈 Prediction range: {final_predictions.min():.3f} to {final_predictions.max():.3f}")
print(f"📊 Mean prediction: {final_predictions.mean():.3f} ± {final_predictions.std():.3f}")

# Final visualization
plt.figure(figsize=(12, 8))

# Prediction distribution
plt.subplot(2, 2, 1)
plt.hist(final_predictions.flatten(), bins=25, alpha=0.7, edgecolor='black', color='skyblue')
plt.title("Final Prediction Distribution")
plt.xlabel("Predicted Values")
plt.ylabel("Frequency")
plt.grid(True, alpha=0.3)

# Box plot
plt.subplot(2, 2, 2)
plt.boxplot(final_predictions.flatten())
plt.title("Final Prediction Statistics")
plt.ylabel("Predicted Values")
plt.grid(True, alpha=0.3)

# Time series style plot
plt.subplot(2, 1, 2)
plt.plot(final_predictions.flatten(), 'o-', alpha=0.7, markersize=4)
plt.title("Final Predictions Sequence")
plt.xlabel("Sample Index")
plt.ylabel("Predicted Value")
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("🎨 Final visualizations completed!")

## Summary and Conclusion

🎉 **Congratulations!** You've mastered advanced NIRS analysis with NIRS4All!

### What You've Accomplished:
1. ✅ **Multi-Source Analysis** - Handled multiple datasets simultaneously with sophisticated preprocessing
2. ✅ **Hyperparameter Optimization** - Used Optuna for automated model tuning
3. ✅ **Dynamic Configuration** - Created flexible, complexity-based pipeline generation
4. ✅ **Advanced Visualizations** - Built comprehensive analysis dashboards
5. ✅ **Neural Networks** - Integrated deep learning models with custom training
6. ✅ **Complete Workflow** - Executed end-to-end professional NIRS analysis

### Advanced Techniques Mastered:
- **Complex Feature Augmentation**: Multi-step preprocessing combinations
- **Ensemble Methods**: Multiple model types working together
- **Robust Validation**: Repeated k-fold cross-validation
- **Model Persistence**: Save/load optimized models
- **Performance Analysis**: Comprehensive statistical reporting
- **Professional Visualization**: Multi-panel analysis dashboards

### Key Performance Insights:
- 🔬 **Preprocessing Impact**: Spectral preprocessing can significantly improve model performance
- 🎯 **Optimization Benefits**: Hyperparameter tuning provides measurable improvements
- 🧠 **Neural Networks**: Deep learning excels with sufficient data and proper training
- 📊 **Multi-Source Robustness**: Cross-dataset validation ensures model generalizability
- 🔄 **Pipeline Flexibility**: Dynamic configuration enables rapid experimentation

### Production-Ready Workflow:
You now have the skills to:
- 🚀 Build production-grade NIRS analysis pipelines
- 📈 Optimize models for specific applications
- 🔍 Validate models across multiple datasets
- 📊 Create professional analysis reports
- 💾 Deploy saved models for real-time prediction

### Next Steps for Your Projects:
1. 🎯 **Domain Adaptation**: Customize preprocessing for your specific spectral data
2. 🔬 **Experiment Design**: Use cross-validation strategies appropriate for your study
3. 📊 **Metric Selection**: Choose evaluation metrics that align with your application goals
4. 🤖 **Model Selection**: Balance complexity with interpretability based on your needs
5. 📈 **Continuous Improvement**: Regularly retrain models as new data becomes available

### Advanced Tips:
- 💡 **Start Complex, Simplify**: Begin with comprehensive analysis, then streamline
- 🔄 **Iterate Rapidly**: Use dynamic configurations for quick experimentation
- 📊 **Visualize Everything**: Rich visualizations reveal insights numbers can't
- 💾 **Save Everything**: Model persistence enables reproducible research
- 🔬 **Validate Thoroughly**: Multi-dataset testing ensures real-world performance

You're now ready to tackle any NIRS analysis challenge! 🚀🔬📊