# Comprehensive SpectraSet Workflow Demo

This notebook demonstrates a detailed workflow using the `SpectraSet` class, including:
- Custom data generation for NIRS and Raman spectroscopy.
- Sample augmentation with multiple custom transformers.
- Source-specific transformations.
- Feature augmentation by creating new spectral sources.
- Custom grouping of samples.
- Stratified train/test splitting.
- Repeated K-Fold cross-validation.
- Unpacking and inspecting data by groups.

In [1]:
# 1. Setup and Imports
# Standard library imports
import itertools

# Third-party imports
import numpy as np
import pandas as pd
import xarray as xr
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import RepeatedKFold, train_test_split
from sklearn.preprocessing import LabelEncoder

# Import the SpectraSet class (ensure nirs4all.data.spectraset is in PYTHONPATH or installed)
# Assuming spectraset.py is in d:\Workspace\ML\NIRS\nirs4all\nirs4all\data\
import sys
sys.path.append('d:\\Workspace\\ML\\NIRS\\nirs4all') # Adjust if necessary
from nirs4all.data.spectraset_grok import SpectraSet

# IPython magic for autoreloading external modules (optional)
%load_ext autoreload
%autoreload 2

print("Imports successful.")

Imports successful.


## 2. Data Generation

We will generate a dataset with 10 samples. Each sample corresponds to an "n" value from 1 to 5 (each n appearing twice).
- **NIRS spectra**: 4 values, calculated as `(n-1) * i` for `i` in `0,1,2,3`.
- **Raman spectra**: 5 values, calculated as `(n-1) * i + 5` for `i` in `0,1,2,3,4`.
- **Target (y)**: The value of `n`.
- **Metadata**:
    - `n_numeric`: `n` (integer).
    - `n_str`: `n` (string).
    - `n_sum_str`: `f"{n}+{n}"` (string, e.g., "1+1").
Initially, each sample has one augmentation (the original spectrum).

In [2]:
# --- Configuration for Data Generation ---
total_samples = 10

# --- Initialize data structures ---
spectra_data_v1 = {'nirs': [], 'raman': []}
target_y_v1 = np.zeros(total_samples, dtype=int)
metadata_v1 = {
    'n_numeric': np.zeros(total_samples, dtype=int),
    'n_str': np.zeros(total_samples, dtype=object),
    'n_sum_str': np.zeros(total_samples, dtype=object)
}

# --- Populate data ---
for s_idx in range(total_samples):
    nirs_spectrum = np.array([s_idx for i in range(4)], dtype=float)
    spectra_data_v1['nirs'].append([nirs_spectrum])
    raman_spectrum = np.array([s_idx + 5 for i in range(5)], dtype=float)
    spectra_data_v1['raman'].append([raman_spectrum])
    # Target
    target_y_v1[s_idx] = s_idx
    # Metadata
    metadata_v1['n_numeric'][s_idx] = s_idx
    metadata_v1['n_str'][s_idx] = str(s_idx)
    metadata_v1['n_sum_str'][s_idx] = f"{s_idx}+{s_idx}"

spec = SpectraSet.build(
    spectra=spectra_data_v1,
    target=target_y_v1,
    metadata=metadata_v1
)

print("--- Initial SpectraSet (spec) ---")
print(f"Shape of spec: {spec.X().shape}, {spec.y().shape}")
print("*"*50)

for i in range(10):
    nirs = spec.X(sources=["nirs"])[i]
    raman = spec.X(sources=["raman"])[i]
    print(f"{i}: {nirs} - {raman}: y = {spec.y()[i]}")
    print(f"{i}: {spec.X()[i]} - {spec.y()[i]}")
    

--- Initial SpectraSet (spec) ---
Shape of spec: (10, 9), (10, 1)
**************************************************
0: [0. 0. 0. 0.] - [5. 5. 5. 5. 5.]: y = [0]
0: [0. 0. 0. 0. 5. 5. 5. 5. 5.] - [0]
1: [1. 1. 1. 1.] - [6. 6. 6. 6. 6.]: y = [1]
1: [1. 1. 1. 1. 6. 6. 6. 6. 6.] - [1]
2: [2. 2. 2. 2.] - [7. 7. 7. 7. 7.]: y = [2]
2: [2. 2. 2. 2. 7. 7. 7. 7. 7.] - [2]
3: [3. 3. 3. 3.] - [8. 8. 8. 8. 8.]: y = [3]
3: [3. 3. 3. 3. 8. 8. 8. 8. 8.] - [3]
4: [4. 4. 4. 4.] - [9. 9. 9. 9. 9.]: y = [4]
4: [4. 4. 4. 4. 9. 9. 9. 9. 9.] - [4]
5: [5. 5. 5. 5.] - [10. 10. 10. 10. 10.]: y = [5]
5: [ 5.  5.  5.  5. 10. 10. 10. 10. 10.] - [5]
6: [6. 6. 6. 6.] - [11. 11. 11. 11. 11.]: y = [6]
6: [ 6.  6.  6.  6. 11. 11. 11. 11. 11.] - [6]
7: [7. 7. 7. 7.] - [12. 12. 12. 12. 12.]: y = [7]
7: [ 7.  7.  7.  7. 12. 12. 12. 12. 12.] - [7]
8: [8. 8. 8. 8.] - [13. 13. 13. 13. 13.]: y = [8]
8: [ 8.  8.  8.  8. 13. 13. 13. 13. 13.] - [8]
9: [9. 9. 9. 9.] - [14. 14. 14. 14. 14.]: y = [9]
9: [ 9.  9.  9.  9. 14. 14. 14

## 3. Custom Transformers for Augmentation

We define `TransformerMixin` classes to add specific values to spectra.
- `AddNTransformer`: Adds a specified value `N` to each element of a spectrum.

In [3]:
# --- Define Custom Transformers ---
class AddNTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, value_to_add):
        self.value_to_add = value_to_add

    def fit(self, X, y=None):
        return self # Nothing to fit

    def transform(self, X_list):
        if isinstance(X_list, np.ndarray): # Single spectrum
             return X_list + self.value_to_add
        return [x + self.value_to_add for x in X_list]


# --- Instantiate Transformers ---
add1_transformer = AddNTransformer(value_to_add=0.1)
add2_transformer = AddNTransformer(value_to_add=0.2)

print("Custom transformers defined: AddNTransformer")

Custom transformers defined: AddNTransformer


In [None]:
# # 1) grab both data & obs‐indices
# X_nirs, obs_idx = spec.X_with_labels(sources=["nirs"])
# print(f"X_nirs shape: {X_nirs.shape}")
# print(obs_idx, X_nirs)
# new_spectra_nirs = add1_transformer.transform(X_nirs)
# print(new_spectra_nirs)
# print("="*50)

# X_raman, obs_idx = spec.X_with_labels(sources=["raman"])
# print(f"X_raman shape: {X_raman.shape}")
# print(obs_idx, X_raman)
# new_spectra_raman = add2_transformer.transform(X_raman)
# print(new_spectra_raman)
# print("="*50)


# spec.augment_samples(
#     new_spectra_by_source = {
#         "nirs": new_spectra_nirs[:5],
#         "raman": new_spectra_raman[:5]
#     },
#     original_obs_indices=obs_idx[:5] # Use the actual obs label
# )



# print(f"Shape of spec: {spec.X().shape}, {spec.y().shape}")
# print("*"*50)

# for i in range(len(spec)):
#     nirs = spec.X(sources=["nirs"])[i]
#     raman = spec.X(sources=["raman"])[i]
#     print(f"{i}: {spec.X()[i]} - {spec.y()[i]}")


# spec.augment_samples(
#     new_spectra_by_source = {
#         "nirs": new_spectra_nirs[:5],
#         "raman": new_spectra_raman[:5]
#     },
#     original_obs_indices=obs_idx[:5] # Use the actual obs label
# )


X_nirs shape: (10, 4)
[0 1 2 3 4 5 6 7 8 9] [[0. 0. 0. 0.]
 [1. 1. 1. 1.]
 [2. 2. 2. 2.]
 [3. 3. 3. 3.]
 [4. 4. 4. 4.]
 [5. 5. 5. 5.]
 [6. 6. 6. 6.]
 [7. 7. 7. 7.]
 [8. 8. 8. 8.]
 [9. 9. 9. 9.]]
[[0.1 0.1 0.1 0.1]
 [1.1 1.1 1.1 1.1]
 [2.1 2.1 2.1 2.1]
 [3.1 3.1 3.1 3.1]
 [4.1 4.1 4.1 4.1]
 [5.1 5.1 5.1 5.1]
 [6.1 6.1 6.1 6.1]
 [7.1 7.1 7.1 7.1]
 [8.1 8.1 8.1 8.1]
 [9.1 9.1 9.1 9.1]]
X_raman shape: (10, 5)
[0 1 2 3 4 5 6 7 8 9] [[ 5.  5.  5.  5.  5.]
 [ 6.  6.  6.  6.  6.]
 [ 7.  7.  7.  7.  7.]
 [ 8.  8.  8.  8.  8.]
 [ 9.  9.  9.  9.  9.]
 [10. 10. 10. 10. 10.]
 [11. 11. 11. 11. 11.]
 [12. 12. 12. 12. 12.]
 [13. 13. 13. 13. 13.]
 [14. 14. 14. 14. 14.]]
[[ 5.2  5.2  5.2  5.2  5.2]
 [ 6.2  6.2  6.2  6.2  6.2]
 [ 7.2  7.2  7.2  7.2  7.2]
 [ 8.2  8.2  8.2  8.2  8.2]
 [ 9.2  9.2  9.2  9.2  9.2]
 [10.2 10.2 10.2 10.2 10.2]
 [11.2 11.2 11.2 11.2 11.2]
 [12.2 12.2 12.2 12.2 12.2]
 [13.2 13.2 13.2 13.2 13.2]
 [14.2 14.2 14.2 14.2 14.2]]
Shape of spec: (15, 9), (15, 1)
*************************

<nirs4all.data.spectraset_grok.SpectraSet at 0x25ce7177610>

In [8]:
print("--- Test: Adding features to 'nirs' source ---")

# 0. Ensure AddNTransformer is defined (it's in cell c7b5dea8)
# And numpy is imported (as np, in cell caffebc2)

# 1. Get current NIRS data and its feature names
print("\n--- Step 1: Get current NIRS data and feature names ---")

available_sources = spec.available_sources
if 'nirs' not in available_sources:
    print("Error: 'nirs' source not found in spec. Skipping feature addition.")
else:
    nirs_data_original = spec.X(sources="nirs") # Shape: (n_obs, n_current_nirs_features)
    print(f"Original NIRS data shape: {nirs_data_original.shape}")

    # Get current feature names for the 'nirs' source
    nirs_feature_dim_name = spec.ds['spectra_nirs'].dims[-1] # e.g., 'feature_nirs'
    current_nirs_feature_names = spec.ds['spectra_nirs'].coords[nirs_feature_dim_name].values.tolist()
    print(f"Current NIRS feature names: {current_nirs_feature_names}")

    # 2. Transform the NIRS data
    print("\n--- Step 2: Transform NIRS data (add 0.01) ---")
    add_0_01_transformer = AddNTransformer(value_to_add=0.01)
    # The transform method in AddNTransformer should handle 2D arrays (applying to each row)
    nirs_data_transformed = add_0_01_transformer.transform(nirs_data_original.copy())
    print(f"Transformed NIRS data shape: {nirs_data_transformed.shape}") # Should be same as nirs_data_original

    # 3. Define feature names for the new (transformed) features
    print("\n--- Step 3: Define new feature names ---")
    new_feature_names = [f"{name}_plus_0.01" for name in current_nirs_feature_names]
    print(f"Names for new features to be added: {new_feature_names}")
    
    if len(new_feature_names) != nirs_data_transformed.shape[1]:
        print(f"ERROR: Mismatch! New names count {len(new_feature_names)}, transformed data features {nirs_data_transformed.shape[1]}")
        # Fallback if error (e.g. if current_nirs_feature_names was empty or incorrect)
        num_new_feats = nirs_data_transformed.shape[1]
        new_feature_names = [f"new_nirs_f{j}_plus_0.01" for j in range(num_new_feats)]
        print(f"Using fallback new feature names: {new_feature_names}")

    # 4. Add these transformed features back to the 'nirs' source
    print("\n--- Step 4: Call spec.add_features ---")
    spec.add_features(
        source_name="nirs",
        new_features=nirs_data_transformed,
        feature_names=new_feature_names
    )
    print("spec.add_features called.")

    # 5. Verification
    print("\n--- Step 5: Verification ---")
    print("Updated spec.ds['spectra_nirs'] structure:")
    print(spec.ds['spectra_nirs'])

    X_nirs_after, obs_labels_nirs_after = spec.X_with_labels(sources="nirs", feature_shape="2d")
    print(f"\nShape of NIRS data after add_features: {X_nirs_after.shape}")
    
    expected_total_nirs_features = len(current_nirs_feature_names) + len(new_feature_names)
    print(f"Expected total NIRS features: {expected_total_nirs_features}")
    
    if X_nirs_after.shape[1] == expected_total_nirs_features:
        print("Total NIRS feature count is as expected.")
    else:
        print(f"ERROR: Total NIRS feature count mismatch! Got {X_nirs_after.shape[1]}, expected {expected_total_nirs_features}")

    final_nirs_feature_names = spec.ds['spectra_nirs'].coords[nirs_feature_dim_name].values.tolist()
    print(f"\nFinal NIRS feature names: {final_nirs_feature_names}")
    
    missing_names = [name for name in new_feature_names if name not in final_nirs_feature_names]
    if not missing_names:
        print("All new feature names are present in the final NIRS feature list.")
    else:
        print(f"ERROR: Some new feature names are missing: {missing_names}")

    print("\n--- NIRS data for first observation after add_features ---")
    if X_nirs_after.shape[0] > 0:
        obs_idx_to_check = 0 # Check the first observation in the returned array    
        obs_label = obs_labels_nirs_after[obs_idx_to_check]
        print(f"Observation label: {obs_label}")
        
        # The original features should be first, followed by the newly added features.
        original_part_from_X = X_nirs_after[obs_idx_to_check, :len(current_nirs_feature_names)]
        added_part_from_X = X_nirs_after[obs_idx_to_check, len(current_nirs_feature_names):len(current_nirs_feature_names) + len(new_feature_names)]
        
        print(f"  Original NIRS data part (from spec.X, first {len(current_nirs_feature_names)} features): {original_part_from_X}")
        print(f"  Added NIRS data part (from spec.X, next {len(new_feature_names)} features): {added_part_from_X}")
        
        # Find the corresponding row in nirs_data_transformed.
        expected_added_data_for_this_obs = nirs_data_transformed[obs_idx_to_check] 
        print(f"  Expected added data (from transformed input for row {obs_idx_to_check}): {expected_added_data_for_this_obs}")

        if np.allclose(added_part_from_X, expected_added_data_for_this_obs):
            print(f"  Verification: Added NIRS features for obs {obs_label} match the transformed input values.")
        else:
            print(f"  VERIFICATION FAILED: Added NIRS features for obs {obs_label} DO NOT match transformed input values.")
    else:
        print("No NIRS data to display for value verification.")

    print("\n--- Concatenated X and y for first observation (if any) ---")
    if len(spec) > 0:
        first_obs_label_in_spec = spec.ds.obs.values[0]
        print(f"Data for first observation in current spec (label: {first_obs_label_in_spec}):")
        print(f"  spec.X()[0] (shape {spec.X()[0].shape}): {spec.X()[0]}")
        print(f"  spec.y()[0]: {spec.y()[0]}")

--- Test: Adding features to 'nirs' source ---

--- Step 1: Get current NIRS data and feature names ---
Original NIRS data shape: (20, 4)
Current NIRS feature names: ['nirs_f0', 'nirs_f1', 'nirs_f2', 'nirs_f3']

--- Step 2: Transform NIRS data (add 0.01) ---
Transformed NIRS data shape: (20, 4)

--- Step 3: Define new feature names ---
Names for new features to be added: ['nirs_f0_plus_0.01', 'nirs_f1_plus_0.01', 'nirs_f2_plus_0.01', 'nirs_f3_plus_0.01']

--- Step 4: Call spec.add_features ---
spec.add_features called.

--- Step 5: Verification ---
Updated spec.ds['spectra_nirs'] structure:
<xarray.DataArray 'spectra_nirs' (obs: 20, feature_nirs: 4)> Size: 640B
array([[0. , 0. , 0. , 0. ],
       [1. , 1. , 1. , 1. ],
       [2. , 2. , 2. , 2. ],
       [3. , 3. , 3. , 3. ],
       [4. , 4. , 4. , 4. ],
       [5. , 5. , 5. , 5. ],
       [6. , 6. , 6. , 6. ],
       [7. , 7. , 7. , 7. ],
       [8. , 8. , 8. , 8. ],
       [9. , 9. , 9. , 9. ],
       [0.1, 0.1, 0.1, 0.1],
       [1.1

In [6]:
# Retrieve X with all sources, concatenated (default)
X_concat = spec.X()
print(f"X (concatenate) shape: {X_concat.shape}")

# Retrieve X with features interlaced
X_interlace = spec.X(feature_shape="interlace")
print(f"X (interlace) shape: {X_interlace.shape}")

# Retrieve X as a 3D array (n_obs, n_sources, max_features_per_source)
X_2d = spec.X(feature_shape="2d")
print(f"X (2d) shape: {X_2d.shape}")

# Retrieve X as a 3D array (n_obs, max_features_per_source, n_sources)
X_transpose2d = spec.X(feature_shape="transpose2d")
print(f"X (transpose2d) shape: {X_transpose2d.shape}")

X (concatenate) shape: (20, 9)
X (interlace) shape: (20, 10)
X (2d) shape: (20, 2, 5)
X (transpose2d) shape: (20, 5, 2)

X (interlace) shape: (20, 10)
X (2d) shape: (20, 2, 5)
X (transpose2d) shape: (20, 5, 2)


In [7]:
# Get encoded target labels
y_encoded = spec.y(encode_labels=True)
print(f"Encoded y: {y_encoded}")

# Inverse transform to get original labels
y_original = spec.inverse_transform_y(y_encoded)
print(f"Original y: {y_original}")

Encoded y: [[0]
 [1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]
 [8]
 [9]
 [0]
 [1]
 [2]
 [3]
 [4]
 [0]
 [1]
 [2]
 [3]
 [4]]


RuntimeError: LabelEncoder not fitted. Call .y(encode_labels=True) first.

## 4. Sample Augmentation

Now, we augment the samples based on their `n` value:
- Samples with `n < 4`: Augment with `add1_transformer` (add 1). This adds one new augmentation.
- Samples with `n >= 4`: Augment with `add1_transformer` and `add2_transformer` (add 1, then add 2 to original). This adds two new augmentations.

The augmentations are applied to the *original* spectra of each sample.

In [None]:
# --- Deep copy initial spectra data to modify for augmentation ---

# --- Apply Sample Augmentation ---
for s_idx in range(total_samples):
    # Original spectra for this sample (first augmentation)
    original_nirs_aug = spectra_data_v1['nirs'][s_idx][0] 
    original_raman_aug = spectra_data_v1['raman'][s_idx][0]

    if s_idx < 4:
        # Augment with add1
        spectra_data_v2['nirs'][s_idx].append(add1_transformer.transform(original_nirs_aug))
        spectra_data_v2['raman'][s_idx].append(add1_transformer.transform(original_raman_aug))
    else: # current_n >= 4
        # Augment with add1
        spectra_data_v2['nirs'][s_idx].append(add1_transformer.transform(original_nirs_aug))
        spectra_data_v2['raman'][s_idx].append(add1_transformer.transform(original_raman_aug))
        # Augment with add2 (applied to original)
        spectra_data_v2['nirs'][s_idx].append(add2_transformer.transform(original_nirs_aug))
        spectra_data_v2['raman'][s_idx].append(add2_transformer.transform(original_raman_aug))


print("--- SpectraSet after Sample Augmentation (ss_v2) ---")
print(f"Shape of X from ss_v2: {ss_v2.X().shape}") 

# for i in range(len(ss_v2)):
#     nirs = ss_v2.X(include_sources=["nirs"])[i]
#     raman = ss_v2.X(include_sources=["raman"])[i]
#     print(f"{i}: {nirs} - {raman}: y = {ss_v2.y()[i]}")
    
# print(ss_v2.X(include_sources=["nirs"], augment="augmented"))
print(ss_v2.subset_by_samples([0, 4]).X(sources=["raman"], augment="augmented"))

NameError: name 'spectra_data_v2' is not defined

## 5. Source-Specific Transformation

Next, we transform all spectra (original and their augmentations) based on their source:
- **NIRS spectra**: Apply a "minus 1" transformation.
- **Raman spectra**: Apply a "plus 100" transformation.

These transformations are applied to *all existing augmentations* for each sample.

In [None]:
# --- Define Source-Specific Transformers ---
class Minus1Transformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None): return self
    def transform(self, X_spectrum): return X_spectrum - 1

class Plus100Transformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None): return self
    def transform(self, X_spectrum): return X_spectrum + 100

minus1_transformer = Minus1Transformer()
plus100_transformer = Plus100Transformer()

# --- Deep copy augmented spectra data for source transformation ---
spectra_data_v3 = {
    src: [[np.copy(aug) for aug in augs_list] for augs_list in sample_augs_list]
    for src, sample_augs_list in spectra_data_v2.items()
}

# --- Apply Source-Specific Transformations ---
# For NIRS source
for s_idx in range(total_samples):
    for aug_idx in range(len(spectra_data_v3['nirs'][s_idx])):
        spectra_data_v3['nirs'][s_idx][aug_idx] = minus1_transformer.transform(
            spectra_data_v3['nirs'][s_idx][aug_idx]
        )

# For Raman source
for s_idx in range(total_samples):
    for aug_idx in range(len(spectra_data_v3['raman'][s_idx])):
        spectra_data_v3['raman'][s_idx][aug_idx] = plus100_transformer.transform(
            spectra_data_v3['raman'][s_idx][aug_idx]
        )

# --- Build SpectraSet with Source-Transformed Data ---
ss_v3 = SpectraSet.build(
    spectra=spectra_data_v3,
    target=target_y_v1,
    metadata=metadata_v1
)

# print("--- SpectraSet after Source-Specific Transformation (ss_v3) ---")
# print(ss_v3.ds)
# print(f"\nShape of X from ss_v3: {ss_v3.X().shape}") # Shape should be same as ss_v2

# # Verify transformations for one sample
# print(f"\nNIRS for sample 0, aug 0 (original was {spectra_data_v2['nirs'][0][0]}):")
# print(f"  Transformed: {spectra_data_v3['nirs'][0][0]}")
# print(f"Raman for sample 0, aug 0 (original was {spectra_data_v2['raman'][0][0]}):")
# print(f"  Transformed: {spectra_data_v3['raman'][0][0]}")


## 6. Feature Augmentation (Adding New Sources)

We create new spectral sources by applying a transformation to existing ones.
- For every sample and its existing augmentations (which are now source-transformed):
    - Create `nirs_plus_01` by adding 0.1 to each NIRS spectrum.
    - Create `raman_plus_01` by adding 0.1 to each Raman spectrum.

In [None]:
# --- Define Feature Augmentation Transformer ---
class AddConstantTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, constant_to_add):
        self.constant_to_add = constant_to_add
    def fit(self, X, y=None): return self
    def transform(self, X_spectrum): return X_spectrum + self.constant_to_add

add_point_one_transformer = AddConstantTransformer(0.1)

# --- Deep copy source-transformed spectra data for feature augmentation ---
spectra_data_v4 = {
    src: [[np.copy(aug) for aug in augs_list] for augs_list in sample_augs_list]
    for src, sample_augs_list in spectra_data_v3.items()
}

# --- Initialize new sources ---
spectra_data_v4['nirs_plus_01'] = [[] for _ in range(total_samples)]
spectra_data_v4['raman_plus_01'] = [[] for _ in range(total_samples)]

# --- Apply Feature Augmentation ---
for s_idx in range(total_samples):
    # For nirs_plus_01
    for aug_spectrum in spectra_data_v3['nirs'][s_idx]: # Iterate existing NIRS augs
        spectra_data_v4['nirs_plus_01'][s_idx].append(
            add_point_one_transformer.transform(aug_spectrum)
        )
    # For raman_plus_01
    for aug_spectrum in spectra_data_v3['raman'][s_idx]: # Iterate existing Raman augs
        spectra_data_v4['raman_plus_01'][s_idx].append(
            add_point_one_transformer.transform(aug_spectrum)
        )
        
# --- Build SpectraSet with Feature-Augmented Data ---
# This will be our main SpectraSet for subsequent steps
ss_v4_pre_group_split = SpectraSet.build(
    spectra=spectra_data_v4,
    target=target_y_v1,
    metadata=metadata_v1
)

print("--- SpectraSet after Feature Augmentation (ss_v4_pre_group_split) ---")
print(ss_v4_pre_group_split.ds)
# X shape should be (24, 9+9) = (24, 18) if concatenated (nirs, raman, nirs_plus_01, raman_plus_01)
# NIRS (4) + Raman (5) + NIRS_plus_01 (4) + Raman_plus_01 (5) = 18 features
print(f"\nShape of X from ss_v4_pre_group_split: {ss_v4_pre_group_split.X().shape}")
print(f"Sources in ss_v4_pre_group_split: {[k for k in ss_v4_pre_group_split.ds.data_vars if k.startswith('spectra_')]}")

# Verify new source for one sample
print(f"\nNIRS for sample 0, aug 0: {spectra_data_v4['nirs'][0][0]}")
print(f"NIRS_plus_01 for sample 0, aug 0: {spectra_data_v4['nirs_plus_01'][0][0]}")

--- SpectraSet after Feature Augmentation (ss_v4_pre_group_split) ---
<xarray.Dataset> Size: 6kB
Dimensions:                (obs: 26, feature_nirs: 4, feature_raman: 5,
                            feature_nirs_plus_01: 4, feature_raman_plus_01: 5,
                            variable: 1)
Coordinates:
  * obs                    (obs) int64 208B 0 1 2 3 4 5 6 ... 20 21 22 23 24 25
  * feature_nirs           (feature_nirs) <U7 112B 'nirs_f0' ... 'nirs_f3'
  * feature_raman          (feature_raman) <U8 160B 'raman_f0' ... 'raman_f4'
  * feature_nirs_plus_01   (feature_nirs_plus_01) <U15 240B 'nirs_plus_01_f0'...
  * feature_raman_plus_01  (feature_raman_plus_01) <U16 320B 'raman_plus_01_f...
  * variable               (variable) int64 8B 0
    sample                 (obs) int64 208B 0 0 1 1 2 2 3 3 ... 7 7 8 8 8 9 9 9
    augmentation           (obs) int64 208B 0 1 0 1 0 1 0 1 ... 1 2 0 1 2 0 1 2
Data variables:
    spectra_nirs           (obs, feature_nirs) float64 832B -1.0 -1.0 ... 8.2


## 7. Grouping Samples

We group samples based on whether the first value of their *original* NIRS spectrum is odd or even.
The original NIRS spectra are taken from `spectra_data_v1`.

In [None]:
# --- Define Groups ---
group_labels_per_sample = []
for s_idx in range(total_samples):
    # Access the first value of the original NIRS spectrum for the sample
    # spectra_data_v1['nirs'][s_idx] is a list containing one np.array (the original aug)
    original_nirs_spectrum = spectra_data_v1['nirs'][s_idx][0]
    first_nirs_value = original_nirs_spectrum[0]
    
    if first_nirs_value % 2 != 0: # Check if odd
        group_labels_per_sample.append('odd_first_nirs')
    else:
        group_labels_per_sample.append('even_first_nirs')

group_labels_np = np.array(group_labels_per_sample)
custom_groups = {'nirs_parity_group': group_labels_np}

# --- Rebuild SpectraSet with Groups ---
# Using spectra_data_v4 (which has all spectral transformations and augmentations)
ss_v5_with_groups = SpectraSet.build(
    spectra=spectra_data_v4,
    target=target_y_v1,
    metadata=metadata_v1,
    groups=custom_groups
)

print("--- SpectraSet with Custom Groups (ss_v5_with_groups) ---")
print(ss_v5_with_groups.ds)
print(f"\nShape of X from ss_v5_with_groups: {ss_v5_with_groups.X().shape}") # Shape should be same as ss_v4
print(f"\nGroup labels per sample (first {total_samples}): {np.unique(group_labels_np)[:total_samples]}")
print(f"\nValue counts for 'nirs_parity_group' (per observation):")
print(pd.Series(ss_v5_with_groups.ds['group_id_nirs_parity_group'].values).value_counts())
# To see counts per original sample:
print(f"\nValue counts for 'nirs_parity_group' (per original sample):")
print(pd.Series(group_labels_np).value_counts())

# Example: n=1 (s_idx=0,1), NIRS starts with 0 (even)
# n=2 (s_idx=2,3), NIRS starts with 0 (even) for (2-1)*0
# n=3 (s_idx=4,5), NIRS starts with 0 (even) for (3-1)*0
# n=4 (s_idx=6,7), NIRS starts with 0 (even) for (4-1)*0
# n=5 (s_idx=8,9), NIRS starts with 0 (even) for (5-1)*0
# It seems all will be 'even_first_nirs' with current (n-1)*i logic. Let's check.
# (n-1)*0 is always 0. So all are 'even'.
# Let's adjust the grouping logic to be more interesting, e.g. based on `current_n` itself.
# Or, let's use the sum of the first NIRS spectrum.

print("\n--- Adjusting Grouping Logic for Better Distribution ---")
group_labels_per_sample_adj = []
for s_idx in range(total_samples):
    original_nirs_spectrum = spectra_data_v1['nirs'][s_idx][0]
    sum_of_nirs_values = np.sum(original_nirs_spectrum)
    if sum_of_nirs_values % 2 != 0: # Check if sum is odd
        group_labels_per_sample_adj.append('odd_sum_nirs')
    else:
        group_labels_per_sample_adj.append('even_sum_nirs')

group_labels_adj_np = np.array(group_labels_per_sample_adj)
custom_groups_adj = {'nirs_sum_parity_group': group_labels_adj_np}

ss_final_with_groups = SpectraSet.build(
    spectra=spectra_data_v4,
    target=target_y_v1,
    metadata=metadata_v1,
    groups=custom_groups_adj # Use adjusted groups
)
print("--- SpectraSet with Adjusted Custom Groups (ss_final_with_groups) ---")
print(ss_final_with_groups.ds)
print(f"\nValue counts for 'nirs_sum_parity_group' (per original sample):")
print(pd.Series(group_labels_adj_np).value_counts())
# n=1, NIRS=[0,0,0,0], sum=0 (even)
# n=2, NIRS=[0,1,2,3], sum=6 (even)
# n=3, NIRS=[0,2,4,6], sum=12 (even)
# n=4, NIRS=[0,3,6,9], sum=18 (even)
# n=5, NIRS=[0,4,8,12], sum=24 (even)
# Still all even. The data generation is very regular.
# Let's group by n % 2 for simplicity to get a split.
print("\n--- Final Grouping Logic: n % 2 ---")
group_labels_n_parity = ['n_odd' if n_values_for_samples[s_idx] % 2 != 0 else 'n_even' for s_idx in range(total_samples)]
group_labels_n_parity_np = np.array(group_labels_n_parity)
final_custom_groups = {'n_parity_group': group_labels_n_parity_np}

ss_final_grouped = SpectraSet.build(
    spectra=spectra_data_v4,
    target=target_y_v1,
    metadata=metadata_v1,
    groups=final_custom_groups
)
print("--- SpectraSet with Final Custom Groups (ss_final_grouped) ---")
print(ss_final_grouped.ds)
print(f"\nValue counts for 'n_parity_group' (per original sample):")
print(pd.Series(group_labels_n_parity_np).value_counts())
# n=1 (odd), n=2 (even), n=3 (odd), n=4 (even), n=5 (odd)
# Each n appears twice. So 6 'n_odd' samples, 4 'n_even' samples. This is good for stratification.

--- SpectraSet with Custom Groups (ss_v5_with_groups) ---
<xarray.Dataset> Size: 8kB
Dimensions:                     (obs: 26, feature_nirs: 4, feature_raman: 5,
                                 feature_nirs_plus_01: 4,
                                 feature_raman_plus_01: 5, variable: 1)
Coordinates:
  * obs                         (obs) int64 208B 0 1 2 3 4 5 ... 21 22 23 24 25
  * feature_nirs                (feature_nirs) <U7 112B 'nirs_f0' ... 'nirs_f3'
  * feature_raman               (feature_raman) <U8 160B 'raman_f0' ... 'rama...
  * feature_nirs_plus_01        (feature_nirs_plus_01) <U15 240B 'nirs_plus_0...
  * feature_raman_plus_01       (feature_raman_plus_01) <U16 320B 'raman_plus...
  * variable                    (variable) int64 8B 0
    sample                      (obs) int64 208B 0 0 1 1 2 2 3 ... 7 8 8 8 9 9 9
    augmentation                (obs) int64 208B 0 1 0 1 0 1 0 ... 2 0 1 2 0 1 2
    group_id_nirs_parity_group  (obs) <U15 2kB 'even_first_nirs' ... 'odd_fi

NameError: name 'n_values_for_samples' is not defined

## 8. Train/Test Split

We split the data into training (80%) and testing (20%) sets.
The split is stratified based on the `n_parity_group` to ensure proportional representation of each group in both sets.
The split is performed on the original sample indices.

In [None]:
# --- Perform Train/Test Split on Sample Indices ---
# We need unique sample indices and their corresponding group labels for stratification
sample_indices = np.arange(total_samples) # 0 to 9 for our 10 original samples

# Stratify based on group_labels_n_parity_np (one label per original sample)
train_sample_idx, test_sample_idx = train_test_split(
    sample_indices,
    test_size=0.2, # 20% for test
    stratify=group_labels_n_parity_np, # Stratify by the group labels of original samples
    random_state=42
)

# --- Create splits array for SpectraSet ---
# This array indicates 'train' or 'test' for each original sample
splits_array = np.full(total_samples, 'train', dtype=object)
splits_array[test_sample_idx] = 'test'

# --- Rebuild SpectraSet with Splits Information ---
ss_final_split = SpectraSet.build(
    spectra=spectra_data_v4, # Use the fully processed spectra
    target=target_y_v1,
    metadata=metadata_v1,
    groups=final_custom_groups, # Keep the groups
    splits=splits_array        # Add the split information
)

print("--- SpectraSet with Train/Test Splits (ss_final_split) ---")
print(ss_final_split.ds)

# --- Extract Train and Test Data using SpectraSet methods ---
X_train = ss_final_split.X(split='train')
y_train = ss_final_split.y(split='train')
X_test = ss_final_split.X(split='test')
y_test = ss_final_split.y(split='test')

print(f"\nShape of X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"Shape of X_test: {X_test.shape}, y_test: {y_test.shape}")

# Verify stratification (counts of groups in train/test based on original samples)
print(f"\nOriginal sample group counts: {pd.Series(group_labels_n_parity_np).value_counts()}")
print(f"Train sample group counts: {pd.Series(group_labels_n_parity_np[train_sample_idx]).value_counts()}")
print(f"Test sample group counts: {pd.Series(group_labels_n_parity_np[test_sample_idx]).value_counts()}")
# Test set should have 2 samples (20% of 10).
# If n_odd=6, n_even=4:
# Test (2 samples): 20% of 6 n_odd is ~1.2 (so 1 or 2), 20% of 4 n_even is ~0.8 (so 0 or 1).
# train_test_split will try its best. For small N, it might not be perfect but aims for proportion.
# With 2 test samples: if 1 n_odd, 1 n_even, that's good.
# Test sample indices: {test_sample_idx}, their groups: {group_labels_n_parity_np[test_sample_idx]}

--- SpectraSet with Train/Test Splits (ss_final_split) ---
<xarray.Dataset> Size: 6kB
Dimensions:                  (obs: 24, feature: 5, variable: 1)
Coordinates:
  * obs                      (obs) object 192B MultiIndex
  * sample                   (obs) int64 192B 0 0 1 1 2 2 3 3 ... 7 8 8 8 9 9 9
  * augmentation             (obs) int64 192B 0 1 0 1 0 1 0 1 ... 2 0 1 2 0 1 2
  * feature                  (feature) int64 40B 0 1 2 3 4
  * variable                 (variable) int64 8B 0
    split                    (obs) <U5 480B 'train' 'train' ... 'train' 'train'
    group_id_n_parity_group  (obs) <U6 576B 'n_odd' 'n_odd' ... 'n_odd' 'n_odd'
Data variables:
    spectra_nirs             (obs, feature) float64 960B -1.0 -1.0 ... 13.0 nan
    spectra_raman            (obs, feature) float64 960B 105.0 105.0 ... 123.0
    spectra_nirs_plus_01     (obs, feature) float64 960B -0.9 -0.9 ... 13.1 nan
    spectra_raman_plus_01    (obs, feature) float64 960B 105.1 105.1 ... 123.1
    target     

## 9. Repeated K-Fold Cross-Validation

We apply `RepeatedKFold` to the 80% training data.
- `n_splits=5`, `n_repeats=2`.
- The folding is done on the *sample indices* of the training set.
- The prompt mentioned "excluding only a fold from one random group". Standard `RepeatedKFold` operates on the provided samples. If group-specific exclusion per fold is needed, a more custom loop would be required. Here, we perform standard `RepeatedKFold` on the training samples.

In [None]:
# --- Setup RepeatedKFold ---
rkf = RepeatedKFold(n_splits=5, n_repeats=2, random_state=42)

# We need to apply KFold to the *samples* within the training set.
# `train_sample_idx` contains the original indices (0-9) of the samples in the training set.
# Let's get a SpectraSet view of only the training data samples.
ss_train_samples_only = ss_final_split.subset_by_samples(train_sample_idx)

# The KFold split will be on the *unique sample IDs present in ss_train_samples_only*.
# These are not necessarily contiguous from 0 to len(train_sample_idx)-1 if we directly use train_sample_idx.
# Instead, we split based on the number of unique samples in the training set.
unique_train_sample_ids_in_ss = ss_train_samples_only.ds.sample.to_series().unique()
# unique_train_sample_ids_in_ss contains the actual sample IDs (from 0-9 range) that are in training.

print(f"Number of unique samples in training set: {len(unique_train_sample_ids_in_ss)}")
print(f"Unique sample IDs in training set: {unique_train_sample_ids_in_ss}")


print("\n--- Iterating through RepeatedKFold Splits ---")
for fold_num, (train_fold_indices, val_fold_indices) in enumerate(rkf.split(unique_train_sample_ids_in_ss)):
    # train_fold_indices and val_fold_indices are indices *within* unique_train_sample_ids_in_ss
    
    # Get the actual sample IDs for this fold's train and validation sets
    current_fold_train_sample_ids = unique_train_sample_ids_in_ss[train_fold_indices]
    current_fold_val_sample_ids = unique_train_sample_ids_in_ss[val_fold_indices]
    
    # Get SpectraSet views for this specific fold's train and validation samples
    ss_fold_train = ss_train_samples_only.subset_by_samples(current_fold_train_sample_ids)
    ss_fold_val = ss_train_samples_only.subset_by_samples(current_fold_val_sample_ids)
    
    # Extract X and y for this fold
    X_fold_tr = ss_fold_train.X()
    y_fold_tr = ss_fold_train.y()
    X_fold_val = ss_fold_val.X()
    y_fold_val = ss_fold_val.y()
    
    print(f"Fold {fold_num + 1}:")
    print(f"  Train samples in fold (IDs): {current_fold_train_sample_ids}")
    print(f"  Validation samples in fold (IDs): {current_fold_val_sample_ids}")
    print(f"  X_train_fold shape: {X_fold_tr.shape}, y_train_fold shape: {y_fold_tr.shape}")
    print(f"  X_val_fold shape: {X_fold_val.shape}, y_val_fold shape: {y_fold_val.shape}")
    if fold_num == 0: # Print details for the first fold
        print(f"  Example y_val_fold (first 5): {y_fold_val[:5].ravel()}")

# Note: With n_splits=5 and 8 training samples, some validation sets might be small (1-2 samples).
# This means X_val/y_val shapes will reflect observations from those 1-2 samples.
# E.g., if a val sample has 2 augs, its y_val will have 2 identical target values.

Number of unique samples in training set: 8
Unique sample IDs in training set: [0 1 2 3 5 7 8 9]

--- Iterating through RepeatedKFold Splits ---
Fold 1:
  Train samples in fold (IDs): [0 2 3 5 8 9]
  Validation samples in fold (IDs): [1 7]
  X_train_fold shape: (14, 20), y_train_fold shape: (14, 1)
  X_val_fold shape: (5, 20), y_val_fold shape: (5, 1)
  Example y_val_fold (first 5): [1 1 4 4 4]
Fold 2:
  Train samples in fold (IDs): [1 2 3 5 7 8]
  Validation samples in fold (IDs): [0 9]
  X_train_fold shape: (14, 20), y_train_fold shape: (14, 1)
  X_val_fold shape: (5, 20), y_val_fold shape: (5, 1)
Fold 3:
  Train samples in fold (IDs): [0 1 3 7 8 9]
  Validation samples in fold (IDs): [2 5]
  X_train_fold shape: (15, 20), y_train_fold shape: (15, 1)
  X_val_fold shape: (4, 20), y_val_fold shape: (4, 1)
Fold 4:
  Train samples in fold (IDs): [0 1 2 5 7 8 9]
  Validation samples in fold (IDs): [3]
  X_train_fold shape: (17, 20), y_train_fold shape: (17, 1)
  X_val_fold shape: (2, 20), 

## 10. Unpacking Groups

Finally, we iterate through the defined groups (`n_parity_group`) in the full dataset (before train/test split, but with all transformations and augmentations - `ss_final_grouped`) and display the shape of `X` and `y` for each group.

In [None]:
# --- Unpack and Inspect Data by Group ---
# Using ss_final_grouped, which contains all samples and their group assignments.
unique_group_values = np.unique(ss_final_grouped.ds['group_id_n_parity_group'].values)

print(f"--- Data Shapes per Group (from ss_final_grouped) ---")
for group_val in unique_group_values:
    # Filter SpectraSet by the current group value
    # The groups argument in X() and y() filters observations.
    X_group = ss_final_grouped.X(groups={'n_parity_group': group_val})
    y_group = ss_final_grouped.y(groups={'n_parity_group': group_val})
    
    print(f"\nGroup: '{group_val}'")
    print(f"  Shape of X for this group: {X_group.shape}")
    print(f"  Shape of y for this group: {y_group.shape}")
    # Verify target values in this group (should be consistent with n_parity)
    unique_n_in_group_y = np.unique(y_group.ravel())
    print(f"  Unique 'n' values (target y) in this group: {unique_n_in_group_y}")
    if group_val == 'n_odd':
        all_odd = all(n % 2 != 0 for n in unique_n_in_group_y)
        print(f"    All target 'n' values are odd: {all_odd}")
    elif group_val == 'n_even':
        all_even = all(n % 2 == 0 for n in unique_n_in_group_y)
        print(f"    All target 'n' values are even: {all_even}")


--- Data Shapes per Group (from ss_final_grouped) ---

Group: 'n_even'
  Shape of X for this group: (10, 20)
  Shape of y for this group: (10, 1)
  Unique 'n' values (target y) in this group: [2 4]
    All target 'n' values are even: True

Group: 'n_odd'
  Shape of X for this group: (14, 20)
  Shape of y for this group: (14, 1)
  Unique 'n' values (target y) in this group: [1 3 5]
    All target 'n' values are odd: True


## 11. Conclusion

This notebook demonstrated a comprehensive workflow using `SpectraSet`, covering data generation, various types of augmentations and transformations, grouping, splitting, and cross-validation preparation. Each step involved careful manipulation of the spectral data and rebuilding or querying the `SpectraSet` object to reflect the changes, while displaying shapes to track the data transformations.