# Notebook 2: Data Wrangling and Hybrid Preprocessing

**Project:** `PharmaControl-Pro`
**Goal:** Use our advanced simulator to generate a large, diverse dataset suitable for training a dynamic model. We will then perform the crucial preprocessing steps: chronological splitting, feature scaling, and creating a PyTorch `Dataset` to serve data in the correct sequence format.

### Table of Contents
1. [System Identification: Generating Rich Data](#1.-System-Identification:-Generating-Rich-Data)
2. [Time-Series Splitting: Avoiding Data Leakage](#2.-Time-Series-Splitting:-Avoiding-Data-Leakage)
3. [Hybrid Modeling: Engineering Soft Sensors](#3.-Hybrid-Modeling:-Engineering-Soft-Sensors)
4. [Data Scaling](#4.-Data-Scaling)
5. [Creating a PyTorch Time-Series Dataset](#5.-Creating-a-PyTorch-Time-Series-Dataset)

--- 
## 1. System Identification: Generating Rich Data

To train a model that understands process dynamics, we need to show it a wide variety of conditions. A simple step-change experiment is not enough. We need to 'excite' the system by changing the inputs (CPPs) frequently and randomly within their operating ranges.

This process is called **System Identification**. We will run our simulator for a long duration, randomly adjusting the CPPs at regular intervals. This will generate a rich time-series dataset that captures how the system responds to a wide range of inputs and transitions.

In [1]:
import pandas as pd
import numpy as np
import os, sys
from tqdm.notebook import tqdm
from V1.src.plant_simulator import AdvancedPlantSimulator

# --- Configuration Constants ---
# Data generation configuration
SIMULATION_STEPS = 15000      # Total simulation time steps
CPP_CHANGE_INTERVAL = 75      # Steps between control parameter changes  
RANDOM_SEED = 42              # For reproducible data generation

# Time series configuration  
LOOKBACK = 36                 # Historical context window (steps)
HORIZON = 72                  # Prediction horizon (steps)
BATCH_SIZE = 64               # Training batch size

# Data management
DATA_DIR = '../data'
RAW_DATA_FILE = os.path.join(DATA_DIR, 'granulation_data_raw.csv')  # Unscaled data

# Define column groups
CMA_COLS = ['d50', 'lod']  # Critical Material Attributes
CPP_COLS = ['spray_rate', 'air_flow', 'carousel_speed', 'specific_energy', 'froude_number_proxy']  # Critical Process Parameters

# Set random seed for reproducibility
np.random.seed(RANDOM_SEED)

# Create data directories if they don't exist
try:
    os.makedirs(DATA_DIR, exist_ok=True)
    #os.makedirs(PROCESSED_DATA_DIR, exist_ok=True)
    print(f"✓ Data directories created: {DATA_DIR}")
except OSError as e:
    raise RuntimeError(f"Failed to create data directories: {e}")

# --- Data Generation ---
if os.path.exists(RAW_DATA_FILE):
    print(f"Dataset '{RAW_DATA_FILE}' already exists. Loading...")
    try:
        df_raw = pd.read_csv(RAW_DATA_FILE)
        print(f"✓ Loaded existing dataset with {len(df_raw)} records")
    except Exception as e:
        raise RuntimeError(f"Failed to load existing dataset: {e}")
else:
    print("Generating new dataset...")
    
    # Initialize plant simulator with reproducible seed
    plant = AdvancedPlantSimulator(random_seed=RANDOM_SEED)
    
    # Use CPP ranges from plant simulator for consistency
    cpp_ranges = plant.CPP_RANGES
    print(f"✓ Using CPP ranges from plant simulator: {cpp_ranges}")
    
    # Initialize with a random valid state
    current_cpps = {key: np.random.uniform(min_v, max_v) for key, (min_v, max_v) in cpp_ranges.items()}

    log = []
    try:
        for t in tqdm(range(SIMULATION_STEPS), desc="Generating data"):
            # Randomly change CPPs at specified intervals
            if t % CPP_CHANGE_INTERVAL == 0:
                current_cpps = {key: np.random.uniform(min_v, max_v) for key, (min_v, max_v) in cpp_ranges.items()}
            
            state = plant.step(current_cpps)
            record = {**current_cpps, **state}
            log.append(record)

        df_raw = pd.DataFrame(log)
        
        # Validate generated data
        expected_cols = set(list(cpp_ranges.keys()) + ['d50', 'lod'])
        actual_cols = set(df_raw.columns)
        if not expected_cols.issubset(actual_cols):
            missing = expected_cols - actual_cols
            raise ValueError(f"Generated data missing expected columns: {missing}")
        
        # Save raw unscaled data
        df_raw.to_csv(RAW_DATA_FILE, index=False)
        print(f"✓ Dataset with {len(df_raw)} records saved to '{RAW_DATA_FILE}'")
        
    except Exception as e:
        raise RuntimeError(f"Data generation failed: {e}")

# Validate data integrity
if df_raw.empty:
    raise ValueError("Generated dataset is empty")
if len(df_raw) < LOOKBACK + HORIZON:
    raise ValueError(f"Dataset too small ({len(df_raw)} rows) for lookback={LOOKBACK} + horizon={HORIZON}")

print(f"✓ Data validation passed: {len(df_raw)} rows, {len(df_raw.columns)} columns")
df_raw.head()

✓ Data directories created: ../data
Dataset '../data/granulation_data_raw.csv' already exists. Loading...
✓ Loaded existing dataset with 15000 records
✓ Data validation passed: 15000 rows, 7 columns


Unnamed: 0,spray_rate,air_flow,carousel_speed,d50,lod,specific_energy,froude_number_proxy
0,139.865848,446.805592,23.11989,418.117522,1.563103,3.233683,54.488209
1,139.865848,446.805592,23.11989,418.095544,1.576609,3.233683,54.488209
2,139.865848,446.805592,23.11989,428.347286,1.551784,3.233683,54.488209
3,139.865848,446.805592,23.11989,442.095644,1.506116,3.233683,54.488209
4,139.865848,446.805592,23.11989,442.482701,1.600427,3.233683,54.488209


---
## 2. Time-Series Splitting: Avoiding Data Leakage

This is one of the most critical steps in any time-series modeling project. **You cannot use a random split (like `sklearn.model_selection.train_test_split`) for time-series data.**

Why? A random split would shuffle the data points, meaning the model could be trained on data from time `t` and tested on data from time `t-1`. This is 'cheating' because it has seen the future. This **data leakage** leads to overly optimistic performance metrics and models that fail catastrophically in real-world deployment.

The correct approach is a **chronological split**. We must train the model on the past and validate/test it on the future, mimicking how it will be used in production.

We will split our data as follows:
*   **Training Set (70%):** The earliest data, used for model training.
*   **Validation Set (15%):** The next block of data, used for hyperparameter tuning.
*   **Test Set (15%):** The most recent data, held out for final, unbiased performance evaluation.

In [2]:
# Data Splitting Configuration
TRAIN_SPLIT = 0.7    # 70% for training
VAL_SPLIT = 0.15     # 15% for validation  
TEST_SPLIT = 0.15    # 15% for testing

# Validate split ratios
if abs(TRAIN_SPLIT + VAL_SPLIT + TEST_SPLIT - 1.0) > 1e-6:
    raise ValueError(f"Split ratios must sum to 1.0, got {TRAIN_SPLIT + VAL_SPLIT + TEST_SPLIT}")

# Calculate split indices
n = len(df_raw)
train_end_idx = int(n * TRAIN_SPLIT)
val_end_idx = int(n * (TRAIN_SPLIT + VAL_SPLIT))

# Validate indices create meaningful datasets
min_dataset_size = LOOKBACK + HORIZON
if train_end_idx < min_dataset_size:
    raise ValueError(f"Training set too small ({train_end_idx}) for sequence requirements ({min_dataset_size})")
if (val_end_idx - train_end_idx) < min_dataset_size:
    raise ValueError(f"Validation set too small ({val_end_idx - train_end_idx}) for sequence requirements ({min_dataset_size})")
if (n - val_end_idx) < min_dataset_size:
    raise ValueError(f"Test set too small ({n - val_end_idx}) for sequence requirements ({min_dataset_size})")

# Actual split is perfomed in next step after adding soft sensor data
# Perform the chronological split
# try:
#     df_train_raw = df_raw.iloc[:train_end_idx].copy()
#     df_val_raw = df_raw.iloc[train_end_idx:val_end_idx].copy()
#     df_test_raw = df_raw.iloc[val_end_idx:].copy()
    
#     print(f"✓ Chronological split completed:")
#     print(f"  Training set:   {df_train_raw.shape[0]:,} rows ({TRAIN_SPLIT:.1%})")
#     print(f"  Validation set: {df_val_raw.shape[0]:,} rows ({VAL_SPLIT:.1%})")
#     print(f"  Test set:       {df_test_raw.shape[0]:,} rows ({TEST_SPLIT:.1%})")
    
#     # Validate no data overlap
#     assert len(set(df_train_raw.index) & set(df_val_raw.index)) == 0, "Training/validation overlap detected"
#     assert len(set(df_val_raw.index) & set(df_test_raw.index)) == 0, "Validation/test overlap detected"
#     assert len(set(df_train_raw.index) & set(df_test_raw.index)) == 0, "Training/test overlap detected"
    
#     print("✓ Data integrity validation passed - no temporal leakage detected")
    
# except Exception as e:
#     raise RuntimeError(f"Data splitting failed: {e}")

--- 
## 3. Hybrid Modeling: Engineering Soft Sensors

A purely data-driven model can be improved by injecting domain knowledge. We can compute 'soft sensors'—features derived from physical principles or mechanistic models—and add them to our dataset.

This helps the model by:
*   **Providing Context:** A feature like `specific_energy` is more informative than raw `spray_rate` and `carousel_speed` values alone.
*   **Improving Generalization:** Models trained with physics-informed features tend to perform better on unseen data.

We will create two simplified soft sensors as a demonstration.

In [3]:
def add_soft_sensors(df):
    """Calculate and add soft sensor columns to the DataFrame.
    
    Args:
        df: DataFrame containing CPP columns
        
    Returns:
        DataFrame with added soft sensor columns
        
    Raises:
        ValueError: If required columns are missing
    """
    # Validate required columns exist
    required_cols = {'spray_rate', 'carousel_speed'}
    missing_cols = required_cols - set(df.columns)
    if missing_cols:
        raise ValueError(f"Missing required columns for soft sensors: {missing_cols}")
    
    df = df.copy()  # Avoid modifying original DataFrame
    
    try:
        # Proxy for Specific Energy (SE): relates energy input to material throughput
        # A more complex model would use torque, but we use a proxy.
        df['specific_energy'] = (df['spray_rate'] * df['carousel_speed']) / 1000.0
        
        # Proxy for Froude Number (Fr): dimensionless number characterizing mixing intensity
        # Fr is proportional to (speed^2) / diameter. We use a simplified version.
        df['froude_number_proxy'] = (df['carousel_speed']**2) / 9.81
        
        # Validate soft sensor calculations
        if df['specific_energy'].isnull().any() or df['froude_number_proxy'].isnull().any():
            raise ValueError("Soft sensor calculations resulted in null values")
            
        print(f"✓ Soft sensors added successfully: specific_energy, froude_number_proxy")
        return df
        
    except Exception as e:
        raise RuntimeError(f"Soft sensor calculation failed: {e}")

# Add soft sensors to raw data BEFORE splitting
try:
    df_raw = add_soft_sensors(df_raw)
    print("✓ Soft sensors added to raw data")
    print(f"Updated columns: {df_raw.columns.tolist()}")
    
    # Save the updated raw data with soft sensors
    df_raw.to_csv(RAW_DATA_FILE, index=False)
    print(f"✓ Updated raw data with soft sensors saved to '{RAW_DATA_FILE}'")
    
except Exception as e:
    raise RuntimeError(f"Failed to add soft sensors: {e}")

✓ Soft sensors added successfully: specific_energy, froude_number_proxy
✓ Soft sensors added to raw data
Updated columns: ['spray_rate', 'air_flow', 'carousel_speed', 'd50', 'lod', 'specific_energy', 'froude_number_proxy']
✓ Updated raw data with soft sensors saved to '../data/granulation_data_raw.csv'


In [4]:
# Now re-split the data WITH soft sensors included
# Data Splitting Configuration  
TRAIN_SPLIT = 0.7    # 70% for training
VAL_SPLIT = 0.15     # 15% for validation  
TEST_SPLIT = 0.15    # 15% for testing

# Validate split ratios
if abs(TRAIN_SPLIT + VAL_SPLIT + TEST_SPLIT - 1.0) > 1e-6:
    raise ValueError(f"Split ratios must sum to 1.0, got {TRAIN_SPLIT + VAL_SPLIT + TEST_SPLIT}")

# Calculate split indices
n = len(df_raw)
train_end_idx = int(n * TRAIN_SPLIT)
val_end_idx = int(n * (TRAIN_SPLIT + VAL_SPLIT))

# Validate indices create meaningful datasets
min_dataset_size = LOOKBACK + HORIZON
if train_end_idx < min_dataset_size:
    raise ValueError(f"Training set too small ({train_end_idx}) for sequence requirements ({min_dataset_size})")
if (val_end_idx - train_end_idx) < min_dataset_size:
    raise ValueError(f"Validation set too small ({val_end_idx - train_end_idx}) for sequence requirements ({min_dataset_size})")
if (n - val_end_idx) < min_dataset_size:
    raise ValueError(f"Test set too small ({n - val_end_idx}) for sequence requirements ({min_dataset_size})")

# Perform the chronological split WITH soft sensors
try:
    df_train_raw_final = df_raw.iloc[:train_end_idx].copy()
    df_val_raw_final = df_raw.iloc[train_end_idx:val_end_idx].copy()
    df_test_raw_final = df_raw.iloc[val_end_idx:].copy()
    
    print(f"✓ Chronological split completed (with soft sensors):")
    print(f"  Training set:   {df_train_raw_final.shape[0]:,} rows ({TRAIN_SPLIT:.1%})")
    print(f"  Validation set: {df_val_raw_final.shape[0]:,} rows ({VAL_SPLIT:.1%})")
    print(f"  Test set:       {df_test_raw_final.shape[0]:,} rows ({TEST_SPLIT:.1%})")
    
    # Validate no data overlap
    assert len(set(df_train_raw_final.index) & set(df_val_raw_final.index)) == 0, "Training/validation overlap detected"
    assert len(set(df_val_raw_final.index) & set(df_test_raw_final.index)) == 0, "Validation/test overlap detected"
    assert len(set(df_train_raw_final.index) & set(df_test_raw_final.index)) == 0, "Training/test overlap detected"
    
    # Validate soft sensors are present in all splits
    required_cols = set(CMA_COLS + CPP_COLS)
    for name, df in [("training", df_train_raw_final), ("validation", df_val_raw_final), ("test", df_test_raw_final)]:
        missing_cols = required_cols - set(df.columns)
        if missing_cols:
            raise ValueError(f"{name} split missing required columns: {missing_cols}")
    
    print("✓ Data integrity validation passed - no temporal leakage detected")
    print("✓ All splits contain required soft sensor columns")
    
except Exception as e:
    raise RuntimeError(f"Data splitting failed: {e}")

# Save the final splits with soft sensors
RAW_TRAIN_FILE = os.path.join(DATA_DIR, 'train_data_raw.csv')
RAW_VAL_FILE = os.path.join(DATA_DIR, 'validation_data_raw.csv') 
RAW_TEST_FILE = os.path.join(DATA_DIR, 'test_data_raw.csv')

try:
    # Save final datasets with soft sensors included
    df_train_raw_final.to_csv(RAW_TRAIN_FILE, index=False)
    df_val_raw_final.to_csv(RAW_VAL_FILE, index=False)
    df_test_raw_final.to_csv(RAW_TEST_FILE, index=False)
    
    print(f"✓ Final raw datasets with soft sensors saved:")
    print(f"  Training: '{RAW_TRAIN_FILE}' ({len(df_train_raw_final.columns)} columns)")
    print(f"  Validation: '{RAW_VAL_FILE}' ({len(df_val_raw_final.columns)} columns)")
    print(f"  Test: '{RAW_TEST_FILE}' ({len(df_test_raw_final.columns)} columns)")
    
    # Verify file integrity and column presence
    for name, filepath in [("Training", RAW_TRAIN_FILE), ("Validation", RAW_VAL_FILE), ("Test", RAW_TEST_FILE)]:
        if not os.path.exists(filepath):
            raise FileNotFoundError(f"Failed to create {filepath}")
        if os.path.getsize(filepath) == 0:
            raise ValueError(f"Created empty file: {filepath}")
        
        # Verify all required columns are present
        test_df = pd.read_csv(filepath, nrows=1)
        required_columns = set(CMA_COLS + CPP_COLS)
        missing_columns = required_columns - set(test_df.columns)
        if missing_columns:
            raise ValueError(f"{name} dataset missing required columns: {missing_columns}")
    
    print("✓ File integrity and column verification passed")
    
except Exception as e:
    raise RuntimeError(f"Failed to save final datasets: {e}")

print("\n" + "="*60)
print("DATA PREPROCESSING SUMMARY")
print("="*60)
print(f"📊 Raw data generated: {len(df_raw):,} total records")
print(f"🧪 Soft sensors added: specific_energy, froude_number_proxy") 
print(f"📈 Train/Val/Test split: {TRAIN_SPLIT:.0%}/{VAL_SPLIT:.0%}/{TEST_SPLIT:.0%}")
print(f"💾 Unscaled data with soft sensors saved to: {DATA_DIR}")
print(f"🎯 Ready for model training with LOOKBACK={LOOKBACK}, HORIZON={HORIZON}")
print(f"📋 All columns: {df_raw.columns.tolist()}")
print("="*60)

✓ Chronological split completed (with soft sensors):
  Training set:   10,500 rows (70.0%)
  Validation set: 2,250 rows (15.0%)
  Test set:       2,250 rows (15.0%)
✓ Data integrity validation passed - no temporal leakage detected
✓ All splits contain required soft sensor columns
✓ Final raw datasets with soft sensors saved:
  Training: '../data/train_data_raw.csv' (7 columns)
  Validation: '../data/validation_data_raw.csv' (7 columns)
  Test: '../data/test_data_raw.csv' (7 columns)
✓ File integrity and column verification passed

DATA PREPROCESSING SUMMARY
📊 Raw data generated: 15,000 total records
🧪 Soft sensors added: specific_energy, froude_number_proxy
📈 Train/Val/Test split: 70%/15%/15%
💾 Unscaled data with soft sensors saved to: ../data
🎯 Ready for model training with LOOKBACK=36, HORIZON=72
📋 All columns: ['spray_rate', 'air_flow', 'carousel_speed', 'd50', 'lod', 'specific_energy', 'froude_number_proxy']


---
## 5. Creating a PyTorch Time-Series Dataset

Our model needs data in a specific format: sequences of past and future information. A standard PyTorch `DataLoader` expects to receive data from a `Dataset` object. We will create a custom `Dataset` class that, given an index `i`, returns a complete sample tuple:

`(past_CMAs, past_CPPs, future_CPPs, future_CMAs_target)`

We will define this class in `src/dataset.py` for reusability.

In [5]:
from torch.utils.data import DataLoader
from V1.src.dataset import GranulationDataset

# Load the final datasets with soft sensors included
try:
    df_train_raw = pd.read_csv(RAW_TRAIN_FILE)
    df_val_raw = pd.read_csv(RAW_VAL_FILE) 
    df_test_raw = pd.read_csv(RAW_TEST_FILE)
    
    print(f"✓ Loaded final datasets for PyTorch processing")
    
    # Display column information for verification
    print(f"  Training dataset columns: {df_train_raw.columns.tolist()}")
    
    # Validate all required columns exist
    all_required_cols = set(CMA_COLS + CPP_COLS)
    for name, df in [("training", df_train_raw), ("validation", df_val_raw), ("test", df_test_raw)]:
        missing_cols = all_required_cols - set(df.columns)
        if missing_cols:
            raise ValueError(f"{name} dataset missing required columns: {missing_cols}")
    
    print(f"✓ Column validation passed for all datasets")
    print(f"✓ All datasets contain required columns: {CMA_COLS + CPP_COLS}")
    
except Exception as e:
    raise RuntimeError(f"Failed to load datasets for PyTorch processing: {e}")

# --- Create Datasets and DataLoaders ---
try:
    # Note: Scaling will be handled in the dataset class or training loop
    # This preserves the original unscaled data for analysis
    train_dataset = GranulationDataset(df_train_raw, CMA_COLS, CPP_COLS, LOOKBACK, HORIZON)
    val_dataset = GranulationDataset(df_val_raw, CMA_COLS, CPP_COLS, LOOKBACK, HORIZON)
    test_dataset = GranulationDataset(df_test_raw, CMA_COLS, CPP_COLS, LOOKBACK, HORIZON)

    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=4)
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=4)
    test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=4)

    print(f"✓ Created PyTorch datasets and dataloaders:")
    print(f"  Training: {len(train_loader)} batches of size {BATCH_SIZE}")
    print(f"  Validation: {len(val_loader)} batches of size {BATCH_SIZE}")
    print(f"  Test: {len(test_loader)} batches of size {BATCH_SIZE}")
    
except Exception as e:
    raise RuntimeError(f"Failed to create PyTorch datasets: {e}")

# --- Verify a sample ---
try:
    past_cmas, past_cpps, future_cpps, future_cmas_target = next(iter(train_loader))

    print(f"\n--- Batch Shape Verification ---")
    print(f"Past CMAs shape:         {past_cmas.shape}")
    print(f"Past CPPs shape:         {past_cpps.shape}")
    print(f"Future CPPs shape:       {future_cpps.shape}")
    print(f"Future CMAs target shape: {future_cmas_target.shape}")

    # Validate expected shapes
    expected_shapes = {
        'past_cmas': (BATCH_SIZE, LOOKBACK, len(CMA_COLS)),
        'past_cpps': (BATCH_SIZE, LOOKBACK, len(CPP_COLS)), 
        'future_cpps': (BATCH_SIZE, HORIZON, len(CPP_COLS)),
        'future_cmas_target': (BATCH_SIZE, HORIZON, len(CMA_COLS))
    }
    
    actual_shapes = {
        'past_cmas': tuple(past_cmas.shape),
        'past_cpps': tuple(past_cpps.shape),
        'future_cpps': tuple(future_cpps.shape), 
        'future_cmas_target': tuple(future_cmas_target.shape)
    }
    
    for tensor_name, expected_shape in expected_shapes.items():
        actual_shape = actual_shapes[tensor_name]
        if actual_shape != expected_shape:
            raise ValueError(f"{tensor_name} shape mismatch: expected {expected_shape}, got {actual_shape}")
    
    print(f"✓ All tensor shapes match expected dimensions")
    
    # Check data value ranges (should be unscaled)
    print(f"\n--- Data Range Verification (Unscaled) ---")
    print(f"d50 range: [{past_cmas[:,:,0].min():.1f}, {past_cmas[:,:,0].max():.1f}] μm")
    print(f"lod range: [{past_cmas[:,:,1].min():.3f}, {past_cmas[:,:,1].max():.3f}] %")
    print(f"spray_rate range: [{past_cpps[:,:,0].min():.1f}, {past_cpps[:,:,0].max():.1f}] g/min")
    
    # Check soft sensor ranges
    specific_energy_idx = CPP_COLS.index('specific_energy')
    froude_idx = CPP_COLS.index('froude_number_proxy')
    print(f"specific_energy range: [{past_cpps[:,:,specific_energy_idx].min():.3f}, {past_cpps[:,:,specific_energy_idx].max():.3f}]")
    print(f"froude_number_proxy range: [{past_cpps[:,:,froude_idx].min():.3f}, {past_cpps[:,:,froude_idx].max():.3f}]")
    
except Exception as e:
    raise RuntimeError(f"Batch verification failed: {e}")

✓ Loaded final datasets for PyTorch processing
  Training dataset columns: ['spray_rate', 'air_flow', 'carousel_speed', 'd50', 'lod', 'specific_energy', 'froude_number_proxy']
✓ Column validation passed for all datasets
✓ All datasets contain required columns: ['d50', 'lod', 'spray_rate', 'air_flow', 'carousel_speed', 'specific_energy', 'froude_number_proxy']
✓ Created PyTorch datasets and dataloaders:
  Training: 163 batches of size 64
  Validation: 34 batches of size 64
  Test: 34 batches of size 64

--- Batch Shape Verification ---
Past CMAs shape:         torch.Size([64, 36, 2])
Past CPPs shape:         torch.Size([64, 36, 5])
Future CPPs shape:       torch.Size([64, 72, 5])
Future CMAs target shape: torch.Size([64, 72, 2])
✓ All tensor shapes match expected dimensions

--- Data Range Verification (Unscaled) ---
d50 range: [291.8, 623.5] μm
lod range: [0.503, 7.613] %
spray_rate range: [80.5, 179.7] g/min
specific_energy range: [2.022, 6.435]
froude_number_proxy range: [43.677, 156