# Notebook 2: Data Wrangling and Hybrid Preprocessing

**Project:** `PharmaControl-Pro`
**Goal:** Use our advanced simulator to generate a large, diverse dataset suitable for training a dynamic model. We will then perform the crucial preprocessing steps: chronological splitting, feature scaling, and creating a PyTorch `Dataset` to serve data in the correct sequence format.

### Table of Contents
1. [System Identification: Generating Rich Data](#1.-System-Identification:-Generating-Rich-Data)
2. [Time-Series Splitting: Avoiding Data Leakage](#2.-Time-Series-Splitting:-Avoiding-Data-Leakage)
3. [Hybrid Modeling: Engineering Soft Sensors](#3.-Hybrid-Modeling:-Engineering-Soft-Sensors)
4. [Data Scaling](#4.-Data-Scaling)
5. [Creating a PyTorch Time-Series Dataset](#5.-Creating-a-PyTorch-Time-Series-Dataset)

--- 
## 1. System Identification: Generating Rich Data

To train a model that understands process dynamics, we need to show it a wide variety of conditions. A simple step-change experiment is not enough. We need to 'excite' the system by changing the inputs (CPPs) frequently and randomly within their operating ranges.

This process is called **System Identification**. We will run our simulator for a long duration, randomly adjusting the CPPs at regular intervals. This will generate a rich time-series dataset that captures how the system responds to a wide range of inputs and transitions.

In [1]:
import pandas as pd
import numpy as np
import os, sys
from tqdm.notebook import tqdm
from V1.src.plant_simulator import AdvancedPlantSimulator

# --- Configuration ---
DATA_DIR = '../data'
DATA_FILE = os.path.join(DATA_DIR, 'granulation_data.csv')
SIMULATION_STEPS = 15000
CPP_CHANGE_INTERVAL = 75 # Change CPPs every 75 steps

# Define operating ranges for our CPPs
CPP_RANGES = {
    'spray_rate': (80.0, 180.0),
    'air_flow': (400.0, 700.0),
    'carousel_speed': (20.0, 40.0)
}

# Create data directory if it doesn't exist
os.makedirs(DATA_DIR, exist_ok=True)

# --- Data Generation ---
if os.path.exists(DATA_FILE):
    print(f"Dataset '{DATA_FILE}' already exists. Skipping generation.")
    df_raw = pd.read_csv(DATA_FILE)
else:
    print("Generating new dataset...")
    plant = AdvancedPlantSimulator()
    
    # Initialize with a random valid state
    current_cpps = {key: np.random.uniform(min_v, max_v) for key, (min_v, max_v) in CPP_RANGES.items()}

    log = []
    for t in tqdm(range(SIMULATION_STEPS)):
        # Randomly change CPPs at specified intervals
        if t % CPP_CHANGE_INTERVAL == 0:
            current_cpps = {key: np.random.uniform(min_v, max_v) for key, (min_v, max_v) in CPP_RANGES.items()}
        
        state = plant.step(current_cpps)
        record = {**current_cpps, **state}
        log.append(record)

    df_raw = pd.DataFrame(log)
    df_raw.to_csv(DATA_FILE, index=False)
    print(f"Dataset with {len(df_raw)} records saved to '{DATA_FILE}'.")

df_raw.head()

Dataset '../data/granulation_data.csv' already exists. Skipping generation.


Unnamed: 0,spray_rate,air_flow,carousel_speed,d50,lod
0,118.028124,425.265825,24.669616,400.398937,1.585391
1,118.028124,425.265825,24.669616,419.201792,1.578726
2,118.028124,425.265825,24.669616,406.536867,1.61126
3,118.028124,425.265825,24.669616,415.168112,1.589788
4,118.028124,425.265825,24.669616,416.203793,1.689158


---
## 2. Time-Series Splitting: Avoiding Data Leakage

This is one of the most critical steps in any time-series modeling project. **You cannot use a random split (like `sklearn.model_selection.train_test_split`) for time-series data.**

Why? A random split would shuffle the data points, meaning the model could be trained on data from time `t` and tested on data from time `t-1`. This is 'cheating' because it has seen the future. This **data leakage** leads to overly optimistic performance metrics and models that fail catastrophically in real-world deployment.

The correct approach is a **chronological split**. We must train the model on the past and validate/test it on the future, mimicking how it will be used in production.

We will split our data as follows:
*   **Training Set (70%):** The earliest data, used for model training.
*   **Validation Set (15%):** The next block of data, used for hyperparameter tuning.
*   **Test Set (15%):** The most recent data, held out for final, unbiased performance evaluation.

In [2]:
# Calculate split indices
n = len(df_raw)
train_end_idx = int(n * 0.7)
val_end_idx = int(n * 0.85)

# Perform the chronological split
df_train = df_raw.iloc[:train_end_idx].copy()
df_val = df_raw.iloc[train_end_idx:val_end_idx].copy()
df_test = df_raw.iloc[val_end_idx:].copy()

print(f"Training set shape:   {df_train.shape}")
print(f"Validation set shape: {df_val.shape}")
print(f"Test set shape:       {df_test.shape}")

Training set shape:   (10500, 5)
Validation set shape: (2250, 5)
Test set shape:       (2250, 5)


--- 
## 3. Hybrid Modeling: Engineering Soft Sensors

As outlined in the paper, a purely data-driven model can be improved by injecting domain knowledge. We can compute 'soft sensors'—features derived from physical principles or mechanistic models—and add them to our dataset.

This helps the model by:
*   **Providing Context:** A feature like `specific_energy` is more informative than raw `spray_rate` and `carousel_speed` values alone.
*   **Improving Generalization:** Models trained with physics-informed features tend to perform better on unseen data.

We will create two simplified soft sensors as a demonstration.

In [3]:
def add_soft_sensors(df):
    """Calculates and adds soft sensor columns to the DataFrame."""
    # Proxy for Specific Energy (SE): relates energy input to material throughput
    # A more complex model would use torque, but we use a proxy.
    df['specific_energy'] = (df['spray_rate'] * df['carousel_speed']) / 1000.0
    
    # Proxy for Froude Number (Fr): dimensionless number characterizing mixing intensity
    # Fr is proportional to (speed^2) / diameter. We use a simplified version.
    df['froude_number_proxy'] = (df['carousel_speed']**2) / 9.81
    return df

df_train = add_soft_sensors(df_train)
df_val = add_soft_sensors(df_val)
df_test = add_soft_sensors(df_test)

print("Soft sensors added. New training set columns:")
print(df_train.columns.tolist())

Soft sensors added. New training set columns:
['spray_rate', 'air_flow', 'carousel_speed', 'd50', 'lod', 'specific_energy', 'froude_number_proxy']


--- 
## 4. Data Scaling

Neural networks perform best when input features are on a similar scale, typically between 0 and 1 or with a mean of 0 and standard deviation of 1. We will use `MinMaxScaler` to scale our data to the `[0, 1]` range.

**Crucial Rule:** The scaler must be **fitted ONLY on the training data**. We then use this fitted scaler to transform the validation and test sets. This prevents any information from the validation/test sets from 'leaking' into the training process.

In [4]:
from sklearn.preprocessing import MinMaxScaler
import joblib

FEATURE_COLUMNS = df_train.columns.tolist()
SCALER_FILE = os.path.join(DATA_DIR, 'scalers.joblib')

# Initialize a dictionary to hold a scaler for each column
scalers = {col: MinMaxScaler() for col in FEATURE_COLUMNS}

# Fit the scalers ONLY on the training data
for col, scaler in scalers.items():
    df_train[col] = scaler.fit_transform(df_train[[col]])

# Transform the validation and test data using the FITTED scalers
for col, scaler in scalers.items():
    df_val[col] = scaler.transform(df_val[[col]])
    df_test[col] = scaler.transform(df_test[[col]])

# Save the fitted scalers for later use (e.g., in deployment)
joblib.dump(scalers, SCALER_FILE)

print("Data scaled successfully. Scalers saved.")
df_train.describe()

Data scaled successfully. Scalers saved.


Unnamed: 0,spray_rate,air_flow,carousel_speed,d50,lod,specific_energy,froude_number_proxy
count,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0
mean,0.463489,0.535058,0.503027,0.501117,0.523148,0.425269,0.450857
std,0.293077,0.286889,0.304963,0.234866,0.232319,0.241163,0.304186
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.207271,0.328495,0.237472,0.312424,0.321943,0.226837,0.177303
50%,0.431652,0.520181,0.506139,0.489725,0.523881,0.370006,0.423088
75%,0.703208,0.795711,0.75935,0.706873,0.723336,0.620675,0.698626
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [5]:
TRAIN_DATA_FILE = os.path.join(DATA_DIR, 'train_data.csv')
df_train.to_csv(TRAIN_DATA_FILE, index=False)
print(f"Training data saved to '{TRAIN_DATA_FILE}'.")
VALIDATION_DATA_FILE = os.path.join(DATA_DIR, 'validation_data.csv')
df_val.to_csv(VALIDATION_DATA_FILE, index=False)
print(f"Validation data saved to '{VALIDATION_DATA_FILE}'.")
TEST_DATA_FILE = os.path.join(DATA_DIR, 'test_data.csv')
df_test.to_csv(TEST_DATA_FILE, index=False)
print(f"Test data saved to '{TEST_DATA_FILE}'.")

Training data saved to '../data/train_data.csv'.
Validation data saved to '../data/validation_data.csv'.
Test data saved to '../data/test_data.csv'.


---
## 5. Creating a PyTorch Time-Series Dataset

Our model needs data in a specific format: sequences of past and future information. A standard PyTorch `DataLoader` expects to receive data from a `Dataset` object. We will create a custom `Dataset` class that, given an index `i`, returns a complete sample tuple:

`(past_CMAs, past_CPPs, future_CPPs, future_CMAs_target)`

We will define this class in `src/dataset.py` for reusability.

In [6]:
%%writefile ../src/dataset.py
import torch
from torch.utils.data import Dataset
import pandas as pd

class GranulationDataset(Dataset):
    """
    Custom PyTorch Dataset for creating time-series sequences for the 
    granulation process predictive model.
    """
    def __init__(self, df, cma_cols, cpp_cols, lookback, horizon):
        self.df = df
        self.cma_cols = cma_cols
        self.cpp_cols = cpp_cols
        self.lookback = lookback
        self.horizon = horizon
        
        # Convert to numpy for faster slicing
        self.cma_data = df[cma_cols].to_numpy()
        self.cpp_data = df[cpp_cols].to_numpy()

    def __len__(self):
        # The number of possible start points for a complete sequence
        return len(self.df) - self.lookback - self.horizon + 1

    def __getitem__(self, idx):
        # Define the slice boundaries for the sample
        past_start = idx
        past_end = idx + self.lookback
        future_end = past_end + self.horizon
        
        # --- Extract sequences ---
        # Historical CMAs (what we observed)
        past_cmas = self.cma_data[past_start:past_end, :]
        
        # Historical CPPs (what we did)
        past_cpps = self.cpp_data[past_start:past_end, :]
        
        # Future CPPs (what we plan to do)
        future_cpps = self.cpp_data[past_end:future_end, :]
        
        # Future CMAs (the ground truth we want to predict)
        future_cmas_target = self.cma_data[past_end:future_end, :]

        # Convert to PyTorch tensors
        return (
            torch.tensor(past_cmas, dtype=torch.float32),
            torch.tensor(past_cpps, dtype=torch.float32),
            torch.tensor(future_cpps, dtype=torch.float32),
            torch.tensor(future_cmas_target, dtype=torch.float32)
        )

Overwriting ../src/dataset.py


In [8]:
from torch.utils.data import DataLoader
from V1.src.dataset import GranulationDataset

# --- Configuration ---
LOOKBACK = 36 # L: Use 36 past steps (e.g., 3 minutes of data)
HORIZON = 72  # H: Predict 72 future steps (e.g., 6 minutes)
BATCH_SIZE = 64

# Define column groups
CMA_COLS = ['d50', 'lod']
CPP_COLS = ['spray_rate', 'air_flow', 'carousel_speed', 'specific_energy', 'froude_number_proxy']

# --- Create Datasets and DataLoaders ---
train_dataset = GranulationDataset(df_train, CMA_COLS, CPP_COLS, LOOKBACK, HORIZON)
val_dataset = GranulationDataset(df_val, CMA_COLS, CPP_COLS, LOOKBACK, HORIZON)
test_dataset = GranulationDataset(df_test, CMA_COLS, CPP_COLS, LOOKBACK, HORIZON)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=4)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=4)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=4)

print(f"Created {len(train_loader)} training batches of size {BATCH_SIZE}.")
print(f"Created {len(val_loader)} validation batches of size {BATCH_SIZE}.")

# --- Verify a sample ---
past_cmas, past_cpps, future_cpps, future_cmas_target = next(iter(train_loader))

print("\n--- Verifying batch shapes ---")
print(f"Past CMAs shape:        {past_cmas.shape}")
print(f"Past CPPs shape:        {past_cpps.shape}")
print(f"Future CPPs shape:      {future_cpps.shape}")
print(f"Future CMAs (Target) shape: {future_cmas_target.shape}")

print("\nExpected Past CMAs shape:        (batch_size, lookback, num_cmas) -> (64, 36, 2)")
print("Expected Future CMAs (Target) shape: (batch_size, horizon, num_cmas) -> (64, 72, 2)")

Created 163 training batches of size 64.
Created 34 validation batches of size 64.



--- Verifying batch shapes ---
Past CMAs shape:        torch.Size([64, 36, 2])
Past CPPs shape:        torch.Size([64, 36, 5])
Future CPPs shape:      torch.Size([64, 72, 5])
Future CMAs (Target) shape: torch.Size([64, 72, 2])

Expected Past CMAs shape:        (batch_size, lookback, num_cmas) -> (64, 36, 2)
Expected Future CMAs (Target) shape: (batch_size, horizon, num_cmas) -> (64, 72, 2)


### Conclusion

We have successfully transformed our raw simulated data into a clean, structured, and properly formatted input for our machine learning model. We have:
1. Generated a diverse dataset.
2. Performed a chronologically correct train/validation/test split.
3. Enriched the data with physics-informed soft sensors.
4. Scaled all features appropriately.
5. Encapsulated the complex sequencing logic into a reusable PyTorch `Dataset`.

We are now ready to move on to the most exciting part: building and training the predictive model in Notebook 3.