# Tutorial 2: Producing a Dataset for Prediction

This tutorial guides you through generating a dataset for **inference/prediction** mode, where target (IST) data is not available.

## Overview

The dataset generation follows the same structure as Tutorial 1:
1. **Target Pipeline (IST)**: ⚠️ **DIFFERENCE**: Instead of processing real IST data, we generate placeholders
2. **Input Pipeline (Process)**: Processes manufacturing parameters (same as Tutorial 1)

## Key Difference from Training

| Aspect | Tutorial 1 (Training) | Tutorial 2 (Prediction) |
|--------|----------------------|-------------------------|
| Step 1 | Process real IST data | Generate placeholder IST |
| Y values | Actual resistance deltas | Zeros (placeholders) |
| Stratified split | Enabled | Disabled |
| Purpose | Train model | Generate predictions |

## Prerequisites

Before starting, ensure you have:
- Installed all dependencies (`pip install -r requirements.txt`)
- Obtained access to the raw process data files
- Placed data files according to `data/README_DATA.md`
- A list of sample groups (batches/panels) to generate predictions for

## Step 0: Setup

First, let's set up the environment and verify the data structure.

In [1]:
import sys
import os
from os.path import exists, join, dirname, abspath

# Add project root to path
PROJECT_ROOT = dirname(abspath(os.getcwd()))
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

print(f"Project root: {PROJECT_ROOT}")

Project root: c:\Users\ScipioneFrancesco\Documents\Projects\proT_pipeline


In [2]:
# Verify data structure
from proT_pipeline.labels import get_target_dirs, get_input_dirs, get_root_dir

ROOT = get_root_dir()
TARGET_INPUT, TARGET_BUILDS = get_target_dirs(ROOT)

print("Checking data directories...")
print(f"  Target input dir: {TARGET_INPUT}")
print(f"    Exists: {exists(TARGET_INPUT)}")
print(f"  Target builds dir: {TARGET_BUILDS}")
print(f"    Exists: {exists(TARGET_BUILDS)}")

Checking data directories...
  Target input dir: c:\Users\ScipioneFrancesco\Documents\Projects\proT_pipeline\data\target\input
    Exists: True
  Target builds dir: c:\Users\ScipioneFrancesco\Documents\Projects\proT_pipeline\data\target\builds
    Exists: True


## Step 1: Process Target Data (IST)

⚠️ **PREDICTION MODE DIFFERENCE**: Instead of processing real IST data with `target_main()`, we generate **placeholder targets** using `generate_ist_placeholders()`.

The placeholder targets have:
- Same structure as real IST data (group, position, variable, value columns)
- Zero values instead of actual resistance measurements
- Same sequence length (`MAX_LEN`) as training data

### Input Options for Group IDs

The `generate_ist_placeholders` function accepts group IDs in multiple formats:
- **List**: `["CYDH_01", "CYDH_05", ...]`
- **CSV file**: Path to file with 'group' column
- **NPY file**: Path to numpy array
- **TXT file**: Path to text file (one ID per line)

In [3]:
# Configuration for target pipeline (PREDICTION MODE)
BUILD_ID = "prediction_tutorial"      # Change this to your build name
GROUPING_METHOD = "panel"             # Group by panel (must match training)
MAX_LEN = 200                         # Maximum sequence length (must match training)
NUM_VARS = 2                          # Number of target variables (Sense A, B)

# Reference build for copying control files (Step 3)
REFERENCE_BUILD = "dyconex_251117"    # An existing build with control files

In [4]:
# Define the group IDs to generate predictions for
# Option A: Define as a list
PREDICTION_GROUPS = [
    "CYDH_01",
    "CYDH_05",
    "CYDH_13",
    "CYDH_15",
    "CYDH_22",
    "CYDH_28",
    "CYDH_30",
    "CYDH_38",
    "CYEI_04",
    "CYEI_10",
    "CYEI_20",
    "CYEI_26",
    "CYEI_28",
    "CYEI_29",
    "CYEI_39",
    "CYEI_41",
]

# Option B: Load from file (uncomment to use)
# PREDICTION_GROUPS = "path/to/sample_ids.csv"    # CSV with 'group' column
# PREDICTION_GROUPS = "path/to/sample_ids.npy"    # NumPy array
# PREDICTION_GROUPS = "path/to/sample_ids.txt"    # Text file, one ID per line

# Option C: Load from existing dataset (uncomment to use)
# from proT_pipeline.target_processing.placeholders import load_group_ids_from_process_data
# PREDICTION_GROUPS = load_group_ids_from_process_data("dyconex_251117", limit=20)

print(f"Groups to predict: {len(PREDICTION_GROUPS) if isinstance(PREDICTION_GROUPS, list) else PREDICTION_GROUPS}")

Groups to predict: 16


In [5]:
# Generate placeholder targets (PREDICTION MODE)
# This replaces the target_main() call from Tutorial 1
from proT_pipeline.target_processing.placeholders import generate_ist_placeholders
from os import makedirs

# Create target build directory
TARGET_BUILD_DIR = join(TARGET_BUILDS, BUILD_ID)
makedirs(TARGET_BUILD_DIR, exist_ok=True)

print("Generating placeholder targets (PREDICTION MODE)...")
print(f"  Build ID: {BUILD_ID}")
print(f"  Max length: {MAX_LEN}")
print(f"  Num variables: {NUM_VARS}")
print()

df_trg = generate_ist_placeholders(
    group_ids=PREDICTION_GROUPS,
    max_len=MAX_LEN,
    num_vars=NUM_VARS,
    output_path=TARGET_BUILD_DIR  # Saves df_trg.csv here
)

print("\nPlaceholder target generation complete!")
print(f"Output: data/target/builds/{BUILD_ID}/df_trg.csv")

Generating placeholder targets (PREDICTION MODE)...
  Build ID: prediction_tutorial
  Max length: 200
  Num variables: 2

Saved placeholder target to: c:\Users\ScipioneFrancesco\Documents\Projects\proT_pipeline\data\target\builds\prediction_tutorial\df_trg.csv

Placeholder target generation complete!
Output: data/target/builds/prediction_tutorial/df_trg.csv


In [6]:
# Verify output and preview
import pandas as pd

df_trg_path = join(TARGET_BUILDS, BUILD_ID, "df_trg.csv")
print(f"Loading: {df_trg_path}")

df_trg = pd.read_csv(df_trg_path)
print(f"\nTarget dataframe shape: {df_trg.shape}")
print(f"Unique groups (samples): {df_trg['group'].nunique()}")
print(f"Unique variables: {df_trg['variable'].unique()}")
print(f"Value range: [{df_trg['value'].min()}, {df_trg['value'].max()}] (should be all zeros)")
print(f"\nFirst few rows:")
df_trg.head(10)

Loading: c:\Users\ScipioneFrancesco\Documents\Projects\proT_pipeline\data\target\builds\prediction_tutorial\df_trg.csv

Target dataframe shape: (6400, 5)
Unique groups (samples): 16
Unique variables: [1 2]
Value range: [0.0, 0.0] (should be all zeros)

First few rows:


Unnamed: 0,group,position,date,variable,value
0,CYDH_01,1,2026-02-03 16:45:21,1,0.0
1,CYDH_01,2,2026-02-03 16:45:21,1,0.0
2,CYDH_01,3,2026-02-03 16:45:21,1,0.0
3,CYDH_01,4,2026-02-03 16:45:21,1,0.0
4,CYDH_01,5,2026-02-03 16:45:21,1,0.0
5,CYDH_01,6,2026-02-03 16:45:21,1,0.0
6,CYDH_01,7,2026-02-03 16:45:21,1,0.0
7,CYDH_01,8,2026-02-03 16:45:21,1,0.0
8,CYDH_01,9,2026-02-03 16:45:21,1,0.0
9,CYDH_01,10,2026-02-03 16:45:21,1,0.0


## Step 2: Copy Target to Build Folder

The input pipeline expects `df_trg.csv` in the build's control folder. We need to copy it there.

In [7]:
# Configuration for input pipeline
DATASET_ID = "prediction_tutorial"    # Must match or create this folder in data/builds/

# Create build folder structure if it doesn't exist
BUILD_DIR = join(ROOT, "data", "builds", DATASET_ID)
CONTROL_DIR = join(BUILD_DIR, "control")
OUTPUT_DIR = join(BUILD_DIR, "output")

os.makedirs(CONTROL_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"Build directory: {BUILD_DIR}")
print(f"Control directory: {CONTROL_DIR}")
print(f"Output directory: {OUTPUT_DIR}")

Build directory: c:\Users\ScipioneFrancesco\Documents\Projects\proT_pipeline\data\builds\prediction_tutorial
Control directory: c:\Users\ScipioneFrancesco\Documents\Projects\proT_pipeline\data\builds\prediction_tutorial\control
Output directory: c:\Users\ScipioneFrancesco\Documents\Projects\proT_pipeline\data\builds\prediction_tutorial\output


In [8]:
# Copy df_trg.csv to control folder
import shutil

source_path = join(TARGET_BUILDS, BUILD_ID, "df_trg.csv")
dest_path = join(CONTROL_DIR, "df_trg.csv")

if exists(source_path):
    shutil.copy(source_path, dest_path)
    print(f"Copied df_trg.csv to {dest_path}")
else:
    print(f"ERROR: Source file not found: {source_path}")
    print("Make sure the placeholder generation ran successfully.")

Copied df_trg.csv to c:\Users\ScipioneFrancesco\Documents\Projects\proT_pipeline\data\builds\prediction_tutorial\control\df_trg.csv


## Step 3: Prepare Control Files

The input pipeline requires several control files. You should copy these from an existing build or create them according to the schema in `data/builds/template_build/control/README.md`.

**Required files:**
- `config.yaml` - Points to input dataset folder
- `lookup_selected.xlsx` - Variable selection per process
- `steps_selected.xlsx` - Process steps to include
- `Prozessfolgen_MSEI.xlsx` - Layer/occurrence mapping

In [9]:
# Check for required control files
required_files = [
    "config.yaml",
    "df_trg.csv",
    "lookup_selected.xlsx",
    "steps_selected.xlsx",
    "Prozessfolgen_MSEI.xlsx"
]

print("Checking control files:")
all_present = True
for f in required_files:
    path = join(CONTROL_DIR, f)
    status = "✓" if exists(path) else "✗ MISSING"
    if not exists(path):
        all_present = False
    print(f"  {status} {f}")

if not all_present:
    print("\n⚠️ Some control files are missing!")
    print("Copy them from an existing build or create them following the schema.")

Checking control files:
  ✓ config.yaml
  ✓ df_trg.csv
  ✓ lookup_selected.xlsx
  ✓ steps_selected.xlsx
  ✓ Prozessfolgen_MSEI.xlsx


## Step 4: Run Input Pipeline

Now we can run the input pipeline to process manufacturing data and generate the final dataset.

⚠️ **PREDICTION MODE DIFFERENCE**: We disable stratified splitting since this is for prediction, not training.

### Configuration Parameters

| Parameter | Description | Prediction Value |
|-----------|-------------|------------------|
| `dataset_id` | Build folder name | Same as your folder in data/builds/ |
| `missing_threshold` | Max % missing values per variable | 30 (same as training) |
| `use_stratified_split` | Enable stratified train/test split | **False** (disabled) |
| `grouping_method` | How samples are grouped | "panel" (must match training) |

In [10]:
# Input pipeline configuration (PREDICTION MODE)
MISSING_THRESHOLD = 30                # Remove variables with >30% missing values
USE_STRATIFIED_SPLIT = False          # ⚠️ DISABLED for prediction mode
DEBUG = False                         # Set True for quick test with subset

In [11]:
# Run input pipeline
from proT_pipeline.main import main as input_main

print("Running input (process) pipeline...")
print(f"  Dataset ID: {DATASET_ID}")
print(f"  Missing threshold: {MISSING_THRESHOLD}%")
print(f"  Stratified split: {USE_STRATIFIED_SPLIT} (disabled for prediction)")
print()

input_main(
    dataset_id=DATASET_ID,
    missing_threshold=MISSING_THRESHOLD,
    select_test=False,
    use_stratified_split=USE_STRATIFIED_SPLIT,
    grouping_method=GROUPING_METHOD,
    grouping_column=None,
    debug=DEBUG
)

print("\nInput pipeline complete!")
print(f"Output: data/builds/{DATASET_ID}/output/")

Running input (process) pipeline...
  Dataset ID: prediction_tutorial
  Missing threshold: 30%
  Stratified split: False (disabled for prediction)

Error occurred 'UCL_Spüle Bakterienbefall-1.04'
Error occurred 'UCL_Galv. Cu Cl--1.11/1.12'
Error occurred 'UCL_Galv. Cu Cl--1.09/1.10'
Error occurred '25TrocknerUmsetzzeit'
Error occurred '25TrocknerDelta_time'
Error occurred '25TrocknerDelta_time_%'


  df_['numeric_part'] = df_[process.panel_label].astype(str).str.extract(r'(\d+)')[0].astype("Int64")
  df_[trans_group_id] = df_[process.WA_label] + '_' + df_['numeric_part'].astype(str)
  df_[trans_group_id] = df_[process.WA_label] + '_*'
100%|██████████| 16/16 [00:00<00:00, 548.20it/s]


Flattening successful, dataset correctly generated!
Found the following sequence lengths
        length_count                        ids
length                                         
320                3                    0, 1, 8
925                3                    5, 6, 7
933                3                    2, 3, 4
1155               7  9, 10, 11, 12, 13, 14, 15


100%|██████████| 16/16 [00:00<00:00, 815.38it/s]

Flattening successful, dataset correctly generated!
Found the following sequence lengths
        length_count                                                ids
length                                                                 
400               16  0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ...

Input pipeline complete!
Output: data/builds/prediction_tutorial/output/





## Step 5: Verify Output

Let's examine the generated dataset to ensure everything worked correctly.

In [12]:
import numpy as np

# Load the generated dataset
dataset_path = join(OUTPUT_DIR, f"ds_{DATASET_ID}", "data.npz")

if exists(dataset_path):
    data = np.load(dataset_path)
    X = data['x']
    Y = data['y']
    
    print("Dataset loaded successfully!")
    print(f"\nX (input) shape: {X.shape}")
    print(f"  - Samples: {X.shape[0]}")
    print(f"  - Max sequence length: {X.shape[1]}")
    print(f"  - Features: {X.shape[2]}")
    print(f"\nY (target) shape: {Y.shape}")
    print(f"  - Samples: {Y.shape[0]}")
    print(f"  - Max sequence length: {Y.shape[1]}")
    print(f"  - Features: {Y.shape[2]}")
else:
    print(f"Dataset not found at: {dataset_path}")

Dataset loaded successfully!

X (input) shape: (16, 1155, 12)
  - Samples: 16
  - Max sequence length: 1155
  - Features: 12

Y (target) shape: (16, 400, 9)
  - Samples: 16
  - Max sequence length: 400
  - Features: 9


In [13]:
# Check vocabulary files
import json

vocab_files = [
    "group_vocabulary.json",
    "process_vocabulary.json",
    "variables_vocabulary.json_input",
    "variables_vocabulary.json_trg",
    "features_dict"
]

print("Vocabulary files:")
for vf in vocab_files:
    path = join(OUTPUT_DIR, vf)
    if exists(path):
        with open(path, 'r') as f:
            vocab = json.load(f)
        print(f"\n{vf}: {len(vocab)} entries")
        if len(vocab) <= 10:
            print(f"  {vocab}")
        else:
            print(f"  First 5: {dict(list(vocab.items())[:5])}")

Vocabulary files:

group_vocabulary.json: 16 entries
  First 5: {'CYDH_01': 0, 'CYDH_05': 1, 'CYDH_13': 2, 'CYDH_15': 3, 'CYDH_22': 4}

process_vocabulary.json: 5 entries
  {'Multibond': 1, 'Microetch': 2, 'Laser': 3, 'Plasma': 4, 'Galvanic': 5}

variables_vocabulary.json_input: 416 entries
  First 5: {'mul_10': 1, 'mul_100': 2, 'mul_101': 3, 'mul_102': 4, 'mul_103': 5}

variables_vocabulary.json_trg: 2 entries
  {'1': 1, '2': 2}

features_dict: 2 entries
  {'input': {'0': 'group', '1': 'process', '2': 'occurrence', '3': 'step', '4': 'variable', '5': 'value_norm', '6': 'order', '7': 'year', '8': 'month', '9': 'day', '10': 'hour', '11': 'minute'}, 'target': {'0': 'group', '1': 'position', '2': 'variable', '3': 'value', '4': 'year', '5': 'month', '6': 'day', '7': 'hour', '8': 'minute'}}


In [14]:
# Verify that Y contains placeholder values (zeros)
# Y structure: [sample, position, features] where feature 3 is the value
if exists(dataset_path):
    y_values = Y[:, :, 3]  # Value column (index 3 based on features_dict)
    
    zero_count = (y_values == 0).sum()
    total_count = y_values.size
    zero_percentage = (zero_count / total_count) * 100
    
    print(f"Target values verification:")
    print(f"  Zero count: {zero_count} / {total_count} ({zero_percentage:.1f}%)")
    
    if zero_percentage > 90:
        print("  ✓ Target contains placeholder zeros as expected for prediction mode")
    else:
        print("  ⚠️ Some non-zero values found (may be from padding)")

Target values verification:
  Zero count: 6400 / 6400 (100.0%)
  ✓ Target contains placeholder zeros as expected for prediction mode


## Summary

You have successfully generated a prediction dataset! The output includes:

| File | Description |
|------|-------------|
| `data.npz` | Full dataset (X with real process data, Y with placeholder zeros) |
| `*_vocabulary.json` | Mapping dictionaries |
| `features_dict` | Feature index documentation |
| `sample_metrics.parquet` | Sample-level metrics for analysis |

**Note**: No `train_data.npz` or `test_data.npz` since stratified split is disabled.

## Using the Prediction Dataset

```python
import numpy as np
import torch

# Load dataset
data = np.load('data/builds/prediction_tutorial/output/ds_prediction_tutorial/data.npz')
X_pred = torch.from_numpy(data['x']).float()
Y_placeholder = torch.from_numpy(data['y']).float()

# Run inference with your trained model
with torch.no_grad():
    predictions = model(X_pred, Y_placeholder)
```

## Next Steps

1. Load the prediction dataset into your trained transformer model
2. Run inference to generate IST predictions
3. Map predictions back to group names using `group_vocabulary.json`
4. Refer to `INTEGRATION_GUIDE.md` for model integration details