# Tutorial 1: Producing a Training Dataset

This tutorial guides you through the complete process of generating a training dataset from raw manufacturing data.

## Overview

The dataset generation involves two main pipelines:
1. **Target Pipeline (IST)**: Processes resistance test data to create `df_trg.csv`
2. **Input Pipeline (Process)**: Processes manufacturing parameters and combines with target data

## Prerequisites

Before starting, ensure you have:
- Installed all dependencies (`pip install -r requirements.txt`)
- Obtained access to the raw data files
- Placed data files according to `data/README_DATA.md`

## Step 0: Setup

First, let's set up the environment and verify the data structure.

In [1]:
import sys
import os
from os.path import exists, join, dirname, abspath

# Add project root to path
PROJECT_ROOT = dirname(abspath(os.getcwd()))
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

print(f"Project root: {PROJECT_ROOT}")

Project root: c:\Users\ScipioneFrancesco\Documents\Projects\proT_pipeline


In [2]:
# Verify data structure
from proT_pipeline.labels import get_target_dirs, get_input_dirs, get_root_dir

ROOT = get_root_dir()
TARGET_INPUT, TARGET_BUILDS = get_target_dirs(ROOT)

print("Checking data directories...")
print(f"  Target input dir: {TARGET_INPUT}")
print(f"    Exists: {exists(TARGET_INPUT)}")
print(f"  Target builds dir: {TARGET_BUILDS}")
print(f"    Exists: {exists(TARGET_BUILDS)}")

Checking data directories...
  Target input dir: c:\Users\ScipioneFrancesco\Documents\Projects\proT_pipeline\data\target\input
    Exists: True
  Target builds dir: c:\Users\ScipioneFrancesco\Documents\Projects\proT_pipeline\data\target\builds
    Exists: True


## Step 1: Process Target Data (IST)

The target pipeline processes the raw IST (Insulation Stress Test) resistance data.

### Configuration Parameters

| Parameter | Description | Typical Value |
|-----------|-------------|---------------|
| `build_id` | Output folder name | e.g., "my_build_2026" |
| `grouping_method` | How to group samples | "panel" or "column" |
| `max_len` | Maximum sequence length | 200 |
| `filter_type` | Coupon type filter | "C" (canary) or "P" (product) |
| `max_len_mode` | How to handle long sequences | "clip" or "remove" |
| `mean_bool` | Calculate mean across groups | False |
| `std_bool` | Calculate std across groups | False |

In [3]:
# Configuration for target pipeline
BUILD_ID = "tutorial_build"          # Change this to your build name
GROUPING_METHOD = "panel"             # Group by panel (individual samples)
MAX_LEN = 200                         # Maximum 200 thermal cycles
FILTER_TYPE = "C"                     # Canary coupons only
UNI_METHOD = "clip"                   # Clip for uniform length
MAX_LEN_MODE = "clip"                 # Clip sequences exceeding max_len
MEAN_BOOL = False                     # Don't calculate mean (keep individual measurements)
STD_BOOL = False                      # Don't calculate std

In [4]:
# Run target pipeline
from proT_pipeline.target_processing.main import main as target_main

print("Running target (IST) pipeline...")
print(f"  Build ID: {BUILD_ID}")
print(f"  Grouping: {GROUPING_METHOD}")
print(f"  Max length: {MAX_LEN}")
print()

target_main(
    build_id=BUILD_ID,
    grouping_method=GROUPING_METHOD,
    grouping_column=None,
    max_len=MAX_LEN,
    filter_type=FILTER_TYPE,
    uni_method=UNI_METHOD,
    max_len_mode=MAX_LEN_MODE,
    mean_bool=MEAN_BOOL,
    std_bool=STD_BOOL
)

print("\nTarget pipeline complete!")
print(f"Output: data/target/builds/{BUILD_ID}/df_trg.csv")

Running target (IST) pipeline...
  Build ID: tutorial_build
  Grouping: panel
  Max length: 200

Processing ist dataframe...
Normalizing values...
Some groups have multiple ids
group
CUFR_28    2
CVGW_13    2
CVGW_15    2
CVGW_17    2
CVGW_21    2
CVGW_30    2
CVGW_44    2
CVGW_47    2
CVGW_8     2
Name: id, dtype: int64
Since mean_bool=False, proceed selecting the dominating one
Each group has exactly one unique id now.
Target filtered dataframe assembled: index unique: True

Target pipeline complete!
Output: data/target/builds/tutorial_build/df_trg.csv


In [5]:
# Verify output and preview
import pandas as pd

df_trg_path = join(TARGET_BUILDS, BUILD_ID, "df_trg.csv")
print(f"Loading: {df_trg_path}")

df_trg = pd.read_csv(df_trg_path)
print(f"\nTarget dataframe shape: {df_trg.shape}")
print(f"Unique groups (samples): {df_trg['group'].nunique()}")
print(f"Unique variables: {df_trg['variable'].unique()}")
print(f"\nFirst few rows:")
df_trg.head(10)

Loading: c:\Users\ScipioneFrancesco\Documents\Projects\proT_pipeline\data\target\builds\tutorial_build\df_trg.csv

Target dataframe shape: (782267, 10)
Unique groups (samples): 1957
Unique variables: ['delta_A_norm' 'delta_B_norm']

First few rows:


Unnamed: 0.1,Unnamed: 0,index,group,position,id,date,design,version,variable,value
0,0,354905,CFWQ_10,1.0,49135,2023-08-12 08:38:00,453828,B,delta_A_norm,0.0
1,1,354906,CFWQ_10,2.0,49135,2023-08-12 08:38:00,453828,B,delta_A_norm,0.003704
2,2,354907,CFWQ_10,3.0,49135,2023-08-12 08:38:00,453828,B,delta_A_norm,0.012594
3,3,354908,CFWQ_10,4.0,49135,2023-08-12 08:38:00,453828,B,delta_A_norm,0.012594
4,4,354909,CFWQ_10,5.0,49135,2023-08-12 08:38:00,453828,B,delta_A_norm,0.020744
5,5,354910,CFWQ_10,6.0,49135,2023-08-12 08:38:00,453828,B,delta_A_norm,0.014817
6,6,354911,CFWQ_10,7.0,49135,2023-08-12 08:38:00,453828,B,delta_A_norm,0.024448
7,7,354912,CFWQ_10,8.0,49135,2023-08-12 08:38:00,453828,B,delta_A_norm,0.026671
8,8,354913,CFWQ_10,9.0,49135,2023-08-12 08:38:00,453828,B,delta_A_norm,0.030375
9,9,354914,CFWQ_10,10.0,49135,2023-08-12 08:38:00,453828,B,delta_A_norm,0.031857


## Step 2: Copy Target to Build Folder

The input pipeline expects `df_trg.csv` in the build's control folder. We need to copy it there.

In [6]:
# Configuration for input pipeline
DATASET_ID = "tutorial_build"         # Must match or create this folder in data/builds/

# Create build folder structure if it doesn't exist
BUILD_DIR = join(ROOT, "data", "builds", DATASET_ID)
CONTROL_DIR = join(BUILD_DIR, "control")
OUTPUT_DIR = join(BUILD_DIR, "output")

os.makedirs(CONTROL_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"Build directory: {BUILD_DIR}")
print(f"Control directory: {CONTROL_DIR}")
print(f"Output directory: {OUTPUT_DIR}")

Build directory: c:\Users\ScipioneFrancesco\Documents\Projects\proT_pipeline\data\builds\tutorial_build
Control directory: c:\Users\ScipioneFrancesco\Documents\Projects\proT_pipeline\data\builds\tutorial_build\control
Output directory: c:\Users\ScipioneFrancesco\Documents\Projects\proT_pipeline\data\builds\tutorial_build\output


In [7]:
# Copy df_trg.csv to control folder
import shutil

source_path = join(TARGET_BUILDS, BUILD_ID, "df_trg.csv")
dest_path = join(CONTROL_DIR, "df_trg.csv")

if exists(source_path):
    shutil.copy(source_path, dest_path)
    print(f"Copied df_trg.csv to {dest_path}")
else:
    print(f"ERROR: Source file not found: {source_path}")
    print("Make sure the target pipeline ran successfully.")

Copied df_trg.csv to c:\Users\ScipioneFrancesco\Documents\Projects\proT_pipeline\data\builds\tutorial_build\control\df_trg.csv


## Step 3: Prepare Control Files

The input pipeline requires several control files. You should copy these from an existing build or create them according to the schema in `data/builds/template_build/control/README.md`.

**Required files:**
- `config.yaml` - Points to input dataset folder
- `lookup_selected.xlsx` - Variable selection per process
- `steps_selected.xlsx` - Process steps to include
- `Prozessfolgen_MSEI.xlsx` - Layer/occurrence mapping

In [9]:
# Check for required control files
required_files = [
    "config.yaml",
    "df_trg.csv",
    "lookup_selected.xlsx",
    "steps_selected.xlsx",
    "Prozessfolgen_MSEI.xlsx"
]

print("Checking control files:")
all_present = True
for f in required_files:
    path = join(CONTROL_DIR, f)
    status = "✓" if exists(path) else "✗ MISSING"
    if not exists(path):
        all_present = False
    print(f"  {status} {f}")

if not all_present:
    print("\n⚠️ Some control files are missing!")
    print("Copy them from an existing build or create them following the schema.")

Checking control files:
  ✓ config.yaml
  ✓ df_trg.csv
  ✓ lookup_selected.xlsx
  ✓ steps_selected.xlsx
  ✓ Prozessfolgen_MSEI.xlsx


## Step 4: Run Input Pipeline

Now we can run the input pipeline to process manufacturing data and generate the final dataset.

### Configuration Parameters

| Parameter | Description | Typical Value |
|-----------|-------------|---------------|
| `dataset_id` | Build folder name | Same as your folder in data/builds/ |
| `missing_threshold` | Max % missing values per variable | 30 |
| `use_stratified_split` | Enable stratified train/test split | True |
| `train_ratio` | Proportion for training set | 0.8 |
| `n_bins` | Number of bins for stratification | 50 |
| `grouping_method` | How samples are grouped | "panel" |

In [10]:
# Input pipeline configuration
MISSING_THRESHOLD = 30                # Remove variables with >30% missing values
USE_STRATIFIED_SPLIT = True           # Enable stratified splitting
STRATIFIED_METRIC = 'rarity_last_value'  # Metric for stratification
TRAIN_RATIO = 0.8                     # 80% train, 20% test
N_BINS = 50                           # Bins for stratification
SPLIT_SHUFFLE = False                 # Don't shuffle within bins
SPLIT_SEED = 42                       # Random seed for reproducibility
DEBUG = False                         # Set True for quick test with subset

In [11]:
# Run input pipeline
from proT_pipeline.main import main as input_main

print("Running input (process) pipeline...")
print(f"  Dataset ID: {DATASET_ID}")
print(f"  Missing threshold: {MISSING_THRESHOLD}%")
print(f"  Stratified split: {USE_STRATIFIED_SPLIT}")
print(f"  Train ratio: {TRAIN_RATIO}")
print()

input_main(
    dataset_id=DATASET_ID,
    missing_threshold=MISSING_THRESHOLD,
    select_test=False,
    use_stratified_split=USE_STRATIFIED_SPLIT,
    stratified_metric=STRATIFIED_METRIC,
    train_ratio=TRAIN_RATIO,
    n_bins=N_BINS,
    split_shuffle=SPLIT_SHUFFLE,
    split_seed=SPLIT_SEED,
    grouping_method=GROUPING_METHOD,
    grouping_column=None,
    debug=DEBUG
)

print("\nInput pipeline complete!")
print(f"Output: data/builds/{DATASET_ID}/output/")

Running input (process) pipeline...
  Dataset ID: tutorial_build
  Missing threshold: 30%
  Stratified split: True
  Train ratio: 0.8

Error occurred 'UCL_Spüle Bakterienbefall-1.04'
Error occurred 'UCL_Galv. Cu Cl--1.11/1.12'
Error occurred 'UCL_Galv. Cu Cl--1.09/1.10'
Error occurred '25TrocknerUmsetzzeit'
Error occurred '25TrocknerDelta_time'
Error occurred '25TrocknerDelta_time_%'


  df_['numeric_part'] = df_[process.panel_label].astype(str).str.extract(r'(\d+)')[0].astype("Int64")
  df_[trans_group_id] = df_[process.WA_label] + '_' + df_['numeric_part'].astype(str)
  df_[trans_group_id] = df_[process.WA_label] + '_*'
100%|██████████| 1920/1920 [05:11<00:00,  6.16it/s]


Flattening successful, dataset correctly generated!
Found the following sequence lengths
        length_count                                                ids
length                                                                 
111                1                                                153
127                5                            148, 149, 150, 152, 155
305               10   151, 154, 796, 797, 798, 799, 800, 801, 802, 803
312                3                                    141, 1355, 1365
330                1                                               1798
...              ...                                                ...
989              163  34, 35, 37, 39, 40, 41, 42, 43, 44, 45, 46, 47...
991               10  949, 1473, 1566, 1670, 1740, 1741, 1742, 1790,...
1007               8      416, 1841, 1842, 1843, 1844, 1845, 1847, 1848
1015              54  8, 9, 12, 14, 15, 105, 106, 108, 186, 188, 189...
1023             224  3, 48, 49, 50, 51, 53, 54

100%|██████████| 1920/1920 [02:13<00:00, 14.36it/s]


Flattening successful, dataset correctly generated!
Found the following sequence lengths
        length_count                                                ids
length                                                                 
200                1                                                656
236                1                                               1899
335                1                                               1473
363                1                                               1807
381                1                                               1468
383                1                                                739
384                1                                               1758
390                1                                               1598
397                1                                                733
399                2                                          573, 1134
400             1909  0, 1, 2, 3, 4, 5, 6, 7, 8

## Step 5: Verify Output

Let's examine the generated dataset to ensure everything worked correctly.

In [12]:
import numpy as np

# Load the generated dataset
dataset_path = join(OUTPUT_DIR, f"ds_{DATASET_ID}", "data.npz")

if exists(dataset_path):
    data = np.load(dataset_path)
    X = data['x']
    Y = data['y']
    
    print("Dataset loaded successfully!")
    print(f"\nX (input) shape: {X.shape}")
    print(f"  - Samples: {X.shape[0]}")
    print(f"  - Max sequence length: {X.shape[1]}")
    print(f"  - Features: {X.shape[2]}")
    print(f"\nY (target) shape: {Y.shape}")
    print(f"  - Samples: {Y.shape[0]}")
    print(f"  - Max sequence length: {Y.shape[1]}")
    print(f"  - Features: {Y.shape[2]}")
else:
    print(f"Dataset not found at: {dataset_path}")

Dataset loaded successfully!

X (input) shape: (1920, 1023, 12)
  - Samples: 1920
  - Max sequence length: 1023
  - Features: 12

Y (target) shape: (1920, 400, 9)
  - Samples: 1920
  - Max sequence length: 400
  - Features: 9


In [13]:
# Check vocabulary files
import json

vocab_files = [
    "group_vocabulary.json",
    "process_vocabulary.json",
    "variables_vocabulary.json_input",
    "variables_vocabulary.json_trg",
    "features_dict"
]

print("Vocabulary files:")
for vf in vocab_files:
    path = join(OUTPUT_DIR, vf)
    if exists(path):
        with open(path, 'r') as f:
            vocab = json.load(f)
        print(f"\n{vf}: {len(vocab)} entries")
        if len(vocab) <= 10:
            print(f"  {vocab}")
        else:
            print(f"  First 5: {dict(list(vocab.items())[:5])}")

Vocabulary files:

group_vocabulary.json: 1920 entries
  First 5: {'CUEX_13': 0, 'CUEX_17': 1, 'CUEX_20': 2, 'CUEX_28': 3, 'CUEX_34': 4}

process_vocabulary.json: 5 entries
  {'Laser': 1, 'Plasma': 2, 'Galvanic': 3, 'Multibond': 4, 'Microetch': 5}

variables_vocabulary.json_input: 372 entries
  First 5: {'las_11': 1, 'las_12': 2, 'las_13': 3, 'las_15': 4, 'las_16': 5}

variables_vocabulary.json_trg: 2 entries
  {'delta_A_norm': 1, 'delta_B_norm': 2}

features_dict: 2 entries
  {'input': {'0': 'group', '1': 'process', '2': 'occurrence', '3': 'step', '4': 'variable', '5': 'value_norm', '6': 'order', '7': 'year', '8': 'month', '9': 'day', '10': 'hour', '11': 'minute'}, 'target': {'0': 'group', '1': 'position', '2': 'variable', '3': 'value', '4': 'year', '5': 'month', '6': 'day', '7': 'hour', '8': 'minute'}}


In [14]:
# Check train/test splits
train_path = join(OUTPUT_DIR, f"ds_{DATASET_ID}", "train_data.npz")
test_path = join(OUTPUT_DIR, f"ds_{DATASET_ID}", "test_data.npz")

if exists(train_path) and exists(test_path):
    train_data = np.load(train_path)
    test_data = np.load(test_path)
    
    print("Train/Test splits:")
    print(f"  Train samples: {train_data['x'].shape[0]}")
    print(f"  Test samples: {test_data['x'].shape[0]}")
    print(f"  Train ratio: {train_data['x'].shape[0] / (train_data['x'].shape[0] + test_data['x'].shape[0]):.2%}")
else:
    print("Train/test splits not found.")
    print("Run with use_stratified_split=True to generate splits.")

Train/Test splits:
  Train samples: 1526
  Test samples: 394
  Train ratio: 79.48%


## Summary

You have successfully generated a training dataset! The output includes:

| File | Description |
|------|-------------|
| `data.npz` | Full dataset (X and Y arrays) |
| `train_data.npz` | Training split |
| `test_data.npz` | Test split |
| `*_vocabulary.json` | Mapping dictionaries |
| `features_dict` | Feature index documentation |
| `sample_metrics.parquet` | Sample-level metrics for analysis |

## Next Steps

1. Use the dataset for training your transformer model
2. See Tutorial 2 for generating prediction datasets (without targets)
3. Refer to `INTEGRATION_GUIDE.md` for advanced configuration options