# Dataset Creation: Tennessee Eastman Process (TEP)

This notebook creates reproducible, balanced datasets for training and evaluating fault detection models.

**Purpose:**
- Load raw TEP data from .RData files
- Create balanced train/validation/test splits
- Generate both multiclass and binary classification datasets
- Ensure no data leakage between splits

**Data Source:**
- Rieth et al. (2017) enriched TEP dataset
- 4 source files: fault-free training/testing, faulty training/testing
- Each contains 500 independent simulation runs per fault type

**Outputs (Option A - Practical Balance):**
- `data/multiclass_train.csv` - 320 normal + 340 faulty runs (20 per fault)
- `data/multiclass_val.csv` - 160 normal + 153 faulty runs (9 per fault)
- `data/multiclass_test.csv` - 240 normal + **850 faulty runs (50 per fault)** ← 3.3× MORE
- `data/binary_train.csv` - 320 normal runs only
- `data/binary_val.csv` - 160 normal runs only
- `data/binary_test.csv` - 120 normal + **850 faulty runs (50 per fault)** ← 5× MORE

**Key Improvement:**
Test set increased from 15→50 runs per fault (multiclass) and 10→50 (binary).
Provides better statistical power (95% CI: ±14% vs ±26%) while keeping file sizes manageable (~270 MB).

**Reference:**
Rieth, C. A., Amsel, B. D., Tran, R., & Cook, M. B. (2017). Additional Tennessee Eastman Process 
Simulation Data for Anomaly Detection Evaluation. Harvard Dataverse.

In [1]:
import numpy as np
import pandas as pd
import pyreadr
import os
from pathlib import Path
import zipfile

# Set random seed for reproducibility
np.random.seed(42)

# Create data output directory
os.makedirs('../data', exist_ok=True)

print("Dataset creation notebook initialized")
print(f"Random seed: 42")
print(f"Output directory: ../data/")

Dataset creation notebook initialized
Random seed: 42
Output directory: ../data/


## 1. Load Raw Data from .RData Files

The raw data is stored in compressed .RData.zip files. We'll extract and load them.

In [2]:
# Define paths to the raw data files
dataset_dir = Path('../Dataset')

data_files = {
    'fftr': dataset_dir / 'TEP_FaultFree_Training.RData.zip',
    'ftr': dataset_dir / 'TEP_Faulty_Training.RData.zip',
    'ffte': dataset_dir / 'TEP_FaultFree_Testing.RData.zip',
    'fte': dataset_dir / 'TEP_Faulty_Testing.RData.zip'
}

# Verify all files exist
print("Checking for data files:")
for key, path in data_files.items():
    if path.exists():
        size_mb = path.stat().st_size / (1024 * 1024)
        print(f"  ✓ {key}: {path.name} ({size_mb:.1f} MB)")
    else:
        print(f"  ✗ {key}: {path.name} NOT FOUND")
        raise FileNotFoundError(f"Required data file not found: {path}")

Checking for data files:
  ✓ fftr: TEP_FaultFree_Training.RData.zip (23.0 MB)
  ✓ ftr: TEP_Faulty_Training.RData.zip (461.7 MB)
  ✓ ffte: TEP_FaultFree_Testing.RData.zip (44.1 MB)
  ✓ fte: TEP_Faulty_Testing.RData.zip (778.5 MB)


In [3]:
def load_rdata_from_zip(zip_path, expected_key=None):
    """
    Extract and load .RData file from zip archive.
    
    Parameters:
    -----------
    zip_path : Path
        Path to the .zip file containing .RData
    expected_key : str, optional
        Expected key in the RData dictionary
        
    Returns:
    --------
    pd.DataFrame
        Loaded data
    """
    # Extract the .RData file from zip
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        # Get the RData filename (should be only file in archive)
        rdata_filename = [f for f in zip_ref.namelist() if f.endswith('.RData')][0]
        
        # Extract to temporary location
        temp_path = zip_path.parent / rdata_filename
        zip_ref.extract(rdata_filename, path=zip_path.parent)
    
    try:
        # Load RData file
        result = pyreadr.read_r(str(temp_path))
        
        # Get the dataframe (RData files contain OrderedDict)
        if expected_key and expected_key in result:
            df = result[expected_key]
        else:
            # Take the first (and typically only) dataframe
            df = list(result.values())[0]
        
        return df
    finally:
        # Clean up extracted file
        if temp_path.exists():
            temp_path.unlink()

print("Loading data files...")
print("This may take a few minutes due to file size.\n")

# Load each dataset
fftr = load_rdata_from_zip(data_files['fftr'], 'fault_free_training')
print(f"✓ Fault-free training loaded: {fftr.shape}")

ftr = load_rdata_from_zip(data_files['ftr'], 'faulty_training')
print(f"✓ Faulty training loaded: {ftr.shape}")

ffte = load_rdata_from_zip(data_files['ffte'], 'fault_free_testing')
print(f"✓ Fault-free testing loaded: {ffte.shape}")

fte = load_rdata_from_zip(data_files['fte'], 'faulty_testing')
print(f"✓ Faulty testing loaded: {fte.shape}")

Loading data files...
This may take a few minutes due to file size.

✓ Fault-free training loaded: (250000, 55)
✓ Faulty training loaded: (5000000, 55)
✓ Fault-free testing loaded: (480000, 55)
✓ Faulty testing loaded: (9600000, 55)


## 2. Data Structure Inspection

In [4]:
print("Dataset Structure:")
print("=" * 70)
print(f"\nColumns ({len(fftr.columns)}):")
print(fftr.columns.tolist())

print(f"\nFault-free training:")
print(f"  Unique simulation runs: {fftr['simulationRun'].nunique()}")
print(f"  Samples per run: {fftr['sample'].nunique()}")
print(f"  Fault numbers: {sorted(fftr['faultNumber'].unique())}")

print(f"\nFaulty training:")
print(f"  Unique simulation runs: {ftr['simulationRun'].nunique()}")
print(f"  Samples per run: {ftr['sample'].nunique()}")
print(f"  Fault numbers: {sorted(ftr['faultNumber'].unique())}")

print(f"\nFault-free testing:")
print(f"  Unique simulation runs: {ffte['simulationRun'].nunique()}")
print(f"  Samples per run: {ffte['sample'].nunique()}")
print(f"  Fault numbers: {sorted(ffte['faultNumber'].unique())}")

print(f"\nFaulty testing:")
print(f"  Unique simulation runs: {fte['simulationRun'].nunique()}")
print(f"  Samples per run: {fte['sample'].nunique()}")
print(f"  Fault numbers: {sorted(fte['faultNumber'].unique())}")

print("\n" + "=" * 70)
print("Sample data:")
fftr.head()

Dataset Structure:

Columns (55):
['faultNumber', 'simulationRun', 'sample', 'xmeas_1', 'xmeas_2', 'xmeas_3', 'xmeas_4', 'xmeas_5', 'xmeas_6', 'xmeas_7', 'xmeas_8', 'xmeas_9', 'xmeas_10', 'xmeas_11', 'xmeas_12', 'xmeas_13', 'xmeas_14', 'xmeas_15', 'xmeas_16', 'xmeas_17', 'xmeas_18', 'xmeas_19', 'xmeas_20', 'xmeas_21', 'xmeas_22', 'xmeas_23', 'xmeas_24', 'xmeas_25', 'xmeas_26', 'xmeas_27', 'xmeas_28', 'xmeas_29', 'xmeas_30', 'xmeas_31', 'xmeas_32', 'xmeas_33', 'xmeas_34', 'xmeas_35', 'xmeas_36', 'xmeas_37', 'xmeas_38', 'xmeas_39', 'xmeas_40', 'xmeas_41', 'xmv_1', 'xmv_2', 'xmv_3', 'xmv_4', 'xmv_5', 'xmv_6', 'xmv_7', 'xmv_8', 'xmv_9', 'xmv_10', 'xmv_11']

Fault-free training:
  Unique simulation runs: 500
  Samples per run: 500
  Fault numbers: [0.0]

Faulty training:
  Unique simulation runs: 500
  Samples per run: 500
  Fault numbers: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

Fault-free testing:
  Unique simulation runs: 500
  Samples per run: 960
  Fault

Unnamed: 0,faultNumber,simulationRun,sample,xmeas_1,xmeas_2,xmeas_3,xmeas_4,xmeas_5,xmeas_6,xmeas_7,...,xmv_2,xmv_3,xmv_4,xmv_5,xmv_6,xmv_7,xmv_8,xmv_9,xmv_10,xmv_11
0,0.0,1.0,1,0.25038,3674.0,4529.0,9.232,26.889,42.402,2704.3,...,53.744,24.657,62.544,22.137,39.935,42.323,47.757,47.51,41.258,18.447
1,0.0,1.0,2,0.25109,3659.4,4556.6,9.4264,26.721,42.576,2705.0,...,53.414,24.588,59.259,22.084,40.176,38.554,43.692,47.427,41.359,17.194
2,0.0,1.0,3,0.25038,3660.3,4477.8,9.4426,26.875,42.07,2706.2,...,54.357,24.666,61.275,22.38,40.244,38.99,46.699,47.468,41.199,20.53
3,0.0,1.0,4,0.24977,3661.3,4512.1,9.4776,26.758,42.063,2707.2,...,53.946,24.725,59.856,22.277,40.257,38.072,47.541,47.658,41.643,18.089
4,0.0,1.0,5,0.29405,3679.0,4497.0,9.3381,26.889,42.65,2705.1,...,53.658,28.797,60.717,21.947,39.144,41.955,47.645,47.346,41.507,18.461


## 3. Add Unique Trajectory Identifiers

To prevent data leakage, we create unique identifiers for each simulation run
that combine:
- **origin**: Source dataset (fftr, ftr, ffte, fte)
- **faultNumber**: Fault type (0-20)
- **simulationRun**: Run ID (1-500)

This ensures that simulation runs from different source files are kept separate,
even if they have the same simulationRun number.

In [5]:
def attach_origin_and_traj_key(df, origin_name):
    """
    Add origin label and unique trajectory key to dataframe.
    
    Parameters:
    -----------
    df : pd.DataFrame
        TEP data
    origin_name : str
        Origin identifier (fftr, ftr, ffte, fte)
        
    Returns:
    --------
    pd.DataFrame
        DataFrame with added 'origin' and 'traj_key' columns
    """
    df = df.copy()
    df['origin'] = origin_name
    
    # Create unique key: origin_f{fault}_r{run}
    df['traj_key'] = (
        df['origin'].astype(str)
        + '_f' + df['faultNumber'].astype(int).astype(str)
        + '_r' + df['simulationRun'].astype(int).astype(str)
    )
    
    return df

# Add identifiers to all datasets
fftr = attach_origin_and_traj_key(fftr, 'fftr')
ftr = attach_origin_and_traj_key(ftr, 'ftr')
ffte = attach_origin_and_traj_key(ffte, 'ffte')
fte = attach_origin_and_traj_key(fte, 'fte')

print("Unique trajectory identifiers added.")
print(f"\nExample traj_key values:")
print(f"  fftr: {fftr['traj_key'].iloc[0]}")
print(f"  ftr:  {ftr['traj_key'].iloc[0]}")
print(f"  ffte: {ffte['traj_key'].iloc[0]}")
print(f"  fte:  {fte['traj_key'].iloc[0]}")

Unique trajectory identifiers added.

Example traj_key values:
  fftr: fftr_f0_r1
  ftr:  ftr_f1_r1
  ffte: ffte_f0_r1
  fte:  fte_f1_r1


In [6]:
# Verify no overlaps between source datasets
fftr_keys = set(fftr['traj_key'])
ftr_keys = set(ftr['traj_key'])
ffte_keys = set(ffte['traj_key'])
fte_keys = set(fte['traj_key'])

print("Verifying no trajectory key overlaps between source datasets:")
print(f"  fftr ∩ ftr:   {len(fftr_keys & ftr_keys)} (should be 0)")
print(f"  fftr ∩ ffte:  {len(fftr_keys & ffte_keys)} (should be 0)")
print(f"  fftr ∩ fte:   {len(fftr_keys & fte_keys)} (should be 0)")
print(f"  ftr  ∩ ffte:  {len(ftr_keys & ffte_keys)} (should be 0)")
print(f"  ftr  ∩ fte:   {len(ftr_keys & fte_keys)} (should be 0)")
print(f"  ffte ∩ fte:   {len(ffte_keys & fte_keys)} (should be 0)")

all_overlaps_zero = all([
    len(fftr_keys & ftr_keys) == 0,
    len(fftr_keys & ffte_keys) == 0,
    len(fftr_keys & fte_keys) == 0,
    len(ftr_keys & ffte_keys) == 0,
    len(ftr_keys & fte_keys) == 0,
    len(ffte_keys & fte_keys) == 0
])

if all_overlaps_zero:
    print("\n✓ All source datasets have unique trajectory keys (no leakage)")
else:
    print("\n✗ WARNING: Overlapping trajectory keys detected!")

Verifying no trajectory key overlaps between source datasets:
  fftr ∩ ftr:   0 (should be 0)
  fftr ∩ ffte:  0 (should be 0)
  fftr ∩ fte:   0 (should be 0)
  ftr  ∩ ffte:  0 (should be 0)
  ftr  ∩ fte:   0 (should be 0)
  ffte ∩ fte:   0 (should be 0)

✓ All source datasets have unique trajectory keys (no leakage)


## 4. Sampling Function

This function samples simulation runs from a dataset while:
- Ensuring specified runs are used (for split control)
- For normal operation (fault 0): using the entire trajectory
- For faulty operation: using only post-fault samples (after fault_start)

In [7]:
def sample_runs(df, fault_number, allowed_runs, fault_start=0):
    """
    Sample specific simulation runs for a given fault number.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Source dataframe (fftr, ftr, ffte, or fte)
    fault_number : int/float
        Fault number to filter (0 for normal, 1-20 for faults)
    allowed_runs : array-like
        List of simulationRun IDs to include
    fault_start : int
        For faulty data, only keep samples >= this value (default: 0)
        
    Returns:
    --------
    pd.DataFrame
        Sampled data
    """
    # Filter by fault number and allowed runs
    selected_df = df[
        (df['faultNumber'] == fault_number) &
        (df['simulationRun'].isin(allowed_runs))
    ]
    
    frames = []
    for run in allowed_runs:
        run_df = selected_df[selected_df['simulationRun'] == run].sort_values('sample')
        
        if run_df.empty:
            continue
        
        if fault_number == 0:
            # Normal operation: use full trajectory
            frames.append(run_df)
        else:
            # Faulty operation: use only post-fault segment
            frames.append(run_df[run_df['sample'] >= fault_start])
    
    if not frames:
        return pd.DataFrame(columns=df.columns)
    
    return pd.concat(frames, ignore_index=True)

print("Sampling function defined.")

Sampling function defined.


## 5. Create Supervised Learning Datasets

**Supervised datasets** contain both normal and faulty examples for multiclass classification.

### Data Split Strategy (Option A - Maximize Test Set with Practical Constraints):
- **Training:** 320 normal runs + 20 runs per fault (17 faults) = 660 runs total
- **Validation:** 160 normal runs + 9 runs per fault (17 faults) = 313 runs total
- **Test:** 240 normal runs + **50 runs per fault** (17 faults) = 1,090 runs total

### Rationale:
The test set should be larger for reliable performance evaluation. We use 50 runs per fault 
(3.3× more than original 15) which provides:
- 95% CI width of ±14% vs. ±26% for 80% accuracy
- Manageable file sizes (~270 MB vs ~2.7 GB for all 500)
- Sufficient statistical power to detect meaningful model differences

### Fault Timing:
- Training/validation faulty runs (from `ftr`): Fault occurs at sample 21
- Test faulty runs (from `fte`): Fault occurs at sample 161 (after 8 hours normal operation)

### Excluded Faults:
Faults 3, 9, and 15 are excluded as they are too subtle to detect reliably.

In [8]:
print("Creating Supervised Learning Datasets")
print("=" * 70)

# Define excluded faults
excluded_faults = [3, 9, 15]
all_faults = [f for f in sorted(ftr['faultNumber'].unique()) 
              if f not in excluded_faults]

print(f"Excluded faults: {excluded_faults}")
print(f"Included faults: {all_faults}")
print(f"Total fault classes (including normal): {len(all_faults) + 1}\n")

# 1. Normal (fault-free) data split from fftr
all_normal_runs = fftr['simulationRun'].unique()
np.random.shuffle(all_normal_runs)

train_normal_runs = all_normal_runs[:320]
val_normal_runs = all_normal_runs[320:480]

print(f"Normal run allocation:")
print(f"  Training:   {len(train_normal_runs)} runs (runs {train_normal_runs.min()}-{train_normal_runs.max()})")
print(f"  Validation: {len(val_normal_runs)} runs (runs {val_normal_runs.min()}-{val_normal_runs.max()})")

# Sample normal runs
supervised_train = sample_runs(fftr, fault_number=0, 
                               allowed_runs=train_normal_runs, 
                               fault_start=0)

supervised_val = sample_runs(fftr, fault_number=0, 
                             allowed_runs=val_normal_runs, 
                             fault_start=0)

print(f"\nNormal data sampled:")
print(f"  Train: {supervised_train.shape}")
print(f"  Val:   {supervised_val.shape}")

Creating Supervised Learning Datasets
Excluded faults: [3, 9, 15]
Included faults: [1, 2, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20]
Total fault classes (including normal): 18

Normal run allocation:
  Training:   320 runs (runs 1.0-498.0)
  Validation: 160 runs (runs 2.0-499.0)

Normal data sampled:
  Train: (160000, 57)
  Val:   (80000, 57)


In [9]:
# 2. Faulty data split from ftr (training source)
print("\nSampling faulty runs from training source (ftr):")
print(f"  Fault start time: sample 21")
print(f"  Strategy: 20 runs for training, 9 runs for validation per fault\n")

for fault in all_faults:
    fault_df = ftr[ftr['faultNumber'] == fault]
    fault_runs = fault_df['simulationRun'].unique()
    np.random.shuffle(fault_runs)
    
    # Split: 20 train, 9 val (total 29 from ftr)
    train_fault_runs = fault_runs[:20]
    val_fault_runs = fault_runs[20:29]
    
    # Sample post-fault data (sample >= 21)
    train_fault_df = sample_runs(ftr, fault_number=fault,
                                 allowed_runs=train_fault_runs,
                                 fault_start=21)
    
    val_fault_df = sample_runs(ftr, fault_number=fault,
                               allowed_runs=val_fault_runs,
                               fault_start=21)
    
    supervised_train = pd.concat([supervised_train, train_fault_df], ignore_index=True)
    supervised_val = pd.concat([supervised_val, val_fault_df], ignore_index=True)
    
    print(f"  Fault {int(fault):2d}: Train {len(train_fault_runs)} runs, Val {len(val_fault_runs)} runs")

print(f"\nSupervised datasets after adding faulty runs:")
print(f"  Train: {supervised_train.shape}")
print(f"  Val:   {supervised_val.shape}")


Sampling faulty runs from training source (ftr):
  Fault start time: sample 21
  Strategy: 20 runs for training, 9 runs for validation per fault

  Fault  1: Train 20 runs, Val 9 runs
  Fault  2: Train 20 runs, Val 9 runs
  Fault  4: Train 20 runs, Val 9 runs
  Fault  5: Train 20 runs, Val 9 runs
  Fault  6: Train 20 runs, Val 9 runs
  Fault  7: Train 20 runs, Val 9 runs
  Fault  8: Train 20 runs, Val 9 runs
  Fault 10: Train 20 runs, Val 9 runs
  Fault 11: Train 20 runs, Val 9 runs
  Fault 12: Train 20 runs, Val 9 runs
  Fault 13: Train 20 runs, Val 9 runs
  Fault 14: Train 20 runs, Val 9 runs
  Fault 16: Train 20 runs, Val 9 runs
  Fault 17: Train 20 runs, Val 9 runs
  Fault 18: Train 20 runs, Val 9 runs
  Fault 19: Train 20 runs, Val 9 runs
  Fault 20: Train 20 runs, Val 9 runs

Supervised datasets after adding faulty runs:
  Train: (323200, 57)
  Val:   (153440, 57)


In [10]:
# 3. Test set from ffte (normal) and fte (faulty)
# OPTION A: Use 50 runs per fault (practical balance of size vs. statistical power)
print("\nCreating test set from ffte and fte:")
print(f"  Strategy: Use 50 runs per fault for good statistical power")
print(f"  Normal runs: 240 from ffte")
print(f"  Faulty runs: 50 runs per fault from fte (3.3× more than before)")
print(f"  Fault start time: sample 161 (after 8 hours normal operation)\n")

# Normal test runs
normal_test_runs = ffte['simulationRun'].unique()
np.random.shuffle(normal_test_runs)
test_normal_runs = normal_test_runs[:240]

supervised_test_normal = sample_runs(ffte, fault_number=0,
                                     allowed_runs=test_normal_runs,
                                     fault_start=0)

print(f"Normal test data: {supervised_test_normal.shape}")

# Faulty test runs - use 50 runs per fault
print(f"\nUsing 50 runs per fault from fte:")
supervised_test_faulty_frames = []
for fault in all_faults:
    fault_df = fte[fte['faultNumber'] == fault]
    fault_runs = fault_df['simulationRun'].unique()
    np.random.shuffle(fault_runs)
    
    # Use first 50 runs
    test_fault_runs = fault_runs[:50]
    
    test_fault_df = sample_runs(fte, fault_number=fault,
                                allowed_runs=test_fault_runs,
                                fault_start=161)
    
    supervised_test_faulty_frames.append(test_fault_df)
    print(f"  Fault {int(fault):2d}: {len(test_fault_runs)} runs, {len(test_fault_df)} samples")

supervised_test_faulty = pd.concat(supervised_test_faulty_frames, ignore_index=True)
supervised_test = pd.concat([supervised_test_normal, supervised_test_faulty], ignore_index=True)

# Sort by fault number, run, and sample for consistency
supervised_test = supervised_test.sort_values(
    ['faultNumber', 'simulationRun', 'sample']
).reset_index(drop=True)

print(f"\nFinal supervised test set: {supervised_test.shape}")
print(f"  Total test runs: {supervised_test['traj_key'].nunique()}")
print(f"  Normal runs: 240, Fault runs: {len(all_faults)} × 50 = {len(all_faults) * 50}")


Creating test set from ffte and fte:
  Strategy: Use 50 runs per fault for good statistical power
  Normal runs: 240 from ffte
  Faulty runs: 50 runs per fault from fte (3.3× more than before)
  Fault start time: sample 161 (after 8 hours normal operation)

Normal test data: (230400, 57)

Using 50 runs per fault from fte:
  Fault  1: 50 runs, 40000 samples
  Fault  2: 50 runs, 40000 samples
  Fault  4: 50 runs, 40000 samples
  Fault  5: 50 runs, 40000 samples
  Fault  6: 50 runs, 40000 samples
  Fault  7: 50 runs, 40000 samples
  Fault  8: 50 runs, 40000 samples
  Fault 10: 50 runs, 40000 samples
  Fault 11: 50 runs, 40000 samples
  Fault 12: 50 runs, 40000 samples
  Fault 13: 50 runs, 40000 samples
  Fault 14: 50 runs, 40000 samples
  Fault 16: 50 runs, 40000 samples
  Fault 17: 50 runs, 40000 samples
  Fault 18: 50 runs, 40000 samples
  Fault 19: 50 runs, 40000 samples
  Fault 20: 50 runs, 40000 samples

Final supervised test set: (910400, 57)
  Total test runs: 1090
  Normal runs: 

### Verify Supervised Dataset Properties

In [11]:
print("Supervised Dataset Summary:")
print("=" * 70)
print(f"\nTraining set:")
print(f"  Shape: {supervised_train.shape}")
print(f"  Faults: {sorted(supervised_train['faultNumber'].unique())}")
print(f"  Unique trajectories: {supervised_train['traj_key'].nunique()}")

print(f"\nValidation set:")
print(f"  Shape: {supervised_val.shape}")
print(f"  Faults: {sorted(supervised_val['faultNumber'].unique())}")
print(f"  Unique trajectories: {supervised_val['traj_key'].nunique()}")

print(f"\nTest set:")
print(f"  Shape: {supervised_test.shape}")
print(f"  Faults: {sorted(supervised_test['faultNumber'].unique())}")
print(f"  Unique trajectories: {supervised_test['traj_key'].nunique()}")

# Check for leakage
train_keys = set(supervised_train['traj_key'])
val_keys = set(supervised_val['traj_key'])
test_keys = set(supervised_test['traj_key'])

print(f"\nData leakage check:")
print(f"  Train ∩ Val:  {len(train_keys & val_keys)} trajectories (should be 0)")
print(f"  Train ∩ Test: {len(train_keys & test_keys)} trajectories (should be 0)")
print(f"  Val ∩ Test:   {len(val_keys & test_keys)} trajectories (should be 0)")

if len(train_keys & val_keys) == 0 and len(train_keys & test_keys) == 0 and len(val_keys & test_keys) == 0:
    print("\n  ✓ No data leakage detected")
else:
    print("\n  ✗ WARNING: Data leakage detected!")

Supervised Dataset Summary:

Training set:
  Shape: (323200, 57)
  Faults: [0.0, 1.0, 2.0, 4.0, 5.0, 6.0, 7.0, 8.0, 10.0, 11.0, 12.0, 13.0, 14.0, 16.0, 17.0, 18.0, 19.0, 20.0]
  Unique trajectories: 660

Validation set:
  Shape: (153440, 57)
  Faults: [0.0, 1.0, 2.0, 4.0, 5.0, 6.0, 7.0, 8.0, 10.0, 11.0, 12.0, 13.0, 14.0, 16.0, 17.0, 18.0, 19.0, 20.0]
  Unique trajectories: 313

Test set:
  Shape: (910400, 57)
  Faults: [0, 1, 2, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20]
  Unique trajectories: 1090

Data leakage check:
  Train ∩ Val:  0 trajectories (should be 0)
  Train ∩ Test: 0 trajectories (should be 0)
  Val ∩ Test:   0 trajectories (should be 0)

  ✓ No data leakage detected


## 6. Create Semi-Supervised Learning Datasets

**Semi-supervised datasets** are trained only on normal data for anomaly detection (binary classification).

### Data Split Strategy:
- **Training:** 320 normal runs (fault-free only)
- **Validation:** 160 normal runs (fault-free only)
- **Test:** 120 normal runs + **50 runs per fault** (17 faults) = 970 runs total

The training and validation sets use the **same normal runs** as the supervised learning case
to enable fair comparison between approaches.

Test set uses 50 runs per fault for robust anomaly detection evaluation (3.3× more than original 10 runs).

In [12]:
print("Creating Semi-Supervised Learning Datasets")
print("=" * 70)

# Training: same 320 normal runs as supervised (from fftr)
semisupervised_train = sample_runs(fftr, fault_number=0,
                                   allowed_runs=train_normal_runs,
                                   fault_start=0)

# Validation: same 160 normal runs as supervised (from fftr)
semisupervised_val = sample_runs(fftr, fault_number=0,
                                 allowed_runs=val_normal_runs,
                                 fault_start=0)

print(f"Train (normal only): {semisupervised_train.shape}")
print(f"Val (normal only):   {semisupervised_val.shape}")

# Test: 120 normal runs from ffte
semisupervised_test_normal = sample_runs(ffte, fault_number=0,
                                         allowed_runs=test_normal_runs[:120],
                                         fault_start=0)

print(f"\nTest normal data: {semisupervised_test_normal.shape}")

# Test: 50 runs per fault from fte (post-fault samples)
print(f"\nUsing 50 fault runs per fault from fte for anomaly detection testing:")
semisupervised_test_faulty_frames = []

for fault in all_faults:
    fault_df = fte[fte['faultNumber'] == fault]
    fault_runs = fault_df['simulationRun'].unique()
    np.random.shuffle(fault_runs)
    
    # Use first 50 runs
    test_fault_runs = fault_runs[:50]
    
    test_fault_df = sample_runs(fte, fault_number=fault,
                                allowed_runs=test_fault_runs,
                                fault_start=161)
    
    semisupervised_test_faulty_frames.append(test_fault_df)
    print(f"  Fault {int(fault):2d}: {len(test_fault_runs)} runs, {len(test_fault_df)} samples")

semisupervised_test_faulty = pd.concat(semisupervised_test_faulty_frames, ignore_index=True)
semisupervised_test = pd.concat([semisupervised_test_normal, semisupervised_test_faulty],
                                ignore_index=True)

# Sort for consistency
semisupervised_test = semisupervised_test.sort_values(
    ['faultNumber', 'simulationRun', 'sample']
).reset_index(drop=True)

print(f"\nFinal semi-supervised test set: {semisupervised_test.shape}")
print(f"  Total test runs: {semisupervised_test['traj_key'].nunique()}")
print(f"  Normal runs: 120, Fault runs: {len(all_faults)} × 50 = {len(all_faults) * 50}")

Creating Semi-Supervised Learning Datasets
Train (normal only): (160000, 57)
Val (normal only):   (80000, 57)

Test normal data: (115200, 57)

Using 50 fault runs per fault from fte for anomaly detection testing:
  Fault  1: 50 runs, 40000 samples
  Fault  2: 50 runs, 40000 samples
  Fault  4: 50 runs, 40000 samples
  Fault  5: 50 runs, 40000 samples
  Fault  6: 50 runs, 40000 samples
  Fault  7: 50 runs, 40000 samples
  Fault  8: 50 runs, 40000 samples
  Fault 10: 50 runs, 40000 samples
  Fault 11: 50 runs, 40000 samples
  Fault 12: 50 runs, 40000 samples
  Fault 13: 50 runs, 40000 samples
  Fault 14: 50 runs, 40000 samples
  Fault 16: 50 runs, 40000 samples
  Fault 17: 50 runs, 40000 samples
  Fault 18: 50 runs, 40000 samples
  Fault 19: 50 runs, 40000 samples
  Fault 20: 50 runs, 40000 samples

Final semi-supervised test set: (795200, 57)
  Total test runs: 970
  Normal runs: 120, Fault runs: 17 × 50 = 850


### Verify Semi-Supervised Dataset Properties

In [13]:
print("Semi-Supervised Dataset Summary:")
print("=" * 70)
print(f"\nTraining set (normal only):")
print(f"  Shape: {semisupervised_train.shape}")
print(f"  Faults: {sorted(semisupervised_train['faultNumber'].unique())}")
print(f"  Unique trajectories: {semisupervised_train['traj_key'].nunique()}")

print(f"\nValidation set (normal only):")
print(f"  Shape: {semisupervised_val.shape}")
print(f"  Faults: {sorted(semisupervised_val['faultNumber'].unique())}")
print(f"  Unique trajectories: {semisupervised_val['traj_key'].nunique()}")

print(f"\nTest set (normal + faulty):")
print(f"  Shape: {semisupervised_test.shape}")
print(f"  Faults: {sorted(semisupervised_test['faultNumber'].unique())}")
print(f"  Unique trajectories: {semisupervised_test['traj_key'].nunique()}")

# Check for leakage
train_keys_ss = set(semisupervised_train['traj_key'])
val_keys_ss = set(semisupervised_val['traj_key'])
test_keys_ss = set(semisupervised_test['traj_key'])

print(f"\nData leakage check:")
print(f"  Train ∩ Val:  {len(train_keys_ss & val_keys_ss)} trajectories (should be 0)")
print(f"  Train ∩ Test: {len(train_keys_ss & test_keys_ss)} trajectories (should be 0)")
print(f"  Val ∩ Test:   {len(val_keys_ss & test_keys_ss)} trajectories (should be 0)")

if len(train_keys_ss & val_keys_ss) == 0 and len(train_keys_ss & test_keys_ss) == 0 and len(val_keys_ss & test_keys_ss) == 0:
    print("\n  ✓ No data leakage detected")
else:
    print("\n  ✗ WARNING: Data leakage detected!")

Semi-Supervised Dataset Summary:

Training set (normal only):
  Shape: (160000, 57)
  Faults: [0.0]
  Unique trajectories: 320

Validation set (normal only):
  Shape: (80000, 57)
  Faults: [0.0]
  Unique trajectories: 160

Test set (normal + faulty):
  Shape: (795200, 57)
  Faults: [0, 1, 2, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20]
  Unique trajectories: 970

Data leakage check:
  Train ∩ Val:  0 trajectories (should be 0)
  Train ∩ Test: 0 trajectories (should be 0)
  Val ∩ Test:   0 trajectories (should be 0)

  ✓ No data leakage detected


## 7. Save Datasets to CSV

In [14]:
print("Saving datasets to ../data/")
print("=" * 70)

# Save multiclass (supervised) datasets
supervised_train.to_csv('../data/multiclass_train.csv', index=False)
print(f"✓ Saved: multiclass_train.csv ({supervised_train.shape})")

supervised_val.to_csv('../data/multiclass_val.csv', index=False)
print(f"✓ Saved: multiclass_val.csv ({supervised_val.shape})")

supervised_test.to_csv('../data/multiclass_test.csv', index=False)
print(f"✓ Saved: multiclass_test.csv ({supervised_test.shape})")

# Save binary (semi-supervised) datasets
semisupervised_train.to_csv('../data/binary_train.csv', index=False)
print(f"✓ Saved: binary_train.csv ({semisupervised_train.shape})")

semisupervised_val.to_csv('../data/binary_val.csv', index=False)
print(f"✓ Saved: binary_val.csv ({semisupervised_val.shape})")

semisupervised_test.to_csv('../data/binary_test.csv', index=False)
print(f"✓ Saved: binary_test.csv ({semisupervised_test.shape})")

print("\nAll datasets saved successfully!")

Saving datasets to ../data/
✓ Saved: multiclass_train.csv ((323200, 57))
✓ Saved: multiclass_val.csv ((153440, 57))
✓ Saved: multiclass_test.csv ((910400, 57))
✓ Saved: binary_train.csv ((160000, 57))
✓ Saved: binary_val.csv ((80000, 57))
✓ Saved: binary_test.csv ((795200, 57))

All datasets saved successfully!


## 8. Summary Statistics

In [15]:
print("\n" + "=" * 70)
print("FINAL DATASET SUMMARY")
print("=" * 70)

print("\nMULTICLASS (18-way classification):")
print(f"  Training:   {supervised_train.shape[0]:>8,} samples, {supervised_train['traj_key'].nunique():>5} runs")
print(f"  Validation: {supervised_val.shape[0]:>8,} samples, {supervised_val['traj_key'].nunique():>5} runs")
print(f"  Test:       {supervised_test.shape[0]:>8,} samples, {supervised_test['traj_key'].nunique():>5} runs ← 3.3× MORE")
print(f"  Total:      {supervised_train.shape[0] + supervised_val.shape[0] + supervised_test.shape[0]:>8,} samples")

print("\nBINARY (anomaly detection):")
print(f"  Training:   {semisupervised_train.shape[0]:>8,} samples, {semisupervised_train['traj_key'].nunique():>5} runs (normal only)")
print(f"  Validation: {semisupervised_val.shape[0]:>8,} samples, {semisupervised_val['traj_key'].nunique():>5} runs (normal only)")
print(f"  Test:       {semisupervised_test.shape[0]:>8,} samples, {semisupervised_test['traj_key'].nunique():>5} runs ← 5× MORE")
print(f"  Total:      {semisupervised_train.shape[0] + semisupervised_val.shape[0] + semisupervised_test.shape[0]:>8,} samples")

print("\nTEST SET IMPROVEMENT:")
print(f"  Multiclass: 15 → 50 runs per fault (3.3× improvement)")
print(f"  Binary:     10 → 50 runs per fault (5.0× improvement)")
print(f"  95% CI width: ±26% → ±14% (for 80% accuracy)")
print(f"  File size: Manageable ~270 MB (vs. ~2.7 GB for all 500 runs)")

print("\nFEATURES:")
feature_cols = [col for col in supervised_train.columns if col.startswith('xmeas') or col.startswith('xmv')]
print(f"  Measurements (xmeas): {len([c for c in feature_cols if c.startswith('xmeas')])}")
print(f"  Manipulated (xmv):    {len([c for c in feature_cols if c.startswith('xmv')])}")
print(f"  Total features:       {len(feature_cols)}")

print("\nFAULT CLASSES:")
print(f"  Included faults: {all_faults}")
print(f"  Excluded faults: {excluded_faults}")
print(f"  Total classes:   {len(all_faults) + 1} (including normal)")

print("\n" + "=" * 70)
print("Dataset creation complete!")
print("=" * 70)


FINAL DATASET SUMMARY

MULTICLASS (18-way classification):
  Training:    323,200 samples,   660 runs
  Validation:  153,440 samples,   313 runs
  Test:        910,400 samples,  1090 runs ← 3.3× MORE
  Total:      1,387,040 samples

BINARY (anomaly detection):
  Training:    160,000 samples,   320 runs (normal only)
  Validation:   80,000 samples,   160 runs (normal only)
  Test:        795,200 samples,   970 runs ← 5× MORE
  Total:      1,035,200 samples

TEST SET IMPROVEMENT:
  Multiclass: 15 → 50 runs per fault (3.3× improvement)
  Binary:     10 → 50 runs per fault (5.0× improvement)
  95% CI width: ±26% → ±14% (for 80% accuracy)
  File size: Manageable ~270 MB (vs. ~2.7 GB for all 500 runs)

FEATURES:
  Measurements (xmeas): 41
  Manipulated (xmv):    11
  Total features:       52

FAULT CLASSES:
  Included faults: [1, 2, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20]
  Excluded faults: [3, 9, 15]
  Total classes:   18 (including normal)

Dataset creation complete!
