# Suffix Prediction Model Training

This notebook demonstrates how to train a sequence-to-sequence LSTM model for suffix prediction.

The model:
- Filters event logs to only "start" and "complete" lifecycle transitions
- Learns to predict the entire remaining sequence (suffix) of activities for a case
- Uses an encoder-decoder architecture to predict sequences instead of single activities


## 1. Setup and Imports


In [5]:
import sys
from pathlib import Path
import logging

# Add project root to Python path
project_root = Path.cwd().parent if Path.cwd().name == 'next_activity_prediction' else Path.cwd()
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

from next_activity_prediction import train_suffix_model


## 2. Configuration

Configure the suffix prediction model architecture and training parameters.


In [6]:
# Path to your event log
EVENT_LOG_PATH = r"..\eventlog\eventlog.xes.gz"  # Update this to your event log path

# Model configuration
config = {
    "event_log_path": EVENT_LOG_PATH,
    "model_dir": "models/suffix_prediction_lstm",  # Where to save the model
    
    # Sequence lengths
    "prefix_length": 50,      # Maximum prefix length (input)
    "suffix_length": 30,      # Maximum suffix length (output)
    "min_prefix_length": 1,   # Minimum prefix length to consider
    
    # Model architecture
    "embedding_dim": 128,     # Embedding dimension for activity tokens
    "encoder_lstm_units": 256,    # Number of LSTM units in encoder
    "decoder_lstm_units": 256,    # Number of LSTM units in decoder
    "encoder_lstm_layers": 2,     # Number of encoder LSTM layers
    "decoder_lstm_layers": 2,     # Number of decoder LSTM layers
    "dropout_rate": 0.3,      # Dropout rate for regularization
    
    # Training parameters
    "batch_size": 64,         # Batch size for training
    "learning_rate": 0.001,   # Initial learning rate
    "epochs": 1,             # Maximum number of training epochs
    "validation_split": 0.2,  # Fraction of data for validation
    "early_stopping_patience": 10,  # Early stopping patience
    
    # Data preprocessing
    "min_case_length": 2,     # Minimum case length to include
    "max_case_length": 200,   # Maximum case length (longer cases are truncated)
    "random_seed": 42
}

print("Configuration:")
print(f"  Event log: {config['event_log_path']}")
print(f"  Model directory: {config['model_dir']}")
print(f"  Prefix length: {config['prefix_length']}")
print(f"  Suffix length: {config['suffix_length']}")
print(f"  Encoder LSTM units: {config['encoder_lstm_units']}")
print(f"  Decoder LSTM units: {config['decoder_lstm_units']}")
print(f"  Encoder layers: {config['encoder_lstm_layers']}")
print(f"  Decoder layers: {config['decoder_lstm_layers']}")
print(f"  Batch size: {config['batch_size']}")
print(f"  Epochs: {config['epochs']}")


Configuration:
  Event log: ..\eventlog\eventlog.xes.gz
  Model directory: models/suffix_prediction_lstm
  Prefix length: 50
  Suffix length: 30
  Encoder LSTM units: 256
  Decoder LSTM units: 256
  Encoder layers: 2
  Decoder layers: 2
  Batch size: 64
  Epochs: 1


## 3. Verify Event Log

Check that the event log exists and has the required columns.


In [7]:
import pandas as pd
import pm4py

# Load and inspect event log
event_log_path = config['event_log_path']
if Path(event_log_path).exists():
    log = pm4py.read_xes(event_log_path)
    df = pm4py.convert_to_dataframe(log)
    
    print(f"Event log loaded: {len(df)} events, {df['case:concept:name'].nunique()} cases")
    print(f"\nColumns: {list(df.columns)}")
    
    # Check for lifecycle column
    if 'lifecycle:transition' in df.columns:
        lifecycle_counts = df['lifecycle:transition'].value_counts()
        print(f"\nLifecycle transitions:")
        print(lifecycle_counts)
        
        start_complete = df[df['lifecycle:transition'].isin(['start', 'complete'])]
        print(f"\nStart/Complete events: {len(start_complete)} ({len(start_complete)/len(df):.1%})")
    else:
        print("\nWarning: No 'lifecycle:transition' column found")
    
    # Show sample activities
    if 'concept:name' in df.columns:
        activities = df['concept:name'].unique()
        print(f"\nUnique activities: {len(activities)}")
        print(f"Sample activities: {list(activities[:10])}")
else:
    print(f"Event log not found: {event_log_path}")
    print("Please update EVENT_LOG_PATH in the configuration cell above.")


Event log loaded: 1202267 events, 31509 cases

Columns: ['case:RequestedAmount', 'OfferedAmount', 'lifecycle:transition', 'concept:name', 'time:timestamp', 'OfferID', 'Accepted', 'case:concept:name', 'case:ApplicationType', 'EventOrigin', 'org:resource', 'Action', 'FirstWithdrawalAmount', 'EventID', 'CreditScore', 'MonthlyCost', 'NumberOfTerms', 'case:LoanGoal', 'Selected']

Lifecycle transitions:
lifecycle:transition
complete     475306
suspend      215402
schedule     149104
start        128227
resume       127160
ate_abort     85224
withdraw      21844
Name: count, dtype: int64

Start/Complete events: 603533 (50.2%)

Unique activities: 26
Sample activities: ['A_Create Application', 'A_Submitted', 'W_Handle leads', 'W_Complete application', 'A_Concept', 'A_Accepted', 'O_Create Offer', 'O_Created', 'O_Sent (mail and online)', 'W_Call after offers']


## 4. Train the Model

Train the sequence-to-sequence LSTM model on the event log. This will:
1. Load and preprocess the event log
2. Filter to start/complete lifecycles
3. Extract activity sequences and create prefix-suffix pairs
4. Create training/validation splits
5. Train the encoder-decoder model with early stopping
6. Save the trained model and metadata


In [8]:
# Train the suffix prediction model
model, metadata = train_suffix_model(**config)

print("\nTraining completed!")


2026-01-09 15:01:37,217 - next_activity_prediction.suffix_trainer - INFO - Preprocessing event log...
2026-01-09 15:01:37,218 - next_activity_prediction.suffix_data_preprocessing - INFO - Loading event log from ..\eventlog\eventlog.xes.gz
2026-01-09 15:01:44,345 - next_activity_prediction.suffix_data_preprocessing - INFO - Loaded 1202267 events, 31509 cases
2026-01-09 15:01:44,500 - next_activity_prediction.suffix_data_preprocessing - INFO - Filtered to start/complete lifecycles: 1,202,267 -> 603,533 (50.2%)
2026-01-09 15:01:45,730 - next_activity_prediction.suffix_data_preprocessing - INFO - Extracted 31509 case sequences
2026-01-09 15:01:45,735 - next_activity_prediction.suffix_data_preprocessing - INFO - Average sequence length: 20.2 (including END)
2026-01-09 15:01:45,738 - next_activity_prediction.suffix_data_preprocessing - INFO - Min/Max sequence length: 9/68
2026-01-09 15:01:45,797 - next_activity_prediction.suffix_data_preprocessing - INFO - Created vocabulary with 28 activiti

2026-01-09 15:01:50,614 - next_activity_prediction.suffix_trainer - INFO - Model: "suffix_prediction_lstm"
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Layer (type)        ┃ Output Shape      ┃    Param # ┃ Connected to      ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ prefix_input        │ (None, 50)        │          0 │ -                 │
│ (InputLayer)        │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ activity_embedding  │ (None, 50, 128)   │      3,584 │ prefix_input[0][… │
│ (Embedding)         │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ not_equal_1         │ (None, 50)        │          0 │ prefix_input[0][… │
│ (NotEqual)          │                   │            │                   │
├─────────────────────┼───────────────────┼───

[1m7545/7545[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 262ms/step - loss: 1.0386 - sparse_categorical_accuracy: 0.7139
Epoch 1: val_loss improved from None to 0.88825, saving model to models\suffix_prediction_lstm\checkpoints\best_model.keras

Epoch 1: finished saving model to models\suffix_prediction_lstm\checkpoints\best_model.keras
[1m7545/7545[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2111s[0m 279ms/step - loss: 0.9503 - sparse_categorical_accuracy: 0.7331 - val_loss: 0.8882 - val_sparse_categorical_accuracy: 0.7466
Restoring model weights from the end of the best epoch: 1.


2026-01-09 15:37:02,152 - next_activity_prediction.suffix_trainer - INFO - Saved model to models\suffix_prediction_lstm\model.keras
2026-01-09 15:37:02,154 - next_activity_prediction.suffix_trainer - INFO - Saved metadata to models\suffix_prediction_lstm\metadata.json



Training completed!


## 5. Training Results

The training history is stored in the metadata. Let's check the final metrics.


In [9]:
# Print training results from metadata
if 'training_history' in metadata:
    history = metadata['training_history']
    print("Final Training Metrics:")
    print(f"  Loss: {history.get('final_train_loss', 'N/A'):.4f}" if isinstance(history.get('final_train_loss'), (int, float)) else f"  Loss: {history.get('final_train_loss', 'N/A')}")
    print(f"  Accuracy: {history.get('final_train_acc', 'N/A'):.4f}" if isinstance(history.get('final_train_acc'), (int, float)) else f"  Accuracy: {history.get('final_train_acc', 'N/A')}")
    print(f"\nFinal Validation Metrics:")
    print(f"  Loss: {history.get('final_val_loss', 'N/A'):.4f}" if isinstance(history.get('final_val_loss'), (int, float)) else f"  Loss: {history.get('final_val_loss', 'N/A')}")
    print(f"  Accuracy: {history.get('final_val_acc', 'N/A'):.4f}" if isinstance(history.get('final_val_acc'), (int, float)) else f"  Accuracy: {history.get('final_val_acc', 'N/A')}")
    print(f"\nEpochs trained: {history.get('epochs_trained', 'N/A')}")
else:
    print("Training history not found in metadata")


Final Training Metrics:
  Loss: 0.9503
  Accuracy: 0.7331

Final Validation Metrics:
  Loss: 0.8882
  Accuracy: 0.7466

Epochs trained: 1


## 6. Verify Model Files

Check that the model and metadata were saved correctly.


In [15]:
model_dir = Path(config['model_dir'])

print(f"Model directory: {model_dir}")
print(f"\nFiles in model directory:")
if model_dir.exists():
    for file in sorted(model_dir.rglob('*')):
        if file.is_file():
            size = file.stat().st_size / (1024 * 1024)  # Size in MB
            print(f"  {file.relative_to(model_dir.parent)} ({size:.2f} MB)")
else:
    print(f"  Directory does not exist: {model_dir}")

# Check metadata
metadata_file = model_dir / "metadata.json"
if metadata_file.exists():
    print(f"\n✓ Metadata file exists: {metadata_file}")
    print(f"\nModel metadata:")
    print(f"  Model type: {metadata.get('model_type', 'N/A')}")
    print(f"  Vocabulary size: {metadata.get('vocab_size', 'N/A')}")
    print(f"  Prefix length: {metadata.get('prefix_length', 'N/A')}")
    print(f"  Suffix length: {metadata.get('suffix_length', 'N/A')}")
    print(f"  END token index: {metadata.get('end_token_idx', 'N/A')}")
else:
    print(f"\n✗ Metadata file not found: {metadata_file}")


Model directory: models\suffix_prediction_lstm

Files in model directory:
  suffix_prediction_lstm\checkpoints\best_model.keras (22.73 MB)
  suffix_prediction_lstm\metadata.json (0.00 MB)
  suffix_prediction_lstm\model.keras (22.73 MB)

✓ Metadata file exists: models\suffix_prediction_lstm\metadata.json

Model metadata:
  Model type: suffix_prediction_lstm
  Vocabulary size: 28
  Prefix length: 50
  Suffix length: 30
  END token index: 27


## 7. Test the Predictor

Test loading the model and making predictions.


In [16]:
from next_activity_prediction import LSTMSuffixPredictor

# Load the predictor
predictor = LSTMSuffixPredictor(model_path=config['model_dir'])

print("✓ Suffix predictor loaded successfully")
print(f"\nModel configuration:")
print(f"  Prefix length: {predictor.prefix_length}")
print(f"  Suffix length: {predictor.suffix_length}")
print(f"  Vocabulary size: {len(predictor.activity_to_idx)}")
print(f"  END token index: {predictor.end_token_idx}")

# Test with a sample prefix
from simulation.engine import CaseState

# Create a sample case state with a prefix
sample_prefix = ["A_Create Application", "A_Submitted", "A_Concept"]
print(f"\nTesting with prefix: {sample_prefix}")

# Simulate a case state
class MockCaseState:
    def __init__(self, case_id, activity_history):
        self.case_id = case_id
        self.activity_history = activity_history

case_state = MockCaseState("test_case_1", sample_prefix)

# Predict next activity (this will predict the entire suffix internally)
next_activity, is_end = predictor.predict(case_state)
print(f"\nPredicted next activity: {next_activity}")
print(f"Is case ended: {is_end}")

# Check if suffix was cached
if "test_case_1" in predictor.predicted_suffixes:
    suffix = predictor.predicted_suffixes["test_case_1"]
    print(f"\nPredicted suffix (first 10 activities): {suffix[:15]}")
    
    if predictor.suffix_positions.get("test_case_1") is not None:
        print(f"Current position in suffix: {predictor.suffix_positions['test_case_1']}")
else:
    print("\nNote: Suffix not cached (case may have ended immediately)")


2026-01-09 15:58:28,920 - next_activity_prediction.suffix_predictor - INFO - Loading suffix prediction model from models/suffix_prediction_lstm...
2026-01-09 15:58:29,144 - next_activity_prediction.suffix_predictor - INFO - Loaded suffix prediction model
2026-01-09 15:58:29,145 - next_activity_prediction.suffix_predictor - INFO - Prefix length: 50, Suffix length: 30
2026-01-09 15:58:29,145 - next_activity_prediction.suffix_predictor - INFO - Vocabulary size: 28
2026-01-09 15:58:29,146 - next_activity_prediction.suffix_predictor - INFO - END token index: 27


✓ Suffix predictor loaded successfully

Model configuration:
  Prefix length: 50
  Suffix length: 30
  Vocabulary size: 28
  END token index: 27

Testing with prefix: ['A_Create Application', 'A_Submitted', 'A_Concept']

Predicted next activity: A_Concept
Is case ended: False

Predicted suffix (first 10 activities): ['A_Concept', 'W_Complete application', 'A_Accepted', 'O_Create Offer', 'O_Created', 'O_Sent (mail and online)', 'W_Complete application', 'W_Call after offers', 'A_Complete', 'W_Validate application', 'A_Validating', 'O_Returned', '<PAD>', '<PAD>', '<PAD>']
Current position in suffix: 1
