This repository implements a research pipeline for resource-centric next activity prediction using process event logs. The core question is how different sequence encoding techniques affect predictive performance on downstream classifiers.
The project evaluates 7 embedding and encoding approaches using two classifiers:
- Random Forest (
run_experiments_rf.py) - LightGBM (
run_experiments_lgbm.py)
It also includes an experiment script to measure encoding complexity and a results analysis utility.
- Resource-centric prediction: The dataset is split by resource (
org:resource), so resources in the test set do not appear in the training set. - Next activity prediction: For each partial sequence of resource activity, predict the next activity.
- Prefix strategies:
prefix: fixed prefix of the firstnactivitieslast_k: lastkactivities before the target pointsliding_window: windowed prefixes across the trace
The repository supports seven methods:
Baseline- position-wise label encoding of the prefixOHE- one-hot encoding of positional activity featuresBigram- bigram transition count features from activity sequencesW2V- Word2Vec embeddings trained on activity prefixesD2V- Doc2Vec embeddings trained on full resource tracesBERT- custom BERT-style encoder pre-trained on activity sequencesACF- Activity Context Frequency embeddings with PMI post-processing
The repository uses BPI Challenge event logs stored in datasets/:
BPI_Challenge_2013_Incidents.xesBPI_Challenge_2017.xesBPI_Challenge_2018.xesBPI_Challenge_2019.xes
These logs are loaded and converted to a resource-centric format containing:
case:concept:nameconcept:nameorg:resourcetime:timestamp
Run Random Forest experiments across configured datasets, strategies, prefix lengths, and embedding methods.
Run LightGBM experiments across configured datasets, strategies, prefix lengths, and embedding methods.
Measure encoding complexity for each method, including feature count, matrix memory size, and encoding time.
Analyze existing experiment CSV results and compute ranking statistics such as:
- total wins per method
- wins per dataset/model configuration
- total points using a top-7 scoring system
modules/data_loader.py- dataset import, cleaning, resource split, and prefix generationrf_trainer.py- Random Forest training and grid search utilitieslgbm_trainer.py- LightGBM training and grid search utilitiesencoders/- encoding and embedding implementationsbaseline.pyone_hot_encoding.pybigram.pyword2vec.pydoc2vec.pybert.pyacf.py
ACF_code/- custom implementation of activity-context frequency / PMI embedding generationdatasets/- input XES event logsresults/- experiment output CSV filesplots/- generated visualizations and plotsnotebooks/- exploratory analysis notebooks
The code uses the following Python packages (at least):
- pandas
- numpy
- pm4py
- scikit-learn
- gensim
- torch
- transformers
- lightgbm
- matplotlib
- seaborn
If you use the LightGBM GPU settings in modules/lgbm_trainer.py, ensure a compatible GPU environment is available or update the trainer configuration to use CPU.
- Install dependencies in your Python environment.
- Place the target XES datasets in
datasets/. - Run the desired script:
python run_experiments_rf.py
python run_experiments_lgbm.py
python run_complexity_experiments.py
python win_counts.py- Inspect generated CSVs in
results/and plots/ for visual summaries.
- Most experiment scripts are configurable via the constants defined at the top of each file (
DATASETS,PREFIX_LENGTHS,STRATEGIES,METHODS,USE_GRID_SEARCH). - The repository is designed for research experimentation.
- The
data_analysis.ipynbandnotebooks/RandomForest.ipynbnotebooks contain additional exploratory work and visualization examples, the latter might not be completely up-to-date with the actual encoders, as this was mainly used for first experimentation.